hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC2018: https://wp.me/p4pxJf-k1
diehlpk has joined #ste||ar
diehlpk has quit [Read error: Connection reset by peer]
Vir has quit [Ping timeout: 265 seconds]
Vir has joined #ste||ar
diehlpk has joined #ste||ar
Vir has quit [Ping timeout: 240 seconds]
Vir has joined #ste||ar
mcopik has quit [Ping timeout: 268 seconds]
diehlpk has quit [Remote host closed the connection]
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: bye]
<jbjnr>
is anyone else back at work and raring to go?
<zao>
I'm on vacation and being about as unproductive as usual :D
<jbjnr>
vacation - anywhere nice?
<jbjnr>
I believe St. Petersburg could be nice for you this week :)
<zao>
Nah, just staycation.
<zao>
Went back home around midsummer, probably going again in a few weeks.
<zao>
(split my vacation days up in two sections this year, so two weeks now, then three weeks later)
jbjnr has quit [Ping timeout: 245 seconds]
jbjnr has joined #ste||ar
jaafar has quit [Ping timeout: 268 seconds]
<jbjnr>
grrrr. my windows machine is just terrible these days. blue screen of death and reboots all the time.
<zao>
How bothersome.
<zao>
Were you running pycicle infra on it?
<jbjnr>
yes
<zao>
I was building one of the SoC PRs the other day, my CI stuff apparently still has trouble with a test :(
<jbjnr>
not sure what 'infra' means. but I have two pycicle instances spawngin work on the cray. they are just python loops polling github. Not building here.
<zao>
I wonder if the container setup interferes with it.
<zao>
Infrastructure.
<zao>
Anyway, welcome back :D
<heller>
jbjnr: welcome back!
jbjnr has quit [Ping timeout: 240 seconds]
jbjnr has joined #ste||ar
<jbjnr>
rebooted again!
<jbjnr>
heller: hi. Have you finished the kokkos intregration yet! :)
<heller>
jbjnr: no, the kokkos and HPX model are interestingly very different and not really compatible
<heller>
what i'd like instead is a thorough comparison between the two
<jbjnr>
I believe we must make our stuff compatible if we are to get peak performance on a node
<heller>
that's what I'm still wondering
<heller>
Kokkos doesn't come for free either
<jbjnr>
I've already made a lot of progress with my abaility to provide hints to the scheduler about where to put tasks, but we need to go much further.
<heller>
we get nice performance for the stream benchmark, for example. Something the kokkos model is supposed to be perfect for, for example
<heller>
right
<heller>
I am not arguing that what we have is perfect
<jbjnr>
the stream benchmark is not really a good example though as it does not use the 'standard' api that the rest of hpx uses
<jbjnr>
did you reach any conclusions about N-ary tasks?
<heller>
I am not even sure what that standard API is ...
<heller>
N-ary tasks: that's just a by product of their model. I don't think it buys us anything
<heller>
but yes, the stream benchmark needs to be streamlined again
<heller>
the point is: It is able to deliver
<jbjnr>
N-ary : I like the idea of creating 1 task instead of 32 (or some other number) and decrementing the ranges used.
<heller>
yes, I guess that's one point where we need to optimize
<heller>
instead of calucating the partitions upfront, each thread should do it on its own based on some index
<heller>
or s
<heller>
o
<heller>
also: you are saying that we aren't able to reach peak on a single node with what we have today. On what ground are you making that statement? Do you have a comparison of your cholesky stuff using Kokkos?
<heller>
and more importantly: I must get out of this overwhelming, productivity killing thesis swamp that keeps on draining my energy for too long
david_pfander1 has quit [Ping timeout: 245 seconds]
hkaiser has joined #ste||ar
jakub_golinowski has joined #ste||ar
jakub_golinowski has quit [Quit: Ex-Chat]
jakub_golinowski has joined #ste||ar
K-ballo has joined #ste||ar
mcopik has joined #ste||ar
nikunj has joined #ste||ar
<nikunj>
hkaiser: yt?
<hkaiser>
here
<nikunj>
So I just tried integrating my apple implementation into HPX. Things are working fine as of now (examples are running well). I'm onto running tests now
<hkaiser>
nice!
<hkaiser>
good job!
<nikunj>
Could you please review my pr so that I can add another pr for apple integration as well (it adds onto hpx_wrap.cpp and I do not want to combine Linux and Mac OS integration into same pr)
<hkaiser>
nikunj: will try to get to it today
<nikunj>
thanks, I'll add another pr as soon as it is reviewed
hkaiser has quit [Quit: bye]
anushi has joined #ste||ar
Anushi1998 has joined #ste||ar
<Anushi1998>
nikunj: Why don't u add a branch and make a second PR? Is there any problem in that or the second PR can only be make if first one is merged?
<nikunj>
Anushi1998: the second pr cannot be worked on until the first one is not merged
<Anushi1998>
okay
<nikunj>
It involves additional code in the file of my first pr.
mcopik has quit [Ping timeout: 245 seconds]
aserio has joined #ste||ar
mcopik has joined #ste||ar
<jakub_golinowski>
M-ms, the build in release mode has linking errors as before in a clean dir
<M-ms>
jakub_golinowski: ok, thanks
<M-ms>
still rebuilding here
<jbjnr>
M-ms: are you in zurich or basel?
<M-ms>
jbjnr: basel
<jbjnr>
ok. see you tomorrow. Is the conf. centre small enough that I'll find everyone easily?
<jbjnr>
I probably won't arrive until lunchtime
<M-ms>
I see you're coming here as well...
<jbjnr>
yup. meeting
hkaiser has joined #ste||ar
<M-ms>
yep, it's reasonably small
<M-ms>
coffee breaks in one hall, otherwise write on slack
<M-ms>
jakub_golinowski: getting the linker errors now on my work laptop, must have something different on my personal one... but now I can at least start looking into it
<nikunj>
hkaiser: can we reschedule our skype meet to Wednesday or Thursday? I had to talk mainly about my implementation of Linux and Mac OS. Now that they are done (almost), I can work on Windows. I think I can get some visible leads until Wednesday to discuss it with you.
nikunj97 has joined #ste||ar
nikunj has quit [Ping timeout: 276 seconds]
<aserio>
heller: yt?
<heller>
aserio: hey
<aserio>
heller: welcome to the team
<heller>
aserio: he, thanks ;)
<heller>
aserio: see pm please ;)
<hkaiser>
nikunj97: sure, works for me (Thursday)
<hkaiser>
let's rather do Friday
<nikunj97>
hkaiser: ok
<nikunj97>
I'll research ways to get things done on windows till then
nikunj1997 has joined #ste||ar
nikunj97 has quit [Ping timeout: 264 seconds]
anushi has quit [Read error: Connection reset by peer]
anushi has joined #ste||ar
anushi has quit [Remote host closed the connection]
Anushi1998 has quit [Quit: Bye]
<jakub_golinowski>
M-ms, I realized that 6 CEST is now :D do you have time to look at the gdoc?
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 265 seconds]
aserio1 is now known as aserio
<M-ms>
jakub_golinowski: yep, thanks
nikunj1997 has quit [Ping timeout: 240 seconds]
<github>
[hpx] hkaiser created destroy_parcel (+1 new commit): https://git.io/fySM9
<github>
hpx/destroy_parcel a79d051 Hartmut Kaiser: Making sure all parcels get destroyed on an HPX thread (TCP pp)
anushi has joined #ste||ar
anushi has quit [Remote host closed the connection]
<github>
[hpx] hkaiser force-pushed destroy_parcel from a79d051 to 0d9a425: https://git.io/fyH4L
<github>
hpx/destroy_parcel 0d9a425 Hartmut Kaiser: Making sure all parcels get destroyed on an HPX thread (TCP pp)
anushi has joined #ste||ar
<github>
[hpx] hkaiser force-pushed destroy_parcel from 0d9a425 to 8e2d7c1: https://git.io/fyH4L
<github>
hpx/destroy_parcel 8e2d7c1 Hartmut Kaiser: Making sure all parcels get destroyed on an HPX thread (TCP pp)...
aserio has quit [Ping timeout: 255 seconds]
Anushi1998 has joined #ste||ar
jakub_golinowski has quit [Ping timeout: 276 seconds]
<Guest87328>
[hpx] hkaiser opened pull request #3361: Making sure all parcels get destroyed on an HPX thread (TCP pp) (master...destroy_parcel) https://git.io/fSs66
jaafar has joined #ste||ar
mcopik has quit [Ping timeout: 248 seconds]
mcopik has joined #ste||ar
mcopik has quit [Ping timeout: 276 seconds]
jakub_golinowski has joined #ste||ar
diehlpk_mobile has joined #ste||ar
<hkaiser>
jbjnr: could you give me the link to the nvidia gpu layering workshop announcement, please
jakub_golinowski has quit [Ping timeout: 276 seconds]
aserio has joined #ste||ar
hkaiser has quit [Quit: bye]
jbjnr has quit [Remote host closed the connection]
hkaiser has joined #ste||ar
aserio1 has joined #ste||ar
<github>
[hpx] khuck pushed 1 new commit to apex_fixing_null_wrapper: https://git.io/fSsQa
<github>
hpx/apex_fixing_null_wrapper e63fcf6 Kevin Huck: Trying to make circleci happy
aserio has quit [Ping timeout: 240 seconds]
aserio1 has quit [Ping timeout: 240 seconds]
aserio has joined #ste||ar
jakub_golinowski has joined #ste||ar
<parsa[w]>
is it possible to determine if we're on locality#0 after hpx::finalize()?
<parsa[w]>
hkaiser: ^
aserio has quit [Ping timeout: 240 seconds]
hkaiser has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
<parsa[w]>
hkaiser: is it possible to determine if we're on locality#0 after hpx::finalize()?
<hkaiser>
parsa[w]: not sure what you mean
aserio has joined #ste||ar
<parsa[w]>
hpx_main finishes execution, and i expect some string to be printed, which happens on locality 0. i want to check for that string when i'm on locality 0
<heller>
hkaiser: what's your stance on kicking the asio based PP in favor of a libfabric only solution?
<hkaiser>
heller: if we can get it to work on any platform that supports sockets, sure - we won't fully get rid of asio this way though
<hkaiser>
heller: what would be the rationale of doing this?
<heller>
Simplifying the whole parcelhandler code by just using libfabric for the communication. This way, we can fully utilize the network. I have a prototype implementation that's on par in terms of latency and bandwidth with MPI. For window size of 1 and single thread
<heller>
For the OSU test
<hkaiser>
what about bootstrapping?
<heller>
Solved
<hkaiser>
nice
<heller>
Even without PMI
<hkaiser>
well, as a first step I'd say - let's add it as an additional pp
<heller>
Full zero copy capable ;)
<heller>
Hmm
<heller>
Not sure if that's going to work out though
<hkaiser>
why?
<K-ballo>
PMI?
<K-ballo>
MPI
<K-ballo>
(I read PMI in passing and thought of Phillip Morris International)
<heller>
I changed the serialization stuff. Mainly to have easier preprocessing and rdma reads on demand
<K-ballo>
oh, it's a thing
<heller>
K-ballo: process management interface
<hkaiser>
heller: what about the mpi pp?
<heller>
I started bottom up. As said, it's just a prototype so far and not yet fully integrated
<hkaiser>
heller: will that make the mpi pp obsolete as well?
<heller>
The mpi pp has no need to exist anymore :p
<heller>
Yes
<heller>
That's the goal
<hkaiser>
ok
<hkaiser>
this needs some discussion
<heller>
It will certainly be a disruptive step since I expect some bugs
<heller>
Sure, that's why I'm bringing it up
<hkaiser>
I'm not in favor of throwing away everything we have in terms of networking and replace it with something new in one bid sweep
<hkaiser>
big*
<heller>
I understand. The two things could happily coexist
<heller>
They in fact do at the moment
<hkaiser>
so what's the problem with leaving the existing tcp pp in place for a while?
<heller>
No problem at all. This new code would make the current parcel handling obsolete.
<hkaiser>
I understand
<heller>
Having a plan on when to remove would be good
<hkaiser>
but as said, I think this change should be done in steps over at least 2 releases
<heller>
Ok
<heller>
No problem.
<hkaiser>
one release have the new stuff in but not as the default, and the next release have it on by default, leaving the old stuff in on demand
<hkaiser>
third release - remove things
<hkaiser>
now, the quicker you do the releases, the quicker the stuff gets in ;-)
<heller>
The risk is: bugs, changed cmake step (need to point to a libfabric install) and a potential problem when not using slurm/pbs/alps for distributed applications. Libfabric might get discontinued and we ended up with a pretty coupled code base and need to invest there
<heller>
The gain: significantly faster distributed applications
<heller>
And making John happy with the rdma transfers
<hkaiser>
heller: sure, I'm behind this - just a bit cautious
<heller>
Good
<heller>
I hope that it works reasonably on Windows and osx
<heller>
They claim it does...
<hkaiser>
heller: sure, if not we can create some pressure through Chris
diehlpk_mobile has quit [Read error: Connection reset by peer]
jakub_golinowski has quit [Ping timeout: 256 seconds]
<Anushi1998>
Why we need to add new split credits? Since we have acquired the lock the credits will be replinshed and again whenever it is split it would be simply divided.
<jakub_golinowski>
I tried rebuilding opencv with the options suggested in the install instructions of MartyCam but it still did not help. Now my guess is that I am using recent master and this might be the issue. In the mean time I am reading the source code of the app
<K-ballo>
hkaiser: nope
jakub_golinowski has quit [Ping timeout: 260 seconds]
nikunj1997 has joined #ste||ar
<github>
[hpx] khuck pushed 1 new commit to apex_fixing_null_wrapper: https://git.io/fSGte
<github>
hpx/apex_fixing_null_wrapper a68ef88 Kevin Huck: Merge branch 'master' into apex_fixing_null_wrapper
<github>
[hpx] khuck pushed 1 new commit to apex_fixing_null_wrapper: https://git.io/fSGtJ
<github>
hpx/apex_fixing_null_wrapper ee55d5d Kevin Huck: Merge branch 'master' into apex_fixing_null_wrapper
<nikunj1997>
hkaiser: 4 tests failed in my Mac OS test (2 of them timed out). 1 tests passed later when I reran it. So overall 99% tests passed. The reason for timed out tests could be due to RAM shortage (I'm running it on VM).