aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
<jbjnr>
Does anyone know why I have soe many asyn::continuation and dataflow::finalize bars in this trace
<hkaiser>
jbjnr: png: what do I look at?
<jbjnr>
a profile/trace of the matrix stuff, but I do not understadn the black bars (colour=arbitrary)
<jbjnr>
they are mostly dataflow_frame::finalize
<jbjnr>
but what is that really?
<jbjnr>
and whay is thread 12 sitting in pre_main?
<hkaiser>
annotate the functions passed to dataflow
<jbjnr>
everything is annotated AFAIK
<hkaiser>
I don't know what pre_main is doing
<jbjnr>
unless they are apex related....
<hkaiser>
hmmm, then the annotation does not work as expected
<jbjnr>
I am experimenting with schedulers
<hkaiser>
dataflow::finalize calls your functions you passed to dataflow
<jbjnr>
and making interesting discoveries
<hkaiser>
what discoveries?
<hkaiser>
the Americas?
<jbjnr>
if I steal High Priority threads - under certain combinations of settings - it runs more slowly than if I don't - but allow them to run on the thread they are assigned to - even if they start much later etc etc
<hkaiser>
do you use #2656 and the recent changes we applied to the thread priorities?
<jbjnr>
yes
<hkaiser>
btw, can I merge #2656 now?
<jbjnr>
let me comment on it first
<hkaiser>
k
<jbjnr>
but thanks for doing it
<jbjnr>
I have one dataflow in the code - it trigger "HP Update Panel #x" - these datflow::finalize always appear right before HP update panel tasks
<hkaiser>
jbjnr: sure
<jbjnr>
could it be cleaning up of dead tasks and that sort of thing?
K-ballo has joined #ste||ar
<hkaiser>
dataflow::finalize is annotated as well
<hkaiser>
so apex first see finalize, then your annotation
<hkaiser>
it's the same task
<jbjnr>
the finalize:: tasks seem to take longer than the actual tasks though, so I'm concerned
<hkaiser>
sure they do
<hkaiser>
first finalize is started, then your function, then your function returns, then finalize returns
<jbjnr>
I mean the black bits are longer than the yellowy bits
<hkaiser>
sure
<hkaiser>
finalize fully overlaps your function
<jbjnr>
so the task doing m work is only a fraction of the main task
<hkaiser>
don't forget that finalize makes the returned future ready, this also means this includes the execution of all continuations attached to that future
<hkaiser>
jbjnr: ^^
<hkaiser>
jbjnr: this also means that if you use that future and pass it to another dataflow, the initial finalize may cause a whole chain of continuations to be executed
<jbjnr>
I wish I could see the dependencies of the tasks as line in these trace plots
<hkaiser>
nod
<jbjnr>
then when one task completed, I could see the tasks it triggered. If the black finalize sections had lots of lines emerging from them, then we could correlate the time there to task creation
<hkaiser>
jbjnr: we could do the annotation of finalize differently
<jbjnr>
in what way?
<hkaiser>
end the annotation region before the value is set in the future
<jbjnr>
each finalize is broken into two pieces btw
<jbjnr>
interesting indeed. I wish I understood what was happening
<jbjnr>
seems like with stealing on, the HP update panel task takes MUCH much longer ....
<jbjnr>
and the other tasks are delayed too
<hkaiser>
or it executes more continuations as more work is ready to go
<jbjnr>
if many tasks are being created -what can I tweak that would improve the lag?
<jbjnr>
Thre are flags for task being created in bunches etc?
<hkaiser>
nothing really
Matombo has joined #ste||ar
<hkaiser>
you can't stop pending tasks from being created
<hkaiser>
looks at the perf counters for task numbers, queue length etc.
<jbjnr>
I was thinking about those flags that affect terminated ever N and also creating N at once -that kind of thing
wash has quit [Remote host closed the connection]
<heller>
jbjnr: btw, I now have access to a 728 node omnipath and 560 node IB cluster. This should get us going to fix those parcelports for real now
<jbjnr>
great
<jbjnr>
are these the ones that I can access if I fill in that form?
<heller>
yes
<jbjnr>
I'll do it then.
<jbjnr>
thanks
<heller>
the omnipath one is not in production yet (read: empty queues all the time!)
ajaivgeorge has quit [Read error: Connection reset by peer]
<heller>
and not quota...
ajaivgeorge has joined #ste||ar
<jbjnr>
we should be able to get the omnipath libfabric running easily. I already made it compile, just didn't fix the necessary bits to make it do anything on init etc
<heller>
ok
<heller>
is omnipath connection less?
hkaiser_ has joined #ste||ar
hkaiser has quit [Ping timeout: 268 seconds]
<jbjnr>
heller: omnipath uses the psm provider and yes it is connectionless I believe
<jbjnr>
thread number assignments are differnt, but apart from that. all looks ok
<hkaiser>
jbjnr: so does that branch help in any way?
<jbjnr>
I don't think I can see anything, that wasn't there before.
<jbjnr>
I confused myself by doing the coluring inconsistenylu
<jbjnr>
I don't think it helps, but the question is : what else is there to see that is smaller than the dataflow::finalize that we might get if we improve it?
<jbjnr>
I guess that all tasks are busy adding newly generated tasks to the queues and there is a lot of contention.
<jbjnr>
When I turn off the stealing, then the other threads get on with lower priority work once it is available and the HP threads do the hp tasks ...
<jbjnr>
the queues are all lock free, so I can't use try_lock ... and then move on if the queue is being used ...
<jbjnr>
the very very long "update panel 0" task in the centre of the plots is 10x longer than it ought to be and it appears to be related to the other finalize events taking place....
EverYoun_ has joined #ste||ar
EverYoung has quit [Ping timeout: 258 seconds]
aserio has quit [Quit: aserio]
aserio has joined #ste||ar
<hkaiser>
well, you should see if finalize invokes continuations
pree has quit [Ping timeout: 255 seconds]
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
<jbjnr>
well, all of those tasks are really continuations. It is surely invoking them. Is that what you meant - or do you mean something else?
mbremer_ has joined #ste||ar
<heller>
jbjnr: did you try experimenting with launching some of the continuations synchronously?
<jbjnr>
heller: no. There are rather a lot of them, but it might be interesting to see what changes ....
<jbjnr>
maybe we're just creating too many tasks in one go and causing a big bottleneck ...
<heller>
especially those short running ones ...
<jbjnr>
looks a bit locky though
pree has joined #ste||ar
david_pfander has quit [Ping timeout: 255 seconds]
pree has quit [Read error: Connection reset by peer]
<heller>
mbremer_: the "configuring HPX" section should be fine
<heller>
are you interested in phase II or I?
<heller>
the only problem is that they removed jemalloc from modules
<mbremer_>
Ahh, yes. I saw that :)
bikineev has joined #ste||ar
<mbremer_>
So I should just install it myself?
<heller>
yes
<mbremer_>
Great! Thanks.
<heller>
should be straight forward. usual autotools stuff
<heller>
don't forget to load the craype-mic-knl module if you are targeting the KNLs ;)
<mbremer_>
Yeah, everything else seemed to work. But it definitely wasn't able to find the jemalloc dir. Maybe I'll open a ticket with NERSC so they clean up their docs.
<mbremer_>
Will do
<heller>
mbremer_: contact alice directly
<mbremer_>
Gotcha. Thanks!
<heller>
she probably outsources it to me anyways :P
Matombo has quit [Ping timeout: 255 seconds]
<mbremer_>
@heller: lol
<heller>
mbremer_: at the very least, you could ask for a jemalloc module. the got rid of it in favor of memkind. but that currently breaks HPX
<heller>
feel free to CC me, my nersc username is sithhell
pree has joined #ste||ar
<github>
[hpx] K-ballo force-pushed std-atomic from 307571d to 4ea0081: https://git.io/vQFAO
<github>
hpx/std-atomic 4ea0081 Agustin K-ballo Berge: Replace boost::atomic with std::atomic
zbyerly_ has quit [Ping timeout: 240 seconds]
<heller>
hkaiser: now that boost is finally dying, we need to work on our serialization docs
<hkaiser>
lol
<mbremer_>
@heller: Not sure how to write a ticket that cc's someone for nersc, but I'll just CC your email.
<K-ballo>
kidding aside, would we consider replacing boost.format with that nice fmt library, and boost.program_options with one of the many nice single header C++11 libraries out there?
<hkaiser>
K-ballo: I'm not opposed to that
<hkaiser>
although I doubt any existing nice single header library would give us all the program option functionality
<hkaiser>
but we could reimplement stuff
<hkaiser>
for the formatting - by all means, should be possible with limited effort
<heller>
K-ballo: I am all for it
<heller>
K-ballo: would be nice if we could just keep a copy of it in our repo or so
<heller>
and update on a need-by-need basis
<heller>
program options: yes, all that ini file support etc. will be hard to replace
<K-ballo>
it's only worth it if we don't have to modify them at this point
<hkaiser>
heller: program options depends on half of boost itself
<K-ballo>
oh, the ini file support, had forgotten about that
<heller>
however, might be worth to take it as an opportunity to modernize the code there
<heller>
hkaiser: I am not talking about program_options. A replacement of
akheir has joined #ste||ar
<heller>
it
<hkaiser>
: k
<jbjnr>
hkaiser: I'll join today
<jbjnr>
(just to find out what's going on)
<heller>
oh gosh ... buildbot is on branch rostam ... not master ... that's why it didn't pick up my changes :/
<heller>
gonna restart it now...
Matombo has joined #ste||ar
<heller>
sorry for the hickups
<aserio>
akheir: what is another node I can use which is not used for buildbot?
<heller>
aserio: type "sinfo" to see which nodes are idle
<aserio>
heller: I was hoping to avoid the nodes that buildbot uses... but I suppose it doesnt matter much
<heller>
aserio: buildbot is currently active. so it will give you a pretty accurate picture
<heller>
aserio: those are the partitions buildbot uses
<aserio>
thanks!
<heller>
/nick akheir: you are welcome
mars0000 has joined #ste||ar
bikineev has quit [Ping timeout: 255 seconds]
aserio has quit [Ping timeout: 255 seconds]
<mbremer_>
@heller: Getting the following build error /usr/common/software/boost/1.61/hsw/intel/include/boost/fusion/support/unused.hpp(61): error: a \ constexpr member function is only permitted in a literal class type operator=(unused_type const&) BOOST_NOEXCEPT
<mbremer_>
Should I just compile with gcc?
<hkaiser>
mbremer_: should help
<heller>
mbremer_: yes, use the PrgEnv-gnu module
<mbremer_>
I was already all suprised that we were compiling with intel on cori
EverYoun_ has quit [Ping timeout: 240 seconds]
<jbjnr>
hkaiser: do you know who is speaking
<hkaiser>
jbjnr: some guy from PNNL
<hkaiser>
Marcin something something
<jbjnr>
ok. I got that too. it was the something something I wondered abot. I'm bored.
<hkaiser>
jbjnr: yah, me too - al very well known
<hkaiser>
jbjnr: google says Marcin Joachimiak
<hkaiser>
or Marcin Zalewski... one of those ;)
<jbjnr>
ok
<jbjnr>
anyway. who cares about layers on top of mpi etc, PGAS. yawn.
EverYoung has joined #ste||ar
<hkaiser>
jbjnr: come on, they just invented the best thing since sliced bread
<hkaiser>
;)
<mbremer_>
@heller: So ti seems like that fixed the compilation issues. Libfabric still seems to be giving me grief: function 'int hpx::parcelset::policies::libfabric::libfabric_controller::poll_event_queue(bool)': /global/cscratch1/sd/mb2010/hpx/plugins/parcelport/libfabric/libfabric_controller.hpp:907:22: error: 'FI_JOIN_COMPLETE' was not declared in this scope case FI_JOIN_COMPLETE:
<mbremer_>
I also had to explicitly load a libfabric module, which wasn't in the configure section
<hkaiser>
yah, the same reported just today as a ticket
<hkaiser>
was*
<hkaiser>
mars0000: use mpi for now
<hkaiser>
mbremer_: ^^
<mbremer_>
Well, I can reproduce the bug :)
<mbremer_>
Kk, So just use the normal MPI_parcelport and comment out the stuff in the CrayKNL toolchain?
<hkaiser>
heller: ^^ ??
<mbremer_>
(That's what we did for stampede2)
<jbjnr>
thrte's an hour of my life I won't get back
<heller>
mbremer: just doing a cmake . -DHPX_WITH_PARCELPORT_LIBFABRIC=OFF should work
<mbremer_>
kk. I'll also detail these steps in my response to alice
<heller>
Thanks!
<heller>
Well, I hope to be able to recommend the fabrics pp soon...
vamatya has joined #ste||ar
parsa[w] has joined #ste||ar
akheir has quit [Remote host closed the connection]
akheir has joined #ste||ar
mars0000 has quit [Quit: mars0000]
aserio has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 240 seconds]
aserio1 is now known as aserio
bikineev has joined #ste||ar
eschnett has quit [Read error: Connection reset by peer]
<hkaiser>
let's fry the bigger fish before going down that rabbithole
<heller>
sure
<heller>
I won't oppose to a PR replacing boost.format with that though
<mbremer_>
@heller: It compiles!! Thanks :)
<hkaiser>
I would
<heller>
mbremer_: yw
<hkaiser>
as an internal dependency it wouldn't work (license) and as an external library it just increases our list of dependencies without any substantial gain
<K-ballo>
it's being proposed for standardization, which is why I wanted to incorporate it (and to cut another boost dependency, that's a given)
<K-ballo>
but note I'm not complaining
<hkaiser>
K-ballo: well, let's wait for it being standard, then ;)
<K-ballo>
I'll figure something out, or find something else
<hkaiser>
K-ballo: do you imaging being able to get rid of boost completely?
<hkaiser>
imagine*
<K-ballo>
in hpx's core? es
<K-ballo>
yes
<hkaiser>
replace spirit?
<K-ballo>
although there's stuff like INI file handling and other obscure corners I haven't considered
<K-ballo>
yes, I have something in mind for that spirit piece, for the long term
<hkaiser>
ok
<hkaiser>
interesting
<K-ballo>
the one in parse command line handle something
<hkaiser>
what about the perf-counter names?
<hkaiser>
could be rewritten using a hand-rolled RD parser for sure...
<K-ballo>
the perf-counter names use spirit? sounds overkill
<hkaiser>
look at the parser, there is also the bind-specification parser
<heller>
doesn't sound like a real alternative though
<K-ballo>
anyways, in the "short term" (for some definition of short) I think the viable option is std::error_code in interfaces, and a helper error_code that interoperates between the different types in implementations
<K-ballo>
kind of like we did with chrono
<heller>
yes
<heller>
so logging seems to be the only part where we still depend on date_time
<heller>
with a hard dependency that is
<K-ballo>
I almost said thread :P
<heller>
;)
<heller>
this boost cmake decision reminds me of trumpcare
pree has quit [Ping timeout: 255 seconds]
bibek_desktop has quit [Ping timeout: 246 seconds]
pree has joined #ste||ar
<K-ballo>
#NotMySteeringCommittee
pree has quit [Ping timeout: 248 seconds]
bibek_desktop has joined #ste||ar
pree has joined #ste||ar
<jbjnr>
what is "this boost cmake decision" - will boost finally see the light and adoptit?
<jbjnr>
yes we can!
<jbjnr>
why is it like obamacare?
<jbjnr>
obamacare is good - cmake is good - I see now!
denis_blank has quit [Quit: denis_blank]
eschnett has quit [Quit: eschnett]
hkaiser has joined #ste||ar
hkaiser has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
<hkaiser>
jbjnr: obama care is an un-affordable mess you have to opt into, built solely for the insurance companies to make money
<hkaiser>
it is being glorified in the European media only
Matombo has joined #ste||ar
<zao>
jbjnr: Do you really believe that the madmen driving Boost will not botch a CMake implementation, badly?
<zao>
hkaiser: Not going to get political, but if you have a fundamentally broken insurance system you can't get rid of, I'm of the impression that ACA is better than nothing.
<zao>
But hey, that's a discussion for another time and place.
* zao
hides in bed, after spreading the Boost news
<hkaiser>
zao: that breaks the existing insurance system even further and increases the gap between those who can afford things and those who don't
<zao>
Most people I talked with were "people use boost?" followed by lamentations of being locked into some niche libraries they can't get rid of.
<zao>
I feel really sorry for the person stuck with Preprocessor.
<hkaiser>
lol - it's soo cool ;)
<hkaiser>
we've been using it for a long time!
<K-ballo>
I havr a replacement on a branch
<hkaiser>
what for?
<K-ballo>
had for a long time now
<hkaiser>
boost?
<K-ballo>
preprocessor
<jbjnr>
hkaiser: I read the washington post and the new york times.
<jbjnr>
(online obviously)
<hkaiser>
sure
<hkaiser>
jbjnr: so you're saying those are not favouring the insurance companies?
<zao>
K-ballo: I still hold a dependency on PropertyTree and of all things, recursive_mutex :D
<jbjnr>
hkaiser: trump is a delusional idiot. the republican party is full of arseholes. Obamacare was/is a not bad effort to make the USA into a half civilized nation. There are no conspiracies out to get you. stop believing fox news.
<jbjnr>
\rant over
akheir has quit [Remote host closed the connection]
<hkaiser>
jbjnr: yah and the shepard does not work in a team with the dog to keep the sheep at bay - that's a conspiracy as well
<jbjnr>
info looks about the same. How can I find out what's going on in those finalize tasks ...
<hkaiser>
I was hoping the changes I made would give us more info
<hkaiser>
jbjnr: hold on, I have another idea to at least get at the function addresses
<jbjnr>
there's always one update panel task that goes on too long ... it's mysterious
<jbjnr>
PS. (your shepherd and dog are not related to the tragedy that is American politics)
eschnett has joined #ste||ar
<jbjnr>
the dataflow_frame::finalize tasks are now made of many very small blocks it seems, not one single big one. not sure if that;s something that changed with your branch
mars0000 has quit [Quit: mars0000]
<jbjnr>
I think I should add a sliding semaphore ...
<hkaiser>
jbjnr: I agree
<hkaiser>
but it's related to you telling me that everything I say is a conspiracy theory
<hkaiser>
jbjnr: I think that's what changed
<hkaiser>
the small bits and pieces that is
<jbjnr>
it might be that the DAG we are creating is sooo huge that those finalize blocks are just the reality of how long it takes to traverse all our tasks
<jbjnr>
adding a sl-sem might help
<hkaiser>
jbjnr: could help, yes
<jbjnr>
have to find the right trigger ...
aserio has quit [Quit: aserio]
patg[[w]] has joined #ste||ar
<patg[[w]]>
hkaiser: I'm just now seeing the space commit
<hkaiser>
patg[[w]]: what commit?
<patg[[w]]>
The stuff Christoph put in space that you commented on
<patg[[w]]>
spack
<jbjnr>
hkaiser: sliding sem didn't help. I'll try more experiments tomorrow. thanks for the help with the function annotations.
pree has quit [Quit: AaBbCc]
patg[[w]] has quit [Quit: Leaving]
patg[[w]] has joined #ste||ar
patg[[w]] has quit [Quit: Leaving]
Matombo has quit [Remote host closed the connection]
<hkaiser>
zbyerly: yt?
mbremer has quit [Quit: Page closed]
EverYoung has quit [Ping timeout: 246 seconds]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]