hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
diehlpk has joined #ste||ar
diehlpk has quit [Ping timeout: 264 seconds]
jaafar has joined #ste||ar
hkaiser has quit [Quit: bye]
jaafar_ has joined #ste||ar
jaafar has quit [Ping timeout: 252 seconds]
jaafar_ has quit [Ping timeout: 246 seconds]
david_pfander has joined #ste||ar
jaafar has joined #ste||ar
eschnett has joined #ste||ar
eschnett has quit [Ping timeout: 252 seconds]
aserio has joined #ste||ar
eschnett has joined #ste||ar
eschnett has quit [Client Quit]
bibek has joined #ste||ar
hkaiser has joined #ste||ar
<jbjnr_> NB. The jemalloc munmap errors have been common recently. I see them when I'm running and the linear algebra people have seen them too on daint (using their own build of hpx). (I have my own theory about it).
<heller> which is?
<jbjnr_> hkaiser: is there any easy call like find_here() that returns rank 0 locality I'm looking for find_agas() or something, but can't seem to locate one
<jbjnr_> heller: stack misuse
<jbjnr_> I had much better behaviour on the lazy stack init branch. I wish there were more hours in the day to fix all this stuff.
<jbjnr_> I think hpx::loc:barrier might be buggy.
<jbjnr_> hpx::lcos::barrier I mean. It may be causing my lockups
<heller> ok
aserio has quit [Ping timeout: 252 seconds]
<heller> so you are saying that there are too many stacks allocated?
<heller> while that might be true
<hkaiser> jbjnr_: sure
<heller> it still is a hint that the application just creates too many threads
<hkaiser> sec
<jbjnr_> heller: absolutely correct
<jbjnr_> combine that with bad reuse of memory ...
<heller> *nod*
<heller> fixing this will still mean that the application has a performance problem, no?
<jbjnr_> (I've been working on that in my scheduler cleanup, but it's not finished)
<jbjnr_> ^^not sure.
<hkaiser> jbjnr_: hpx::agas::get_console_locality()
<jbjnr_> heller: we just shouldn't allow executors to keep creating tasks without limit ... (limiting executor)
<jbjnr_> hkaiser: thanks. perfect
<jbjnr_> I searched for all regexes apart form that one
<heller> jbjnr_: if the application is written that way ... we can't bound the creation of tasks in general, that will mean trouble for our forward progress guarantee
<jbjnr_> not necessarily. as long as locks aren't held....
<jbjnr_> progress is still made, just not on the spawning thread
<heller> right, which might be a problem
<heller> even without locks
akheir has quit [Quit: Konversation terminated!]
<jbjnr_> the spawning thread will be resumed - that is guaranteed, so nothing changes. Ig any other thread suspends, then the spawning thread can resume. The only case where it doesn't is if all tasks run indefinitely or take OS locks - in that case, there is a problem anyway.
<jbjnr_> s/ig/if/
<jbjnr_> hkaiser: is it possible that barrier has bugs?
<jbjnr_> how well tested is it?
<jbjnr_> could the order that ranks enter the barrier trigger a corner case?
akheir has joined #ste||ar
<hkaiser> jbjnr_: anything is possible
<hkaiser> jbjnr_: local::barrier or the distributed one?
<jbjnr_> distributed
<hkaiser> ask heller ;-)
<jbjnr_> I thought you rewote it. sorry.
<hkaiser> we use it heavily during startup, so it uses at least for those use cases
<hkaiser> no heller rewrote it
<jbjnr_> Seems like I have a case where if rank 0 is last to enter the barrier, I get a deadlock. Not certain about it, but it looks suspicious
<hkaiser> s/uses/works/
<hkaiser> jbjnr_: could be
diehlpk_work has quit [Remote host closed the connection]
diehlpk_work has joined #ste||ar
aserio has joined #ste||ar
david_pfander has quit [Ping timeout: 250 seconds]
aserio has quit [Ping timeout: 252 seconds]
aserio has joined #ste||ar
eschnett has joined #ste||ar
aserio has quit [Ping timeout: 252 seconds]
<diehlpk_work> jbjnr_, Will you attend the meeting today?
eschnett has quit [Quit: eschnett]
eschnett has joined #ste||ar
eschnett has quit [Quit: eschnett]
aserio has joined #ste||ar
<aserio> jbjnr_: Will you be joining the Operation Bell Meeting today?
aserio has quit [Ping timeout: 252 seconds]
hkaiser has quit [Quit: bye]
<diehlpk_work> jbjnr_, yet?
aserio has joined #ste||ar
bibek has quit [Quit: Konversation terminated!]
hkaiser has joined #ste||ar
<hkaiser> parsa: yt?
<hkaiser> parsa: any idea what happened here: https://circleci.com/gh/STEllAR-GROUP/phylanx/22748 ?
<hkaiser> (many others are failing similarily)
aserio has quit [Quit: aserio]
<parsa> hkaiser: looking right now
<hkaiser> parsa: thanks!
<parsa> hkaiser: there is an extra <CR> at the beginning of the conv.xsl. no clue where it came from. it doesn't exist in the live url and we don't manipulate the file anywhere
<hkaiser> parsa: let's rerun and see what happens