aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
EverYoun_ has joined #ste||ar
EverYoung has quit [Ping timeout: 246 seconds]
EverYoun_ has quit [Ping timeout: 255 seconds]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
hkaiser has quit [Ping timeout: 268 seconds]
hkaiser has joined #ste||ar
EverYoung has quit [Ping timeout: 265 seconds]
parsa has quit [Quit: Zzzzzzzzzzzz]
daissgr has joined #ste||ar
rtohid has left #ste||ar [#ste||ar]
hkaiser has quit [Quit: bye]
K-ballo has quit [Quit: K-ballo]
jaafar_ has quit [Ping timeout: 268 seconds]
hkaiser has joined #ste||ar
<heller_> those segfaults are indeed strange
hkaiser has quit [Quit: bye]
<heller_> that's the segfault backtrace
<heller_> why is it building with generic context corouties?
parsa has joined #ste||ar
eschnett has quit [Quit: eschnett]
eschnett has joined #ste||ar
EverYoung has joined #ste||ar
<github> [hpx] sithhell force-pushed thread_data_refcount from 9559988 to ed9396c: https://git.io/vNacO
<github> hpx/thread_data_refcount ed9396c Thomas Heller: Don't use boost::intrusive_ptr for thread_id_type...
parsa has quit [Quit: Zzzzzzzzzzzz]
nanashi55 has quit [Ping timeout: 264 seconds]
nanashi55 has joined #ste||ar
parsa has joined #ste||ar
parsa has quit [Client Quit]
parsa has joined #ste||ar
parsa has quit [Ping timeout: 264 seconds]
jaafar_ has joined #ste||ar
jaafar_ has quit [Remote host closed the connection]
jaafar_ has joined #ste||ar
vamatya has joined #ste||ar
daissgr has quit [Ping timeout: 246 seconds]
jaafar_ is now known as jaafar
jaafar_ has joined #ste||ar
jaafar_ has quit [Client Quit]
jaafar has quit [Ping timeout: 255 seconds]
jaafar has joined #ste||ar
EverYoun_ has joined #ste||ar
EverYoung has quit [Ping timeout: 246 seconds]
<github> [hpx] sithhell force-pushed thread_data_refcount from ed9396c to 3f3143e: https://git.io/vNacO
<github> hpx/thread_data_refcount 3f3143e Thomas Heller: Don't use boost::intrusive_ptr for thread_id_type...
mcopik has quit [Ping timeout: 240 seconds]
david_pfander has joined #ste||ar
vamatya has quit [Ping timeout: 256 seconds]
<jbjnr_> heller_: what's the status of your branch/branches?
<jbjnr_> I've got problems with master and I feel like I'm wasting time fixing things if I'm going to use your branch anyway, where things like wait_or_Add_new no longer exist
<jbjnr_> home/biddisco/src/hpx/src/runtime/parcelset/parcel.cpp:456:78: error: invalid user-defined conversion from ‘hpx::threads::thread_id_type’ to ‘uint64_t {aka long unsigned int}’ [-fpermissive]
gedaj has quit [Ping timeout: 260 seconds]
gedaj has joined #ste||ar
<jbjnr_> ok. Everything is broken if I use the thread_overheads branch.
<github> [hpx] StellarBot pushed 1 new commit to gh-pages: https://git.io/vNwnk
<github> hpx/gh-pages 8376734 StellarBot: Updating docs
RostamLog has joined #ste||ar
<heller_> jbjnr_: ok
<heller_> jbjnr_: figured as much ... the stuff in core should compile fine though
<heller_> there are a few conflicts with master as I am gradually trying to merge stuff
<heller_> jbjnr_: there are still a few open questions to discuss with that branch
<heller_> and I failed compiling with apex...
<heller_> that error should happen on my branch though, shouldn't it?
<heller_> jbjnr_: I am going to rebase the thread_overheads to latest master once thread_data_refcount branch was merged
K-ballo has joined #ste||ar
Vir is now known as Guest2733
<github> [hpx] sithhell force-pushed thread_data_refcount from e0c1b07 to 9d4fcca: https://git.io/vNacO
<github> hpx/thread_data_refcount 9d4fcca Thomas Heller: Don't use boost::intrusive_ptr for thread_id_type...
hkaiser has joined #ste||ar
<hkaiser> heller_: yt?
<heller_> hkaiser: hey
<hkaiser> jbjnr_: thanks, your patch to pycicle seems to have fixed the problem I was seeing
<hkaiser> heller_: see pm, pls
parsa has joined #ste||ar
<heller_> hkaiser: the segfault on rostam is not on our side btw
<hkaiser> thought so
<hkaiser> something is off there
<heller_> why is it building with generic context coroutines, btw?
<heller_> yeah, something with the PAPI rapl counters
<hkaiser> shouldn't
<heller_> it does though ;)
<hkaiser> ok, so it tests this at least ;)
<hkaiser> I'll talk to Al today, he might have change something lately
<heller_> maybe some kernel updates related to meltdown?
parsa has quit [Quit: Zzzzzzzzzzzz]
parsa has joined #ste||ar
<heller_> jbjnr_: are you compiling with the idle backoff thingy?
<jbjnr_> I've got everything to compile now
<jbjnr_> doesn't work, but ...
<jbjnr_> goig to go back to master anyway
<heller_> so, one thing I discovered just now is that the idle backoff mechanism might be one of the reasons for the remaining 10% of performance you are looking for
<jbjnr_> ok, keep talking
<heller_> especially in your use case, where the queues probably drain quite often
<jbjnr_> no
<jbjnr_> queues are always super full
<heller_> what we do is a timed wait on a condition variable
<heller_> oh, ok
<heller_> then nevermind
<jbjnr_> (or at least they should be)
<jbjnr_> I will disable the idel backoff anyway. I think I usually do, but I'll check
<jbjnr_> my custom scheduler is broken somehow and I don't know why
<jbjnr_> I'm using this -DHPX_WITH_THREAD_MANAGER_IDLE_BACKOFF:BOOL=OFF \
<heller_> with the thread_overhead branch?
<heller_> ah ok, that's good then
<jbjnr_> no. my scheduler seems to be broken with master, so I thought rather than fixing it there, I'd try your branch cos then the thread queues are simpler, but it isn't any better
<heller_> hmm
<jbjnr_> it was faulting in wait_or_add_new, so I figured as you removed thet ...
<heller_> what symptoms do you see?
<jbjnr_> and never terminating
<heller_> tell me more...
<jbjnr_> entering destroy_thread millions of times, but never terminating
<jbjnr_> and eventually locking the macie up
<jbjnr_> machine^
<heller_> ok
<heller_> with my branch, right?
<jbjnr_> with your branch it's worse. with master it was flakey
<jbjnr_> it used to work fine, but I didn't touch ot for 2 months
<heller_> ok
<heller_> I think I know what's going on
<jbjnr_> and rebasing it onto laster master has taken me 2 days and it still doesn't work.
<heller_> can I see your updated scheduler against my branch please?
<jbjnr_> ok will try
<jbjnr_> ta
<jbjnr_> the second one call the first, so I only need it in the second one
<heller_> jbjnr_: thread_queue::destroy_thread now always returns true
<jbjnr_> what if i call it with a thread that's in another queue?
<heller_> threa_id_type::get_queue returns a void pointer to the thread_queue that allocated it
<jbjnr_> aha. ok then I can simplify everything
<heller_> *nod*
<jbjnr_> the first link you showed is now obsolete then
hkaiser has quit [Quit: bye]
<jbjnr_> cos that loops over queues
<jbjnr_> (and so does the first, I have holders of queues)
<heller_> yeah, those loops should be obsolete now
<jbjnr_> hmmm. still calling destroy_thread millions of times and looping forever
<heller_> hmpf
<jbjnr_> no. worked that time.
<jbjnr_> must have run the wrong binary first time maybe....
<heller_> good
<jbjnr_> (I think I ran the binary from the terminal before the IDE had finished writing out the new compiled version)
<jbjnr_> seems to be working now. thanks
<jbjnr_> I can begin testing again (if I can fix the other unrelated bug)
<jbjnr_> might be able to make use of that queue member elsewhere to remove some other loops
<jbjnr_> heller_: hard to tell on the laptop, but things do look a bit better for smaller blocks. I will try to get binaries redy for some big tests this evening on daint
<jbjnr_> rats. other bug still there. locks up on N>1 numa domains.
hkaiser has joined #ste||ar
hkaiser has quit [Client Quit]
hkaiser has joined #ste||ar
<heller_> jbjnr_: do you know where the lockups are?
<heller_> At shutdown or somewhere else?
rtohid has joined #ste||ar
<jbjnr_> my lockups are nearer to start than shutdown
<jbjnr_> when I make my single socket lpatop appear to have 2 numa domains for testing
<heller_> Hmm
<heller_> How do you do that?
<jbjnr_> cores go into a queue to domain lookup table
<jbjnr_> by fudging that at startup.
<jbjnr_> it used to work, but I broke it and now I can't find out what I did wrong
eschnett has quit [Quit: eschnett]
eschnett has joined #ste||ar
aserio has joined #ste||ar
daissgr has joined #ste||ar
EverYoun_ has quit [Ping timeout: 255 seconds]
daissgr has quit [Ping timeout: 265 seconds]
<diehlpk_work> hkaiser, jbjnr_ Did you go trough the proposals?
daissgr has joined #ste||ar
EverYoung has joined #ste||ar
Smasher has joined #ste||ar
vamatya has joined #ste||ar
jaafar_ has joined #ste||ar
jaafar has quit [Ping timeout: 276 seconds]
eschnett has quit [Quit: eschnett]
<hkaiser> yes, I did
aserio has quit [Ping timeout: 276 seconds]
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
<diehlpk_work> Cool
jaafar_ is now known as jaafar
<diehlpk_work> Ariel nodes seems to be srun: Required node not available (down, drained or reserved)
jaafar has quit [Read error: Connection reset by peer]
jaafar has joined #ste||ar
aserio has joined #ste||ar
<zao> I assume the information out of `sinfo` and `sinfo -R` is of little help.
<diehlpk_work> ariel up 3-00:00:00 2 drain ariel[00-01]
<diehlpk_work> Duplicate jobid slurm 2018-01-23T04:47:28 ariel[00-01]
EverYoun_ has joined #ste||ar
<zao> We typically drain ours on node failures or patch-reboots, heaven knows what the local sysapes at LSU do.
<diehlpk_work> I send alirez this error
EverYoung has quit [Ping timeout: 276 seconds]
eschnett has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 255 seconds]
aserio1 is now known as aserio
<jbjnr_> diehlpk_work: I went through some and added a new one, but mostly they are unchanged.
<diehlpk_work> ok, thanks
<diehlpk_work> I added the blaze one this morning
<diehlpk_work> So we just can wait
daissgr has quit [Ping timeout: 276 seconds]
<github> [hpx] stevenrbrandt created fix_hello (+1 new commit): https://git.io/vNrZm
<github> hpx/fix_hello 4af283e Steven R. Brandt: Add new constructor.
<github> [hpx] stevenrbrandt opened pull request #3123: Add new constructor. (master...fix_hello) https://git.io/vNrZW
jaafar has quit [Ping timeout: 276 seconds]
jaafar has joined #ste||ar
<K-ballo> that reminds me, I heard hello_world_component was failing to compile
<K-ballo> it should be added to CI with the rest of the examples
<hkaiser> create a ticket, pls
jaafar has quit [Read error: Connection reset by peer]
jaafar has joined #ste||ar
aserio has quit [Ping timeout: 276 seconds]
david_pfander has quit [Ping timeout: 260 seconds]
<github> [hpx] sithhell created after_3120 (+1 new commit): https://git.io/vNr4P
<github> hpx/after_3120 d07c0e1 Thomas Heller: Replacing nullptr with hpx::threads::invalid_thread_id...
<Guest5549> [hpx] sithhell opened pull request #3125: Replacing nullptr with hpx::threads::invalid_thread_id (master...after_3120) https://git.io/vNr4d
aserio has joined #ste||ar
mcopik has joined #ste||ar
eschnett_ has joined #ste||ar
eschnett has quit [Ping timeout: 256 seconds]
aserio1 has joined #ste||ar
<github> [hpx] sithhell pushed 1 new commit to after_3120: https://git.io/vNr04
<github> hpx/after_3120 941ac5b Thomas Heller: Replacing constexpr with HPX_CXX14_CONSTEXPR where appropriate
daissgr has joined #ste||ar
aserio has quit [Ping timeout: 265 seconds]
aserio1 is now known as aserio
<github> [hpx] sithhell force-pushed after_3120 from 941ac5b to a3bacea: https://git.io/vNr0F
<github> hpx/after_3120 2654964 Thomas Heller: Replacing nullptr with hpx::threads::invalid_thread_id...
<github> hpx/after_3120 a3bacea Thomas Heller: Replacing constexpr with HPX_CXX14_CONSTEXPR where appropriate
aserio has quit [Ping timeout: 265 seconds]
daissgr has quit [Ping timeout: 256 seconds]
hkaiser has quit [Quit: bye]
daissgr has joined #ste||ar
aserio has joined #ste||ar
<heller_> ok ... after_3120 should be good now
jaafar_ has joined #ste||ar
jaafar has quit [Ping timeout: 276 seconds]
hkaiser has joined #ste||ar
aserio has quit [Quit: aserio]
aserio has joined #ste||ar
aserio has quit [Client Quit]
jaafar_ is now known as jaafar
eschnett_ has quit [Quit: eschnett_]
<github> [hpx] stevenrbrandt pushed 1 new commit to fix_hello: https://git.io/vNrD4
<github> hpx/fix_hello baacf09 Steven R. Brandt: Merge branch 'master' into fix_hello
daissgr has quit [Ping timeout: 248 seconds]
rtohid has left #ste||ar [#ste||ar]
heller_ has quit [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
heller_ has joined #ste||ar
parsa has quit [Quit: Zzzzzzzzzzzz]