aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
EverYoun_ has joined #ste||ar
EverYoung has quit [Ping timeout: 246 seconds]
EverYoun_ has quit [Ping timeout: 255 seconds]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
<heller_>
why is it building with generic context corouties?
parsa has joined #ste||ar
eschnett has quit [Quit: eschnett]
eschnett has joined #ste||ar
EverYoung has joined #ste||ar
<github>
[hpx] sithhell force-pushed thread_data_refcount from 9559988 to ed9396c: https://git.io/vNacO
<github>
hpx/thread_data_refcount ed9396c Thomas Heller: Don't use boost::intrusive_ptr for thread_id_type...
parsa has quit [Quit: Zzzzzzzzzzzz]
nanashi55 has quit [Ping timeout: 264 seconds]
nanashi55 has joined #ste||ar
parsa has joined #ste||ar
parsa has quit [Client Quit]
parsa has joined #ste||ar
parsa has quit [Ping timeout: 264 seconds]
jaafar_ has joined #ste||ar
jaafar_ has quit [Remote host closed the connection]
jaafar_ has joined #ste||ar
vamatya has joined #ste||ar
daissgr has quit [Ping timeout: 246 seconds]
jaafar_ is now known as jaafar
jaafar_ has joined #ste||ar
jaafar_ has quit [Client Quit]
jaafar has quit [Ping timeout: 255 seconds]
jaafar has joined #ste||ar
EverYoun_ has joined #ste||ar
EverYoung has quit [Ping timeout: 246 seconds]
<github>
[hpx] sithhell force-pushed thread_data_refcount from ed9396c to 3f3143e: https://git.io/vNacO
<github>
hpx/thread_data_refcount 3f3143e Thomas Heller: Don't use boost::intrusive_ptr for thread_id_type...
mcopik has quit [Ping timeout: 240 seconds]
david_pfander has joined #ste||ar
vamatya has quit [Ping timeout: 256 seconds]
<jbjnr_>
heller_: what's the status of your branch/branches?
<jbjnr_>
I've got problems with master and I feel like I'm wasting time fixing things if I'm going to use your branch anyway, where things like wait_or_Add_new no longer exist
<jbjnr_>
home/biddisco/src/hpx/src/runtime/parcelset/parcel.cpp:456:78: error: invalid user-defined conversion from ‘hpx::threads::thread_id_type’ to ‘uint64_t {aka long unsigned int}’ [-fpermissive]
gedaj has quit [Ping timeout: 260 seconds]
gedaj has joined #ste||ar
<jbjnr_>
ok. Everything is broken if I use the thread_overheads branch.
<heller_>
why is it building with generic context coroutines, btw?
<heller_>
yeah, something with the PAPI rapl counters
<hkaiser>
shouldn't
<heller_>
it does though ;)
<hkaiser>
ok, so it tests this at least ;)
<hkaiser>
I'll talk to Al today, he might have change something lately
<heller_>
maybe some kernel updates related to meltdown?
parsa has quit [Quit: Zzzzzzzzzzzz]
parsa has joined #ste||ar
<heller_>
jbjnr_: are you compiling with the idle backoff thingy?
<jbjnr_>
I've got everything to compile now
<jbjnr_>
doesn't work, but ...
<jbjnr_>
goig to go back to master anyway
<heller_>
so, one thing I discovered just now is that the idle backoff mechanism might be one of the reasons for the remaining 10% of performance you are looking for
<jbjnr_>
ok, keep talking
<heller_>
especially in your use case, where the queues probably drain quite often
<jbjnr_>
no
<jbjnr_>
queues are always super full
<heller_>
what we do is a timed wait on a condition variable
<heller_>
oh, ok
<heller_>
then nevermind
<jbjnr_>
(or at least they should be)
<jbjnr_>
I will disable the idel backoff anyway. I think I usually do, but I'll check
<jbjnr_>
my custom scheduler is broken somehow and I don't know why
<jbjnr_>
I'm using this -DHPX_WITH_THREAD_MANAGER_IDLE_BACKOFF:BOOL=OFF \
<heller_>
with the thread_overhead branch?
<heller_>
ah ok, that's good then
<jbjnr_>
no. my scheduler seems to be broken with master, so I thought rather than fixing it there, I'd try your branch cos then the thread queues are simpler, but it isn't any better
<heller_>
hmm
<jbjnr_>
it was faulting in wait_or_add_new, so I figured as you removed thet ...
<heller_>
what symptoms do you see?
<jbjnr_>
and never terminating
<heller_>
tell me more...
<jbjnr_>
entering destroy_thread millions of times, but never terminating
<jbjnr_>
and eventually locking the macie up
<jbjnr_>
machine^
<heller_>
ok
<heller_>
with my branch, right?
<jbjnr_>
with your branch it's worse. with master it was flakey
<jbjnr_>
it used to work fine, but I didn't touch ot for 2 months
<heller_>
ok
<heller_>
I think I know what's going on
<jbjnr_>
and rebasing it onto laster master has taken me 2 days and it still doesn't work.
<heller_>
can I see your updated scheduler against my branch please?
<jbjnr_>
the second one call the first, so I only need it in the second one
<heller_>
jbjnr_: thread_queue::destroy_thread now always returns true
<jbjnr_>
what if i call it with a thread that's in another queue?
<heller_>
threa_id_type::get_queue returns a void pointer to the thread_queue that allocated it
<jbjnr_>
aha. ok then I can simplify everything
<heller_>
*nod*
<jbjnr_>
the first link you showed is now obsolete then
hkaiser has quit [Quit: bye]
<jbjnr_>
cos that loops over queues
<jbjnr_>
(and so does the first, I have holders of queues)
<heller_>
yeah, those loops should be obsolete now
<jbjnr_>
hmmm. still calling destroy_thread millions of times and looping forever
<heller_>
hmpf
<jbjnr_>
no. worked that time.
<jbjnr_>
must have run the wrong binary first time maybe....
<heller_>
good
<jbjnr_>
(I think I ran the binary from the terminal before the IDE had finished writing out the new compiled version)
<jbjnr_>
seems to be working now. thanks
<jbjnr_>
I can begin testing again (if I can fix the other unrelated bug)
<jbjnr_>
might be able to make use of that queue member elsewhere to remove some other loops
<jbjnr_>
heller_: hard to tell on the laptop, but things do look a bit better for smaller blocks. I will try to get binaries redy for some big tests this evening on daint
<jbjnr_>
rats. other bug still there. locks up on N>1 numa domains.
hkaiser has joined #ste||ar
hkaiser has quit [Client Quit]
hkaiser has joined #ste||ar
<heller_>
jbjnr_: do you know where the lockups are?
<heller_>
At shutdown or somewhere else?
rtohid has joined #ste||ar
<jbjnr_>
my lockups are nearer to start than shutdown
<jbjnr_>
when I make my single socket lpatop appear to have 2 numa domains for testing