aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
<simbergm>
K-ballo: did you see out why I reverted the cxx11_std_atomic test changes?
<heller_>
jbjnr_: prepare for impact...
kisaacs has joined #ste||ar
<heller_>
jbjnr_: just have to fix this little shutdown problem ... and then we're flying...
<simbergm>
heller_: shutdown = the spurious thread pool executor etc failures?
<heller_>
jbjnr_: initial tests show that I was able to reduce the non-contention overhead by 1/3 and the 10 us grain size task spawning benchmark clocks in with a 16x speedup
<heller_>
simbergm: nope... different thing
<simbergm>
ah, related to your optimizations now?
<simbergm>
very impressive btw
<heller_>
yeah
<heller_>
we'll see
<heller_>
especially how it'll work in a real application
<simbergm>
yeah, promising in any case
kisaacs has quit [Ping timeout: 255 seconds]
<simbergm>
jbjnr_: it seems like cxx11_std_atomic PR which I reverted might not have failed with pycicle because of a stale cache, but I'm not certain
<jbjnr_>
simbergm: EITHER WAY. i'LL PUT IN THE BINARY WIPE BEFORE BUILD. iT'S THE 'RIGHT' THING TO DO
<jbjnr_>
ARRGH. CAPS LOCK!
<jbjnr_>
sorry
<simbergm>
OK!
<simbergm>
thanks :)
<jbjnr_>
lol
<jbjnr_>
heller_: this is great news indeed. I am reasonably confident that this is the last point of contention that is holding us back. if we get a cholesky speed up, then we're submitting and SC paper with the cholesky results.
<simbergm>
jbjnr_: also, do you maybe want to pull the timeout change for pycicle as well?
<jbjnr_>
I merged them imediately
<jbjnr_>
and restarted pycicle. did they not stick
<jbjnr_>
oh crap. I know what's wrong
<simbergm>
hmm, seems like it didn't
<jbjnr_>
[pause]
<heller_>
jbjnr_: even with my changes, the scheduling overhead is still fairly large (85%), but got it reduced from 3000 cycles to 2000 ...
<jbjnr_>
every little helps
<github>
[hpx] msimberg closed pull request #3109: Fixing thread scheduling when yielding a thread id. (master...fix_scheduling) https://git.io/vNRSG
jfbastien_ has quit [Ping timeout: 265 seconds]
david_pfander has joined #ste||ar
jaafar has quit [Ping timeout: 255 seconds]
<jbjnr_>
simbergm: restarted pycicle - hopefully with correct timeout now
<simbergm>
jbjnr_: great, thanks!
<github>
[hpx] sithhell pushed 1 new commit to fix_thread_overheads: https://git.io/vNuvT
<github>
hpx/fix_thread_overheads 1cdb0f6 Thomas Heller: Optimizing thread scheduling (WIP)...
<heller_>
woah .... this is insane
<heller_>
with a fine grain run, the concurrency is bad due to freeing the memory associated with the thread function object :/
david_pfander has quit [Ping timeout: 248 seconds]
david_pfander1 is now known as david_pfander
quaz0r has joined #ste||ar
nanashi55 has quit [Ping timeout: 240 seconds]
nanashi55 has joined #ste||ar
parsa has joined #ste||ar
parsa has quit [Client Quit]
<heller_>
woah ... you won't believe this
<K-ballo>
alright, I won't
<heller_>
register_work, register_work_nullary, register_thread and register_thread_nullary lead to dynamic memory allocation inside of hpx::util::function
<heller_>
I brought the scheduling overheads down such that the dynamic memory management of the thread functions becomes the bottleneck :/
<K-ballo>
why would that not be believable?
<heller_>
it is, once you stumble over that
<K-ballo>
actually, I thought we were putting the callable in the "stack ?
<heller_>
the callable has to be captured first
<heller_>
somwhere
<heller_>
*somewhere
<K-ballo>
so I was mistaken?
<heller_>
yes
<heller_>
more or less
<heller_>
we store the address of a trampoline function in the stack, this trampoline function knows the real entry point
<github>
[hpx] hkaiser created exclusive_scan (+1 new commit): https://git.io/vNu3H
<github>
hpx/exclusive_scan 24831ed Hartmut Kaiser: Minor fixes to exclusive_scan algorithm...
<K-ballo>
I wonder why I thought we were emplacing the callable in the "stack"
<K-ballo>
I think I even remember something reserving a bunch of space, then having a function re-target it via reference wrapper
<heller_>
when calling one of the functions above, we essentially do a function<thread_result_type>(bind(one_shot(thread_function), move(callable))); where callable is a function<R()> by itself
<github>
[hpx] hkaiser opened pull request #3112: Minor fixes to exclusive_scan algorithm (master...exclusive_scan) https://git.io/vNu3d
<K-ballo>
maybe it is something I experimented with when I touched the couutines
<K-ballo>
*coroutines
<heller_>
could be, yeah
<heller_>
we more or less just store the address of the coroutine itself on the stack, which has an operator() overload, which then calls the user provided thread function
<heller_>
the problem at hand doesn't go away if we somehow emplace the callable on the stack though
<hkaiser>
heller_: util::function should perform samll value optimizations, is the function object we create too large for this?
<heller_>
hkaiser: yeah, it's essentially a function pointer + a function itself, which is impossible to fit into the small object space
<heller_>
by definition
<hkaiser>
heller_: why?
<hkaiser>
the small object space is 3 * sizeof(void*)
<hkaiser>
iirc
<heller_>
because sizeof(util::function<...>) == sizeof(void*) + sizeof(void*)*3
<hkaiser>
so we need 4*SIZEOF(VOID*)?
<heller_>
no
<K-ballo>
lol no
<K-ballo>
we need recursively many
<heller_>
right
<hkaiser>
shrug, you lost me - but that's fine ;)
<heller_>
we now want to store a object with sizeof(void*) + sizeof(void*)*3 + sizeof(void(*)())
<heller_>
because our bound function is a function pointer + util::function
<heller_>
you see ;)
<hkaiser>
k
<hkaiser>
let's change THAT, then
<heller_>
yes
<heller_>
on it
<hkaiser>
just don't 'forget' to create PR afterwards ;)
<heller_>
hkaiser: do you remember what the difference between register_work and register_thread is?
<hkaiser>
sure
<hkaiser>
register_work does not actually create the thread
<hkaiser>
register_thread does
<hkaiser>
(if runnow == true)
<heller_>
ok, so no real difference ;)
<hkaiser>
almost none, yah
<heller_>
i got rid of the task queue anyway :P
<hkaiser>
so you lazily allocate the stack now?
<heller_>
yes
<heller_>
will this be a problem for windows?
<hkaiser>
no 'staged' threads anymore?
<heller_>
right
<hkaiser>
remove the perf-counters as well, pls
<heller_>
sure
<heller_>
it will be a massive PR :P
<heller_>
breaking everything...
<hkaiser>
might be better to do in stages
<heller_>
yeah
<hkaiser>
the guys want to do a release soon
<heller_>
sure
<heller_>
I want to release my dissertation as well ;)
<hkaiser>
so they will not accept this before the release :/
<heller_>
I am fine with that
<heller_>
but I am pretty sure jbjnr_ is more than interested to get this into the release ;)
<K-ballo>
better to do after the release
<heller_>
the overall changes, I agree
<heller_>
this thread function changes shouldn't have any impact on the functionality at all
<jbjnr_>
heller_: hkaiser if heller's PR gives a cholesky speedup then we will delay any release until it's in :)
<heller_>
lol
<jbjnr_>
and of course, I'll stay up late to make sure it gets merged.
kisaacs has joined #ste||ar
<hkaiser>
jbjnr_: ok
<hkaiser>
this will not an easy ride, though
<hkaiser>
be*
<hkaiser>
to many things are interrelated here
<heller_>
yeah, I expect it to take at least 3 months until every problem has been ironed out
<simbergm>
hard deadline for the release is the next HPX tutorial, shifting a week or two from the current plan should be okay
<simbergm>
but preferably not three months...
<heller_>
the scheduler changes are fairly high risk
<hkaiser>
heller_: as I said, let's do this gradually, not everything at once
<simbergm>
if things stay smooth on master I would be ready to do a 1.1.1 whenever these improvements are ready (sometime sooner than in 6 months)
<heller_>
hkaiser: sure
<heller_>
hkaiser: I am doing the thread function work in complete isolation, from top of master, now
parsa has joined #ste||ar
<hkaiser>
good
kisaacs has quit [Ping timeout: 248 seconds]
<jbjnr_>
I want to cancel the tutorial. shall I send am email to Rolf and see how many are signed up
<jbjnr_>
hkaiser: there's not reason to panic over heller_ 's changes. If it orks, it works, we have enough tests that if something is wrong, we'll find it.
<jbjnr_>
(famous last words)
<jbjnr_>
at least PRs are actually tested now :)
* jbjnr_
pats himself on the back
<heller_>
jbjnr_: yes, do it
kisaacs has joined #ste||ar
<heller_>
I hate bind...
<heller_>
how did we name the guard against bind_eval again?
<K-ballo>
don't use bind
<K-ballo>
util::protect ?
<K-ballo>
really, don't use bind
<heller_>
right...
<heller_>
what would you suggest?
<K-ballo>
try bind_front/back first, if lazy
<heller_>
thanks
<K-ballo>
if the context in which the callable is used is constrained, an internal hand crafted thingy might be lighter
<heller_>
yeah
<heller_>
good call...
<K-ballo>
oh and don't forget deferred_call too
<heller_>
deferred_call is what I used first
<heller_>
after I discovered that I need to swallow the argument that was passed
<heller_>
a handcrafted thingy it is
<zao>
jbjnr_: Once again, good job!
<heller_>
K-ballo: now, If i'd want to do a empty base optimization .... what would I do? just derive from it?
<heller_>
or is this a bad idea in general?
<K-ballo>
derive from what?
<heller_>
right now I have: template<typename F> struct thread_function { decay_t<T> f: R operator()(..); };
<heller_>
simplified ... but then, F is probably almost never empty... hmmm
<jbjnr_>
hkaiser: finally managed to rebase my guided_pool_executor back onto master after your reworking of the future continuations.
<heller_>
wait, if it is empty, I don't need to store it in the first place ;)
<jbjnr_>
phew!
<jbjnr_>
gtg
<zao>
jbjnr_: Are there any configurations you're not running with your pysicles?
<zao>
Like clang+libc++, or something similarly weird?
<zao>
Do we have Intel testers at all?
<github>
[hpx] hkaiser created mpi_cmake_v3.10.2 (+1 new commit): https://git.io/vNulQ
<github>
hpx/mpi_cmake_v3.10.2 5807fb5 Hartmut Kaiser: cmake V3.10.2 has changed the variable names used for MPI
<hkaiser>
jbjnr_: nice
<hkaiser>
I hope it simplified your code
<jbjnr_>
zao: got to run, but no. I am doing gcc and clang only
<jbjnr_>
hkaiser: not yet, I just rebased and made sure it compiles, not actually looked at anything properly yet
<github>
[hpx] hkaiser opened pull request #3113: cmake V3.10.2 has changed the variable names used for MPI (master...mpi_cmake_v3.10.2) https://git.io/vNu8v
<hkaiser>
jbjnr_: it should simplifiy your code significantly
<jbjnr_>
zao: no odd configs at the moment. if you look at the daint/greina config files you can see all the stuff that's set in there
<jbjnr_>
hkaiser: really? please leave detail here so I can read tonioght tomorrow. really leaving now. I wopuld apreciate clues as to what has changed cos I'vv forgotten the details now
<zao>
I'm still a bit unsure what to actually do with my spare Skylake/Ryzen boxes at home.
<zao>
All I ever do with them nowadays is reproduce stuff for AMD support and build software with EasyBuild.
<hkaiser>
jbjnr_: you assume I remember ? ;)
<hkaiser>
need to look at your code to see what's changed
<zao>
We need an deadline_timer bot that responds to async_wait.
<hkaiser>
heller_: would you mind me working on #3105?
<heller_>
hkaiser: not at all
<hkaiser>
k - I think all of your changes can go, we need to fix the test - that's all
<heller_>
hkaiser: I agree
<hkaiser>
k
<heller_>
that's what I wrote in the comment as well, I think
<hkaiser>
k
<heller_>
alright, the register_thread changes alone brought a noticable speedup
<hkaiser>
heller_: I think you should split the refcnt removal from the stack-related changes
<hkaiser>
and make the removal of the staging stuff separate as well
<hkaiser>
all of those things are independent
<heller_>
refcnt: I agree
<heller_>
the others: maybe
<heller_>
i squashed them all together now ... so it's a little hard to disentangle
<hkaiser>
the stack changes don't touch the scheduler - we should keep the scheduler changes isolated
eschnett has quit [Quit: eschnett]
<heller_>
ok, that we can agree on
<heller_>
it will be lotsa small patches then
<hkaiser>
heller_: that's fine
<hkaiser>
make reviewing simpler as well
kisaacs_ has joined #ste||ar
kisaacs has quit [Ping timeout: 248 seconds]
kisaacs has joined #ste||ar
kisaacs_ has quit [Ping timeout: 256 seconds]
hkaiser has quit [Quit: bye]
diehlpk_work has quit [Ping timeout: 240 seconds]
diehlpk_work has joined #ste||ar
jaafar has joined #ste||ar
mcopik has joined #ste||ar
hkaiser has joined #ste||ar
<github>
[hpx] hkaiser force-pushed fix_traversal_frame_refcount from 3a711a6 to 8fcaf8e: https://git.io/vNuVr
<github>
hpx/fix_traversal_frame_refcount 8fcaf8e Hartmut Kaiser: Fixing test to start off with an initial refcnt of 1
<hkaiser>
heller_: if dataflow(f, std::vector<future<T>>) is called with an empty vector, 'f' is never called - would you consider this to be a bug?
<heller_>
hkaiser: yes, not an expected behavior
<hkaiser>
ok, I'll look into it
<K-ballo>
what does it do when the vector has one future? and two?
<K-ballo>
it does not "unwrap", or does it?
<hkaiser>
K-ballo: it calls 'f' once the futures have become ready
<diehlpk_work>
hkaiser, After cleaning the configure script, blazemark suppoprts hpx thread
<hkaiser>
K-ballo: not implicitly, no
<hkaiser>
diehlpk_work: wonderful!
<diehlpk_work>
Will run the benchmarks next week, have to finish some other things. Was traveling this week and expierenced chaos at all airports
<hkaiser>
diehlpk_work: thanks! looking forward to seeing those numbers
<hkaiser>
diehlpk_work: does Klau generate his magic numbers from that benchmark as well?
<diehlpk_work>
It seems so
<diehlpk_work>
I have to ask him
<hkaiser>
so we could use it to generate the magic numbers for hpx, great
<diehlpk_work>
Once I have a clean version fot eh code, I will do the pull request
daissgr has joined #ste||ar
<hkaiser>
he'll appreciate that, for sure
<diehlpk_work>
I think that for the magic numbers we need his support
<hkaiser>
k
aserio has joined #ste||ar
<hkaiser>
but he should be interested in generating those for the hpx backend
<diehlpk_work>
I wanted first do the measurements and write a blog post or technical report for this
<diehlpk_work>
After that the magic numnbers
<hkaiser>
k, great
<hkaiser>
:D
eschnett has joined #ste||ar
eschnett has quit [Ping timeout: 248 seconds]
<aserio>
simbergm: yt?
<simbergm>
aserio: yep
<hkaiser>
heller_: ok, I take that back, the function _is_ called
<aserio>
simbergm: please see pm :)
hkaiser has quit [Quit: bye]
eschnett has joined #ste||ar
hkaiser has joined #ste||ar
<aserio>
daissgr: What compiler are you using for your work?
<github>
[hpx] sithhell created thread_function (+1 new commit): https://git.io/vNu6X
<github>
hpx/thread_function 19414ab Thomas Heller: Avoid using util::function for thread function wrappers...
<github>
[hpx] sithhell opened pull request #3114: Avoid using util::function for thread function wrappers (master...thread_function) https://git.io/vNu6D
hkaiser has quit [Quit: bye]
<aserio>
Has anyone else noticed that Rostam is slow?
<K-ballo>
heller_: you ended up using bind after all?
<K-ballo>
mmh, no, looks like is just the include
<K-ballo>
heller_: let me suggest pre-decaying the F given to `thread_function`, then do decay_unwrap inside
<K-ballo>
if you can add the wait_signaled ignoring to thread_function itself all the better, but might not work
david_pfander has quit [Ping timeout: 255 seconds]
hkaiser has joined #ste||ar
<heller_>
K-ballo: hm, not sure I'm following
<heller_>
The unwrapping in the callable leads to less code
<github>
[hpx] sithhell force-pushed thread_function from 19414ab to 134b498: https://git.io/vNuyh
<github>
hpx/thread_function 134b498 Thomas Heller: Avoid using util::function for thread function wrappers...
<hkaiser>
heller_: is #3104 ok now? also, I changed #3105
<heller_>
hkaiser: #3105: no, for some reason, this didn't work for me :/
<github>
[hpx] hkaiser closed pull request #3104: Local execution of direct actions is now actually performed directly (master...fixing_local_direct_actions) https://git.io/vN84W
daissgr has quit [Ping timeout: 248 seconds]
kisaacs has quit [Ping timeout: 256 seconds]
jaafar has joined #ste||ar
aserio has quit [Ping timeout: 260 seconds]
<github>
[hpx] sithhell force-pushed thread_function from 134b498 to ba9e815: https://git.io/vNuyh
<github>
hpx/thread_function ba9e815 Thomas Heller: Avoid using util::function for thread function wrappers...
kisaacs has joined #ste||ar
kisaacs has quit [Ping timeout: 248 seconds]
<hkaiser>
heller_: I figured #3105 out
<hkaiser>
the test itself it completely broken
kisaacs has joined #ste||ar
<heller_>
hkaiser: figured
daissgr has joined #ste||ar
<github>
[hpx] hkaiser force-pushed fix_traversal_frame_refcount from 8fcaf8e to 6a6b13f: https://git.io/vNuVr
<github>
hpx/fix_traversal_frame_refcount 6a6b13f Hartmut Kaiser: Fixing test to start off with an initial refcnt of 1...
<hkaiser>
heller_: would you mind trying again whether it's still broken for you?
<diehlpk_work>
hkaiser, With support of Klaus, I could finish the hpx blazemark
<hkaiser>
nice!
<heller_>
hkaiser: give me a few minutes...
<diehlpk_work>
He have to adapt the configure script to Linux and we are done
<hkaiser>
sure
<diehlpk_work>
Without kacking it compiles on mac os x only
daissgr has quit [Ping timeout: 276 seconds]
<heller_>
hkaiser: I think the non-in_place version should go ...
<heller_>
it's completely useless
<hkaiser>
yah
<hkaiser>
that's orthogonal, though
<heller_>
yes
<heller_>
that way, you could have used future_data_base just as well ;)
<heller_>
doesn't really matter though
<hkaiser>
I wanted to avoid that
<hkaiser>
needed to figure out what's wrong
<heller_>
sure, seperating the tests...
<hkaiser>
the boost::intrusive_ref_counter is completely broken, btw
<hkaiser>
that was the reason why we didn't see the problems - the test should have failed compiling
daissgr has joined #ste||ar
<heller_>
yeah ... pulling copies with atomics...
<K-ballo>
broken?
<hkaiser>
K-ballo: yah it overloads copy-ctr and copy-assignment to hide non-copyablitiy of atomic_count