aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
diehlpk has quit [Ping timeout: 248 seconds]
<heller_>
hkaiser: tatata, got the scheduling overhead down by 7% with my changes so far, the interesting optimizations are still to come ;)
<hkaiser>
cool
<hkaiser>
don't forget to actually delete the threads ;)
<heller_>
that was version 0 :P
<heller_>
the htts scaling got a nice speedup of 1.3 already
<hkaiser>
ice
<hkaiser>
nice
<heller_>
the only contention I removed so far was just due to refcounting...
<heller_>
of 1) thread_id_type and 2) coroutine
<heller_>
getting rid of the intrusive ptr in coroutine didn't have a huge impact though
diehlpk has joined #ste||ar
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
jaafar has quit [Ping timeout: 268 seconds]
kisaacs has quit [Ping timeout: 256 seconds]
diehlpk has quit [Ping timeout: 248 seconds]
parsa has quit [Quit: Zzzzzzzzzzzz]
rod_t has joined #ste||ar
<hkaiser>
heller_: I'm not sure if things just work because you don't actually delete the threads
<hkaiser>
or do you do that now?
eschnett has joined #ste||ar
kisaacs has joined #ste||ar
auviga has quit [*.net *.split]
auviga has joined #ste||ar
kisaacs has quit [Ping timeout: 240 seconds]
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
hkaiser has quit [Quit: bye]
kisaacs has joined #ste||ar
parsa has joined #ste||ar
vamatya has quit [Ping timeout: 256 seconds]
vamatya has joined #ste||ar
bibek has quit [Remote host closed the connection]
bibek has joined #ste||ar
heller_ has quit [*.net *.split]
parsa[w] has quit [*.net *.split]
zbyerly__ has quit [*.net *.split]
heller_ has joined #ste||ar
rtohid has joined #ste||ar
zbyerly__ has joined #ste||ar
parsa[w] has joined #ste||ar
rod_t has joined #ste||ar
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
nanashi55 has quit [Ping timeout: 248 seconds]
nanashi55 has joined #ste||ar
parsa has quit [Quit: *yawn*]
kisaacs has quit [Quit: leaving]
rod_t has joined #ste||ar
jbjnr_ has joined #ste||ar
jbjnr has quit [Ping timeout: 255 seconds]
jaafar has joined #ste||ar
vamatya has quit [Ping timeout: 264 seconds]
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
mcopik has joined #ste||ar
<github>
[hpx] msimberg created revert-3080-suspend-pool (+1 new commit): https://git.io/vNRrL
<github>
hpx/revert-3080-suspend-pool a772ea8 Mikael Simberg: Revert "Suspend thread pool"
<heller_>
so ... thread_map_ removal almost complete... tests are passing so far.
<hkaiser>
heller_: the PR you submitted makes (almost) all tests fail
<heller_>
/scratch/snx3000/biddisco/pycicle/build/3109-clang-6.0.0/bin/for_each_annotated_function_test: error while loading shared libraries: /scratch/snx3000/biddisco/pycicle/build/3109-clang-6.0.0/lib/libhpx.so.1: file too short
<heller_>
looks like a filesystem problem
<hkaiser>
ok
<heller_>
anyways ... regarding thread_map_ removal, there are two problems right now: 1) abort_all_suspended_threads 2) enumerate_threads
<heller_>
I'd like to solve the second point by conditionally enabling thread_map_ again, if the feature is needed, IIUC, that's only needed for debugging puposes
<heller_>
the first point ... I am actually not sure if it is needed after all
<heller_>
if it is needed, I guess the only way around is to have a suspended_threads_ member, which is populate when a thread suspends
<hkaiser>
heller_: and for the perf counters
<heller_>
for the perf counters, I only need the count
<hkaiser>
k
<heller_>
which should be straight forward, I guess
<hkaiser>
creating a list of suspended threads adds the overhead of adding removing a thread to some list for each suspenion point instead of once as we have it now
<heller_>
yes ... that's why I am hesitant here
<hkaiser>
it also would require acquiring a lock for each thread suspension/resumption
<heller_>
right, so more or less same effect. it does of course proper cleanup of the stack etc.
<hkaiser>
and the thread might handle the exception properly, releasing other threads
<hkaiser>
the exception is thrown in the context of the suspended thread
<heller_>
yes
<hkaiser>
I agree its a corner case, but it's there for a reason, I believe
<heller_>
yeah, the big question is if all we do if we omit this is to leak resources
<heller_>
since it's the shutdown of the process ...
<hkaiser>
then other localities might never be notified - so you need a way to tell the others to cancel everything as well
<heller_>
yeah, such that a future is never set to ready
<hkaiser>
or will they naturally hit the same code path?
<hkaiser>
right
<heller_>
it will only hit it, if that function is called after the termination detection
<heller_>
if it is, I would even assert that there are no remaining suspended threads
<heller_>
which it is
<hkaiser>
k
<hkaiser>
I don't fully understand the consequences
<heller_>
me neither
<heller_>
requires more testing, obviously
<heller_>
and understanding
<heller_>
let me first check if all this is worth the effort ;)
<hkaiser>
k
<hkaiser>
but remember, as always - making a fast program correct is closely impossible
<heller_>
that's the big question here
<heller_>
looks like this corner case is a road block
<heller_>
at least at first sight
<hkaiser>
it's always the corner cases ruining everything ;)
<heller_>
i'd still opt in for a solution to track suspended tasks. After all, what we want, are continuation based applications, which rarely suspend
<hkaiser>
uhh
<heller_>
;)
<hkaiser>
what about spinlock?
<heller_>
never suspends
<hkaiser>
not true
<heller_>
it yields, but not as 'suspended', just as 'pending' or 'pending_boost'
<heller_>
the pending ones are not a big deal
<hkaiser>
what about mutex or CV
<heller_>
without the thread_map, the only ones that aren't known to the scheduling policy, are the ones that are truly suspended, waiting to be put into pending again (for example condition_variable::wait)
<hkaiser>
well try it - this will make things only worse
<heller_>
lcos::local::mutex is currently only used in a test :P
<hkaiser>
shrug
<hkaiser>
CV is used everywhere
<heller_>
yes, in a future, for example
<heller_>
in this scenario, you only pay when suspending, not always
<heller_>
that's the idea ...
<heller_>
for all the tests I ran so far, it wasn't needed after all
<github>
[hpx] msimberg opened pull request #3110: WIP: Suspend thread pool (master...suspend-pool) https://git.io/vNRj7
<heller_>
the problem is, removing thread_map_ actually slowed down the whole thing :/
<heller_>
well, brought it back to where I started...
<hkaiser>
heller_: but you pay more than once for a thread, possibly - whereas before we paid once per thread
<jbjnr_>
heller_: 7% (ignoring your latest fail) - I'm starting to get interested. Let me know when you want a tester.
<jbjnr_>
Also - bear in mind that cscs is using hpx without networking, so we don't care about distributed hpx probems for now
<heller_>
jbjnr_: feel free to try it, the current state of the branch is what gave me the 7%
<jbjnr_>
not sure why removeing thread map would slow things down ...
<heller_>
I think I know why...
<jbjnr_>
why
<heller_>
the problem is that thread creation still requires a lock
<heller_>
and thread recycling
<heller_>
that's the next step...
<heller_>
and the one after that, would be to remove the extra staging queue
<heller_>
but as hkaiser always points out: conservation of energy is everywhere...
<hkaiser>
conservation of contention ;)
<hkaiser>
but you're onto something, for sure
<heller_>
contention is our friction point, where the energy escapes the system ;)
<hkaiser>
7% i significant in this contex as it essentially allows to reduce the minimal efficient thread length by the same amout
<heller_>
yeah
<hkaiser>
and that's at least a factor of 100 in terms of time
<hkaiser>
if you save 100 ns per thread, that gives us a win of 10 us for the minimal efficient thread runtime
<heller_>
yeah, which is huge
<hkaiser>
nod
<heller_>
my big problem is the huge time spent for the whole turnaround to schedule threads ... just creating a coroutine and do the switching is in the order of 100 nanseconds ... once the schduler is involved, we are in the order of a micro second
<heller_>
and I don't accept that ;)
<hkaiser>
heller_: well, I'd say the switching is in the order of 300-400ns in one direction
<heller_>
not on my system with my benchmark
<hkaiser>
shrug
<hkaiser>
your benchmark - your rules ;)
<heller_>
149 ns for thrd() to return, that is switching to the context and immediately returning, aka yielding back to the caller context
<hkaiser>
heller_: sure, in that case everything is in the cache
<hkaiser>
L1 cache, that is
<hkaiser>
that's not realistic in the general case
<heller_>
that's in the order of 300 cycles on my machine, btw
<hkaiser>
nod
<heller_>
it sure is a optimal scenario
<jbjnr_>
heller_: those locks are there to reduce contention, but they could probably be replaced with spinloks instead of full pthread locks. (I'd guess). gtg meeting.
<heller_>
jbjnr_: yeah ... contention is a problem now...
simbergm has quit [Ping timeout: 268 seconds]
samuel has joined #ste||ar
rod_t has joined #ste||ar
simbergm has joined #ste||ar
eschnett has quit [Quit: eschnett]
<github>
[hpx] hkaiser force-pushed fixing_local_direct_actions from 1ebb3a3 to 6706984: https://git.io/vNBVp
<github>
hpx/fixing_local_direct_actions 6706984 Hartmut Kaiser: Local execution of direct actions is now actually performed directly...
<github>
[hpx] hkaiser force-pushed fixing_3102 from f0055e7 to d73fdb0: https://git.io/vNBjY
<github>
hpx/fixing_3102 d73fdb0 Hartmut Kaiser: Adding support for generic counter_raw_values performance counter type
<heller_>
guess the test frames need to derive from future_data, and use the inplace tags
<simbergm>
you keep the cmake cache around between commits?
<heller_>
hkaiser: but you were absolutely correct, it's only a problem with the test, the when_all and dataflow frames are fine. The async traversal stuff is conflating various different things, without a consistent way of handling stuff
eschnett has joined #ste||ar
<hkaiser>
heller_: it's better than what we've had before, though
<hkaiser>
simbergm: I do, but that requires care and awareness
<hkaiser>
;)
<heller_>
in terms of keeping us busy, yeah :P
<hkaiser>
com eon
<heller_>
I didn't have the time to attempt a proper fix
<heller_>
but it look like the in_place thing, is the only sensible option to use with the async traversal
<heller_>
NB: my latest improvements brought down the time for a single task roundtrip by 26%. The only problem now, it doesn't scale as much as I'd like it to do ;)
<github>
[hpx] msimberg created revert-3096-fixing_3031 (+1 new commit): https://git.io/vN0z8
<github>
hpx/revert-3096-fixing_3031 bef7f3e Mikael Simberg: Revert "fix detection of cxx11_std_atomic"
<K-ballo>
what's the story there ^ ?
<github>
[hpx] msimberg opened pull request #3111: Revert "fix detection of cxx11_std_atomic" (master...revert-3096-fixing_3031) https://git.io/vN0z6
<github>
hpx/master 8bdfa43 Mikael Simberg: Merge pull request #3111 from STEllAR-GROUP/revert-3096-fixing_3031...
<K-ballo>
woa
<K-ballo>
simbergm: ?
daissgr has joined #ste||ar
<github>
[hpx] hkaiser deleted revert-3096-fixing_3031 at bef7f3e: https://git.io/vN020
jfbastien_ has joined #ste||ar
kisaacs has joined #ste||ar
<kisaacs>
So I had Katy also build hpx on rostam with "cmake -DHPX_WITH_CXX14=On -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_CXX_FLAGS=-stdlib=libc++ -DCMAKE_BUILD_TYPE=Debug .." and the cmake command failed on the std::atomic check. She was on commit bb55a8b4 (merge cmake test for std::decay_t to fix cuda build). When she rolled back to the commit I was on yesterday (3507be48) it worked.
<hkaiser>
kisaacs: yah, we just rolled that change back - needs more work
<heller_>
I got the overall overhead for a single threaded task/scheduling overhead reduced by a third
<heller_>
so <1 us it is now on my system, started with 1.5us it scales nicely to 8 cores... once it crosses NUMA domains ... scaling stops :/
<heller_>
this is 10 us grain size :D
<heller_>
essentially, hpx::async([](){}).get() completes in <1us in that scenario
<hkaiser>
nice
<heller_>
getting there ... yes
<heller_>
and greatly simplified the scheduling policies!
<heller_>
no wait_or_add_new right now
<heller_>
and potentially completely lock free, if it wasn't for contention avoidance
daissgr has joined #ste||ar
<hkaiser>
good job!
<heller_>
i found a promising concurrent queue implementation which I'll try tomorrow :D
<heller_>
I ran their benchmark, which tests various multi producer/multi consumer scenarios against boost::lockfree::queue. the results showed at a minimum 2x speedup... and it is completely C++11 and BSL
<aserio>
hkaiser: please see pm
vamatya has joined #ste||ar
vamatya has quit [Read error: Connection reset by peer]
<aserio>
hkaiser: yt?
<hkaiser>
aserio: here
<aserio>
hkaiser: see pm
Guest82936 is now known as Vir
parsa has quit [Read error: Connection reset by peer]
gedaj has quit [Quit: Konversation terminated!]
parsa has joined #ste||ar
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
rod_t has joined #ste||ar
<jbjnr_>
simbergm: yes. Currently I leave the build directory between rebuilds of PRs and do not wipe the cache or binaries because I wanted a fast turnaround. But if you've been bitten by a faulty/stale cache entry and it made the tests give the wrong answers, then I'll put back the binary wipe between builds.
<jbjnr_>
heller_: shall I test your stuff now?
<hkaiser>
heller_: lock-free performance depends on the amount of contention
<hkaiser>
and if it's their benchmark they will favour their own containers
<jbjnr_>
kisaacs: yt? if you are, then I'd like to have a gsoc project to get our apex results working in your trace viewer
<hkaiser>
but if it gives better results, by all means...
<jbjnr_>
kisaacs: and then add new fetaures like plotting statistics etc (particularly things like histograms of timing of certain tasks and linking them to the trace view)
<aserio>
simbergm: yt?
mbremer has quit [Quit: Page closed]
Smasher has quit [Remote host closed the connection]
gedaj has joined #ste||ar
<heller_>
hkaiser: not only the contended case is interesting, also the uncontented one
<heller_>
And I looked at the benchmark and actually ran it myself
<heller_>
jbjnr_: sure test it, haven't pushed the latest yet
simbergm has quit [Quit: WeeChat 1.4]
kisaacs has quit [Ping timeout: 248 seconds]
ct-clmsn has joined #ste||ar
aserio has quit [Quit: aserio]
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]