aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
diehlpk has quit [Ping timeout: 248 seconds]
<heller_> hkaiser: tatata, got the scheduling overhead down by 7% with my changes so far, the interesting optimizations are still to come ;)
<hkaiser> cool
<hkaiser> don't forget to actually delete the threads ;)
<heller_> that was version 0 :P
<heller_> the htts scaling got a nice speedup of 1.3 already
<hkaiser> ice
<hkaiser> nice
<heller_> the only contention I removed so far was just due to refcounting...
<heller_> of 1) thread_id_type and 2) coroutine
<heller_> getting rid of the intrusive ptr in coroutine didn't have a huge impact though
diehlpk has joined #ste||ar
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
jaafar has quit [Ping timeout: 268 seconds]
kisaacs has quit [Ping timeout: 256 seconds]
diehlpk has quit [Ping timeout: 248 seconds]
parsa has quit [Quit: Zzzzzzzzzzzz]
rod_t has joined #ste||ar
<hkaiser> heller_: I'm not sure if things just work because you don't actually delete the threads
<hkaiser> or do you do that now?
eschnett has joined #ste||ar
kisaacs has joined #ste||ar
auviga has quit [*.net *.split]
auviga has joined #ste||ar
kisaacs has quit [Ping timeout: 240 seconds]
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
hkaiser has quit [Quit: bye]
kisaacs has joined #ste||ar
parsa has joined #ste||ar
vamatya has quit [Ping timeout: 256 seconds]
vamatya has joined #ste||ar
bibek has quit [Remote host closed the connection]
bibek has joined #ste||ar
heller_ has quit [*.net *.split]
parsa[w] has quit [*.net *.split]
zbyerly__ has quit [*.net *.split]
heller_ has joined #ste||ar
rtohid has joined #ste||ar
zbyerly__ has joined #ste||ar
parsa[w] has joined #ste||ar
rod_t has joined #ste||ar
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
nanashi55 has quit [Ping timeout: 248 seconds]
nanashi55 has joined #ste||ar
parsa has quit [Quit: *yawn*]
kisaacs has quit [Quit: leaving]
rod_t has joined #ste||ar
jbjnr_ has joined #ste||ar
jbjnr has quit [Ping timeout: 255 seconds]
jaafar has joined #ste||ar
vamatya has quit [Ping timeout: 264 seconds]
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
mcopik has joined #ste||ar
<github> [hpx] msimberg created revert-3080-suspend-pool (+1 new commit): https://git.io/vNRrL
<github> hpx/revert-3080-suspend-pool a772ea8 Mikael Simberg: Revert "Suspend thread pool"
<github> [hpx] msimberg opened pull request #3108: Revert "Suspend thread pool" (master...revert-3080-suspend-pool) https://git.io/vNRrB
<github> [hpx] msimberg pushed 1 new commit to master: https://git.io/vNRrR
<github> hpx/master e1e41ca Mikael Simberg: Merge pull request #3108 from STEllAR-GROUP/revert-3080-suspend-pool...
david_pfander has joined #ste||ar
Smasher has joined #ste||ar
Smasher has quit [Remote host closed the connection]
<heller_> jbjnr_: got a 7% performance increase so far ... thread_map_ is next
jaafar has quit [Ping timeout: 276 seconds]
<heller_> also found a nice concurrentqueue implementation: https://github.com/cameron314/concurrentqueue
<heller_> which is supposed to be super fast and seems to support our scheduler model *pretty* nicely
simbergm has quit [Ping timeout: 256 seconds]
simbergm has joined #ste||ar
<github> [hpx] sithhell force-pushed fix_thread_overheads from 09a6fa1 to 77fdf0d: https://git.io/vNByp
<github> hpx/fix_thread_overheads dc01f0e Thomas Heller: Don't use boost::intrusive_ptr for thread_id_type...
<github> hpx/fix_thread_overheads 9d421e6 Thomas Heller: Fixing thread scheduling when yielding a thread id....
<github> hpx/fix_thread_overheads 77fdf0d Thomas Heller: Cleaning up coroutine implementation...
<github> [hpx] sithhell created fix_scheduling (+1 new commit): https://git.io/vNRyA
<github> hpx/fix_scheduling 5a0d3f8 Thomas Heller: Fixing thread scheduling when yielding a thread id....
<github> [hpx] sithhell opened pull request #3109: Fixing thread scheduling when yielding a thread id. (master...fix_scheduling) https://git.io/vNRSG
Vir is now known as Guest82936
heller_ has quit [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
heller_ has joined #ste||ar
simbergm has quit [Ping timeout: 256 seconds]
<github> [hpx] sithhell force-pushed fix_scheduling from 5a0d3f8 to b709d11: https://git.io/vNRHg
<github> hpx/fix_scheduling b709d11 Thomas Heller: Fixing thread scheduling when yielding a thread id....
simbergm has joined #ste||ar
<github> [hpx] StellarBot pushed 1 new commit to gh-pages: https://git.io/vNRdh
<github> hpx/gh-pages c8dcc15 StellarBot: Updating docs
gedaj has quit [Ping timeout: 248 seconds]
hkaiser has joined #ste||ar
<heller_> so ... thread_map_ removal almost complete... tests are passing so far.
<hkaiser> heller_: the PR you submitted makes (almost) all tests fail
<heller_> /scratch/snx3000/biddisco/pycicle/build/3109-clang-6.0.0/bin/for_each_annotated_function_test: error while loading shared libraries: /scratch/snx3000/biddisco/pycicle/build/3109-clang-6.0.0/lib/libhpx.so.1: file too short
<heller_> looks like a filesystem problem
<hkaiser> ok
<heller_> anyways ... regarding thread_map_ removal, there are two problems right now: 1) abort_all_suspended_threads 2) enumerate_threads
<heller_> I'd like to solve the second point by conditionally enabling thread_map_ again, if the feature is needed, IIUC, that's only needed for debugging puposes
<heller_> the first point ... I am actually not sure if it is needed after all
<heller_> if it is needed, I guess the only way around is to have a suspended_threads_ member, which is populate when a thread suspends
<hkaiser> heller_: and for the perf counters
<heller_> for the perf counters, I only need the count
<hkaiser> k
<heller_> which should be straight forward, I guess
<hkaiser> creating a list of suspended threads adds the overhead of adding removing a thread to some list for each suspenion point instead of once as we have it now
<heller_> yes ... that's why I am hesitant here
<hkaiser> it also would require acquiring a lock for each thread suspension/resumption
xxl- has quit [Quit: AndroIRC - Android IRC Client ( http://www.androirc.com )]
<heller_> so the question if, what harm is there if we just let those die ungracefully
<heller_> one harm would be, that we'd leak memory
<hkaiser> things might hang in distributed if we did that
<heller_> not sure
<heller_> the test suite completes so far. the question is , how often we run into that corner case in real, large scale runs
<hkaiser> heller_: shrug
<hkaiser> the test suite will not trigger this case
<heller_> calling the coroutine with wait_abort will just make it yield again immediately, without further executing user code, wouldn't it?
<hkaiser> no
<heller_> the task itself can't continue, at least
<heller_> since there was a reason why it got suspended in the first place
<heller_> right, so more or less same effect. it does of course proper cleanup of the stack etc.
<hkaiser> and the thread might handle the exception properly, releasing other threads
<hkaiser> the exception is thrown in the context of the suspended thread
<heller_> yes
<hkaiser> I agree its a corner case, but it's there for a reason, I believe
<heller_> yeah, the big question is if all we do if we omit this is to leak resources
<heller_> since it's the shutdown of the process ...
<hkaiser> then other localities might never be notified - so you need a way to tell the others to cancel everything as well
<heller_> yeah, such that a future is never set to ready
<hkaiser> or will they naturally hit the same code path?
<hkaiser> right
<heller_> it will only hit it, if that function is called after the termination detection
<heller_> if it is, I would even assert that there are no remaining suspended threads
<heller_> which it is
<hkaiser> k
<hkaiser> I don't fully understand the consequences
<heller_> me neither
<heller_> requires more testing, obviously
<heller_> and understanding
<heller_> let me first check if all this is worth the effort ;)
<hkaiser> k
<hkaiser> but remember, as always - making a fast program correct is closely impossible
<heller_> that's the big question here
<heller_> looks like this corner case is a road block
<heller_> at least at first sight
<hkaiser> it's always the corner cases ruining everything ;)
<heller_> i'd still opt in for a solution to track suspended tasks. After all, what we want, are continuation based applications, which rarely suspend
<hkaiser> uhh
<heller_> ;)
<hkaiser> what about spinlock?
<heller_> never suspends
<hkaiser> not true
<heller_> it yields, but not as 'suspended', just as 'pending' or 'pending_boost'
<heller_> the pending ones are not a big deal
<hkaiser> what about mutex or CV
<heller_> without the thread_map, the only ones that aren't known to the scheduling policy, are the ones that are truly suspended, waiting to be put into pending again (for example condition_variable::wait)
<hkaiser> well try it - this will make things only worse
<heller_> lcos::local::mutex is currently only used in a test :P
<hkaiser> shrug
<hkaiser> CV is used everywhere
<heller_> yes, in a future, for example
<heller_> in this scenario, you only pay when suspending, not always
<heller_> that's the idea ...
<heller_> for all the tests I ran so far, it wasn't needed after all
<github> [hpx] msimberg opened pull request #3110: WIP: Suspend thread pool (master...suspend-pool) https://git.io/vNRj7
<heller_> the problem is, removing thread_map_ actually slowed down the whole thing :/
<heller_> well, brought it back to where I started...
<hkaiser> heller_: but you pay more than once for a thread, possibly - whereas before we paid once per thread
<jbjnr_> heller_: 7% (ignoring your latest fail) - I'm starting to get interested. Let me know when you want a tester.
<jbjnr_> Also - bear in mind that cscs is using hpx without networking, so we don't care about distributed hpx probems for now
<heller_> jbjnr_: feel free to try it, the current state of the branch is what gave me the 7%
<jbjnr_> not sure why removeing thread map would slow things down ...
<heller_> I think I know why...
<jbjnr_> why
<heller_> the problem is that thread creation still requires a lock
<heller_> and thread recycling
<heller_> that's the next step...
<heller_> and the one after that, would be to remove the extra staging queue
<heller_> but as hkaiser always points out: conservation of energy is everywhere...
<hkaiser> conservation of contention ;)
<hkaiser> but you're onto something, for sure
<heller_> contention is our friction point, where the energy escapes the system ;)
<hkaiser> 7% i significant in this contex as it essentially allows to reduce the minimal efficient thread length by the same amout
<heller_> yeah
<hkaiser> and that's at least a factor of 100 in terms of time
<hkaiser> if you save 100 ns per thread, that gives us a win of 10 us for the minimal efficient thread runtime
<heller_> yeah, which is huge
<hkaiser> nod
<heller_> my big problem is the huge time spent for the whole turnaround to schedule threads ... just creating a coroutine and do the switching is in the order of 100 nanseconds ... once the schduler is involved, we are in the order of a micro second
<heller_> and I don't accept that ;)
<hkaiser> heller_: well, I'd say the switching is in the order of 300-400ns in one direction
<heller_> not on my system with my benchmark
<hkaiser> shrug
<hkaiser> your benchmark - your rules ;)
<heller_> 149 ns for thrd() to return, that is switching to the context and immediately returning, aka yielding back to the caller context
<hkaiser> heller_: sure, in that case everything is in the cache
<hkaiser> L1 cache, that is
<hkaiser> that's not realistic in the general case
<heller_> that's in the order of 300 cycles on my machine, btw
<hkaiser> nod
<heller_> it sure is a optimal scenario
<jbjnr_> heller_: those locks are there to reduce contention, but they could probably be replaced with spinloks instead of full pthread locks. (I'd guess). gtg meeting.
<heller_> jbjnr_: yeah ... contention is a problem now...
simbergm has quit [Ping timeout: 268 seconds]
samuel has joined #ste||ar
rod_t has joined #ste||ar
simbergm has joined #ste||ar
eschnett has quit [Quit: eschnett]
<github> [hpx] hkaiser force-pushed fixing_local_direct_actions from 1ebb3a3 to 6706984: https://git.io/vNBVp
<github> hpx/fixing_local_direct_actions 6706984 Hartmut Kaiser: Local execution of direct actions is now actually performed directly...
<github> [hpx] hkaiser force-pushed fixing_3102 from f0055e7 to d73fdb0: https://git.io/vNBjY
<github> hpx/fixing_3102 d73fdb0 Hartmut Kaiser: Adding support for generic counter_raw_values performance counter type
<hkaiser> heller_: what will happen to #3105?
<github> [hpx] hkaiser closed pull request #3107: Remove UB from thread::id relational operators (master...thread-id-relops) https://git.io/vNBi8
<github> [hpx] hkaiser pushed 4 new commits to master: https://git.io/vN0Ck
<github> hpx/master 8ef85e4 Mikael Simberg: Add cmake test for std::decay_t to fix cuda build
<github> hpx/master 9fdf28e Mikael Simberg: Revert "Add cmake test for std::decay_t to fix cuda build"...
<github> hpx/master a55436c Mikael Simberg: Use std::decay instead of decay_t in optional...
david_pfander has quit [Ping timeout: 240 seconds]
simbergm has quit [Ping timeout: 256 seconds]
gedaj has joined #ste||ar
simbergm has joined #ste||ar
<heller_> hkaiser: the test is broken...
<heller_> it has to be fixed somehow, I guess
<heller_> guess the test frames need to derive from future_data, and use the inplace tags
<simbergm> you keep the cmake cache around between commits?
<heller_> hkaiser: but you were absolutely correct, it's only a problem with the test, the when_all and dataflow frames are fine. The async traversal stuff is conflating various different things, without a consistent way of handling stuff
eschnett has joined #ste||ar
<hkaiser> heller_: it's better than what we've had before, though
<hkaiser> simbergm: I do, but that requires care and awareness
<hkaiser> ;)
<heller_> in terms of keeping us busy, yeah :P
<hkaiser> com eon
<heller_> I didn't have the time to attempt a proper fix
<heller_> but it look like the in_place thing, is the only sensible option to use with the async traversal
<heller_> NB: my latest improvements brought down the time for a single task roundtrip by 26%. The only problem now, it doesn't scale as much as I'd like it to do ;)
<github> [hpx] msimberg created revert-3096-fixing_3031 (+1 new commit): https://git.io/vN0z8
<github> hpx/revert-3096-fixing_3031 bef7f3e Mikael Simberg: Revert "fix detection of cxx11_std_atomic"
<K-ballo> what's the story there ^ ?
<github> [hpx] msimberg opened pull request #3111: Revert "fix detection of cxx11_std_atomic" (master...revert-3096-fixing_3031) https://git.io/vN0z6
<github> [hpx] msimberg pushed 1 new commit to master: https://git.io/vN0zP
<github> hpx/master 8bdfa43 Mikael Simberg: Merge pull request #3111 from STEllAR-GROUP/revert-3096-fixing_3031...
<K-ballo> woa
<K-ballo> simbergm: ?
daissgr has joined #ste||ar
<github> [hpx] hkaiser deleted revert-3096-fixing_3031 at bef7f3e: https://git.io/vN020
jfbastien_ has joined #ste||ar
kisaacs has joined #ste||ar
<kisaacs> So I had Katy also build hpx on rostam with "cmake -DHPX_WITH_CXX14=On -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_CXX_FLAGS=-stdlib=libc++ -DCMAKE_BUILD_TYPE=Debug .." and the cmake command failed on the std::atomic check. She was on commit bb55a8b4 (merge cmake test for std::decay_t to fix cuda build). When she rolled back to the commit I was on yesterday (3507be48) it worked.
<hkaiser> kisaacs: yah, we just rolled that change back - needs more work
vamatya has joined #ste||ar
kisaacs has quit [Ping timeout: 240 seconds]
daissgr has quit [Ping timeout: 264 seconds]
rod_t has left #ste||ar ["Textual IRC Client: www.textualapp.com"]
vamatya has quit [Ping timeout: 268 seconds]
rod_t has joined #ste||ar
aserio has joined #ste||ar
diehlpk_work has joined #ste||ar
jaafar has joined #ste||ar
kisaacs has joined #ste||ar
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
Smasher has joined #ste||ar
rod_t has joined #ste||ar
parsa has joined #ste||ar
<hkaiser> heller_: how do I make an external project link against the iostreams component?
<heller_> hkaiser: do you use hpx_setup_target or hpx_add_XXX?
<aserio> hkaiser: Good Afternoon!
<aserio> hkaiser: Does FLeCSI have an account number yet?
<heller_> hkaiser: for both, you should be able to say: COMPONENT_DEPENDENCIES iostreams (as an additional paramter to either of those functions)
<hkaiser> tks
<heller_> NB: I hate contention :/
<hkaiser> heh
<hkaiser> slippery like soap
<heller_> I got the overall overhead for a single threaded task/scheduling overhead reduced by a third
<heller_> so <1 us it is now on my system, started with 1.5us it scales nicely to 8 cores... once it crosses NUMA domains ... scaling stops :/
<heller_> this is 10 us grain size :D
<heller_> essentially, hpx::async([](){}).get() completes in <1us in that scenario
<hkaiser> nice
<heller_> getting there ... yes
<heller_> and greatly simplified the scheduling policies!
<heller_> no wait_or_add_new right now
<heller_> and potentially completely lock free, if it wasn't for contention avoidance
daissgr has joined #ste||ar
<hkaiser> good job!
<heller_> i found a promising concurrent queue implementation which I'll try tomorrow :D
<heller_> I ran their benchmark, which tests various multi producer/multi consumer scenarios against boost::lockfree::queue. the results showed at a minimum 2x speedup... and it is completely C++11 and BSL
<aserio> hkaiser: please see pm
vamatya has joined #ste||ar
vamatya has quit [Read error: Connection reset by peer]
<aserio> hkaiser: yt?
<hkaiser> aserio: here
<aserio> hkaiser: see pm
Guest82936 is now known as Vir
parsa has quit [Read error: Connection reset by peer]
gedaj has quit [Quit: Konversation terminated!]
parsa has joined #ste||ar
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
rod_t has joined #ste||ar
<jbjnr_> simbergm: yes. Currently I leave the build directory between rebuilds of PRs and do not wipe the cache or binaries because I wanted a fast turnaround. But if you've been bitten by a faulty/stale cache entry and it made the tests give the wrong answers, then I'll put back the binary wipe between builds.
<jbjnr_> heller_: shall I test your stuff now?
<hkaiser> heller_: lock-free performance depends on the amount of contention
<hkaiser> and if it's their benchmark they will favour their own containers
<jbjnr_> kisaacs: yt? if you are, then I'd like to have a gsoc project to get our apex results working in your trace viewer
<hkaiser> but if it gives better results, by all means...
<jbjnr_> kisaacs: and then add new fetaures like plotting statistics etc (particularly things like histograms of timing of certain tasks and linking them to the trace view)
<aserio> simbergm: yt?
mbremer has quit [Quit: Page closed]
Smasher has quit [Remote host closed the connection]
gedaj has joined #ste||ar
<heller_> hkaiser: not only the contended case is interesting, also the uncontented one
<heller_> And I looked at the benchmark and actually ran it myself
<heller_> jbjnr_: sure test it, haven't pushed the latest yet
simbergm has quit [Quit: WeeChat 1.4]
kisaacs has quit [Ping timeout: 248 seconds]
ct-clmsn has joined #ste||ar
aserio has quit [Quit: aserio]
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
kisaacs has joined #ste||ar
<hkaiser> heller_: sure, sure - you're my hero