#ste||ar on 2018-01-18 — irc logs at irclog.cct.lsu.edu

2017-05-17 13:54 aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:06 diehlpk has quit [Ping timeout: 248 seconds]

00:08 <heller_> hkaiser: tatata, got the scheduling overhead down by 7% with my changes so far, the interesting optimizations are still to come ;)

00:08 <hkaiser> cool

00:08 <hkaiser> don't forget to actually delete the threads ;)

00:09 <heller_> that was version 0 :P

00:12 <heller_> the htts scaling got a nice speedup of 1.3 already

00:12 <hkaiser> ice

00:12 <hkaiser> nice

00:14 <heller_> the only contention I removed so far was just due to refcounting...

00:14 <heller_> of 1) thread_id_type and 2) coroutine

00:15 <heller_> getting rid of the intrusive ptr in coroutine didn't have a huge impact though

00:21 diehlpk has joined #ste||ar

00:23 rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

00:27 jaafar has quit [Ping timeout: 268 seconds]

00:36 kisaacs has quit [Ping timeout: 256 seconds]

00:47 diehlpk has quit [Ping timeout: 248 seconds]

00:49 parsa has quit [Quit: Zzzzzzzzzzzz]

00:51 rod_t has joined #ste||ar

00:57 <hkaiser> heller_: I'm not sure if things just work because you don't actually delete the threads

00:57 <hkaiser> or do you do that now?

01:20 eschnett has joined #ste||ar

01:20 kisaacs has joined #ste||ar

01:55 auviga has quit [*.net *.split]

01:55 auviga has joined #ste||ar

01:57 kisaacs has quit [Ping timeout: 240 seconds]

01:59 rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

02:14 hkaiser has quit [Quit: bye]

02:24 kisaacs has joined #ste||ar

02:38 parsa has joined #ste||ar

02:39 vamatya has quit [Ping timeout: 256 seconds]

02:54 vamatya has joined #ste||ar

03:15 bibek has quit [Remote host closed the connection]

03:19 bibek has joined #ste||ar

03:30 heller_ has quit [*.net *.split]

03:30 parsa[w] has quit [*.net *.split]

03:30 zbyerly__ has quit [*.net *.split]

03:30 heller_ has joined #ste||ar

03:30 rtohid has joined #ste||ar

03:30 zbyerly__ has joined #ste||ar

03:30 parsa[w] has joined #ste||ar

03:58 rod_t has joined #ste||ar

04:08 rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

04:10 nanashi55 has quit [Ping timeout: 248 seconds]

04:11 nanashi55 has joined #ste||ar

04:36 parsa has quit [Quit: *yawn*]

04:38 kisaacs has quit [Quit: leaving]

04:45 rod_t has joined #ste||ar

06:38 jbjnr_ has joined #ste||ar

06:42 jbjnr has quit [Ping timeout: 255 seconds]

07:18 jaafar has joined #ste||ar

07:18 vamatya has quit [Ping timeout: 264 seconds]

07:20 rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

07:31 mcopik has joined #ste||ar

07:47 <github> [hpx] msimberg created revert-3080-suspend-pool (+1 new commit): https://git.io/vNRrL

07:47 <github> hpx/revert-3080-suspend-pool a772ea8 Mikael Simberg: Revert "Suspend thread pool"

07:51 <github> [hpx] msimberg opened pull request #3108: Revert "Suspend thread pool" (master...revert-3080-suspend-pool) https://git.io/vNRrB

07:51 <github> [hpx] msimberg pushed 1 new commit to master: https://git.io/vNRrR

07:51 <github> hpx/master e1e41ca Mikael Simberg: Merge pull request #3108 from STEllAR-GROUP/revert-3080-suspend-pool...

08:05 david_pfander has joined #ste||ar

08:13 Smasher has joined #ste||ar

08:13 Smasher has quit [Remote host closed the connection]

08:17 <heller_> jbjnr_: got a 7% performance increase so far ... thread_map_ is next

09:00 jaafar has quit [Ping timeout: 276 seconds]

09:10 <heller_> also found a nice concurrentqueue implementation: https://github.com/cameron314/concurrentqueue

09:10 <heller_> which is supposed to be super fast and seems to support our scheduler model *pretty* nicely

09:30 simbergm has quit [Ping timeout: 256 seconds]

09:45 simbergm has joined #ste||ar

10:09 <github> [hpx] sithhell force-pushed fix_thread_overheads from 09a6fa1 to 77fdf0d: https://git.io/vNByp

10:09 <github> hpx/fix_thread_overheads dc01f0e Thomas Heller: Don't use boost::intrusive_ptr for thread_id_type...

10:09 <github> hpx/fix_thread_overheads 9d421e6 Thomas Heller: Fixing thread scheduling when yielding a thread id....

10:09 <github> hpx/fix_thread_overheads 77fdf0d Thomas Heller: Cleaning up coroutine implementation...

10:14 <github> [hpx] sithhell created fix_scheduling (+1 new commit): https://git.io/vNRyA

10:14 <github> hpx/fix_scheduling 5a0d3f8 Thomas Heller: Fixing thread scheduling when yielding a thread id....

10:17 <github> [hpx] sithhell opened pull request #3109: Fixing thread scheduling when yielding a thread id. (master...fix_scheduling) https://git.io/vNRSG

10:23 Vir is now known as Guest82936

10:24 heller_ has quit [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]

10:26 heller_ has joined #ste||ar

10:29 simbergm has quit [Ping timeout: 256 seconds]

10:33 <github> [hpx] sithhell force-pushed fix_scheduling from 5a0d3f8 to b709d11: https://git.io/vNRHg

10:33 <github> hpx/fix_scheduling b709d11 Thomas Heller: Fixing thread scheduling when yielding a thread id....

10:41 simbergm has joined #ste||ar

11:20 <github> [hpx] StellarBot pushed 1 new commit to gh-pages: https://git.io/vNRdh

11:20 <github> hpx/gh-pages c8dcc15 StellarBot: Updating docs

11:46 gedaj has quit [Ping timeout: 248 seconds]

11:59 hkaiser has joined #ste||ar

12:10 <heller_> so ... thread_map_ removal almost complete... tests are passing so far.

12:11 <hkaiser> heller_: the PR you submitted makes (almost) all tests fail

12:11 <heller_> /scratch/snx3000/biddisco/pycicle/build/3109-clang-6.0.0/bin/for_each_annotated_function_test: error while loading shared libraries: /scratch/snx3000/biddisco/pycicle/build/3109-clang-6.0.0/lib/libhpx.so.1: file too short

12:12 <heller_> looks like a filesystem problem

12:12 <hkaiser> ok

12:13 <heller_> anyways ... regarding thread_map_ removal, there are two problems right now: 1) abort_all_suspended_threads 2) enumerate_threads

12:14 <heller_> I'd like to solve the second point by conditionally enabling thread_map_ again, if the feature is needed, IIUC, that's only needed for debugging puposes

12:15 <heller_> the first point ... I am actually not sure if it is needed after all

12:16 <heller_> if it is needed, I guess the only way around is to have a suspended_threads_ member, which is populate when a thread suspends

12:16 <hkaiser> heller_: and for the perf counters

12:17 <heller_> for the perf counters, I only need the count

12:17 <hkaiser> k

12:17 <heller_> which should be straight forward, I guess

12:18 <hkaiser> creating a list of suspended threads adds the overhead of adding removing a thread to some list for each suspenion point instead of once as we have it now

12:18 <heller_> yes ... that's why I am hesitant here

12:18 <hkaiser> it also would require acquiring a lock for each thread suspension/resumption

12:19 xxl- has quit [Quit: AndroIRC - Android IRC Client ( http://www.androirc.com )]

12:20 <heller_> the thing is, abort_all_suspended_threads is only called here: https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/components/server/runtime_support_server.cpp#L919

12:21 <heller_> so the question if, what harm is there if we just let those die ungracefully

12:21 <heller_> one harm would be, that we'd leak memory

12:23 <hkaiser> things might hang in distributed if we did that

12:25 <heller_> not sure

12:26 <heller_> the test suite completes so far. the question is , how often we run into that corner case in real, large scale runs

12:27 <hkaiser> heller_: shrug

12:27 <hkaiser> the test suite will not trigger this case

12:28 <heller_> calling the coroutine with wait_abort will just make it yield again immediately, without further executing user code, wouldn't it?

12:28 <hkaiser> no

12:29 <heller_> the task itself can't continue, at least

12:29 <heller_> since there was a reason why it got suspended in the first place

12:30 <hkaiser> it will throw: https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/threads/thread_helpers.cpp#L512

12:32 <heller_> right, so more or less same effect. it does of course proper cleanup of the stack etc.

12:32 <hkaiser> and the thread might handle the exception properly, releasing other threads

12:33 <hkaiser> the exception is thrown in the context of the suspended thread

12:33 <heller_> yes

12:34 <hkaiser> I agree its a corner case, but it's there for a reason, I believe

12:34 <heller_> yeah, the big question is if all we do if we omit this is to leak resources

12:35 <heller_> since it's the shutdown of the process ...

12:35 <hkaiser> then other localities might never be notified - so you need a way to tell the others to cancel everything as well

12:35 <heller_> yeah, such that a future is never set to ready

12:36 <hkaiser> or will they naturally hit the same code path?

12:36 <hkaiser> right

12:37 <heller_> it will only hit it, if that function is called after the termination detection

12:37 <heller_> if it is, I would even assert that there are no remaining suspended threads

12:38 <heller_> which it is

12:38 <hkaiser> k

12:38 <hkaiser> I don't fully understand the consequences

12:39 <heller_> me neither

12:39 <heller_> requires more testing, obviously

12:39 <heller_> and understanding

12:39 <heller_> let me first check if all this is worth the effort ;)

12:39 <hkaiser> k

12:40 <hkaiser> but remember, as always - making a fast program correct is closely impossible

12:42 <heller_> that's the big question here

12:42 <heller_> looks like this corner case is a road block

12:42 <heller_> at least at first sight

12:43 <hkaiser> it's always the corner cases ruining everything ;)

12:43 <heller_> i'd still opt in for a solution to track suspended tasks. After all, what we want, are continuation based applications, which rarely suspend

12:43 <hkaiser> uhh

12:43 <heller_> ;)

12:43 <hkaiser> what about spinlock?

12:43 <heller_> never suspends

12:43 <hkaiser> not true

12:44 <heller_> it yields, but not as 'suspended', just as 'pending' or 'pending_boost'

12:44 <heller_> the pending ones are not a big deal

12:45 <hkaiser> what about mutex or CV

12:45 <heller_> without the thread_map, the only ones that aren't known to the scheduling policy, are the ones that are truly suspended, waiting to be put into pending again (for example condition_variable::wait)

12:45 <hkaiser> well try it - this will make things only worse

12:46 <heller_> lcos::local::mutex is currently only used in a test :P

12:46 <hkaiser> shrug

12:46 <hkaiser> CV is used everywhere

12:47 <heller_> yes, in a future, for example

12:47 <heller_> in this scenario, you only pay when suspending, not always

12:47 <heller_> that's the idea ...

12:48 <heller_> for all the tests I ran so far, it wasn't needed after all

12:48 <github> [hpx] msimberg opened pull request #3110: WIP: Suspend thread pool (master...suspend-pool) https://git.io/vNRj7

12:49 <heller_> the problem is, removing thread_map_ actually slowed down the whole thing :/

12:50 <heller_> well, brought it back to where I started...

12:51 <hkaiser> heller_: but you pay more than once for a thread, possibly - whereas before we paid once per thread

12:51 <jbjnr_> heller_: 7% (ignoring your latest fail) - I'm starting to get interested. Let me know when you want a tester.

12:52 <jbjnr_> Also - bear in mind that cscs is using hpx without networking, so we don't care about distributed hpx probems for now

12:52 <heller_> jbjnr_: feel free to try it, the current state of the branch is what gave me the 7%

12:52 <jbjnr_> not sure why removeing thread map would slow things down ...

12:52 <heller_> I think I know why...

12:53 <jbjnr_> why

12:53 <heller_> the problem is that thread creation still requires a lock

12:53 <heller_> and thread recycling

12:53 <heller_> that's the next step...

12:55 <heller_> and the one after that, would be to remove the extra staging queue

12:56 <heller_> but as hkaiser always points out: conservation of energy is everywhere...

12:56 <hkaiser> conservation of contention ;)

12:56 <hkaiser> but you're onto something, for sure

12:57 <heller_> contention is our friction point, where the energy escapes the system ;)

12:57 <hkaiser> 7% i significant in this contex as it essentially allows to reduce the minimal efficient thread length by the same amout

12:58 <heller_> yeah

12:58 <hkaiser> and that's at least a factor of 100 in terms of time

12:59 <hkaiser> if you save 100 ns per thread, that gives us a win of 10 us for the minimal efficient thread runtime

12:59 <heller_> yeah, which is huge

12:59 <hkaiser> nod

13:04 <heller_> my big problem is the huge time spent for the whole turnaround to schedule threads ... just creating a coroutine and do the switching is in the order of 100 nanseconds ... once the schduler is involved, we are in the order of a micro second

13:04 <heller_> and I don't accept that ;)

13:05 <hkaiser> heller_: well, I'd say the switching is in the order of 300-400ns in one direction

13:05 <heller_> not on my system with my benchmark

13:05 <hkaiser> shrug

13:05 <hkaiser> your benchmark - your rules ;)

13:06 <heller_> 149 ns for thrd() to return, that is switching to the context and immediately returning, aka yielding back to the caller context

13:07 <hkaiser> heller_: sure, in that case everything is in the cache

13:07 <hkaiser> L1 cache, that is

13:07 <hkaiser> that's not realistic in the general case

13:08 <heller_> that's in the order of 300 cycles on my machine, btw

13:08 <hkaiser> nod

13:08 <heller_> it sure is a optimal scenario

14:52 <jbjnr_> heller_: those locks are there to reduce contention, but they could probably be replaced with spinloks instead of full pthread locks. (I'd guess). gtg meeting.

14:57 <heller_> jbjnr_: yeah ... contention is a problem now...

15:00 simbergm has quit [Ping timeout: 268 seconds]

15:01 samuel has joined #ste||ar

15:03 rod_t has joined #ste||ar

15:15 simbergm has joined #ste||ar

15:36 eschnett has quit [Quit: eschnett]

15:38 <github> [hpx] hkaiser force-pushed fixing_local_direct_actions from 1ebb3a3 to 6706984: https://git.io/vNBVp

15:38 <github> hpx/fixing_local_direct_actions 6706984 Hartmut Kaiser: Local execution of direct actions is now actually performed directly...

15:41 <github> [hpx] hkaiser force-pushed fixing_3102 from f0055e7 to d73fdb0: https://git.io/vNBjY

15:41 <github> hpx/fixing_3102 d73fdb0 Hartmut Kaiser: Adding support for generic counter_raw_values performance counter type

15:42 <hkaiser> heller_: what will happen to #3105?

15:44 <github> [hpx] hkaiser closed pull request #3107: Remove UB from thread::id relational operators (master...thread-id-relops) https://git.io/vNBi8

15:45 <github> [hpx] hkaiser pushed 4 new commits to master: https://git.io/vN0Ck

15:45 <github> hpx/master 8ef85e4 Mikael Simberg: Add cmake test for std::decay_t to fix cuda build

15:45 <github> hpx/master 9fdf28e Mikael Simberg: Revert "Add cmake test for std::decay_t to fix cuda build"...

15:45 <github> hpx/master a55436c Mikael Simberg: Use std::decay instead of decay_t in optional...

15:55 david_pfander has quit [Ping timeout: 240 seconds]

16:06 simbergm has quit [Ping timeout: 256 seconds]

16:08 gedaj has joined #ste||ar

16:19 simbergm has joined #ste||ar

16:25 <heller_> hkaiser: the test is broken...

16:25 <heller_> it has to be fixed somehow, I guess

16:26 <simbergm> jbjnr_: pycicle didn't catch this here: http://rostam.cct.lsu.edu/builders/hpx_clang_4_boost_1_63_centos_x86_64_debug/builds/171/steps/configure/logs/stdio

16:26 <heller_> guess the test frames need to derive from future_data, and use the inplace tags

16:26 <simbergm> you keep the cmake cache around between commits?

16:28 <heller_> hkaiser: but you were absolutely correct, it's only a problem with the test, the when_all and dataflow frames are fine. The async traversal stuff is conflating various different things, without a consistent way of handling stuff

16:29 eschnett has joined #ste||ar

16:31 <hkaiser> heller_: it's better than what we've had before, though

16:31 <hkaiser> simbergm: I do, but that requires care and awareness

16:31 <hkaiser> ;)

16:31 <heller_> in terms of keeping us busy, yeah :P

16:31 <hkaiser> com eon

16:35 <heller_> I didn't have the time to attempt a proper fix

16:35 <heller_> but it look like the in_place thing, is the only sensible option to use with the async traversal

16:36 <heller_> NB: my latest improvements brought down the time for a single task roundtrip by 26%. The only problem now, it doesn't scale as much as I'd like it to do ;)

16:48 <github> [hpx] msimberg created revert-3096-fixing_3031 (+1 new commit): https://git.io/vN0z8

16:48 <github> hpx/revert-3096-fixing_3031 bef7f3e Mikael Simberg: Revert "fix detection of cxx11_std_atomic"

16:49 <K-ballo> what's the story there ^ ?

16:50 <github> [hpx] msimberg opened pull request #3111: Revert "fix detection of cxx11_std_atomic" (master...revert-3096-fixing_3031) https://git.io/vN0z6

16:50 <github> [hpx] msimberg pushed 1 new commit to master: https://git.io/vN0zP

16:50 <github> hpx/master 8bdfa43 Mikael Simberg: Merge pull request #3111 from STEllAR-GROUP/revert-3096-fixing_3031...

16:55 <K-ballo> woa

16:55 <K-ballo> simbergm: ?

16:57 daissgr has joined #ste||ar

17:04 <github> [hpx] hkaiser deleted revert-3096-fixing_3031 at bef7f3e: https://git.io/vN020

17:18 jfbastien_ has joined #ste||ar

17:25 kisaacs has joined #ste||ar

17:28 <kisaacs> So I had Katy also build hpx on rostam with "cmake -DHPX_WITH_CXX14=On -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_CXX_FLAGS=-stdlib=libc++ -DCMAKE_BUILD_TYPE=Debug .." and the cmake command failed on the std::atomic check. She was on commit bb55a8b4 (merge cmake test for std::decay_t to fix cuda build). When she rolled back to the commit I was on yesterday (3507be48) it worked.

17:38 <hkaiser> kisaacs: yah, we just rolled that change back - needs more work

17:51 vamatya has joined #ste||ar

18:13 kisaacs has quit [Ping timeout: 240 seconds]

18:15 daissgr has quit [Ping timeout: 264 seconds]

18:19 rod_t has left #ste||ar ["Textual IRC Client: www.textualapp.com"]

18:27 vamatya has quit [Ping timeout: 268 seconds]

18:37 rod_t has joined #ste||ar

18:39 aserio has joined #ste||ar

18:56 diehlpk_work has joined #ste||ar

19:03 jaafar has joined #ste||ar

19:07 kisaacs has joined #ste||ar

19:08 rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

19:15 Smasher has joined #ste||ar

19:20 rod_t has joined #ste||ar

19:34 parsa has joined #ste||ar

19:34 <hkaiser> heller_: how do I make an external project link against the iostreams component?

19:35 <heller_> hkaiser: do you use hpx_setup_target or hpx_add_XXX?

19:35 <aserio> hkaiser: Good Afternoon!

19:35 <aserio> hkaiser: Does FLeCSI have an account number yet?

19:36 <heller_> hkaiser: for both, you should be able to say: COMPONENT_DEPENDENCIES iostreams (as an additional paramter to either of those functions)

19:36 <heller_> hkaiser: for example: https://github.com/allscale/allscale_runtime/blob/master/src/CMakeLists.txt#L26

19:37 <heller_> hkaiser: the docs are here: https://stellar-group.github.io/hpx/docs/html/hpx/manual/build_system/using_hpx/using_hpx_cmake.html

19:38 <hkaiser> tks

19:38 <heller_> NB: I hate contention :/

19:38 <hkaiser> heh

19:38 <hkaiser> slippery like soap

19:38 <heller_> I got the overall overhead for a single threaded task/scheduling overhead reduced by a third

19:39 <heller_> so <1 us it is now on my system, started with 1.5us it scales nicely to 8 cores... once it crosses NUMA domains ... scaling stops :/

19:40 <heller_> this is 10 us grain size :D

19:43 <heller_> essentially, hpx::async([](){}).get() completes in <1us in that scenario

19:44 <hkaiser> nice

19:44 <heller_> getting there ... yes

19:44 <heller_> and greatly simplified the scheduling policies!

19:45 <heller_> no wait_or_add_new right now

19:45 <heller_> and potentially completely lock free, if it wasn't for contention avoidance

19:47 daissgr has joined #ste||ar

19:48 <hkaiser> good job!

19:49 <heller_> i found a promising concurrent queue implementation which I'll try tomorrow :D

19:50 <heller_> I ran their benchmark, which tests various multi producer/multi consumer scenarios against boost::lockfree::queue. the results showed at a minimum 2x speedup... and it is completely C++11 and BSL

19:52 <aserio> hkaiser: please see pm

20:03 vamatya has joined #ste||ar

20:04 vamatya has quit [Read error: Connection reset by peer]

20:27 <aserio> hkaiser: yt?

20:43 <hkaiser> aserio: here

20:43 <aserio> hkaiser: see pm

20:57 Guest82936 is now known as Vir

21:08 parsa has quit [Read error: Connection reset by peer]

21:08 gedaj has quit [Quit: Konversation terminated!]

21:08 parsa has joined #ste||ar

21:10 rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

21:17 rod_t has joined #ste||ar

21:30 <jbjnr_> simbergm: yes. Currently I leave the build directory between rebuilds of PRs and do not wipe the cache or binaries because I wanted a fast turnaround. But if you've been bitten by a faulty/stale cache entry and it made the tests give the wrong answers, then I'll put back the binary wipe between builds.

21:30 <jbjnr_> heller_: shall I test your stuff now?

21:32 <hkaiser> heller_: lock-free performance depends on the amount of contention

21:32 <hkaiser> and if it's their benchmark they will favour their own containers

21:33 <jbjnr_> kisaacs: yt? if you are, then I'd like to have a gsoc project to get our apex results working in your trace viewer

21:33 <hkaiser> but if it gives better results, by all means...

21:34 <jbjnr_> kisaacs: and then add new fetaures like plotting statistics etc (particularly things like histograms of timing of certain tasks and linking them to the trace view)

22:07 <aserio> simbergm: yt?

22:16 mbremer has quit [Quit: Page closed]

22:19 Smasher has quit [Remote host closed the connection]

22:24 gedaj has joined #ste||ar

22:28 <heller_> hkaiser: not only the contended case is interesting, also the uncontented one

22:29 <heller_> And I looked at the benchmark and actually ran it myself

22:31 <heller_> jbjnr_: sure test it, haven't pushed the latest yet

22:37 simbergm has quit [Quit: WeeChat 1.4]

22:48 kisaacs has quit [Ping timeout: 248 seconds]

23:02 ct-clmsn has joined #ste||ar

23:08 aserio has quit [Quit: aserio]

23:19 rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

23:27 kisaacs has joined #ste||ar

23:41 <hkaiser> heller_: sure, sure - you're my hero