#ste||ar on 2019-02-01 — irc logs at irclog.cct.lsu.edu

2018-08-26 23:03 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:09 <heller_> unless there is some strange locking going on or so

00:10 <khuck> not at the os level

00:12 <khuck> is there any way to capture "branching factor" - i.e. avg number of subtasks the tasks have

00:14 <khuck> there are definitely synchronous points in the application where concurrency reduces to 1.

00:14 <khuck> and they are frequent

00:32 <heller_> looks like it, yeah

00:32 <heller_> re branching factor: not that I know of

00:32 <heller_> that would be good though

00:33 <khuck> I think that's what our task dependency graphs are supposed to do. :)

00:35 <heller_> an easy way to find out if only one OS thread executes all the work is by looking at the distribution of the tasks onto the different queues

00:36 <khuck> I don't think that's the problem. The trace doesn't show that.

00:37 <heller_> ok, so if the distribution of the tasks is good, there has to be some other factor that leads to the low CPU utilization

00:37 <heller_> first thing popping to my mind here is IO ... but there isn't any

00:38 <heller_> the other thing would be extensive usage of locks that are highly contended and access OS synchronization primitives

00:38 <khuck> if they are HPX locks, they don't show up as system calls, do they?

00:38 <heller_> nope

00:39 <heller_> and if they are HPX locks, it would look like complete busy waiting to the CPU

00:39 <khuck> hmmm

00:39 <khuck> how could I see those?

00:40 <heller_> unless they are really highly contended, and the idle callback kicks in (which waits on a OS sync primitive leading to reduce the CPU utilization of that OS thread)

00:40 <heller_> the itt notify API has a hook for them

00:41 <heller_> HPX_ITT_SYC

00:41 <khuck> would that result in pthread_cond_timedwait()?

00:41 <heller_> essentially, yes

00:41 <khuck> bingo.

00:41 <khuck> see the bottom figure on http://www.nic.uoregon.edu/~khuck/regression/phylanx/

00:42 <heller_> aha!

00:42 <heller_> one sec...

00:42 <khuck> hartmut said not to worry about the pthread_cond_timedwait() calls...

00:42 <khuck> but I can't recall why

00:42 <heller_> ok

00:42 <heller_> here is something for you to try

00:42 <khuck> basically, the worker threads are spinning down and waiting for...something?

00:42 <heller_> multiple things, actually ;)

00:42 <heller_> yeah, waiting for new work

00:43 <heller_> so, first thing

00:43 <khuck> (keep in mind this behavior started around the first week of January)

00:44 <heller_> khuck: in your hpx build dir: run 'cmake . -DHPX_WITH_THREAD_MANAGER_IDLE_BACKOFF=Off && make core (or ninja core)'

00:44 <heller_> sure

00:44 <khuck> ok

00:45 <heller_> and rebuild the als benchmark as well

00:45 <heller_> and then check how the timing changes

00:50 <heller_> khuck: the thing that you can try afterwards is applying this patch: https://gist.githubusercontent.com/sithhell/528cca77730b8e4a64bda397a1198df3/raw/be556a2eca483f4b1985c286c83fa2b673b95700/spinlock.patch (and recompile the stuff)

01:00 <khuck> how would changing the lock make a difference?

01:00 hkaiser has joined #ste||ar

01:02 <heller_> khuck: the lock is acquired for some operations on the GID, phylanx makes some use of those, this change switches from OS primitives to HPX ones. this should give you at least a higher CPU utilization

01:03 <heller_> what I am not 100% sure about is how this changes the actual wallclock time

01:04 <hkaiser> there is simply too little parallelism in those examples

01:04 <khuck> the movie database?

01:05 <khuck> heller_: setting -DHPX_WITH_THREAD_MANAGER_IDLE_BACKOFF=OFF didn't have an effect

01:05 <heller_> there are 30k tasks per second for 16 cores

01:05 <heller_> that should be enough

01:05 <hkaiser> thought so, it's just not the pthread_cond_wait that is the culprit

01:06 <heller_> it is the symptom

01:06 <khuck> not the culprit, could be the symptom

01:06 <hkaiser> heller_: those are generated by the blaze backend, so it's veru bursty

01:06 <heller_> I see

01:07 <heller_> so what you are saying is that we see lots of sequential HPX tasks which lead to that behaviour?

01:07 <hkaiser> the execution_tree parallelism gives a factor of two only (A + B lauches A and B concurrently)

01:07 <hkaiser> yes

01:07 <heller_> so if everything is executed as a direct action, this should improve things significantly, correct?

01:08 <khuck> so I shouldn't worry about it?

01:08 <heller_> at least in that example

01:08 <hkaiser> the jump in execution time happened after the false sharing fixes we applied to the scheduler

01:08 <heller_> that doesn't make sense

01:08 <hkaiser> heller_: try it

01:08 <hkaiser> but then you will not have any parallelism (except in the blaze backend)

01:09 <heller_> sure, i'd just cut down the overhead of creating and scheduling tasks

01:09 <heller_> for the execution tree at lease

01:09 <heller_> for the execution tree at least

01:09 <hkaiser> and remove the parallelism exposed by concurrently executing the tree branches on all levels

01:10 <heller_> you are contradicting yourself

01:10 <hkaiser> am I?

01:11 <hkaiser> if you execute things using direct actions you cut off the execution tree parallelism

01:11 <heller_> you argue that the example doesn't expose enough parallelism since the execution tree is mostly serial (except for the blaze based primitives)

01:11 <hkaiser> right

01:11 <hkaiser> by executing everything directly you make things worse

01:12 <hkaiser> (to a certain extent)

01:12 <hkaiser> the truth lies inbetween, we need to find the points where to switch to direct execution

01:12 <heller_> and at the same time you are saying that the execution tree is what leads to parallelism in the first place

01:12 <hkaiser> desn't it

01:12 <hkaiser> ?

01:13 <heller_> well, my assumption is, that having the execution tree being executed as HPX tasks, creates overheads. With the addition, that the true gain comes out of blaze

01:14 <heller_> when reducing this overhead, the overall execution time should go down

01:14 <heller_> not saying that it's optimal, but should be better

01:14 <khuck> btw, I think the power/clang stack bug still exists. I am disabling direct tasks on power again

01:15 <hkaiser> yah, that PR I created a while back still needs attention

01:15 <heller_> I rebased it onto master

01:15 <heller_> should be good to go

01:18 <khuck> hkaiser: btw, we talked about the policy for direct actions today, we are planning on discussing it tomorrow at the usual time

01:18 <hkaiser> khuck: ok, I wanted to ask whether we plan to meet

01:33 <khuck> heller_: the patch didn't make a difference, either

01:34 <heller_> ok

01:34 <khuck> it may just be the test case. but that doesn't explain why the performance keeps getting worse for the same problem over time.

01:35 <khuck> I gotta run

01:35 khuck has quit []

02:01 aserio has joined #ste||ar

02:09 aserio has quit [Quit: aserio]

02:09 hello has joined #ste||ar

03:14 parsa[[[w]]] is now known as parsa[w]

03:15 parsa[w] is now known as parsa[r]

03:16 parsa[r] is now known as parsa[[r]]

03:16 parsa[[r]] is now known as parsa[[[r]]]

03:31 parsa[r] has joined #ste||ar

03:31 parsa[[[r]]] is now known as parsa[w]

03:38 hkaiser has quit [Quit: bye]

04:02 parsa[r] has quit [Quit: ZNC 1.7.2 - https://znc.in]

06:20 parsa has joined #ste||ar

06:29 parsa has quit [Quit: Zzzzzzzzzzzz]

06:39 parsa has joined #ste||ar

07:04 parsa has quit [Quit: *yawn*]

08:23 david_pfander has joined #ste||ar

08:25 daissgr has joined #ste||ar

08:54 <Yorlik> I started building hpx with "vcpkg install hpx --triplet x64-windows" I realize it's now also building Boost Coroutine as a dependency, AFAIK Coroutine is supposed deprecated and replaced with Coroutine2. Could it be the Boost dependencies in the vcpkg package need rework/ are too many?

09:03 hello has quit [Ping timeout: 250 seconds]

09:03 <Yorlik> On Linux (Debian 9) the hwloc build from vcpkg broke.

09:06 hello has joined #ste||ar

09:10 <Yorlik> hwloc vcpkg fail message: https://paste.ee/p/Lk011

09:10 <Yorlik> It looks like a bug in a cmake script to me handing ove an unquoted cmake list

09:11 <Yorlik> (line 11)

09:13 daissgr has quit [Ping timeout: 264 seconds]

09:26 dexhunter has quit [Ping timeout: 245 seconds]

09:26 dexhunter has joined #ste||ar

09:47 <heller_> Yorlik: never used vcpkg on Linux..

10:05 daissgr has joined #ste||ar

11:22 jbjnr has joined #ste||ar

12:03 hello has quit [Quit: Going offline, see ya! (www.adiirc.com)]

14:07 hkaiser has joined #ste||ar

14:47 aserio has joined #ste||ar

15:30 hkaiser has quit [Quit: bye]

15:53 david_pfander has quit [Quit: david_pfander]

15:55 akheir has quit [Remote host closed the connection]

15:56 <simbergm> diehlpk: I've just tagged 1.2.1-rc1 with the patches you need, could you check that it works correctly with your fedora builds?

16:05 aserio1 has joined #ste||ar

16:05 aserio has quit [Ping timeout: 240 seconds]

16:05 aserio1 is now known as aserio

16:54 hkaiser has joined #ste||ar

17:10 <diehlpk_work> simbergm, Sure, I can do it next week

17:15 <diehlpk_work> Oh, -- The CXX compiler identification is GNU 9.0.1

17:15 <diehlpk_work> Fedora 30 also ships a newer gcc

17:17 <diehlpk_work> hkaiser, heller_ Do we have experience with gcc 9.0.1?

17:23 <hkaiser> don't think so

17:25 <diehlpk_work> Ok, I will give it a try

17:28 <diehlpk_work> simberg could you upload the release to stellar.cct.lsu.edu/files ?

17:28 <diehlpk_work> So it would be conform to prior releases

17:28 daissgr has quit [Read error: Connection reset by peer]

17:40 <K-ballo> this should no longer be necessary https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/lcos/channel.hpp#L164-L168

17:45 parsa[[w]] has joined #ste||ar

17:45 parsa[[w]] is now known as parsa_

17:53 nikunj has joined #ste||ar

18:17 <hkaiser> k

18:20 <simbergm> diehlpk_work: thanks

18:21 <simbergm> we haven't usually uploaded rcs to stellar.cct.lsu.edu, can you use the tarball from github for testing it? I'll upload the actual release to stellar.cct.lsu.edu of course

18:21 <simbergm> just wanted to avoid making 1.2.1 and not having it work

18:28 <hkaiser> simbergm: we could just upload the files without exposing them through the web page

18:32 <simbergm> hkaiser: good point...

18:32 <simbergm> I'll have to do it on monday though

18:34 aserio has quit [Ping timeout: 240 seconds]

18:53 aserio has joined #ste||ar

18:56 <diehlpk_work> simbergm, Your release candidate compiles

19:00 <heller_> With gcc 9?

19:10 <hkaiser> yes, with strange warnings

19:24 <heller_> indeed

19:26 <heller_> the mgirate_component test is hanging very frequently recently

19:48 <K-ballo> hkaiser: could this be replaced by a local latch? https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/parallel/executors/parallel_executor.hpp#L200

19:52 <heller_> K-ballo: probably, I guess when_all should ultimately be equivalent to an asynchronous latch

19:54 <K-ballo> there's multiple hierarchicall when_alls

19:54 <K-ballo> (each with a shared state allocation?)

19:55 aserio1 has joined #ste||ar

19:56 <heller_> in this specific piece of code or in general?

19:57 aserio has quit [Ping timeout: 240 seconds]

19:57 <K-ballo> there's hierarchical when_alls in that specific piece of code, and I'd expect in general each when_all to have a corresponding shared state allocation

19:59 aserio1 has quit [Ping timeout: 240 seconds]

20:06 <heller_> oh, right, good catch

20:09 <heller_> especially since the call to it is immediately blocking on the completion

20:16 aserio has joined #ste||ar

20:17 <diehlpk_work> heller_, x86, i686 passed, including the example test

20:17 <diehlpk_work> aarch64 and ppc is still compiling

20:17 <diehlpk_work> Somehow arm7 failed

20:20 <diehlpk_work> Ok, arm failed BUILDSTDERR: cc1plus: out of memory allocating 1333152 bytes after a total of 73457664 bytes

20:20 <diehlpk_work> So we might do not have a arm package

20:26 <heller_> I wouldn't run HPX on 32 bit anyways

20:30 <diehlpk_work> Why not?

20:32 <heller_> address space limitations

20:32 <heller_> it works, but not very nicely

20:34 <diehlpk_work> I think as long it works, we should keep it

21:11 hkaiser has quit [Quit: bye]

21:50 aserio has quit [Quit: aserio]

22:36 hkaiser has joined #ste||ar

23:12 parsa[[w]] has joined #ste||ar

23:14 parsa[w] has quit [Ping timeout: 250 seconds]

23:42 jbjnr has quit [Ping timeout: 250 seconds]