hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
<heller_>
unless there is some strange locking going on or so
<khuck>
not at the os level
<khuck>
is there any way to capture "branching factor" - i.e. avg number of subtasks the tasks have
<khuck>
there are definitely synchronous points in the application where concurrency reduces to 1.
<khuck>
and they are frequent
<heller_>
looks like it, yeah
<heller_>
re branching factor: not that I know of
<heller_>
that would be good though
<khuck>
I think that's what our task dependency graphs are supposed to do. :)
<heller_>
an easy way to find out if only one OS thread executes all the work is by looking at the distribution of the tasks onto the different queues
<khuck>
I don't think that's the problem. The trace doesn't show that.
<heller_>
ok, so if the distribution of the tasks is good, there has to be some other factor that leads to the low CPU utilization
<heller_>
first thing popping to my mind here is IO ... but there isn't any
<heller_>
the other thing would be extensive usage of locks that are highly contended and access OS synchronization primitives
<khuck>
if they are HPX locks, they don't show up as system calls, do they?
<heller_>
nope
<heller_>
and if they are HPX locks, it would look like complete busy waiting to the CPU
<khuck>
hmmm
<khuck>
how could I see those?
<heller_>
unless they are really highly contended, and the idle callback kicks in (which waits on a OS sync primitive leading to reduce the CPU utilization of that OS thread)
<heller_>
the itt notify API has a hook for them
<heller_>
HPX_ITT_SYC
<khuck>
would that result in pthread_cond_timedwait()?
<khuck>
how would changing the lock make a difference?
hkaiser has joined #ste||ar
<heller_>
khuck: the lock is acquired for some operations on the GID, phylanx makes some use of those, this change switches from OS primitives to HPX ones. this should give you at least a higher CPU utilization
<heller_>
what I am not 100% sure about is how this changes the actual wallclock time
<hkaiser>
there is simply too little parallelism in those examples
<khuck>
the movie database?
<khuck>
heller_: setting -DHPX_WITH_THREAD_MANAGER_IDLE_BACKOFF=OFF didn't have an effect
<heller_>
there are 30k tasks per second for 16 cores
<heller_>
that should be enough
<hkaiser>
thought so, it's just not the pthread_cond_wait that is the culprit
<heller_>
it is the symptom
<khuck>
not the culprit, could be the symptom
<hkaiser>
heller_: those are generated by the blaze backend, so it's veru bursty
<heller_>
I see
<heller_>
so what you are saying is that we see lots of sequential HPX tasks which lead to that behaviour?
<hkaiser>
the execution_tree parallelism gives a factor of two only (A + B lauches A and B concurrently)
<hkaiser>
yes
<heller_>
so if everything is executed as a direct action, this should improve things significantly, correct?
<khuck>
so I shouldn't worry about it?
<heller_>
at least in that example
<hkaiser>
the jump in execution time happened after the false sharing fixes we applied to the scheduler
<heller_>
that doesn't make sense
<hkaiser>
heller_: try it
<hkaiser>
but then you will not have any parallelism (except in the blaze backend)
<heller_>
sure, i'd just cut down the overhead of creating and scheduling tasks
<heller_>
for the execution tree at lease
<heller_>
for the execution tree at least
<hkaiser>
and remove the parallelism exposed by concurrently executing the tree branches on all levels
<heller_>
you are contradicting yourself
<hkaiser>
am I?
<hkaiser>
if you execute things using direct actions you cut off the execution tree parallelism
<heller_>
you argue that the example doesn't expose enough parallelism since the execution tree is mostly serial (except for the blaze based primitives)
<hkaiser>
right
<hkaiser>
by executing everything directly you make things worse
<hkaiser>
(to a certain extent)
<hkaiser>
the truth lies inbetween, we need to find the points where to switch to direct execution
<heller_>
and at the same time you are saying that the execution tree is what leads to parallelism in the first place
<hkaiser>
desn't it
<hkaiser>
?
<heller_>
well, my assumption is, that having the execution tree being executed as HPX tasks, creates overheads. With the addition, that the true gain comes out of blaze
<heller_>
when reducing this overhead, the overall execution time should go down
<heller_>
not saying that it's optimal, but should be better
<khuck>
btw, I think the power/clang stack bug still exists. I am disabling direct tasks on power again
<hkaiser>
yah, that PR I created a while back still needs attention
<heller_>
I rebased it onto master
<heller_>
should be good to go
<khuck>
hkaiser: btw, we talked about the policy for direct actions today, we are planning on discussing it tomorrow at the usual time
<hkaiser>
khuck: ok, I wanted to ask whether we plan to meet
<khuck>
heller_: the patch didn't make a difference, either
<heller_>
ok
<khuck>
it may just be the test case. but that doesn't explain why the performance keeps getting worse for the same problem over time.
<Yorlik>
I started building hpx with "vcpkg install hpx --triplet x64-windows" I realize it's now also building Boost Coroutine as a dependency, AFAIK Coroutine is supposed deprecated and replaced with Coroutine2. Could it be the Boost dependencies in the vcpkg package need rework/ are too many?
hello has quit [Ping timeout: 250 seconds]
<Yorlik>
On Linux (Debian 9) the hwloc build from vcpkg broke.
<simbergm>
we haven't usually uploaded rcs to stellar.cct.lsu.edu, can you use the tarball from github for testing it? I'll upload the actual release to stellar.cct.lsu.edu of course
<simbergm>
just wanted to avoid making 1.2.1 and not having it work
<hkaiser>
simbergm: we could just upload the files without exposing them through the web page
<simbergm>
hkaiser: good point...
<simbergm>
I'll have to do it on monday though
aserio has quit [Ping timeout: 240 seconds]
aserio has joined #ste||ar
<diehlpk_work>
simbergm, Your release candidate compiles
<heller_>
With gcc 9?
<hkaiser>
yes, with strange warnings
<heller_>
indeed
<heller_>
the mgirate_component test is hanging very frequently recently
<heller_>
in this specific piece of code or in general?
aserio has quit [Ping timeout: 240 seconds]
<K-ballo>
there's hierarchical when_alls in that specific piece of code, and I'd expect in general each when_all to have a corresponding shared state allocation
aserio1 has quit [Ping timeout: 240 seconds]
<heller_>
oh, right, good catch
<heller_>
especially since the call to it is immediately blocking on the completion
aserio has joined #ste||ar
<diehlpk_work>
heller_, x86, i686 passed, including the example test
<diehlpk_work>
aarch64 and ppc is still compiling
<diehlpk_work>
Somehow arm7 failed
<diehlpk_work>
Ok, arm failed BUILDSTDERR: cc1plus: out of memory allocating 1333152 bytes after a total of 73457664 bytes
<diehlpk_work>
So we might do not have a arm package
<heller_>
I wouldn't run HPX on 32 bit anyways
<diehlpk_work>
Why not?
<heller_>
address space limitations
<heller_>
it works, but not very nicely
<diehlpk_work>
I think as long it works, we should keep it