<jaafar>
Is there any place I could log scheduling decisions within HPX? I'm seeing some idle periods that would be very interesting to understand
<jaafar>
Mysterious gaps, for one - every so often a benchmark run has what appears to be idle periods in the middle
<jaafar>
but there are also just "decisions" I don't understand.
<jaafar>
For example, the scan algorithms begin by launching async tasks for each chunk as the first stage
<jaafar>
Then all the dataflow items are entered, which depend on the async tasks and also each other
<jaafar>
What I see is that in almost all cases the initial async tasks are executed before the dataflow continuations, even if the dataflow is ready to go
<jaafar>
I'm going to attach some pictures to my bug report
<hkaiser>
jaafar: interesting insight
<hkaiser>
we have no way of logging this, sorry
<jaafar>
hkaiser: if you can point me to somewhere in the code I might be able to :)
<jaafar>
also see my issue update for a nice picture
<hkaiser>
point where
<hkaiser>
?
<jaafar>
I mean, assuming there is a point where the "next task" is chosen from a set of available work
<jaafar>
a point in the code
<jaafar>
I am using linux tracepoints
<hkaiser>
interesting graph
<hkaiser>
jaafar: I can point you to the scheduling loop
<hkaiser>
jaafar: good luck - this is the real center of the action, but usually difficult to follow, especially with more than one thread
<hkaiser>
(core)
<hkaiser>
jaafar: so from you picture, stage 3 starts only after stage 1 is done
<hkaiser>
it should start right away, shouldn't it?
<hkaiser>
do we have that over-constrained somehow? is the dependency logic too strict?
<hkaiser>
essentially our algorithm is a glorified sequential one :/
<hkaiser>
doh!
<hkaiser>
jaafar: I'm convinced that your cache related conclusions are a red herring (sorry)
<hkaiser>
I think the underlying algorithm is just plain wrong
<jaafar>
hkaiser: as I (barely) understand it the dataflow items and the async tasks that are initially launched are equally valid things to run
<jaafar>
because the dataflow items actually have their inputs available well before they are run
<hkaiser>
well, we can raise the priority of certain tasks, if that helps
<jaafar>
but the async stage 1 things happen instead - don't know why
<hkaiser>
but I'm not sure we have too strict dependencies defined
<jaafar>
I understand your skepticism about my caching theories :)
<jaafar>
One thing I could easily do is benchmark the "warm cache" vs "cold cache" situation and measure the performance difference
<hkaiser>
you can raise task priorities by using launch::async(threads::thread_priority_high)
<hkaiser>
jaafar: we're talking about milliseconds here, caches will not make a dent
<hkaiser>
if you tried raising the priotities of stage 2 and 3 it might change the picture
<jaafar>
hkaiser: the perf data suggests L3 cache misses are dominating the performance costs
<hkaiser>
nah
<jaafar>
:)
<jaafar>
OK!
<hkaiser>
it's logic error here or some wrong assumption
<hkaiser>
this is too glaring
<hkaiser>
things are usually executed in the order they are scheduled
<hkaiser>
so if you schedule a lot of stage 1 first before stage 3, that latter will be executed too late
<jaafar>
I figured
<jaafar>
well, the sad truth is I did try launching stage 2 and 3 with async and high priority
<jaafar>
result: worse performance :)
<hkaiser>
so doing dataflow(launch::async(thread_priority_high), f, ...) for stage 3 might change the picture as that will make those tasks execute right away
<jaafar>
you would think so, wouldn't you :)
<hkaiser>
ok
<hkaiser>
you're way ahead of me here
<hkaiser>
can you produce such an image for using high priority?
<jaafar>
my intuition is clearly lacking in some important ways
<jaafar>
yes! I will do that
<hkaiser>
stage 2 should still be sync, I think
<hkaiser>
no point in creating a separate task for those as the work is minimal
<jaafar>
yeah
<hkaiser>
jaafar: anyways - many thanks for your insights, very interesting!
<jaafar>
you're welcome! I hope it helps
<hkaiser>
it will!
<hkaiser>
jaafar: one last question - how many cores did you use for creating that image ?
<jaafar>
4 cores
<hkaiser>
k
<hkaiser>
thanks
<jaafar>
IIRC that was the sweet spot on my system
<jaafar>
I think also the number of "true" cores i.e. not hyperthreading
<hkaiser>
yah, you can see that, there are mostly 4 tasks runnin concurrently in stage 1
<jaafar>
yep
<hkaiser>
fun!
<jaafar>
OK gotta get out of my window system to make the system closer to idle for benchmarking brb
jaafar has quit [Quit: Konversation terminated!]
hkaiser has quit [Quit: bye]
jaafar has joined #ste||ar
<jaafar>
oops
<jaafar>
well, the picture looks very similar
jaafar has quit [Quit: Konversation terminated!]
Guest70891 has quit [Ping timeout: 240 seconds]
Guest70891 has joined #ste||ar
jaafar has joined #ste||ar
<jaafar>
correction: actually that did change things
jaafar has quit [Quit: Konversation terminated!]
jaafar has joined #ste||ar
Guest70891 has quit [Ping timeout: 268 seconds]
Guest70891 has joined #ste||ar
mdiers_1 has joined #ste||ar
mdiers_ has quit [Ping timeout: 265 seconds]
mdiers_1 is now known as mdiers_
weilewei has quit [Remote host closed the connection]
Guest70891 has quit [Ping timeout: 264 seconds]
Guest70891 has joined #ste||ar
Guest70891 has quit [Ping timeout: 264 seconds]
Guest70891 has joined #ste||ar
Guest70891 has quit [Ping timeout: 265 seconds]
Guest70891 has joined #ste||ar
<jbjnr>
jaafar: I can help you with tracing activity in the scheduler