K-ballo changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
bita_ has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
bita_ has quit [Ping timeout: 256 seconds]
weilewei has quit [Remote host closed the connection]
hkaiser has quit [Quit: bye]
<jaafar>
Anyone around who knows about how the scheduler works?
<jaafar>
I can always ask tomorrow
gdaiss[m] has quit [Ping timeout: 260 seconds]
klaus[m] has quit [Ping timeout: 260 seconds]
gdaiss[m] has joined #ste||ar
bobakk3r_ has quit [*.net *.split]
klaus[m] has joined #ste||ar
bobakk3r_ has joined #ste||ar
parsa[m] has quit [Quit: Idle for 30+ days]
<ms[m]>
jaafar: I can probably tell you a thing or two about it
<jaafar>
Good morning, or whatever your time zone is ms[m]
<jaafar>
I'd be glad to get some hints that I can use to understand the behavior I'm seeing in the scheduler
<jaafar>
From my reading it looks like there is one queue per OS thread from which runnable tasks are selected whenever the currently executing task needs to block on an "LCOS"
<jaafar>
runnable tasks are not chosen from other queues unless the queue for the current thread is empty
<jaafar>
Do I have the big picture correct?
hkaiser has quit [Quit: bye]
bita_ has joined #ste||ar
<jaafar>
Well, that's my general understanding ^^^
<jaafar>
I'll describe something specific I see in my traces
<jaafar>
in whatever thread hpx_main() is I run hpx::async() four times, with tasks A, B, C, and D
<jaafar>
then I wait on some future
<jaafar>
What I observe from my traces is that B, C, and D start running on other OS threads, followed by A once the launching code blocks
<jaafar>
My interpretation is: B, C, and D were stolen by the other queues and A begins once there is no further work in the original OS thread's queue
<jaafar>
Does that sound right?
hkaiser has joined #ste||ar
<hkaiser>
ms[m]: headsup: I have force-pushed to your branch msimberg/fix-5035
weilewei has joined #ste||ar
<weilewei>
hkaiser can you quickly join the meeting and enable the screen sharing for me?
<weilewei>
I want to share my screen later during the meeting
<hkaiser>
sure, will be there in a sec
<weilewei>
thanks
elfring has quit [Quit: Konversation terminated!]
bita__ has joined #ste||ar
bita_ has quit [Ping timeout: 264 seconds]
<jaafar>
Is there anyone who might be willing to take some (hopefully simple) scheduler questions?
<hkaiser>
jaafar: here now
<jaafar>
hey!
<jaafar>
I know you all have lots going on so I'll be brief
<jaafar>
I'm trying to understand the behavior of the scan algorithm, which I'm tracing
weilewei has quit [Remote host closed the connection]
<jaafar>
IIUC there's one work-stealing queue for each OS thread
<jaafar>
and queues will grab work when they are empty
<jaafar>
Imagine I am in hpx_main and I launch four tasks A, B, C, and D and I'm using four OS threads
<hkaiser>
jaafar: yes, if most simplified, this is true
<jaafar>
My guess is all four will begin life on a single queue associated with hpx_main
<jaafar>
and they get stolen into the other three queues
<jaafar>
right so far?
<hkaiser>
jaafar: depends
<hkaiser>
I think the default is to round robin the queues, but you can specify hints
<jaafar>
Does that mean the "thief" is chosen round robin?
<hkaiser>
no, the initial queue is chosen round robin
<jaafar>
i.e. queue 1 can steal, then 2, then 3
<hkaiser>
all queues steal concurrently
<jaafar>
OK what is the "initial queue"?
<hkaiser>
well, the cores the empty queue is associated with steals
<hkaiser>
the task is created and put into an 'initial queue' from qhere it might get stolen
<jaafar>
ah, so in fact my tasks A, B, C, D may actually begin life in different queues
<jaafar>
chosen round robin
<hkaiser>
yes
<jaafar>
OK. Is there any way to ensure that A (the first created) begins executing first? From my traces it seems random
<jaafar>
I was hoping that they would begin execution A, B, C, D but sometimes A ends up on the main thread I'm creating the tasks from
<jaafar>
and thus starts after I create A-Z
<hkaiser>
jaafar: as said we can specify hints, or even ask the scheduler what core(s) are currently free
<jaafar>
Is that doable through hpx::async()? I can see that executors can have scheduling hints, but if I have just the one executor I worry I don't have that flexibility
<jaafar>
or dataflow()
<jaafar>
It's clear I can control the launch policy and the chunk size
<jaafar>
but not sure about anything else
<hkaiser>
jaafar: let's create that capability ;-)
<jaafar>
ha OK gotcha
<jaafar>
just a couple left
<jaafar>
if a future is ready, can get() still cause a context switch to a different queued task?
<hkaiser>
we could dynamically attach new executor parameters as needed
<hkaiser>
no, if the future is ready, the get will return right away with the result
<jaafar>
hm, OK
<jaafar>
I'll check my traces again. Thought I saw that happening.
<hkaiser>
well, it can happen
<hkaiser>
future states are protected by a spinlock - if for some reason the spinlock is not available, then even a get could suspend
<jaafar>
I see
<hkaiser>
shouldn't happen too often, though
<jaafar>
and promise::set_value() should normally not block - but maybe this spinlock could prevent that, also?
<hkaiser>
right - I need to look at the code to be sure, though
<jaafar>
I imagine it might not return either - I could see a change of context to some waiting code
<jaafar>
like dataflow with sync launch policy
<jaafar>
OK thanks for your help hkaiser. I feel like getting the tasks to run on the right cores in the right order is the key to unlocking the performance of exclusive_scan
<hkaiser>
jaafar: nod, sounds about right
<hkaiser>
jaafar: there is one more thing, though
<hkaiser>
the scan partinioner launches one task more than it has to
<hkaiser>
the last partiton doesn't need to run f1 separately, but f1 and f3 can be combined
<hkaiser>
sorry for my typos
<hkaiser>
anyways, gtg now
<jaafar>
The first partition, right?
<jaafar>
:)
<jaafar>
I was looking at how to do that
<jaafar>
TTYL
<jaafar>
Along those lines (for later) I've also seen in my traces that sometimes f2 of the previous chunk finishes prior to f1 starting, which means the same optimization is available
<jaafar>
i.e. the elimination of f3
<jaafar>
but it's timing-dependent so I don't know how to properly exploit it. Cancellation seemed like a possibility, but... I'm not sure how to do it properly.
<hkaiser>
jaafar: no, I meant the last partition
<hkaiser>
there f1 can be run whenever it is ready to run f3, at the very end
<jaafar>
Huh
<jaafar>
Well, the first partition can be run this way, because the init value is always "ready"
<hkaiser>
sure
<jaafar>
I don't see it for the last one :)
<hkaiser>
the last can run as well, as you don't need the result of its f1
<jaafar>
I think the f1's perform a prefix scan
<jaafar>
and the f3's add a single value to all the entries in the output
<jaafar>
so if you didn't run f1...
<hkaiser>
sure, but the result of f1 is used as the initial value for the next partition
<jaafar>
yes
<hkaiser>
I'm not saying that f1 shouldn't run
<jaafar>
OK!
<hkaiser>
you can run f1 together with f3 on a element by element basis
<hkaiser>
in the same task
<jaafar>
ah
<jaafar>
Sounds like what I had in mind for the first stage
<hkaiser>
yes, there it works as well
<jaafar>
OK great I get it, I think :)
<jaafar>
Although - bear with me - f3 is much faster than f1
<jaafar>
so if we're waiting for a previous f2, doing f1 while other work is being completed, then f3 when that data arrives is a win