K-ballo changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
bita_ has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
bita_ has quit [Ping timeout: 256 seconds]
weilewei has quit [Remote host closed the connection]
hkaiser has quit [Quit: bye]
<jaafar> Anyone around who knows about how the scheduler works?
<jaafar> I can always ask tomorrow
gdaiss[m] has quit [Ping timeout: 260 seconds]
klaus[m] has quit [Ping timeout: 260 seconds]
gdaiss[m] has joined #ste||ar
bobakk3r_ has quit [*.net *.split]
klaus[m] has joined #ste||ar
bobakk3r_ has joined #ste||ar
parsa[m] has quit [Quit: Idle for 30+ days]
<ms[m]> jaafar: I can probably tell you a thing or two about it
<ms[m]> just ask away
hkaiser has joined #ste||ar
K-ballo has joined #ste||ar
hkaiser has quit [Quit: bye]
hkaiser has joined #ste||ar
elfring has joined #ste||ar
<jaafar> Good morning, or whatever your time zone is ms[m]
<jaafar> I'd be glad to get some hints that I can use to understand the behavior I'm seeing in the scheduler
<jaafar> From my reading it looks like there is one queue per OS thread from which runnable tasks are selected whenever the currently executing task needs to block on an "LCOS"
<jaafar> runnable tasks are not chosen from other queues unless the queue for the current thread is empty
<jaafar> Do I have the big picture correct?
hkaiser has quit [Quit: bye]
bita_ has joined #ste||ar
<jaafar> Well, that's my general understanding ^^^
<jaafar> I'll describe something specific I see in my traces
<jaafar> in whatever thread hpx_main() is I run hpx::async() four times, with tasks A, B, C, and D
<jaafar> then I wait on some future
<jaafar> What I observe from my traces is that B, C, and D start running on other OS threads, followed by A once the launching code blocks
<jaafar> My interpretation is: B, C, and D were stolen by the other queues and A begins once there is no further work in the original OS thread's queue
<jaafar> Does that sound right?
hkaiser has joined #ste||ar
<hkaiser> ms[m]: headsup: I have force-pushed to your branch msimberg/fix-5035
weilewei has joined #ste||ar
<weilewei> hkaiser can you quickly join the meeting and enable the screen sharing for me?
<weilewei> I want to share my screen later during the meeting
<hkaiser> sure, will be there in a sec
<weilewei> thanks
elfring has quit [Quit: Konversation terminated!]
bita__ has joined #ste||ar
bita_ has quit [Ping timeout: 264 seconds]
<jaafar> Is there anyone who might be willing to take some (hopefully simple) scheduler questions?
<hkaiser> jaafar: here now
<jaafar> hey!
<jaafar> I know you all have lots going on so I'll be brief
<jaafar> I'm trying to understand the behavior of the scan algorithm, which I'm tracing
weilewei has quit [Remote host closed the connection]
<jaafar> IIUC there's one work-stealing queue for each OS thread
<jaafar> and queues will grab work when they are empty
<jaafar> Imagine I am in hpx_main and I launch four tasks A, B, C, and D and I'm using four OS threads
<hkaiser> jaafar: yes, if most simplified, this is true
<jaafar> My guess is all four will begin life on a single queue associated with hpx_main
<jaafar> and they get stolen into the other three queues
<jaafar> right so far?
<hkaiser> jaafar: depends
<hkaiser> I think the default is to round robin the queues, but you can specify hints
<jaafar> Does that mean the "thief" is chosen round robin?
<hkaiser> no, the initial queue is chosen round robin
<jaafar> i.e. queue 1 can steal, then 2, then 3
<hkaiser> all queues steal concurrently
<jaafar> OK what is the "initial queue"?
<hkaiser> well, the cores the empty queue is associated with steals
<hkaiser> the task is created and put into an 'initial queue' from qhere it might get stolen
<jaafar> ah, so in fact my tasks A, B, C, D may actually begin life in different queues
<jaafar> chosen round robin
<hkaiser> yes
<jaafar> OK. Is there any way to ensure that A (the first created) begins executing first? From my traces it seems random
<jaafar> I was hoping that they would begin execution A, B, C, D but sometimes A ends up on the main thread I'm creating the tasks from
<jaafar> and thus starts after I create A-Z
<hkaiser> jaafar: as said we can specify hints, or even ask the scheduler what core(s) are currently free
<jaafar> Is that doable through hpx::async()? I can see that executors can have scheduling hints, but if I have just the one executor I worry I don't have that flexibility
<jaafar> or dataflow()
<jaafar> It's clear I can control the launch policy and the chunk size
<jaafar> but not sure about anything else
<hkaiser> jaafar: let's create that capability ;-)
<jaafar> ha OK gotcha
<jaafar> just a couple left
<jaafar> if a future is ready, can get() still cause a context switch to a different queued task?
<hkaiser> we could dynamically attach new executor parameters as needed
<hkaiser> no, if the future is ready, the get will return right away with the result
<jaafar> hm, OK
<jaafar> I'll check my traces again. Thought I saw that happening.
<hkaiser> well, it can happen
<hkaiser> future states are protected by a spinlock - if for some reason the spinlock is not available, then even a get could suspend
<jaafar> I see
<hkaiser> shouldn't happen too often, though
<jaafar> and promise::set_value() should normally not block - but maybe this spinlock could prevent that, also?
<hkaiser> right - I need to look at the code to be sure, though
<jaafar> I imagine it might not return either - I could see a change of context to some waiting code
<jaafar> like dataflow with sync launch policy
<jaafar> OK thanks for your help hkaiser. I feel like getting the tasks to run on the right cores in the right order is the key to unlocking the performance of exclusive_scan
<hkaiser> jaafar: nod, sounds about right
<hkaiser> jaafar: there is one more thing, though
<hkaiser> the scan partinioner launches one task more than it has to
<hkaiser> the last partiton doesn't need to run f1 separately, but f1 and f3 can be combined
<hkaiser> sorry for my typos
<hkaiser> anyways, gtg now
<jaafar> The first partition, right?
<jaafar> :)
<jaafar> I was looking at how to do that
<jaafar> TTYL
<jaafar> Along those lines (for later) I've also seen in my traces that sometimes f2 of the previous chunk finishes prior to f1 starting, which means the same optimization is available
<jaafar> i.e. the elimination of f3
<jaafar> but it's timing-dependent so I don't know how to properly exploit it. Cancellation seemed like a possibility, but... I'm not sure how to do it properly.
<hkaiser> jaafar: no, I meant the last partition
<hkaiser> there f1 can be run whenever it is ready to run f3, at the very end
<jaafar> Huh
<jaafar> Well, the first partition can be run this way, because the init value is always "ready"
<hkaiser> sure
<jaafar> I don't see it for the last one :)
<hkaiser> the last can run as well, as you don't need the result of its f1
<jaafar> I think the f1's perform a prefix scan
<jaafar> and the f3's add a single value to all the entries in the output
<jaafar> so if you didn't run f1...
<hkaiser> sure, but the result of f1 is used as the initial value for the next partition
<jaafar> yes
<hkaiser> I'm not saying that f1 shouldn't run
<jaafar> OK!
<hkaiser> you can run f1 together with f3 on a element by element basis
<hkaiser> in the same task
<jaafar> ah
<jaafar> Sounds like what I had in mind for the first stage
<hkaiser> yes, there it works as well
<jaafar> OK great I get it, I think :)
<jaafar> Although - bear with me - f3 is much faster than f1
<jaafar> so if we're waiting for a previous f2, doing f1 while other work is being completed, then f3 when that data arrives is a win
<hkaiser> jaafar: see https://en.wikipedia.org/wiki/Prefix_sum under Shared memory: Two-level algorithm
<hkaiser> jaafar: yes, f3 is faster
<hkaiser> (most of the time)
<jaafar> If my benchmarking is any clue, we will be waiting for the previous f2 at the end
<jaafar> better to run f3 than f1 in that case, I think
<hkaiser> right, so having f1 run before that might indeed help
<hkaiser> (for the last partition)
<jaafar> cool
<hkaiser> so I take back what I said ;-)
<jaafar> OK whew, glad we're on the same page!
<hkaiser> thanks jaafar!
jaafar has quit [Quit: Konversation terminated!]
jaafar has joined #ste||ar