#ste||ar on 2020-12-14 — irc logs at irclog.cct.lsu.edu

2020-09-17 16:16 K-ballo changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:37 bita_ has joined #ste||ar

02:05 K-ballo has quit [Quit: K-ballo]

02:18 bita_ has quit [Ping timeout: 256 seconds]

03:15 weilewei has quit [Remote host closed the connection]

03:18 hkaiser has quit [Quit: bye]

06:42 <jaafar> Anyone around who knows about how the scheduler works?

06:43 <jaafar> I can always ask tomorrow

08:16 gdaiss[m] has quit [Ping timeout: 260 seconds]

08:16 klaus[m] has quit [Ping timeout: 260 seconds]

08:16 gdaiss[m] has joined #ste||ar

08:17 bobakk3r_ has quit [*.net *.split]

08:17 klaus[m] has joined #ste||ar

08:22 bobakk3r_ has joined #ste||ar

09:00 parsa[m] has quit [Quit: Idle for 30+ days]

09:23 <ms[m]> jaafar: I can probably tell you a thing or two about it

09:23 <ms[m]> just ask away

11:18 hkaiser has joined #ste||ar

12:04 K-ballo has joined #ste||ar

12:10 <K-ballo> https://cmake.org/cmake/help/latest/variable/CMAKE_DL_LIBS.html

13:50 hkaiser has quit [Quit: bye]

14:41 hkaiser has joined #ste||ar

15:06 elfring has joined #ste||ar

16:47 <jaafar> Good morning, or whatever your time zone is ms[m]

16:47 <jaafar> I'd be glad to get some hints that I can use to understand the behavior I'm seeing in the scheduler

16:49 <jaafar> From my reading it looks like there is one queue per OS thread from which runnable tasks are selected whenever the currently executing task needs to block on an "LCOS"

16:49 <jaafar> runnable tasks are not chosen from other queues unless the queue for the current thread is empty

16:49 <jaafar> Do I have the big picture correct?

18:04 hkaiser has quit [Quit: bye]

18:43 bita_ has joined #ste||ar

19:18 <jaafar> Well, that's my general understanding ^^^

19:18 <jaafar> I'll describe something specific I see in my traces

19:19 <jaafar> in whatever thread hpx_main() is I run hpx::async() four times, with tasks A, B, C, and D

19:20 <jaafar> then I wait on some future

19:20 <jaafar> What I observe from my traces is that B, C, and D start running on other OS threads, followed by A once the launching code blocks

19:21 <jaafar> My interpretation is: B, C, and D were stolen by the other queues and A begins once there is no further work in the original OS thread's queue

19:21 <jaafar> Does that sound right?

19:23 hkaiser has joined #ste||ar

19:29 <hkaiser> ms[m]: headsup: I have force-pushed to your branch msimberg/fix-5035

19:50 weilewei has joined #ste||ar

19:50 <weilewei> hkaiser can you quickly join the meeting and enable the screen sharing for me?

19:51 <weilewei> I want to share my screen later during the meeting

19:51 <hkaiser> sure, will be there in a sec

19:51 <weilewei> thanks

20:36 elfring has quit [Quit: Konversation terminated!]

20:39 bita__ has joined #ste||ar

20:42 bita_ has quit [Ping timeout: 264 seconds]

21:37 <jaafar> Is there anyone who might be willing to take some (hopefully simple) scheduler questions?

21:37 <hkaiser> jaafar: here now

21:38 <jaafar> hey!

21:38 <jaafar> I know you all have lots going on so I'll be brief

21:38 <jaafar> I'm trying to understand the behavior of the scan algorithm, which I'm tracing

21:38 weilewei has quit [Remote host closed the connection]

21:38 <jaafar> IIUC there's one work-stealing queue for each OS thread

21:39 <jaafar> and queues will grab work when they are empty

21:39 <jaafar> Imagine I am in hpx_main and I launch four tasks A, B, C, and D and I'm using four OS threads

21:39 <hkaiser> jaafar: yes, if most simplified, this is true

21:39 <jaafar> My guess is all four will begin life on a single queue associated with hpx_main

21:40 <jaafar> and they get stolen into the other three queues

21:40 <jaafar> right so far?

21:40 <hkaiser> jaafar: depends

21:40 <hkaiser> I think the default is to round robin the queues, but you can specify hints

21:41 <jaafar> Does that mean the "thief" is chosen round robin?

21:41 <hkaiser> no, the initial queue is chosen round robin

21:41 <jaafar> i.e. queue 1 can steal, then 2, then 3

21:41 <hkaiser> all queues steal concurrently

21:41 <jaafar> OK what is the "initial queue"?

21:41 <hkaiser> well, the cores the empty queue is associated with steals

21:42 <hkaiser> the task is created and put into an 'initial queue' from qhere it might get stolen

21:42 <jaafar> ah, so in fact my tasks A, B, C, D may actually begin life in different queues

21:42 <jaafar> chosen round robin

21:42 <hkaiser> yes

21:43 <jaafar> OK. Is there any way to ensure that A (the first created) begins executing first? From my traces it seems random

21:44 <jaafar> I was hoping that they would begin execution A, B, C, D but sometimes A ends up on the main thread I'm creating the tasks from

21:44 <jaafar> and thus starts after I create A-Z

21:45 <hkaiser> jaafar: as said we can specify hints, or even ask the scheduler what core(s) are currently free

21:47 <jaafar> Is that doable through hpx::async()? I can see that executors can have scheduling hints, but if I have just the one executor I worry I don't have that flexibility

21:48 <jaafar> or dataflow()

21:49 <jaafar> It's clear I can control the launch policy and the chunk size

21:49 <jaafar> but not sure about anything else

21:49 <hkaiser> jaafar: let's create that capability ;-)

21:49 <jaafar> ha OK gotcha

21:49 <jaafar> just a couple left

21:50 <jaafar> if a future is ready, can get() still cause a context switch to a different queued task?

21:50 <hkaiser> we could dynamically attach new executor parameters as needed

21:50 <hkaiser> no, if the future is ready, the get will return right away with the result

21:50 <jaafar> hm, OK

21:51 <jaafar> I'll check my traces again. Thought I saw that happening.

21:51 <hkaiser> well, it can happen

21:51 <hkaiser> future states are protected by a spinlock - if for some reason the spinlock is not available, then even a get could suspend

21:52 <jaafar> I see

21:52 <hkaiser> shouldn't happen too often, though

21:52 <jaafar> and promise::set_value() should normally not block - but maybe this spinlock could prevent that, also?

21:53 <hkaiser> right - I need to look at the code to be sure, though

21:53 <jaafar> I imagine it might not return either - I could see a change of context to some waiting code

21:54 <jaafar> like dataflow with sync launch policy

22:01 <jaafar> OK thanks for your help hkaiser. I feel like getting the tasks to run on the right cores in the right order is the key to unlocking the performance of exclusive_scan

22:03 <hkaiser> jaafar: nod, sounds about right

22:03 <hkaiser> jaafar: there is one more thing, though

22:03 <hkaiser> the scan partinioner launches one task more than it has to

22:04 <hkaiser> the last partiton doesn't need to run f1 separately, but f1 and f3 can be combined

22:05 <hkaiser> sorry for my typos

22:05 <hkaiser> anyways, gtg now

22:07 <jaafar> The first partition, right?

22:07 <jaafar> :)

22:07 <jaafar> I was looking at how to do that

22:07 <jaafar> TTYL

22:17 <jaafar> Along those lines (for later) I've also seen in my traces that sometimes f2 of the previous chunk finishes prior to f1 starting, which means the same optimization is available

22:17 <jaafar> i.e. the elimination of f3

22:17 <jaafar> but it's timing-dependent so I don't know how to properly exploit it. Cancellation seemed like a possibility, but... I'm not sure how to do it properly.

23:00 <hkaiser> jaafar: no, I meant the last partition

23:01 <hkaiser> there f1 can be run whenever it is ready to run f3, at the very end

23:15 <jaafar> Huh

23:16 <jaafar> Well, the first partition can be run this way, because the init value is always "ready"

23:16 <hkaiser> sure

23:16 <jaafar> I don't see it for the last one :)

23:16 <hkaiser> the last can run as well, as you don't need the result of its f1

23:17 <jaafar> I think the f1's perform a prefix scan

23:17 <jaafar> and the f3's add a single value to all the entries in the output

23:17 <jaafar> so if you didn't run f1...

23:17 <hkaiser> sure, but the result of f1 is used as the initial value for the next partition

23:17 <jaafar> yes

23:17 <hkaiser> I'm not saying that f1 shouldn't run

23:17 <jaafar> OK!

23:18 <hkaiser> you can run f1 together with f3 on a element by element basis

23:18 <hkaiser> in the same task

23:18 <jaafar> ah

23:18 <jaafar> Sounds like what I had in mind for the first stage

23:19 <hkaiser> yes, there it works as well

23:19 <jaafar> OK great I get it, I think :)

23:19 <jaafar> Although - bear with me - f3 is much faster than f1

23:20 <jaafar> so if we're waiting for a previous f2, doing f1 while other work is being completed, then f3 when that data arrives is a win

23:20 <hkaiser> jaafar: see https://en.wikipedia.org/wiki/Prefix_sum under Shared memory: Two-level algorithm

23:21 <hkaiser> jaafar: yes, f3 is faster

23:21 <hkaiser> (most of the time)

23:22 <jaafar> If my benchmarking is any clue, we will be waiting for the previous f2 at the end

23:22 <jaafar> better to run f3 than f1 in that case, I think

23:22 <hkaiser> right, so having f1 run before that might indeed help

23:22 <hkaiser> (for the last partition)

23:22 <jaafar> cool

23:22 <hkaiser> so I take back what I said ;-)

23:23 <jaafar> OK whew, glad we're on the same page!

23:24 <hkaiser> thanks jaafar!

23:30 jaafar has quit [Quit: Konversation terminated!]

23:31 jaafar has joined #ste||ar