#ste||ar on 2020-06-24 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

01:02 <nikunj> hkaiser: idle rates are low and the gain size of the parallel for loop is in ~250ms. This work is divided on upto 4 threads.

01:02 <nikunj> Also, it is about memory. Sending a double to be precise

01:03 <hkaiser> ok

01:04 <nikunj> A separate program with send/ receive over 1000 iteration shows single thread completes in 0.2s and 64 threads took 7+s

01:05 <nikunj> It is just a for loop with 1000 iterations with one set and one get function.

01:06 <nikunj> I realized this behavior while testing on rpi4 where 1 thread times were significantly lower than 4 thread times for sending the data and receiving it.

01:07 <nikunj> hkaiser: is this an expected behavior?

01:08 <hkaiser> nikunj: and the idle rates are still low when running on 64 threads?

01:08 <nikunj> Yes

01:09 <hkaiser> I don't see how this can happen, frankly

01:10 <hkaiser> it should spend the 6.8s somewhere

01:10 <hkaiser> do all of the threads work with the same channel instance?

01:10 <nikunj> Yes

01:11 <nikunj> It's a single instance created on ever node

01:11 <hkaiser> so you have very high contention on the channel

01:12 <nikunj> How would that affect if I increase the number of threads?

01:12 <nikunj> It is a single instance per node

01:14 nikunj97 has joined #ste||ar

01:17 <nikunj97> hkaiser, here is a sample code: https://gist.github.com/NK-Nikunj/db0153a742ff8930512e7af9eedc4cf5

01:18 <nikunj97> the code just simply records the time for sending/receiving data items for 1000 iterations. for 1 thread the results are great, but for 64 threads they are horrible

01:18 <hkaiser> nikunj: the channel uses spinlocks to protect the data

01:19 <hkaiser> if you hit it with 64 threads concurrently none of the threads can make progress in the end

01:19 <hkaiser> typical lifelock situation

01:19 <nikunj> Aha

01:19 <hkaiser> use one channel per core

01:19 <nikunj> But I really want to use just one per node

01:20 <nikunj> The communication is from one node to another and not one thread to another

01:23 <nikunj> hkaiser: btw the program runs with 64 hpx threads but it's 1 thread that calls on it and and thread receives it. All other 63 threads at using at that point.

01:24 <nikunj> s/using/idling

01:24 <hkaiser> so why is the idle rate low if this is the case?

01:24 <nikunj> I'm taking about the example code I shared

01:24 <nikunj> Check the gist: https://gist.github.com/NK-Nikunj/db0153a742ff8930512e7af9eedc4cf5

01:25 <nikunj> If you can explain why I see the behavior in this minimal example, I will be able to extrapolate it to my actual code.

01:27 <nikunj> hkaiser: the communicator file is the same as this one: https://github.com/STEllAR-GROUP/hpx/blob/39fc6aa0f29e80addb057f83041cc405d62154f1/examples/1d_stencil/communicator.hpp

01:27 <hkaiser> nikunj: can I look tomorrow? can't do it right now?

01:28 <nikunj> Sure!

01:42 weilewei has quit [Ping timeout: 245 seconds]

01:49 Nikunj__ has joined #ste||ar

01:51 nikunj97 has quit [Ping timeout: 260 seconds]

02:39 hkaiser has quit [Quit: bye]

02:53 kale[m] has quit [Ping timeout: 260 seconds]

02:54 kale[m] has joined #ste||ar

02:58 Nikunj__ has quit [Ping timeout: 265 seconds]

03:00 akheir has quit [Quit: Leaving]

03:07 kale[m] has quit [Ping timeout: 244 seconds]

03:20 kale[m] has joined #ste||ar

05:13 bita__ has quit [Ping timeout: 244 seconds]

05:19 bita__ has joined #ste||ar

05:25 bita__ has quit [Ping timeout: 260 seconds]

05:43 nan11 has quit [Ping timeout: 245 seconds]

05:46 nikunj97 has joined #ste||ar

06:23 Nikunj__ has joined #ste||ar

06:27 nikunj97 has quit [Ping timeout: 240 seconds]

07:23 kale[m] has quit [Ping timeout: 264 seconds]

07:24 kale[m] has joined #ste||ar

08:23 kale[m] has quit [Ping timeout: 244 seconds]

08:23 kale[m] has joined #ste||ar

09:15 Nikunj__ is now known as nikunj97

09:16 <nikunj97> ms[m], is there a function that translate ranks to hpx::id_type?

09:16 <nikunj97> For example, let's say my locality rank is 0, I want the id of the locality with rank 1

09:23 <ms[m]> nikunj97: don't know, but I don't think so

09:23 <ms[m]> maybe index into the vector returned by find_all_localities?

09:23 <nikunj97> ms[m], so what is the standard way to communicate with my neighboring rank?

09:23 <nikunj97> does find_all_localities sort by rank?

09:23 <ms[m]> jbjnr: heller ^?

09:24 <ms[m]> not sure if the contents of that vector is actually guaranteed to be ordered the same way on all ranks

09:24 <ms[m]> I don't know :)

09:42 mcopik has joined #ste||ar

09:43 mcopik has quit [Client Quit]

09:53 wladimir[m] has joined #ste||ar

10:33 hkaiser has joined #ste||ar

11:40 kale_ has joined #ste||ar

12:46 <nikunj97> hkaiser, yt?

12:47 <hkaiser> here

12:47 <nikunj97> hkaiser, good time to talk about lcos channels?

12:48 <hkaiser> I have a couple of minute, yes

12:48 <nikunj97> Take this example: https://gist.github.com/NK-Nikunj/824217963417bfc5abb3f9539e927bad#file-network_bottleneck-cpp

12:49 <nikunj97> when I run it with hpx:threads=1 it runs in no time, but when I run it with 64 threads, it takes seconds to complete. This is with 2 nodes

12:50 <hkaiser> where is that communicator defined?

12:50 <nikunj97> it's in the file above

12:50 <nikunj97> https://gist.github.com/NK-Nikunj/824217963417bfc5abb3f9539e927bad#file-communicator-hpp

12:51 <hkaiser> as said yesterday, you have a lot of contention on the channel instances

12:51 <nikunj97> how so? and how do I mitigate it?

12:51 <hkaiser> hold on

12:51 <hkaiser> one thread is doing all the work and 63 cores are idle?

12:52 <nikunj97> yes

12:52 <nikunj97> that's what the program is intended to do

12:52 <hkaiser> so why did you tell me you have a low idle rate, then?

12:52 <nikunj97> it was with a different program. I found out this was the issue so I narrowed it down to this example

12:53 <hkaiser> this example has 100% idle-rate

12:53 <nikunj97> and this example has the same behavior. When I run with 1 thread, it runs perfectly fine. But with 64 threads it takes forever

12:53 <nikunj97> yes, this example is only meant to do send/receive from one thread on all nodes

12:53 <nikunj97> one node sends it, and the other receives it

12:54 <nikunj97> it should technically not depend on the no. of threads per node

12:54 <nikunj97> but it does and it takes significantly higher execution time with more no. of threads per node

12:54 <hkaiser> sure

12:54 <hkaiser> you have 64 cores fighting of scraps of work

12:55 <hkaiser> that means constant cache disruption on all levels

12:55 <nikunj97> are channels not bound to a single thread?

12:56 <hkaiser> well nothing is bound to a particular thread, but that's not the issue

12:56 <nikunj97> what does channels do in order to send/receive then?

12:56 <hkaiser> 63 threads have no work in their queues, so they try to steal work which disrupts the caches of every other thread, constantly

12:56 <nikunj97> aha

12:57 <nikunj97> so that channel work is being demanded on all threads then

12:57 <hkaiser> you're seeing the effect of what happens if caches are disabled, essentially

12:57 <hkaiser> no

12:58 <nikunj97> how do I mitigate that?

12:58 <hkaiser> the channel is inconsequential to the outcome, this will happen with any other code that has no parallelism

12:58 <hkaiser> how to mitigate? add parallelism

12:58 <hkaiser> give work to the starving cores

12:59 <nikunj97> I did add .then to the get in the actual code so that other cores could work on other tasks, but it didn't really improve the situation

12:59 <hkaiser> nikunj: do you have suffient work to keep all cores busy?

13:00 <nikunj97> https://gist.github.com/NK-Nikunj/824217963417bfc5abb3f9539e927bad#file-gistfile1-txt

13:00 <nikunj97> yes, I do

13:00 <nikunj97> in the actual application, I do

13:00 <hkaiser> how many partitions did you run this with?

13:01 <nikunj97> I ran it on 1,2,4 nodes each with 64 threads and 4.8 billion stencil points

13:01 <hkaiser> ok

13:02 <nikunj97> it takes 95s on single node

13:02 <hkaiser> but all of this still uses only 2 channels?

13:02 <nikunj97> no, on 1 node it doesn't use any channels

13:02 <nikunj97> on 2 nodes, each node has 4 channels, 2 receive and 2 send

13:02 <nikunj97> same for 4 nodes

13:02 <hkaiser> so you have one partition per node?

13:02 <nikunj97> every node has 4 channels, 2 listens in on receive and 2 are used to send data

13:03 <nikunj97> no I have 1024 partitions per node

13:03 <nikunj97> number of partitions is set with Nlp i.e. number of local partitions

13:03 <hkaiser> how do the partitions communicate amonst themselves?

13:04 <hkaiser> I think each partition has to have 4 channels

13:04 <nikunj97> partitions are on the same node. They have all the data locally so they don't communicate using channels.

13:04 <nikunj97> I guess you're comparing it with stencil 8

13:04 <hkaiser> how do they communicate, then?

13:04 <nikunj97> it is slightly different.

13:04 <nikunj97> So each node has a partition of the actual array.

13:04 <nikunj97> each node has local partitions

13:05 <nikunj97> so there is no need for explicit communication between local partitions. For accessing elements from neigboring nodes, there are 4 channels in place

13:05 <nikunj97> in each iteration only the first and last element is shared through the channel to the left and right neighbors

13:06 <hkaiser> ok, understood

13:07 <nikunj97> therefore, only 4 channels are required for each node instead of each local partition. Now, while the local partitions are iterating over the time step, we wait to receive data from neighbors. We expect to have received data before we're done with local iteration. Once local iteration ends, we take care of first and last element

13:09 <nikunj97> There is enough parallelism and the time taken for that parallel_for loop is in the range of 100-250ms. This should be enough to get data from neighbors, but it apparently isn't. It takes another 50-250ms to retrieve data.

13:09 <nikunj97> That's why I narrowed it down to lcos channels not working correctly

13:11 <hkaiser> ok

13:11 <hkaiser> I don't see the issue here yet

13:12 <nikunj97> where are lcos channels for distributed implemented? may be I'll find something there

13:13 <hkaiser> module lcos_distributed

13:14 <nikunj97> what could be the fault if not channels?

13:14 <hkaiser> difficult to tell

13:15 <hkaiser> try making the .then(hpx::launch::sync, ...)

13:15 <nikunj97> I doubt that will work. Let me try.

13:16 <hkaiser> nikunj: the amount of work is minimal and as channels communicate using direct actions (I believe) this would invoke the continuation right away once the parcel was received

13:17 <hkaiser> also, what parcelport do you use?

13:17 <hkaiser> mpi or tcp?

13:17 <nikunj97> tcp

13:17 <hkaiser> k

13:17 <hkaiser> what's the performance if you run using 63 cores?

13:17 <hkaiser> better?

13:17 <nikunj97> it is better, but it doesn't scale as well as it should

13:17 <hkaiser> much better?

13:18 <nikunj97> well, with x86 I see about 5x difference when going from 1 node to 8

13:18 <nikunj97> so I won't call it bad, but the excess time spent in channels can be reduced as it is taking exorbitantly higher times than it should

13:19 <nikunj97> I believe I can squeeze out up to 6 to 7x performance for 8 nodes

13:19 <hkaiser> I thin kthe channels are a red herring, but pls correct me if I'm wrong

13:20 <nikunj97> Not really sure, but I used to timers to calculate the times

13:20 <nikunj97> if they're essentially actions invoked on other nodes, should they not take only a few ms?

13:21 <nikunj97> they take 100+ms per for a set/get cycle and it is bothering me

13:22 <nikunj97> if I'm able to solve this issue, I'm certain that we will have an almost linear scaling 1d stencil in our hands that will look good as a benchmark everywhere

13:24 <nikunj97> hkaiser, I do have a workaround right now which is rather bleak. It involves using 2 global variables one for left most and one for right most. And then making each node update that value on the other node using plain actions.

13:26 <nikunj97> btw .then(hpx::launch::sync, ...) didn't work :/

13:27 <nikunj97> hkaiser, one last thing. Is there a function that maps ranks to hpx::id_type?

13:27 <nikunj97> For example, let's say my locality rank is 0, I want the id of the locality with rank 1

13:43 <hkaiser> nikunj97: what does that mean: 'it didn't work'?

13:43 <nikunj97> the execution time didn't decrease

13:43 <hkaiser> ok

13:43 <nikunj97> it is about the same, like 1s difference in 95s run

13:43 <nikunj97> I'd call it noise more or less

13:43 <hkaiser> btw channels use nothing else but actions to communicate

13:44 <hkaiser> ok, just wanted to make sure that the continuation is not stalled

13:45 <nikunj97> I'm trying the hack, I just talked about. If that works, we'll have to dig into channels. Otherwise, I believe it has to do with something else.

13:45 <hkaiser> k

13:45 <hkaiser> good luck

13:48 nan11 has joined #ste||ar

14:13 karame_ has joined #ste||ar

14:15 _kale_ has joined #ste||ar

14:15 kale[m] has quit [Ping timeout: 264 seconds]

14:16 kale[m] has joined #ste||ar

14:17 kale_ has quit [Ping timeout: 265 seconds]

14:25 weilewei has joined #ste||ar

14:26 akheir has joined #ste||ar

14:52 bita__ has joined #ste||ar

15:03 <nikunj97> hkaiser, no improvements using explicit actions.

15:03 <hkaiser> so it's not the channel, as expected

15:03 <nikunj97> yes. Where should I look at now?

15:04 <hkaiser> profiling?

15:04 <nikunj97> ughh...

15:05 <hkaiser> :D

15:06 <nikunj97> hkaiser, before I get into profiling. Why did you say it is due to contention?

15:07 nan11 has quit [Remote host closed the connection]

15:07 <nikunj97> because you thought the grain size is too small?

15:07 <hkaiser> in your small example it's contention, the actual app seems to have sufficient parallelism, so it's probably ok

17:32 <nikunj97> hkaiser, I was able to improve the code further! The set function was launching asynchronously and had no grain size. So I made that sync. Furthermore, I increased the grain size of .then to allow for better scheduling. Results: 1 node 95s, 8 nodes 15s

17:33 <nikunj97> now that's what I call linear scaling :D

17:53 nanmiao has joined #ste||ar

17:58 weilewei has quit [Remote host closed the connection]

18:03 <hkaiser> nice

18:32 nanmiao has quit [Remote host closed the connection]

18:33 Nanmiao11 has joined #ste||ar

18:35 weilewei has joined #ste||ar

19:16 _kale_ has quit [Quit: Leaving]

19:40 sayefsakin has joined #ste||ar

19:50 Nanmiao11 has quit [Remote host closed the connection]

19:51 nan666 has joined #ste||ar

20:05 weilewei has quit [Remote host closed the connection]

20:07 nikunj97 has quit [Read error: Connection reset by peer]

20:07 mcopik has joined #ste||ar

20:07 mcopik has quit [Client Quit]

20:22 <hkaiser> nikunj: I talked to Kevin, he's aware of you wanting to use APEX for the perf analysis, he said to just get in contact

20:23 <nikunj> hkaiser: thanks!

20:37 nikunj has quit [Read error: Connection reset by peer]

20:37 kale[m] has quit [Ping timeout: 265 seconds]

20:37 nikunj has joined #ste||ar

20:38 kale[m] has joined #ste||ar

20:47 nikunj has quit [Ping timeout: 264 seconds]

21:00 karame_ has quit [Remote host closed the connection]

21:23 nan666 has quit [Remote host closed the connection]

21:28 nanmiao99 has joined #ste||ar

21:36 nikunj has joined #ste||ar

22:20 weilewei has joined #ste||ar

22:58 kale[m] has quit [Ping timeout: 260 seconds]

22:59 kale[m] has joined #ste||ar

23:41 kale[m] has quit [Ping timeout: 258 seconds]

23:45 kale[m] has joined #ste||ar

23:52 kale[m] has quit [Ping timeout: 240 seconds]

23:55 kale[m] has joined #ste||ar