hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
<nikunj> hkaiser: idle rates are low and the gain size of the parallel for loop is in ~250ms. This work is divided on upto 4 threads.
<nikunj> Also, it is about memory. Sending a double to be precise
<hkaiser> ok
<nikunj> A separate program with send/ receive over 1000 iteration shows single thread completes in 0.2s and 64 threads took 7+s
<nikunj> It is just a for loop with 1000 iterations with one set and one get function.
<nikunj> I realized this behavior while testing on rpi4 where 1 thread times were significantly lower than 4 thread times for sending the data and receiving it.
<nikunj> hkaiser: is this an expected behavior?
<hkaiser> nikunj: and the idle rates are still low when running on 64 threads?
<nikunj> Yes
<hkaiser> I don't see how this can happen, frankly
<hkaiser> it should spend the 6.8s somewhere
<hkaiser> do all of the threads work with the same channel instance?
<nikunj> Yes
<nikunj> It's a single instance created on ever node
<hkaiser> so you have very high contention on the channel
<nikunj> How would that affect if I increase the number of threads?
<nikunj> It is a single instance per node
nikunj97 has joined #ste||ar
<nikunj97> the code just simply records the time for sending/receiving data items for 1000 iterations. for 1 thread the results are great, but for 64 threads they are horrible
<hkaiser> nikunj: the channel uses spinlocks to protect the data
<hkaiser> if you hit it with 64 threads concurrently none of the threads can make progress in the end
<hkaiser> typical lifelock situation
<nikunj> Aha
<hkaiser> use one channel per core
<nikunj> But I really want to use just one per node
<nikunj> The communication is from one node to another and not one thread to another
<nikunj> hkaiser: btw the program runs with 64 hpx threads but it's 1 thread that calls on it and and thread receives it. All other 63 threads at using at that point.
<nikunj> s/using/idling
<hkaiser> so why is the idle rate low if this is the case?
<nikunj> I'm taking about the example code I shared
<nikunj> If you can explain why I see the behavior in this minimal example, I will be able to extrapolate it to my actual code.
<hkaiser> nikunj: can I look tomorrow? can't do it right now?
<nikunj> Sure!
weilewei has quit [Ping timeout: 245 seconds]
Nikunj__ has joined #ste||ar
nikunj97 has quit [Ping timeout: 260 seconds]
hkaiser has quit [Quit: bye]
kale[m] has quit [Ping timeout: 260 seconds]
kale[m] has joined #ste||ar
Nikunj__ has quit [Ping timeout: 265 seconds]
akheir has quit [Quit: Leaving]
kale[m] has quit [Ping timeout: 244 seconds]
kale[m] has joined #ste||ar
bita__ has quit [Ping timeout: 244 seconds]
bita__ has joined #ste||ar
bita__ has quit [Ping timeout: 260 seconds]
nan11 has quit [Ping timeout: 245 seconds]
nikunj97 has joined #ste||ar
Nikunj__ has joined #ste||ar
nikunj97 has quit [Ping timeout: 240 seconds]
kale[m] has quit [Ping timeout: 264 seconds]
kale[m] has joined #ste||ar
kale[m] has quit [Ping timeout: 244 seconds]
kale[m] has joined #ste||ar
Nikunj__ is now known as nikunj97
<nikunj97> ms[m], is there a function that translate ranks to hpx::id_type?
<nikunj97> For example, let's say my locality rank is 0, I want the id of the locality with rank 1
<ms[m]> nikunj97: don't know, but I don't think so
<ms[m]> maybe index into the vector returned by find_all_localities?
<nikunj97> ms[m], so what is the standard way to communicate with my neighboring rank?
<nikunj97> does find_all_localities sort by rank?
<ms[m]> jbjnr: heller ^?
<ms[m]> not sure if the contents of that vector is actually guaranteed to be ordered the same way on all ranks
<ms[m]> I don't know :)
mcopik has joined #ste||ar
mcopik has quit [Client Quit]
wladimir[m] has joined #ste||ar
hkaiser has joined #ste||ar
kale_ has joined #ste||ar
<nikunj97> hkaiser, yt?
<hkaiser> here
<nikunj97> hkaiser, good time to talk about lcos channels?
<hkaiser> I have a couple of minute, yes
<nikunj97> when I run it with hpx:threads=1 it runs in no time, but when I run it with 64 threads, it takes seconds to complete. This is with 2 nodes
<hkaiser> where is that communicator defined?
<nikunj97> it's in the file above
<hkaiser> as said yesterday, you have a lot of contention on the channel instances
<nikunj97> how so? and how do I mitigate it?
<hkaiser> hold on
<hkaiser> one thread is doing all the work and 63 cores are idle?
<nikunj97> yes
<nikunj97> that's what the program is intended to do
<hkaiser> so why did you tell me you have a low idle rate, then?
<nikunj97> it was with a different program. I found out this was the issue so I narrowed it down to this example
<hkaiser> this example has 100% idle-rate
<nikunj97> and this example has the same behavior. When I run with 1 thread, it runs perfectly fine. But with 64 threads it takes forever
<nikunj97> yes, this example is only meant to do send/receive from one thread on all nodes
<nikunj97> one node sends it, and the other receives it
<nikunj97> it should technically not depend on the no. of threads per node
<nikunj97> but it does and it takes significantly higher execution time with more no. of threads per node
<hkaiser> sure
<hkaiser> you have 64 cores fighting of scraps of work
<hkaiser> that means constant cache disruption on all levels
<nikunj97> are channels not bound to a single thread?
<hkaiser> well nothing is bound to a particular thread, but that's not the issue
<nikunj97> what does channels do in order to send/receive then?
<hkaiser> 63 threads have no work in their queues, so they try to steal work which disrupts the caches of every other thread, constantly
<nikunj97> aha
<nikunj97> so that channel work is being demanded on all threads then
<hkaiser> you're seeing the effect of what happens if caches are disabled, essentially
<hkaiser> no
<nikunj97> how do I mitigate that?
<hkaiser> the channel is inconsequential to the outcome, this will happen with any other code that has no parallelism
<hkaiser> how to mitigate? add parallelism
<hkaiser> give work to the starving cores
<nikunj97> I did add .then to the get in the actual code so that other cores could work on other tasks, but it didn't really improve the situation
<hkaiser> nikunj: do you have suffient work to keep all cores busy?
<nikunj97> yes, I do
<nikunj97> in the actual application, I do
<hkaiser> how many partitions did you run this with?
<nikunj97> I ran it on 1,2,4 nodes each with 64 threads and 4.8 billion stencil points
<hkaiser> ok
<nikunj97> it takes 95s on single node
<hkaiser> but all of this still uses only 2 channels?
<nikunj97> no, on 1 node it doesn't use any channels
<nikunj97> on 2 nodes, each node has 4 channels, 2 receive and 2 send
<nikunj97> same for 4 nodes
<hkaiser> so you have one partition per node?
<nikunj97> every node has 4 channels, 2 listens in on receive and 2 are used to send data
<nikunj97> no I have 1024 partitions per node
<nikunj97> number of partitions is set with Nlp i.e. number of local partitions
<hkaiser> how do the partitions communicate amonst themselves?
<hkaiser> I think each partition has to have 4 channels
<nikunj97> partitions are on the same node. They have all the data locally so they don't communicate using channels.
<nikunj97> I guess you're comparing it with stencil 8
<hkaiser> how do they communicate, then?
<nikunj97> it is slightly different.
<nikunj97> So each node has a partition of the actual array.
<nikunj97> each node has local partitions
<nikunj97> so there is no need for explicit communication between local partitions. For accessing elements from neigboring nodes, there are 4 channels in place
<nikunj97> in each iteration only the first and last element is shared through the channel to the left and right neighbors
<hkaiser> ok, understood
<nikunj97> therefore, only 4 channels are required for each node instead of each local partition. Now, while the local partitions are iterating over the time step, we wait to receive data from neighbors. We expect to have received data before we're done with local iteration. Once local iteration ends, we take care of first and last element
<nikunj97> There is enough parallelism and the time taken for that parallel_for loop is in the range of 100-250ms. This should be enough to get data from neighbors, but it apparently isn't. It takes another 50-250ms to retrieve data.
<nikunj97> That's why I narrowed it down to lcos channels not working correctly
<hkaiser> ok
<hkaiser> I don't see the issue here yet
<nikunj97> where are lcos channels for distributed implemented? may be I'll find something there
<hkaiser> module lcos_distributed
<nikunj97> what could be the fault if not channels?
<hkaiser> difficult to tell
<hkaiser> try making the .then(hpx::launch::sync, ...)
<nikunj97> I doubt that will work. Let me try.
<hkaiser> nikunj: the amount of work is minimal and as channels communicate using direct actions (I believe) this would invoke the continuation right away once the parcel was received
<hkaiser> also, what parcelport do you use?
<hkaiser> mpi or tcp?
<nikunj97> tcp
<hkaiser> k
<hkaiser> what's the performance if you run using 63 cores?
<hkaiser> better?
<nikunj97> it is better, but it doesn't scale as well as it should
<hkaiser> much better?
<nikunj97> well, with x86 I see about 5x difference when going from 1 node to 8
<nikunj97> so I won't call it bad, but the excess time spent in channels can be reduced as it is taking exorbitantly higher times than it should
<nikunj97> I believe I can squeeze out up to 6 to 7x performance for 8 nodes
<hkaiser> I thin kthe channels are a red herring, but pls correct me if I'm wrong
<nikunj97> Not really sure, but I used to timers to calculate the times
<nikunj97> if they're essentially actions invoked on other nodes, should they not take only a few ms?
<nikunj97> they take 100+ms per for a set/get cycle and it is bothering me
<nikunj97> if I'm able to solve this issue, I'm certain that we will have an almost linear scaling 1d stencil in our hands that will look good as a benchmark everywhere
<nikunj97> hkaiser, I do have a workaround right now which is rather bleak. It involves using 2 global variables one for left most and one for right most. And then making each node update that value on the other node using plain actions.
<nikunj97> btw .then(hpx::launch::sync, ...) didn't work :/
<nikunj97> hkaiser, one last thing. Is there a function that maps ranks to hpx::id_type?
<nikunj97> For example, let's say my locality rank is 0, I want the id of the locality with rank 1
<hkaiser> nikunj97: what does that mean: 'it didn't work'?
<nikunj97> the execution time didn't decrease
<hkaiser> ok
<nikunj97> it is about the same, like 1s difference in 95s run
<nikunj97> I'd call it noise more or less
<hkaiser> btw channels use nothing else but actions to communicate
<hkaiser> ok, just wanted to make sure that the continuation is not stalled
<nikunj97> I'm trying the hack, I just talked about. If that works, we'll have to dig into channels. Otherwise, I believe it has to do with something else.
<hkaiser> k
<hkaiser> good luck
nan11 has joined #ste||ar
karame_ has joined #ste||ar
_kale_ has joined #ste||ar
kale[m] has quit [Ping timeout: 264 seconds]
kale[m] has joined #ste||ar
kale_ has quit [Ping timeout: 265 seconds]
weilewei has joined #ste||ar
akheir has joined #ste||ar
bita__ has joined #ste||ar
<nikunj97> hkaiser, no improvements using explicit actions.
<hkaiser> so it's not the channel, as expected
<nikunj97> yes. Where should I look at now?
<hkaiser> profiling?
<nikunj97> ughh...
<hkaiser> :D
<nikunj97> hkaiser, before I get into profiling. Why did you say it is due to contention?
nan11 has quit [Remote host closed the connection]
<nikunj97> because you thought the grain size is too small?
<hkaiser> in your small example it's contention, the actual app seems to have sufficient parallelism, so it's probably ok
<nikunj97> hkaiser, I was able to improve the code further! The set function was launching asynchronously and had no grain size. So I made that sync. Furthermore, I increased the grain size of .then to allow for better scheduling. Results: 1 node 95s, 8 nodes 15s
<nikunj97> now that's what I call linear scaling :D
nanmiao has joined #ste||ar
weilewei has quit [Remote host closed the connection]
<hkaiser> nice
nanmiao has quit [Remote host closed the connection]
Nanmiao11 has joined #ste||ar
weilewei has joined #ste||ar
_kale_ has quit [Quit: Leaving]
sayefsakin has joined #ste||ar
Nanmiao11 has quit [Remote host closed the connection]
nan666 has joined #ste||ar
weilewei has quit [Remote host closed the connection]
nikunj97 has quit [Read error: Connection reset by peer]
mcopik has joined #ste||ar
mcopik has quit [Client Quit]
<hkaiser> nikunj: I talked to Kevin, he's aware of you wanting to use APEX for the perf analysis, he said to just get in contact
<nikunj> hkaiser: thanks!
nikunj has quit [Read error: Connection reset by peer]
kale[m] has quit [Ping timeout: 265 seconds]
nikunj has joined #ste||ar
kale[m] has joined #ste||ar
nikunj has quit [Ping timeout: 264 seconds]
karame_ has quit [Remote host closed the connection]
nan666 has quit [Remote host closed the connection]
nanmiao99 has joined #ste||ar
nikunj has joined #ste||ar
weilewei has joined #ste||ar
kale[m] has quit [Ping timeout: 260 seconds]
kale[m] has joined #ste||ar
kale[m] has quit [Ping timeout: 258 seconds]
kale[m] has joined #ste||ar
kale[m] has quit [Ping timeout: 240 seconds]
kale[m] has joined #ste||ar