<nikunj>
hkaiser: idle rates are low and the gain size of the parallel for loop is in ~250ms. This work is divided on upto 4 threads.
<nikunj>
Also, it is about memory. Sending a double to be precise
<hkaiser>
ok
<nikunj>
A separate program with send/ receive over 1000 iteration shows single thread completes in 0.2s and 64 threads took 7+s
<nikunj>
It is just a for loop with 1000 iterations with one set and one get function.
<nikunj>
I realized this behavior while testing on rpi4 where 1 thread times were significantly lower than 4 thread times for sending the data and receiving it.
<nikunj>
hkaiser: is this an expected behavior?
<hkaiser>
nikunj: and the idle rates are still low when running on 64 threads?
<nikunj>
Yes
<hkaiser>
I don't see how this can happen, frankly
<hkaiser>
it should spend the 6.8s somewhere
<hkaiser>
do all of the threads work with the same channel instance?
<nikunj>
Yes
<nikunj>
It's a single instance created on ever node
<hkaiser>
so you have very high contention on the channel
<nikunj>
How would that affect if I increase the number of threads?
<nikunj97>
the code just simply records the time for sending/receiving data items for 1000 iterations. for 1 thread the results are great, but for 64 threads they are horrible
<hkaiser>
nikunj: the channel uses spinlocks to protect the data
<hkaiser>
if you hit it with 64 threads concurrently none of the threads can make progress in the end
<hkaiser>
typical lifelock situation
<nikunj>
Aha
<hkaiser>
use one channel per core
<nikunj>
But I really want to use just one per node
<nikunj>
The communication is from one node to another and not one thread to another
<nikunj>
hkaiser: btw the program runs with 64 hpx threads but it's 1 thread that calls on it and and thread receives it. All other 63 threads at using at that point.
<nikunj>
s/using/idling
<hkaiser>
so why is the idle rate low if this is the case?
<nikunj>
I'm taking about the example code I shared
<hkaiser>
how many partitions did you run this with?
<nikunj97>
I ran it on 1,2,4 nodes each with 64 threads and 4.8 billion stencil points
<hkaiser>
ok
<nikunj97>
it takes 95s on single node
<hkaiser>
but all of this still uses only 2 channels?
<nikunj97>
no, on 1 node it doesn't use any channels
<nikunj97>
on 2 nodes, each node has 4 channels, 2 receive and 2 send
<nikunj97>
same for 4 nodes
<hkaiser>
so you have one partition per node?
<nikunj97>
every node has 4 channels, 2 listens in on receive and 2 are used to send data
<nikunj97>
no I have 1024 partitions per node
<nikunj97>
number of partitions is set with Nlp i.e. number of local partitions
<hkaiser>
how do the partitions communicate amonst themselves?
<hkaiser>
I think each partition has to have 4 channels
<nikunj97>
partitions are on the same node. They have all the data locally so they don't communicate using channels.
<nikunj97>
I guess you're comparing it with stencil 8
<hkaiser>
how do they communicate, then?
<nikunj97>
it is slightly different.
<nikunj97>
So each node has a partition of the actual array.
<nikunj97>
each node has local partitions
<nikunj97>
so there is no need for explicit communication between local partitions. For accessing elements from neigboring nodes, there are 4 channels in place
<nikunj97>
in each iteration only the first and last element is shared through the channel to the left and right neighbors
<hkaiser>
ok, understood
<nikunj97>
therefore, only 4 channels are required for each node instead of each local partition. Now, while the local partitions are iterating over the time step, we wait to receive data from neighbors. We expect to have received data before we're done with local iteration. Once local iteration ends, we take care of first and last element
<nikunj97>
There is enough parallelism and the time taken for that parallel_for loop is in the range of 100-250ms. This should be enough to get data from neighbors, but it apparently isn't. It takes another 50-250ms to retrieve data.
<nikunj97>
That's why I narrowed it down to lcos channels not working correctly
<hkaiser>
ok
<hkaiser>
I don't see the issue here yet
<nikunj97>
where are lcos channels for distributed implemented? may be I'll find something there
<hkaiser>
module lcos_distributed
<nikunj97>
what could be the fault if not channels?
<hkaiser>
difficult to tell
<hkaiser>
try making the .then(hpx::launch::sync, ...)
<nikunj97>
I doubt that will work. Let me try.
<hkaiser>
nikunj: the amount of work is minimal and as channels communicate using direct actions (I believe) this would invoke the continuation right away once the parcel was received
<hkaiser>
also, what parcelport do you use?
<hkaiser>
mpi or tcp?
<nikunj97>
tcp
<hkaiser>
k
<hkaiser>
what's the performance if you run using 63 cores?
<hkaiser>
better?
<nikunj97>
it is better, but it doesn't scale as well as it should
<hkaiser>
much better?
<nikunj97>
well, with x86 I see about 5x difference when going from 1 node to 8
<nikunj97>
so I won't call it bad, but the excess time spent in channels can be reduced as it is taking exorbitantly higher times than it should
<nikunj97>
I believe I can squeeze out up to 6 to 7x performance for 8 nodes
<hkaiser>
I thin kthe channels are a red herring, but pls correct me if I'm wrong
<nikunj97>
Not really sure, but I used to timers to calculate the times
<nikunj97>
if they're essentially actions invoked on other nodes, should they not take only a few ms?
<nikunj97>
they take 100+ms per for a set/get cycle and it is bothering me
<nikunj97>
if I'm able to solve this issue, I'm certain that we will have an almost linear scaling 1d stencil in our hands that will look good as a benchmark everywhere
<nikunj97>
hkaiser, I do have a workaround right now which is rather bleak. It involves using 2 global variables one for left most and one for right most. And then making each node update that value on the other node using plain actions.
<nikunj97>
btw .then(hpx::launch::sync, ...) didn't work :/
<nikunj97>
hkaiser, one last thing. Is there a function that maps ranks to hpx::id_type?
<nikunj97>
For example, let's say my locality rank is 0, I want the id of the locality with rank 1
<hkaiser>
nikunj97: what does that mean: 'it didn't work'?
<nikunj97>
the execution time didn't decrease
<hkaiser>
ok
<nikunj97>
it is about the same, like 1s difference in 95s run
<nikunj97>
I'd call it noise more or less
<hkaiser>
btw channels use nothing else but actions to communicate
<hkaiser>
ok, just wanted to make sure that the continuation is not stalled
<nikunj97>
I'm trying the hack, I just talked about. If that works, we'll have to dig into channels. Otherwise, I believe it has to do with something else.
<hkaiser>
k
<hkaiser>
good luck
nan11 has joined #ste||ar
karame_ has joined #ste||ar
_kale_ has joined #ste||ar
kale[m] has quit [Ping timeout: 264 seconds]
kale[m] has joined #ste||ar
kale_ has quit [Ping timeout: 265 seconds]
weilewei has joined #ste||ar
akheir has joined #ste||ar
bita__ has joined #ste||ar
<nikunj97>
hkaiser, no improvements using explicit actions.
<hkaiser>
so it's not the channel, as expected
<nikunj97>
yes. Where should I look at now?
<hkaiser>
profiling?
<nikunj97>
ughh...
<hkaiser>
:D
<nikunj97>
hkaiser, before I get into profiling. Why did you say it is due to contention?
nan11 has quit [Remote host closed the connection]
<nikunj97>
because you thought the grain size is too small?
<hkaiser>
in your small example it's contention, the actual app seems to have sufficient parallelism, so it's probably ok
<nikunj97>
hkaiser, I was able to improve the code further! The set function was launching asynchronously and had no grain size. So I made that sync. Furthermore, I increased the grain size of .then to allow for better scheduling. Results: 1 node 95s, 8 nodes 15s
<nikunj97>
now that's what I call linear scaling :D
nanmiao has joined #ste||ar
weilewei has quit [Remote host closed the connection]
<hkaiser>
nice
nanmiao has quit [Remote host closed the connection]
Nanmiao11 has joined #ste||ar
weilewei has joined #ste||ar
_kale_ has quit [Quit: Leaving]
sayefsakin has joined #ste||ar
Nanmiao11 has quit [Remote host closed the connection]
nan666 has joined #ste||ar
weilewei has quit [Remote host closed the connection]
nikunj97 has quit [Read error: Connection reset by peer]
mcopik has joined #ste||ar
mcopik has quit [Client Quit]
<hkaiser>
nikunj: I talked to Kevin, he's aware of you wanting to use APEX for the perf analysis, he said to just get in contact
<nikunj>
hkaiser: thanks!
nikunj has quit [Read error: Connection reset by peer]
kale[m] has quit [Ping timeout: 265 seconds]
nikunj has joined #ste||ar
kale[m] has joined #ste||ar
nikunj has quit [Ping timeout: 264 seconds]
karame_ has quit [Remote host closed the connection]
nan666 has quit [Remote host closed the connection]