<weilewei>
with nvlink method, more time spent on updating G4 kernel function, also more time spent on doing memory copy, becasue we have more peer-to-peer copy
<weilewei>
time spent on memory 50.3% (w/ nvlink), 2.5% (without nvlink)
<hkaiser>
weilewei: why more memory copy operations? did you say things go directly to the device?
<weilewei>
hkaiser I think when you perform MPI_Isend(), it is doing memory copy which copies remote G2 to local G2. Also, it is necessary for the ring algorithm, to copy local G2 to a local send buffer, so there are more copies happen
akheir1 has quit [Read error: Connection reset by peer]
akheir1 has joined #ste||ar
<hkaiser>
weilewei: but what's the point of the nvlink, then?
<weilewei>
hkaiser currently, I am testing the scenario where size of total G4 can still fit into one GPU. But when G2 becomes larger (from 30MB now to 1GB), then size of total G4 (=G2*G2) can no longer fit into one node, original DCA (without NVLINK) can no longer perform the computation
<weilewei>
At that time, one can only solve memory bound issue with NVLINK, and then accumulating G4 also becomes impossible (according to DCA team, it is not necessary to accumulate G4 at that point)
<weilewei>
So with NVLINk, one is suffering the penalty of many many many communication time (MPI_isend, etc.) but gains more memory usage.
<weilewei>
hkaiser does that make sense? so for small size G4, using nvlink is no benefit at all; but when size increase (more science job can be done), then nvlink comes to the stage
<hkaiser>
weilewei: ok
<weilewei>
hkaiser thanks, we can talk more tmr when we have meeting
<weilewei>
that's a lot of time in communication phase
<hkaiser>
nod
<weilewei>
in average, each mpi_wait is 0.006908625 seconds, but the max can go as high as 0.252(s), though it is no clear about standard deviation, but it is implicitly saying potential load imbalance. Sometimes wait long enough to receive G2
nikunj97 has quit [Read error: Connection reset by peer]