2021-08-06 22:55
hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu
00:22
HHN93 has joined #ste||ar
00:24
HHN93 has quit [Client Quit]
00:57
sarkar_t[m] has joined #ste||ar
01:24
HHN93 has joined #ste||ar
01:24
<
HHN93 >
is there any reason we unroll loops in loop.hpp?
01:24
<
HHN93 >
why don't we trust the compiler?
01:24
<
hkaiser_ >
HHN93: you can't trust the compiler, ever
01:25
<
HHN93 >
that's a very scare statement
01:25
<
hkaiser_ >
unrolling by hand guarantees it
01:25
<
HHN93 >
also can I trust AMD profilers?
01:26
<
hkaiser_ >
if you can makes sense of the data they provide you with? it's mostly trusting yourself when it comes to optimization ;-)
01:26
<
HHN93 >
so basically for the reverse par
01:26
<
HHN93 >
there's function with 10CPI
01:26
<
hkaiser_ >
which one?
01:26
<
HHN93 >
10 cycles per instruction (not exactly 10 always but very high)
01:27
<
HHN93 >
obviously the loop
01:27
<
hkaiser_ >
most likely because of cache issues
01:27
<
HHN93 >
the 2 instructions which are marked at hotspots are
01:27
<
HHN93 >
1. mov insruction
01:27
<
HHN93 >
2. add 1 instruction
01:27
<
HHN93 >
yeah cache issues for sure
01:28
<
hkaiser_ >
one of the iterators in reverse oes backwards, so it works against the cache
01:28
<
HHN93 >
mov instruction makes sense, tried to jump through assembly and find out what the add instruction is
01:28
<
HHN93 >
it is the unrolled loop
01:28
<
hkaiser_ >
the add is the iterator increment
01:29
<
HHN93 >
if its iterator increament I don't think it'll be +1
01:29
<
HHN93 >
there are +- 10 for it ig
01:29
<
HHN93 >
can I share the screenshot?
01:30
<
HHN93 >
I jumped through assembly and am pretty sure it is the unrolled loop
01:30
<
hkaiser_ >
do you understand how reverse is implemented?
01:30
<
HHN93 >
we break it into chunks and reverse each chunk
01:30
<
HHN93 >
not reverse the chunk
01:31
<
HHN93 >
but reverse chunk and corresponding chunk
01:31
<
hkaiser_ >
we parallelize over a zip of two iterators, one going from first to last and the other goes from last to first
01:32
<
hkaiser_ >
the second is a rverse iterator, so bothe can be incremented to achieve what's needed
01:32
<
HHN93 >
yeah the zip is zip of a forward and backward iterator
01:32
<
HHN93 >
how can I share an image?
01:32
<
HHN93 >
gdrive or some other way?
01:33
<
hkaiser_ >
paste it somewhere, e.g. imgbb.com
01:34
<
HHN93 >
0x1ae4 and 0x1ae8 are considered hot
01:34
<
HHN93 >
taking too much time
01:36
<
HHN93 >
ok wait, could addq be taking too much time due to stall in the pipeline because of execution of the movdqu instruction?
01:37
<
HHN93 >
`one of the iterators in reverse oes backwards, so it works against the cache`
01:37
<
HHN93 >
I don't understand why that'd be the case. We are going to fetch the blocks corresponding to the forward and reverse iterator, right?
01:38
<
HHN93 >
also the slowdown is same even for 1 hpx thread, so I feel we can try optimising the implementation
01:38
<
hkaiser_ >
in my experience, timing data for assembly istructions are often one operatoin off
01:38
<
hkaiser_ >
that would mean in your case that the two loads are the critical ones
01:40
<
hkaiser_ >
or is it stores?
01:40
<
hkaiser_ >
doesn't matter, either way it would make sense
01:40
<
HHN93 >
`in my experience, timing data for assembly istructions are often one operatoin off`
01:40
<
HHN93 >
it makes sense if we assume they consider an instruction stalled as adding up to the execution time
01:40
<
hkaiser_ >
even more as I said, one of the iterators goes against the cache
01:41
<
HHN93 >
yes makes sense
01:41
<
HHN93 >
but with a single thread its still slower than seq
01:42
<
HHN93 >
so I feel we can improve the performance
01:42
<
hkaiser_ >
that's for different reasons
01:42
<
hkaiser_ >
try running with --hpx:threads=1
01:42
<
HHN93 >
`that's for different reasons`
01:42
<
HHN93 >
yes that's what I tried
01:42
<
HHN93 >
still slower than seq
01:43
<
HHN93 >
same hotspots
01:43
<
hkaiser_ >
interesting
01:43
<
hkaiser_ >
what's different in the seq implementation?
01:43
<
HHN93 >
seq doesn't movdqu is one thing I observed
01:43
<
hkaiser_ >
is it just that the compiler can't see through the zip and badly optimizes things?
01:44
<
HHN93 >
I am not sure yet
01:45
<
HHN93 >
in case of seq we don't zip and also no manual unrolling of loop
01:45
<
HHN93 >
compiler doesn't unroll, and also doesn't use vector registers for swapping (atleast g++ doesn't)
01:45
<
hkaiser_ >
remove the unrolling to try
01:46
<
HHN93 >
sure, will try it
01:46
<
HHN93 >
also in case of seq, whole implemetation seems in be inlined into main
01:46
<
HHN93 >
my hpx main has only reverse function
01:55
<
HHN93 >
also, if we do fix this we are basically fixing the G++ version of std::reverse too
01:55
<
HHN93 >
in case of g++ par reverse seems to fall back to seq, and according to the blog its the same for msvc too
01:58
Yorlik_ has joined #ste||ar
02:02
Yorlik__ has quit [Ping timeout: 256 seconds]
02:18
HHN93 has quit [Quit: Client closed]
03:21
hkaiser_ has quit [Quit: Bye!]
08:50
Yorlik_ is now known as Yorlik
10:25
K-ballo1 has joined #ste||ar
10:26
K-ballo has quit [Ping timeout: 268 seconds]
10:26
K-ballo1 is now known as K-ballo
13:39
hkaiser has joined #ste||ar
15:35
hkaiser_ has joined #ste||ar
15:38
hkaiser has quit [Ping timeout: 240 seconds]
17:00
Srini has joined #ste||ar
17:07
Srini has quit [Quit: Ping timeout (120 seconds)]
17:18
sarkar_t[m] has quit [Ping timeout: 265 seconds]
17:18
sarkar_t[m] has joined #ste||ar
17:23
srinivasyadav18[ has joined #ste||ar
19:27
K-ballo1 has joined #ste||ar
19:28
K-ballo has quit [Ping timeout: 268 seconds]
19:28
K-ballo1 is now known as K-ballo