#ste||ar on 2023-05-07 — irc logs at irclog.cct.lsu.edu

2021-08-06 22:55 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu

00:22 HHN93 has joined #ste||ar

00:24 HHN93 has quit [Client Quit]

00:57 sarkar_t[m] has joined #ste||ar

01:24 HHN93 has joined #ste||ar

01:24 <HHN93> is there any reason we unroll loops in loop.hpp?

01:24 <HHN93> why don't we trust the compiler?

01:24 <hkaiser_> HHN93: you can't trust the compiler, ever

01:25 <HHN93> oh ok

01:25 <hkaiser_> ;-)

01:25 <HHN93> that's a very scare statement

01:25 <hkaiser_> unrolling by hand guarantees it

01:25 <HHN93> sscary*

01:25 <HHN93> ok

01:25 <HHN93> also can I trust AMD profilers?

01:25 <HHN93> https://www.amd.com/en/developer/uprof.html

01:26 <hkaiser_> if you can makes sense of the data they provide you with? it's mostly trusting yourself when it comes to optimization ;-)

01:26 <HHN93> so basically for the reverse par

01:26 <HHN93> there's function with 10CPI

01:26 <hkaiser_> which one?

01:26 <HHN93> 10 cycles per instruction (not exactly 10 always but very high)

01:27 <HHN93> obviously the loop

01:27 <hkaiser_> ok

01:27 <hkaiser_> most likely because of cache issues

01:27 <HHN93> the 2 instructions which are marked at hotspots are

01:27 <HHN93> 1. mov insruction

01:27 <HHN93> 2. add 1 instruction

01:27 <HHN93> yeah cache issues for sure

01:28 <hkaiser_> one of the iterators in reverse oes backwards, so it works against the cache

01:28 <HHN93> mov instruction makes sense, tried to jump through assembly and find out what the add instruction is

01:28 <HHN93> it is the unrolled loop

01:28 <hkaiser_> the add is the iterator increment

01:28 <HHN93> no

01:29 <HHN93> if its iterator increament I don't think it'll be +1

01:29 <HHN93> there are +- 10 for it ig

01:29 <HHN93> can I share the screenshot?

01:29 <hkaiser_> ok

01:30 <HHN93> I jumped through assembly and am pretty sure it is the unrolled loop

01:30 <hkaiser_> do you understand how reverse is implemented?

01:30 <HHN93> we break it into chunks and reverse each chunk

01:30 <HHN93> not reverse the chunk

01:31 <HHN93> but reverse chunk and corresponding chunk

01:31 <hkaiser_> we parallelize over a zip of two iterators, one going from first to last and the other goes from last to first

01:32 <hkaiser_> the second is a rverse iterator, so bothe can be incremented to achieve what's needed

01:32 <HHN93> yeah the zip is zip of a forward and backward iterator

01:32 <hkaiser_> yes

01:32 <HHN93> how can I share an image?

01:32 <HHN93> gdrive or some other way?

01:33 <hkaiser_> paste it somewhere, e.g. imgbb.com

01:34 <HHN93> https://ibb.co/4NnqgMS

01:34 <HHN93> 0x1ae4 and 0x1ae8 are considered hot

01:34 <HHN93> taking too much time

01:36 <HHN93> ok wait, could addq be taking too much time due to stall in the pipeline because of execution of the movdqu instruction?

01:37 <HHN93> `one of the iterators in reverse oes backwards, so it works against the cache`

01:37 <HHN93> I don't understand why that'd be the case. We are going to fetch the blocks corresponding to the forward and reverse iterator, right?

01:38 <HHN93> also the slowdown is same even for 1 hpx thread, so I feel we can try optimising the implementation

01:38 <hkaiser_> in my experience, timing data for assembly istructions are often one operatoin off

01:38 <hkaiser_> that would mean in your case that the two loads are the critical ones

01:39 <HHN93> oh ok

01:40 <hkaiser_> or is it stores?

01:40 <hkaiser_> doesn't matter, either way it would make sense

01:40 <HHN93> `in my experience, timing data for assembly istructions are often one operatoin off`

01:40 <HHN93> it makes sense if we assume they consider an instruction stalled as adding up to the execution time

01:40 <hkaiser_> even more as I said, one of the iterators goes against the cache

01:41 <HHN93> yes makes sense

01:41 <HHN93> but with a single thread its still slower than seq

01:42 <HHN93> so I feel we can improve the performance

01:42 <hkaiser_> that's for different reasons

01:42 <hkaiser_> try running with --hpx:threads=1

01:42 <HHN93> `that's for different reasons`

01:42 <HHN93> oh, like?

01:42 <HHN93> yes that's what I tried

01:42 <hkaiser_> ahh

01:42 <HHN93> still slower than seq

01:43 <HHN93> same hotspots

01:43 <hkaiser_> interesting

01:43 <hkaiser_> what's different in the seq implementation?

01:43 <HHN93> seq doesn't movdqu is one thing I observed

01:43 <hkaiser_> is it just that the compiler can't see through the zip and badly optimizes things?

01:44 <HHN93> I am not sure yet

01:45 <HHN93> in case of seq we don't zip and also no manual unrolling of loop

01:45 <HHN93> compiler doesn't unroll, and also doesn't use vector registers for swapping (atleast g++ doesn't)

01:45 <hkaiser_> remove the unrolling to try

01:46 <HHN93> sure, will try it

01:46 <HHN93> also in case of seq, whole implemetation seems in be inlined into main

01:46 <HHN93> my hpx main has only reverse function

01:55 <HHN93> also, if we do fix this we are basically fixing the G++ version of std::reverse too

01:55 <HHN93> in case of g++ par reverse seems to fall back to seq, and according to the blog its the same for msvc too

01:58 Yorlik_ has joined #ste||ar

02:02 Yorlik__ has quit [Ping timeout: 256 seconds]

02:18 HHN93 has quit [Quit: Client closed]

03:21 hkaiser_ has quit [Quit: Bye!]

08:50 Yorlik_ is now known as Yorlik

10:25 K-ballo1 has joined #ste||ar

10:26 K-ballo has quit [Ping timeout: 268 seconds]

10:26 K-ballo1 is now known as K-ballo

13:39 hkaiser has joined #ste||ar

15:35 hkaiser_ has joined #ste||ar

15:38 hkaiser has quit [Ping timeout: 240 seconds]

17:00 Srini has joined #ste||ar

17:07 Srini has quit [Quit: Ping timeout (120 seconds)]

17:18 sarkar_t[m] has quit [Ping timeout: 265 seconds]

17:18 sarkar_t[m] has joined #ste||ar

17:23 srinivasyadav18[ has joined #ste||ar

19:27 K-ballo1 has joined #ste||ar

19:28 K-ballo has quit [Ping timeout: 268 seconds]

19:28 K-ballo1 is now known as K-ballo