hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu
HHN93 has joined #ste||ar
HHN93 has quit [Client Quit]
sarkar_t[m] has joined #ste||ar
HHN93 has joined #ste||ar
<HHN93> is there any reason we unroll loops in loop.hpp?
<HHN93> why don't we trust the compiler?
<hkaiser_> HHN93: you can't trust the compiler, ever
<HHN93> oh ok
<hkaiser_> ;-)
<HHN93> that's a very scare statement
<hkaiser_> unrolling by hand guarantees it
<HHN93> sscary*
<HHN93> ok
<HHN93> also can I trust AMD profilers?
<hkaiser_> if you can makes sense of the data they provide you with? it's mostly trusting yourself when it comes to optimization ;-)
<HHN93> so basically for the reverse par
<HHN93> there's function with 10CPI
<hkaiser_> which one?
<HHN93> 10 cycles per instruction (not exactly 10 always but very high)
<HHN93> obviously the loop
<hkaiser_> ok
<hkaiser_> most likely because of cache issues
<HHN93> the 2 instructions which are marked at hotspots are
<HHN93> 1. mov insruction
<HHN93> 2. add 1 instruction
<HHN93> yeah cache issues for sure
<hkaiser_> one of the iterators in reverse oes backwards, so it works against the cache
<HHN93> mov instruction makes sense, tried to jump through assembly and find out what the add instruction is
<HHN93> it is the unrolled loop
<hkaiser_> the add is the iterator increment
<HHN93> no
<HHN93> if its iterator increament I don't think it'll be +1
<HHN93> there are +- 10 for it ig
<HHN93> can I share the screenshot?
<hkaiser_> ok
<HHN93> I jumped through assembly and am pretty sure it is the unrolled loop
<hkaiser_> do you understand how reverse is implemented?
<HHN93> we break it into chunks and reverse each chunk
<HHN93> not reverse the chunk
<HHN93> but reverse chunk and corresponding chunk
<hkaiser_> we parallelize over a zip of two iterators, one going from first to last and the other goes from last to first
<hkaiser_> the second is a rverse iterator, so bothe can be incremented to achieve what's needed
<HHN93> yeah the zip is zip of a forward and backward iterator
<hkaiser_> yes
<HHN93> how can I share an image?
<HHN93> gdrive or some other way?
<hkaiser_> paste it somewhere, e.g. imgbb.com
<HHN93> 0x1ae4 and 0x1ae8 are considered hot
<HHN93> taking too much time
<HHN93> ok wait, could addq be taking too much time due to stall in the pipeline because of execution of the movdqu instruction?
<HHN93> `one of the iterators in reverse oes backwards, so it works against the cache`
<HHN93> I don't understand why that'd be the case. We are going to fetch the blocks corresponding to the forward and reverse iterator, right?
<HHN93> also the slowdown is same even for 1 hpx thread, so I feel we can try optimising the implementation
<hkaiser_> in my experience, timing data for assembly istructions are often one operatoin off
<hkaiser_> that would mean in your case that the two loads are the critical ones
<HHN93> oh ok
<hkaiser_> or is it stores?
<hkaiser_> doesn't matter, either way it would make sense
<HHN93> `in my experience, timing data for assembly istructions are often one operatoin off`
<HHN93> it makes sense if we assume they consider an instruction stalled as adding up to the execution time
<hkaiser_> even more as I said, one of the iterators goes against the cache
<HHN93> yes makes sense
<HHN93> but with a single thread its still slower than seq
<HHN93> so I feel we can improve the performance
<hkaiser_> that's for different reasons
<hkaiser_> try running with --hpx:threads=1
<HHN93> `that's for different reasons`
<HHN93> oh, like?
<HHN93> yes that's what I tried
<hkaiser_> ahh
<HHN93> still slower than seq
<HHN93> same hotspots
<hkaiser_> interesting
<hkaiser_> what's different in the seq implementation?
<HHN93> seq doesn't movdqu is one thing I observed
<hkaiser_> is it just that the compiler can't see through the zip and badly optimizes things?
<HHN93> I am not sure yet
<HHN93> in case of seq we don't zip and also no manual unrolling of loop
<HHN93> compiler doesn't unroll, and also doesn't use vector registers for swapping (atleast g++ doesn't)
<hkaiser_> remove the unrolling to try
<HHN93> sure, will try it
<HHN93> also in case of seq, whole implemetation seems in be inlined into main
<HHN93> my hpx main has only reverse function
<HHN93> also, if we do fix this we are basically fixing the G++ version of std::reverse too
<HHN93> in case of g++ par reverse seems to fall back to seq, and according to the blog its the same for msvc too
Yorlik_ has joined #ste||ar
Yorlik__ has quit [Ping timeout: 256 seconds]
HHN93 has quit [Quit: Client closed]
hkaiser_ has quit [Quit: Bye!]
Yorlik_ is now known as Yorlik
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 268 seconds]
K-ballo1 is now known as K-ballo
hkaiser has joined #ste||ar
hkaiser_ has joined #ste||ar
hkaiser has quit [Ping timeout: 240 seconds]
Srini has joined #ste||ar
Srini has quit [Quit: Ping timeout (120 seconds)]
sarkar_t[m] has quit [Ping timeout: 265 seconds]
sarkar_t[m] has joined #ste||ar
srinivasyadav18[ has joined #ste||ar
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 268 seconds]
K-ballo1 is now known as K-ballo