hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu
soulreaper has quit [Quit: Client closed]
Yorlik_ has joined #ste||ar
Yorlik has quit [Ping timeout: 248 seconds]
hkaiser has quit [Quit: Bye!]
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 255 seconds]
K-ballo1 is now known as K-ballo
Yorlik_ is now known as Yorlik
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 248 seconds]
K-ballo1 is now known as K-ballo
hkaiser has joined #ste||ar
soulreaper has joined #ste||ar
HHN93 has joined #ste||ar
HHN93 has quit [Ping timeout: 260 seconds]
soulreaper has quit [Ping timeout: 260 seconds]
HHN93 has joined #ste||ar
<HHN93> In most cases -O3 vectorizes loops already, but it does seem to make sense that we add explicit vectorization
<hkaiser> HHN93: in cases where no user defined lambdas are involved, using experimental::simd is certainly an option
<HHN93> Any advice on how I should approach this issue as performance is often not improved significantly on adding explicit vectorization
<hkaiser> if the compiler can't apply vectorization because it isn't able to look through the loop, we have two options
<hkaiser> a) simlify the loop such that the compiler can understand it better (i.e. integral boundaries instead of iterators, not function calls inside the loop, etc.
<HHN93> no the issue I am highlighting is that sometimes the loops are already vectorized
<hkaiser> b) use truely explicit vectorization, i.e. experiental::simd
<hkaiser> well, there is c) don't do anything, obviously
<hkaiser> HHN93: sure, but that's not guaranteed, not even close
<HHN93> isn't compiler optimisations deterministic?
<hkaiser> but if all compilers already apply vectorization without being asked to do it, then we certainly don't need to do anything special
<HHN93> no, my issue is reasoning that my changes improve the performance. Because it seems that the loops are sometimes already vectorized
<hkaiser> most pragma's we use are not to force vectorization, though - it's to give certain assurances to the compiler that it can actually consider vectorizing code
<HHN93> but it does make sense to still add vectorization pragmas
<hkaiser> like #pragma ivdep, which tells the compiler that there is aliasing going on
<hkaiser> yes, that's what I'm trying to say - it is worth adding the pragmas if the execution policies request it
<hkaiser> in general, execution policies are not to instruct the implementation to do things in certain ways (i.e. par doesn't mean 'do parallelize)
<hkaiser> execution policies convey guarantees to the implementation that may enable certain optimizations
<HHN93> `yes, that's what I'm trying to say - it is worth adding the pragmas if the execution policies request it`
<HHN93> I agree with it
<HHN93> But sometimes there are no performance/assembly instructions improvements on adding vectorization pragmas
<hkaiser> so par means 'it's safe to execute the iterations in any order and potentially concurrently'
<HHN93> because -O3 had already vectorized them
<hkaiser> not always
<HHN93> yes it is not always the case but rather sometimes
<hkaiser> the compiler for instance can't assume that involved pointers are not aliasing the same data
<hkaiser> #pragam ivdep tells the compiler that this can be assumed, etc.
<hkaiser> I agree
<HHN93> in the case of generate_n par and par_unseq have very similar performance despite making the change, I observed that this was the case because std::generate_n has same performance for seq and unseq
<hkaiser> the compiler applies vectorization on its own if it can prove that this doesn't change semantics
<HHN93> so in the github PR is there anything I can add to prove that my PR is an improvement
<hkaiser> it could be that the implementation doesn't do anything if std::generate_n has same performance for seq and unseq
<hkaiser> well, you showed that unseq is faster than seq
<HHN93> `the compiler applies vectorization on its own if it can prove that this doesn't change semantics`
<HHN93> I am not sure how generate_n actually assumes this to be true. But as seen by the benchmarks, on enabling -O3 both have same performance
<HHN93> `well, you showed that unseq is faster than seq`
<HHN93> `well, you showed that unseq is faster than seq`
<HHN93> when no optimisations are on
<HHN93> on -O3 both are very close
<hkaiser> ahh, that's not a criteria, then
<HHN93> `ahh, that's not a criteria, then`
<HHN93> can you please elaborate?
<hkaiser> are you sure your implementation of std::generate_n is actually doing additional vectorization for unseq?
<hkaiser> can you please elaborate? - doing perf measurements with anything but -O3 is pointless
<hkaiser> what I may suggest that to implement unseq using experimental::simd and see if that improves the picture
<HHN93> yes, I have checked they do add a #pragma simd.
<hkaiser> for stdlibc++ or libc++? or for the msvc std library?
<HHN93> `can you please elaborate? - doing perf measurements with anything but -O3 is pointless`
<HHN93> ok so the fact that unseq is faster when no optimisations are enables doesn't prove anything?
<hkaiser> no it doesn't
<HHN93> g++ compiler
<hkaiser> it just forces vectorization even for not optimized code
HHN93 has quit [Quit: Client closed]
hkaiser has quit [Quit: Bye!]
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 276 seconds]
K-ballo1 is now known as K-ballo
soulreaper has joined #ste||ar
hkaiser has joined #ste||ar
soulreaper has quit [Quit: Client closed]