hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu
soulreaper has quit [Quit: Client closed]
Yorlik_ has joined #ste||ar
Yorlik has quit [Ping timeout: 248 seconds]
hkaiser has quit [Quit: Bye!]
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 255 seconds]
K-ballo1 is now known as K-ballo
Yorlik_ is now known as Yorlik
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 248 seconds]
K-ballo1 is now known as K-ballo
hkaiser has joined #ste||ar
soulreaper has joined #ste||ar
HHN93 has joined #ste||ar
HHN93 has quit [Ping timeout: 260 seconds]
soulreaper has quit [Ping timeout: 260 seconds]
HHN93 has joined #ste||ar
<HHN93>
In most cases -O3 vectorizes loops already, but it does seem to make sense that we add explicit vectorization
<hkaiser>
HHN93: in cases where no user defined lambdas are involved, using experimental::simd is certainly an option
<HHN93>
Any advice on how I should approach this issue as performance is often not improved significantly on adding explicit vectorization
<hkaiser>
if the compiler can't apply vectorization because it isn't able to look through the loop, we have two options
<hkaiser>
a) simlify the loop such that the compiler can understand it better (i.e. integral boundaries instead of iterators, not function calls inside the loop, etc.
<HHN93>
no the issue I am highlighting is that sometimes the loops are already vectorized
<hkaiser>
b) use truely explicit vectorization, i.e. experiental::simd
<hkaiser>
well, there is c) don't do anything, obviously
<hkaiser>
HHN93: sure, but that's not guaranteed, not even close
<hkaiser>
but if all compilers already apply vectorization without being asked to do it, then we certainly don't need to do anything special
<HHN93>
no, my issue is reasoning that my changes improve the performance. Because it seems that the loops are sometimes already vectorized
<hkaiser>
most pragma's we use are not to force vectorization, though - it's to give certain assurances to the compiler that it can actually consider vectorizing code
<HHN93>
but it does make sense to still add vectorization pragmas
<hkaiser>
like #pragma ivdep, which tells the compiler that there is aliasing going on
<hkaiser>
yes, that's what I'm trying to say - it is worth adding the pragmas if the execution policies request it
<hkaiser>
in general, execution policies are not to instruct the implementation to do things in certain ways (i.e. par doesn't mean 'do parallelize)
<hkaiser>
execution policies convey guarantees to the implementation that may enable certain optimizations
<HHN93>
`yes, that's what I'm trying to say - it is worth adding the pragmas if the execution policies request it`
<HHN93>
I agree with it
<HHN93>
But sometimes there are no performance/assembly instructions improvements on adding vectorization pragmas
<hkaiser>
so par means 'it's safe to execute the iterations in any order and potentially concurrently'
<HHN93>
because -O3 had already vectorized them
<hkaiser>
not always
<HHN93>
yes it is not always the case but rather sometimes
<hkaiser>
the compiler for instance can't assume that involved pointers are not aliasing the same data
<hkaiser>
#pragam ivdep tells the compiler that this can be assumed, etc.
<hkaiser>
I agree
<HHN93>
in the case of generate_n par and par_unseq have very similar performance despite making the change, I observed that this was the case because std::generate_n has same performance for seq and unseq
<hkaiser>
the compiler applies vectorization on its own if it can prove that this doesn't change semantics
<HHN93>
so in the github PR is there anything I can add to prove that my PR is an improvement
<hkaiser>
it could be that the implementation doesn't do anything if std::generate_n has same performance for seq and unseq
<hkaiser>
well, you showed that unseq is faster than seq
<HHN93>
`the compiler applies vectorization on its own if it can prove that this doesn't change semantics`
<HHN93>
I am not sure how generate_n actually assumes this to be true. But as seen by the benchmarks, on enabling -O3 both have same performance
<HHN93>
`well, you showed that unseq is faster than seq`
<HHN93>
`well, you showed that unseq is faster than seq`
<HHN93>
when no optimisations are on
<HHN93>
on -O3 both are very close
<hkaiser>
ahh, that's not a criteria, then
<HHN93>
`ahh, that's not a criteria, then`
<HHN93>
can you please elaborate?
<hkaiser>
are you sure your implementation of std::generate_n is actually doing additional vectorization for unseq?
<hkaiser>
can you please elaborate? - doing perf measurements with anything but -O3 is pointless
<hkaiser>
what I may suggest that to implement unseq using experimental::simd and see if that improves the picture
<HHN93>
yes, I have checked they do add a #pragma simd.
<hkaiser>
for stdlibc++ or libc++? or for the msvc std library?
<HHN93>
`can you please elaborate? - doing perf measurements with anything but -O3 is pointless`
<HHN93>
ok so the fact that unseq is faster when no optimisations are enables doesn't prove anything?
<hkaiser>
no it doesn't
<HHN93>
g++ compiler
<hkaiser>
it just forces vectorization even for not optimized code