hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu
<gonidelis[m]> hkaiser:
<hkaiser> gonidelis[m]: it's not specialized
<gonidelis[m]> yeah scratch that... i missed a closing `>` on my reading
rtohid[m] has joined #ste||ar
diehlpk has joined #ste||ar
diehlpk has quit [Ping timeout: 255 seconds]
diehlpk has joined #ste||ar
diehlpk has quit [Ping timeout: 255 seconds]
diehlpk has joined #ste||ar
diehlpk has quit [Ping timeout: 255 seconds]
diehlpk has joined #ste||ar
diehlpk has quit [Ping timeout: 255 seconds]
diehlpk has joined #ste||ar
diehlpk has quit [Client Quit]
Yorlik_ has joined #ste||ar
Yorlik__ has quit [Ping timeout: 265 seconds]
hkaiser has quit [Quit: Bye!]
Yorlik__ has joined #ste||ar
Yorlik_ has quit [Ping timeout: 264 seconds]
Yorlik__ is now known as Yorlik
hkaiser has joined #ste||ar
diehlpk has joined #ste||ar
tufei has joined #ste||ar
diehlpk has quit [Quit: Leaving.]
diehlpk has joined #ste||ar
diehlpk has quit [Client Quit]
<gonidelis[m]> why was tag_invoke ambiguous when other libraries were implementing it too? https://github.com/STEllAR-GROUP/hpx/pull/6146/files
weilewei has joined #ste||ar
<hkaiser> this call for instance did find two different tag_invoke implementations, the one from MWGraph and ours.
<weilewei> nvc++ std::execution::seq: 3.65969 sec. avarged over 10 runs.
<weilewei> nvc++ std::execution::par: 0.0699834 sec. avarged over 10 runs.
<weilewei> nvc++ std::execution::par_unseq: 0.0716834 sec. avarged over 10 runs.
<weilewei> hpx::execution::seq: 18.2115 sec. avarged over 10 runs.
<weilewei> hpx::execution::par: 0.277915 sec. avarged over 10 runs.
<weilewei> hpx::execution::par_unseq: 0.253258 sec. avarged over 10 runs.
<weilewei> kokkos::parallel_for transform: 0.562939 sec. avarged over 10 runs.
<weilewei> std::execution::seq: 16.2453 sec. avarged over 10 runs.
<weilewei> std::execution::par: 16.1129 sec. avarged over 10 runs.
<weilewei> std::execution::par_unseq: 16.3805 sec. avarged over 10 runs.
<weilewei> __gnu_parallel::transform: 0.460436 sec. avarged over 10 runs.
<weilewei> It seems the seq version, hpx is 2 sec slower than std? The code: https://github.com/weilewei/parSTL
<hkaiser> par is not too good either :/
<weilewei> really? hpx par runs at 0.277915 sec and std par runs at 16.1129 sec
<weilewei> it seems hpx is faster. But I don't know why the std par is way too slow.
<hkaiser> but the nvcc version? is it running on gpu?
<weilewei> yes nvc++ is running on gpu
<hkaiser> ahh, so apples and bananas
<weilewei> yeah, ignore nvc++ for now. Later I will run Kokkos with nvc++ backend
<hkaiser> we have not necessarily optimized our seq execution
<hkaiser> I can have a look, however
<weilewei> ok, so hpx seq is slower than std seq, this is expected for now?
<hkaiser> also, you might want to try par_simd ;-) instead of par_unseq (which in our case is the same as par)
<hkaiser> Srinivas can help with that
<weilewei> ok, I will add par_simd to the todo list
<weilewei> if I would like to run hpx par on gpu, how does that work?
<hkaiser> or just simd for sequential with vectorization
<hkaiser> weilewei: we don't support that
<weilewei> adding Kokkos?
<hkaiser> not sure if kokkos-hpx supports that, ms[m]1 might know
<weilewei> got it
<hkaiser> weilewei: btw, how many cores is that benchmark running on?
<weilewei> hmm I have not specified it yet
<hkaiser> how many cores has that node?
<weilewei> 64 cpu cores, 256 processors
<weilewei> on Perlmutter
<hkaiser> so it's most likely running on 64 threads (try --hpx:print-bind)
<weilewei> 127: PU L#254(P#127), Core L#127(P#63), Socket L#1(P#1), on pool "default"
<hkaiser> so 127 cores
<hkaiser> 128
<weilewei> why does it print locality 0 twice?
<hkaiser> that's 64 cores with 2 HT each
<hkaiser> I'd suggest specifying --hpx:threads=64
<weilewei> ok
<hkaiser> hmmm. not sure - how do you launch the test?
<hkaiser> what's your slurm command?
<hkaiser> I'll have a look why it's printing things twice, probably a bug
<weilewei> #SBATCH -C gpu
<weilewei> #SBATCH -t 20:00
<weilewei> #SBATCH -N 1
<weilewei> #SBATCH --ntasks-per-node=1
<weilewei> #SBATCH -o parSTL.out
<weilewei> #SBATCH -e parSTL.err
<weilewei> cd /global/homes/w/wwei/src/parSTL
<weilewei> ./scripts.sh
<hkaiser> yah, most likely a bug - I'll investigate
<weilewei> Thanks
<hkaiser> don't run using HTs
<weilewei> ok, I will use  --hpx:threads=64 to run experiments again
<weilewei> let's see how it changes
<weilewei> With --hpx:threads=64, it seems a bit slower than without hpx::execution::seq: 17.0798 sec. avarged over 10 runs.
<weilewei> hpx::execution::par: 0.334277 sec. avarged over 10 runs.
<weilewei> hpx::execution::par_unseq: 0.341671 sec. avarged over 10 runs.
<hkaiser> a bit worse, ok
<weilewei> let me wait for std results in the same job run
<weilewei> std::execution::seq: 16.138 sec. avarged over 10 runs.
<weilewei> std::execution::par: 16.0406 sec. avarged over 10 runs.
<weilewei> std::execution::par_unseq: 16.2172 sec. avarged over 10 runs.
<weilewei> well, hpx seq is close to std seq
weilewei has quit [Ping timeout: 260 seconds]
weilewei has joined #ste||ar
<weilewei> hkaiser if I want to learn more on sender/receiver work for HPX, who should be the point of contact?
<hkaiser> shreyas, myself, and certainly ms[m]1
<weilewei> Got it, I will learn a bit more for my next project :)
<weilewei> will reach out later for this regard
<hkaiser> weilewei: as said, we've never tried to make seq optimal
<weilewei> Got it, I will note it
weilewei has quit [Quit: Ping timeout (120 seconds)]