#ste||ar on 2023-01-24 — irc logs at irclog.cct.lsu.edu

2021-08-06 22:55 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu

00:05 <gonidelis[m]> hkaiser:

00:39 <hkaiser> gonidelis[m]: it's not specialized

00:41 <gonidelis[m]> yeah scratch that... i missed a closing `>` on my reading

00:50 rtohid[m] has joined #ste||ar

00:51 diehlpk has joined #ste||ar

00:56 diehlpk has quit [Ping timeout: 255 seconds]

00:58 diehlpk has joined #ste||ar

01:03 diehlpk has quit [Ping timeout: 255 seconds]

01:31 diehlpk has joined #ste||ar

01:35 diehlpk has quit [Ping timeout: 255 seconds]

02:04 diehlpk has joined #ste||ar

02:09 diehlpk has quit [Ping timeout: 255 seconds]

02:26 diehlpk has joined #ste||ar

02:26 diehlpk has quit [Client Quit]

02:51 Yorlik_ has joined #ste||ar

02:55 Yorlik__ has quit [Ping timeout: 265 seconds]

03:53 hkaiser has quit [Quit: Bye!]

05:52 Yorlik__ has joined #ste||ar

05:56 Yorlik_ has quit [Ping timeout: 264 seconds]

07:55 Yorlik__ is now known as Yorlik

12:10 hkaiser has joined #ste||ar

20:39 diehlpk has joined #ste||ar

21:33 tufei has joined #ste||ar

22:04 diehlpk has quit [Quit: Leaving.]

22:07 diehlpk has joined #ste||ar

22:10 diehlpk has quit [Client Quit]

22:25 <gonidelis[m]> why was tag_invoke ambiguous when other libraries were implementing it too? https://github.com/STEllAR-GROUP/hpx/pull/6146/files

22:31 weilewei has joined #ste||ar

22:31 <hkaiser> this call for instance did find two different tag_invoke implementations, the one from MWGraph and ours.

22:32 <hkaiser> https://github.com/STEllAR-GROUP/hpx/pull/6146/files#diff-fa77768b3919c1c7337358aa48142c3941b05d7ca2960977ab89e8904fcf2062L466

22:35 <weilewei> nvc++ std::execution::seq: 3.65969 sec. avarged over 10 runs.

22:35 <weilewei> nvc++ std::execution::par: 0.0699834 sec. avarged over 10 runs.

22:35 <weilewei> nvc++ std::execution::par_unseq: 0.0716834 sec. avarged over 10 runs.

22:35 <weilewei> hpx::execution::seq: 18.2115 sec. avarged over 10 runs.

22:35 <weilewei> hpx::execution::par: 0.277915 sec. avarged over 10 runs.

22:35 <weilewei> hpx::execution::par_unseq: 0.253258 sec. avarged over 10 runs.

22:35 <weilewei> kokkos::parallel_for transform: 0.562939 sec. avarged over 10 runs.

22:35 <weilewei> std::execution::seq: 16.2453 sec. avarged over 10 runs.

22:35 <weilewei> std::execution::par: 16.1129 sec. avarged over 10 runs.

22:35 <weilewei> std::execution::par_unseq: 16.3805 sec. avarged over 10 runs.

22:35 <weilewei> __gnu_parallel::transform: 0.460436 sec. avarged over 10 runs.

22:36 <weilewei> It seems the seq version, hpx is 2 sec slower than std? The code: https://github.com/weilewei/parSTL

22:36 <hkaiser> par is not too good either :/

22:37 <weilewei> really? hpx par runs at 0.277915 sec and std par runs at 16.1129 sec

22:38 <weilewei> it seems hpx is faster. But I don't know why the std par is way too slow.

22:38 <hkaiser> but the nvcc version? is it running on gpu?

22:38 <weilewei> yes nvc++ is running on gpu

22:38 <hkaiser> ahh, so apples and bananas

22:39 <weilewei> yeah, ignore nvc++ for now. Later I will run Kokkos with nvc++ backend

22:39 <hkaiser> we have not necessarily optimized our seq execution

22:39 <hkaiser> I can have a look, however

22:39 <weilewei> ok, so hpx seq is slower than std seq, this is expected for now?

22:39 <hkaiser> also, you might want to try par_simd ;-) instead of par_unseq (which in our case is the same as par)

22:40 <hkaiser> Srinivas can help with that

22:40 <weilewei> ok, I will add par_simd to the todo list

22:40 <weilewei> if I would like to run hpx par on gpu, how does that work?

22:41 <hkaiser> or just simd for sequential with vectorization

22:41 <hkaiser> weilewei: we don't support that

22:41 <weilewei> adding Kokkos?

22:41 <hkaiser> not sure if kokkos-hpx supports that, ms[m]1 might know

22:42 <weilewei> got it

22:43 <hkaiser> weilewei: btw, how many cores is that benchmark running on?

22:43 <weilewei> hmm I have not specified it yet

22:43 <hkaiser> how many cores has that node?

22:44 <weilewei> 64 cpu cores, 256 processors

22:44 <weilewei> on Perlmutter

22:45 <hkaiser> so it's most likely running on 64 threads (try --hpx:print-bind)

22:46 <weilewei> 127: PU L#254(P#127), Core L#127(P#63), Socket L#1(P#1), on pool "default"

22:49 <weilewei> full --hpx:print-bind: https://gist.github.com/weilewei/c5a6b8c29fb3e324a68391d521981e09

22:49 <hkaiser> so 127 cores

22:49 <hkaiser> 128

22:50 <weilewei> why does it print locality 0 twice?

22:50 <hkaiser> that's 64 cores with 2 HT each

22:50 <hkaiser> I'd suggest specifying --hpx:threads=64

22:50 <weilewei> ok

22:50 <hkaiser> hmmm. not sure - how do you launch the test?

22:51 <hkaiser> what's your slurm command?

22:51 <hkaiser> I'll have a look why it's printing things twice, probably a bug

22:51 <weilewei> #SBATCH -C gpu

22:51 <weilewei> #SBATCH -t 20:00

22:51 <weilewei> #SBATCH -N 1

22:51 <weilewei> #SBATCH --ntasks-per-node=1

22:51 <weilewei> #SBATCH -o parSTL.out

22:51 <weilewei> #SBATCH -e parSTL.err

22:51 <weilewei> cd /global/homes/w/wwei/src/parSTL

22:51 <weilewei> ./scripts.sh

22:52 <hkaiser> yah, most likely a bug - I'll investigate

22:52 <weilewei> Thanks

22:52 <hkaiser> don't run using HTs

22:52 <weilewei> ok, I will use --hpx:threads=64 to run experiments again

22:52 <weilewei> let's see how it changes

22:56 <weilewei> With --hpx:threads=64, it seems a bit slower than without hpx::execution::seq: 17.0798 sec. avarged over 10 runs.

22:56 <weilewei> hpx::execution::par: 0.334277 sec. avarged over 10 runs.

22:56 <weilewei> hpx::execution::par_unseq: 0.341671 sec. avarged over 10 runs.

22:57 <hkaiser> a bit worse, ok

22:57 <weilewei> let me wait for std results in the same job run

23:03 <weilewei> std::execution::seq: 16.138 sec. avarged over 10 runs.

23:03 <weilewei> std::execution::par: 16.0406 sec. avarged over 10 runs.

23:03 <weilewei> std::execution::par_unseq: 16.2172 sec. avarged over 10 runs.

23:04 <weilewei> well, hpx seq is close to std seq

23:19 weilewei has quit [Ping timeout: 260 seconds]

23:21 weilewei has joined #ste||ar

23:26 <weilewei> hkaiser if I want to learn more on sender/receiver work for HPX, who should be the point of contact?

23:29 <hkaiser> shreyas, myself, and certainly ms[m]1

23:30 <weilewei> Got it, I will learn a bit more for my next project :)

23:30 <weilewei> will reach out later for this regard

23:31 <hkaiser> weilewei: as said, we've never tried to make seq optimal

23:31 <weilewei> Got it, I will note it

23:42 weilewei has quit [Quit: Ping timeout (120 seconds)]