2021-08-06 22:55 
hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu
 
00:05 
<
gonidelis[m] >
hkaiser: 
 
00:39 
<
hkaiser >
gonidelis[m]: it's not specialized
 
00:41 
<
gonidelis[m] >
yeah scratch that... i missed a closing `>` on my reading
 
00:50 
rtohid[m] has joined #ste||ar
 
00:51 
diehlpk has joined #ste||ar
 
00:56 
diehlpk has quit [Ping timeout: 255 seconds]
 
00:58 
diehlpk has joined #ste||ar
 
01:03 
diehlpk has quit [Ping timeout: 255 seconds]
 
01:31 
diehlpk has joined #ste||ar
 
01:35 
diehlpk has quit [Ping timeout: 255 seconds]
 
02:04 
diehlpk has joined #ste||ar
 
02:09 
diehlpk has quit [Ping timeout: 255 seconds]
 
02:26 
diehlpk has joined #ste||ar
 
02:26 
diehlpk has quit [Client Quit]
 
02:51 
Yorlik_ has joined #ste||ar
 
02:55 
Yorlik__ has quit [Ping timeout: 265 seconds]
 
03:53 
hkaiser has quit [Quit: Bye!]
 
05:52 
Yorlik__ has joined #ste||ar
 
05:56 
Yorlik_ has quit [Ping timeout: 264 seconds]
 
07:55 
Yorlik__ is now known as Yorlik
 
12:10 
hkaiser has joined #ste||ar
 
20:39 
diehlpk has joined #ste||ar
 
21:33 
tufei has joined #ste||ar
 
22:04 
diehlpk has quit [Quit: Leaving.]
 
22:07 
diehlpk has joined #ste||ar
 
22:10 
diehlpk has quit [Client Quit]
 
22:31 
weilewei has joined #ste||ar
 
22:31 
<
hkaiser >
this call for instance did find two different tag_invoke implementations, the one from MWGraph and ours.
 
22:35 
<
weilewei >
nvc++ std::execution::seq: 3.65969 sec. avarged over 10 runs.
 
22:35 
<
weilewei >
nvc++ std::execution::par: 0.0699834 sec. avarged over 10 runs.
 
22:35 
<
weilewei >
nvc++ std::execution::par_unseq: 0.0716834 sec. avarged over 10 runs.
 
22:35 
<
weilewei >
hpx::execution::seq: 18.2115 sec. avarged over 10 runs.
 
22:35 
<
weilewei >
hpx::execution::par: 0.277915 sec. avarged over 10 runs.
 
22:35 
<
weilewei >
hpx::execution::par_unseq: 0.253258 sec. avarged over 10 runs.
 
22:35 
<
weilewei >
kokkos::parallel_for transform: 0.562939 sec. avarged over 10 runs.
 
22:35 
<
weilewei >
std::execution::seq: 16.2453 sec. avarged over 10 runs.
 
22:35 
<
weilewei >
std::execution::par: 16.1129 sec. avarged over 10 runs.
 
22:35 
<
weilewei >
std::execution::par_unseq: 16.3805 sec. avarged over 10 runs.
 
22:35 
<
weilewei >
__gnu_parallel::transform: 0.460436 sec. avarged over 10 runs.
 
22:36 
<
hkaiser >
par is not too good either :/
 
22:37 
<
weilewei >
really? hpx par runs at 0.277915 sec and std par runs at 16.1129 sec
 
22:38 
<
weilewei >
it seems hpx is faster. But I don't know why the std par is way too slow.
 
22:38 
<
hkaiser >
but the nvcc version? is it running on gpu?
 
22:38 
<
weilewei >
yes nvc++ is running on gpu
 
22:38 
<
hkaiser >
ahh, so apples and bananas
 
22:39 
<
weilewei >
yeah, ignore nvc++ for now. Later I will run Kokkos with nvc++ backend
 
22:39 
<
hkaiser >
we have not necessarily optimized our seq execution
 
22:39 
<
hkaiser >
I can have a look, however
 
22:39 
<
weilewei >
ok, so hpx seq is slower than std seq, this is expected for now?
 
22:39 
<
hkaiser >
also, you might want to try par_simd ;-) instead of par_unseq (which in our case is the same as par)
 
22:40 
<
hkaiser >
Srinivas can help with that
 
22:40 
<
weilewei >
ok, I will add par_simd to the todo list
 
22:40 
<
weilewei >
if I would like to run hpx par on gpu, how does that work?
 
22:41 
<
hkaiser >
or just simd for sequential with vectorization
 
22:41 
<
hkaiser >
weilewei: we don't support that
 
22:41 
<
weilewei >
adding Kokkos?
 
22:41 
<
hkaiser >
not sure if kokkos-hpx supports that, ms[m]1 might know
 
22:43 
<
hkaiser >
weilewei: btw, how many cores is that benchmark running on?
 
22:43 
<
weilewei >
hmm I have not specified it yet
 
22:43 
<
hkaiser >
how many cores has that node?
 
22:44 
<
weilewei >
64 cpu cores, 256 processors
 
22:44 
<
weilewei >
on Perlmutter
 
22:45 
<
hkaiser >
so it's most likely running on 64 threads (try --hpx:print-bind)
 
22:46 
<
weilewei >
127: PU L#254(P#127), Core L#127(P#63), Socket L#1(P#1), on pool "default"
 
22:49 
<
hkaiser >
so 127 cores
 
22:50 
<
weilewei >
why does it print locality 0 twice?
 
22:50 
<
hkaiser >
that's 64 cores with 2 HT each
 
22:50 
<
hkaiser >
I'd suggest specifying --hpx:threads=64
 
22:50 
<
hkaiser >
hmmm. not sure - how do you launch the test?
 
22:51 
<
hkaiser >
what's your slurm command?
 
22:51 
<
hkaiser >
I'll have a look why it's printing things twice, probably a bug
 
22:51 
<
weilewei >
#SBATCH -C gpu
 
22:51 
<
weilewei >
#SBATCH -t 20:00
 
22:51 
<
weilewei >
#SBATCH -N 1
 
22:51 
<
weilewei >
#SBATCH --ntasks-per-node=1
 
22:51 
<
weilewei >
#SBATCH -o parSTL.out
 
22:51 
<
weilewei >
#SBATCH -e parSTL.err
 
22:51 
<
weilewei >
cd /global/homes/w/wwei/src/parSTL
 
22:51 
<
weilewei >
./scripts.sh
 
22:52 
<
hkaiser >
yah, most likely a bug - I'll investigate
 
22:52 
<
hkaiser >
don't run using HTs
 
22:52 
<
weilewei >
ok, I will use  --hpx:threads=64 to run experiments again
 
22:52 
<
weilewei >
let's see how it changes
 
22:56 
<
weilewei >
With --hpx:threads=64, it seems a bit slower than without hpx::execution::seq: 17.0798 sec. avarged over 10 runs.
 
22:56 
<
weilewei >
hpx::execution::par: 0.334277 sec. avarged over 10 runs.
 
22:56 
<
weilewei >
hpx::execution::par_unseq: 0.341671 sec. avarged over 10 runs.
 
22:57 
<
hkaiser >
a bit worse, ok
 
22:57 
<
weilewei >
let me wait for std results in the same job run
 
23:03 
<
weilewei >
std::execution::seq: 16.138 sec. avarged over 10 runs.
 
23:03 
<
weilewei >
std::execution::par: 16.0406 sec. avarged over 10 runs.
 
23:03 
<
weilewei >
std::execution::par_unseq: 16.2172 sec. avarged over 10 runs.
 
23:04 
<
weilewei >
well, hpx seq is close to std seq
 
23:19 
weilewei has quit [Ping timeout: 260 seconds]
 
23:21 
weilewei has joined #ste||ar
 
23:26 
<
weilewei >
hkaiser if I want to learn more on sender/receiver work for HPX, who should be the point of contact?
 
23:29 
<
hkaiser >
shreyas, myself, and certainly ms[m]1
 
23:30 
<
weilewei >
Got it, I will learn a bit more for my next project :)
 
23:30 
<
weilewei >
will reach out later for this regard
 
23:31 
<
hkaiser >
weilewei: as said, we've never tried to make seq optimal
 
23:31 
<
weilewei >
Got it, I will note it
 
23:42 
weilewei has quit [Quit: Ping timeout (120 seconds)]