#ste||ar on 2020-04-01 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

00:07 nan11 has joined #ste||ar

00:09 <diehlpk_mobile[m> The student submission period is now over - and the numbers were much higher than ever before -- over 51,000 students registered for the program this year (a 65% increase from the previous high)! 6,335 students submitted their final proposals and applications for you all to review over these next couple of weeks (that's a 13% increase over last year).

00:09 <diehlpk_mobile[m> Wow

00:13 <hkaiser> diehlpk_mobile[m: do you know how many organizations they have?

00:22 <diehlpk_mobile[m> Hkaiser 200 and 30 novel ones

00:22 akheir1 has quit [Read error: Connection reset by peer]

00:22 <diehlpk_mobile[m> 200 in total and 30 out of them are novel

00:23 akheir1 has joined #ste||ar

00:28 <wate123_Jun> wow, but also not surprising because of this pandemic.

00:36 hkaiser has quit [Read error: Connection reset by peer]

00:42 weilewei has quit [Remote host closed the connection]

00:45 hkaiser has joined #ste||ar

01:18 <Yorlik> hkaiser: I think I have run into my first deadlocks

01:19 <Yorlik> It seems to depend on the upper and lower limit of the task limiting algorithm.

01:19 <Yorlik> What happens is, that all workers are in the yield_while

01:20 <Yorlik> Still investigating.

01:20 <hkaiser> nod, bad one

01:20 <Yorlik> I'm doing a bit of reading on the topic. Do you have any recommendation what to look for?

01:21 <Yorlik> I am thinking about modifying the yield_while predicate I'm using and have a chance to continue or add a time based release

01:22 <Yorlik> But I don't want to kill performance ofc.

01:22 <Yorlik> You know - I always want my free lunch and eat it too :)

01:24 nan11 has quit [Ping timeout: 240 seconds]

01:24 wate123_Jun has quit [Remote host closed the connection]

01:25 wate123_Jun has joined #ste||ar

01:29 wate123_Jun has quit [Ping timeout: 246 seconds]

01:44 wate123_Jun has joined #ste||ar

02:03 wate123_Jun has quit [Remote host closed the connection]

02:04 wate123_Jun has joined #ste||ar

02:05 Vir has quit [Ping timeout: 256 seconds]

02:05 Vir has joined #ste||ar

02:05 Vir has quit [Changing host]

02:05 Vir has joined #ste||ar

02:07 weilewei has joined #ste||ar

02:08 <weilewei> do mentors need to review proposal from other organizations?

02:08 wate123_Jun has quit [Ping timeout: 240 seconds]

02:13 <hkaiser> weilewei: no

02:13 <weilewei> hkaiser ok, got it

02:23 diehlpk_work has quit [Remote host closed the connection]

02:40 wate123_Jun has joined #ste||ar

02:41 akheir1 has quit [Read error: Connection reset by peer]

02:41 akheir1 has joined #ste||ar

02:45 wate123_Jun has quit [Ping timeout: 252 seconds]

03:01 hkaiser has quit [Quit: bye]

03:01 akheir1 has quit [Read error: Connection reset by peer]

03:01 akheir1 has joined #ste||ar

03:15 wate123_Jun has joined #ste||ar

03:20 wate123_Jun has quit [Ping timeout: 252 seconds]

03:20 akheir1 has quit [Read error: Connection reset by peer]

03:20 akheir1 has joined #ste||ar

03:33 parsa has quit [Ping timeout: 252 seconds]

03:48 weilewei has quit [Remote host closed the connection]

03:48 wate123_Jun has joined #ste||ar

03:52 wate123_Jun has quit [Ping timeout: 252 seconds]

03:56 akheir1 has quit [Remote host closed the connection]

04:07 parsa has joined #ste||ar

04:23 wate123_Jun has joined #ste||ar

04:27 wate123_Jun has quit [Ping timeout: 240 seconds]

04:58 wate123_Jun has joined #ste||ar

05:03 wate123_Jun has quit [Ping timeout: 252 seconds]

05:37 wate123_Jun has joined #ste||ar

05:41 wate123_Jun has quit [Ping timeout: 252 seconds]

05:41 nikunj97 has joined #ste||ar

06:12 wate123_Jun has joined #ste||ar

06:17 wate123_Jun has quit [Ping timeout: 252 seconds]

06:50 wate123_Jun has joined #ste||ar

06:55 wate123_Jun has quit [Ping timeout: 252 seconds]

06:56 kale_ has joined #ste||ar

07:01 kale_ has quit [Ping timeout: 265 seconds]

07:06 kale_ has joined #ste||ar

07:06 kale_ has quit [Client Quit]

07:29 wate123_Jun has joined #ste||ar

07:33 wate123_Jun has quit [Ping timeout: 252 seconds]

07:46 <nikunj97> heller1, just confirming. If expected peak of double is x, then expected peak of float will be 2x. Right?

07:52 <heller1> depends on the architecture

07:57 <nikunj97> I see. My expected performance for HiSilicon1616 are way off the observed results

07:59 <nikunj97> heller1, I've sent you the drive link to the initial benchmark results I've obtained

08:00 <nikunj97> not the best looking graph, but I wanted to know your opinions on it

08:00 <nikunj97> I'm considering 2 loads and 1 store for calculating peak expected performance

08:03 <heller1> Where is the roofline in those graphs?

08:04 wate123_Jun has joined #ste||ar

08:04 <nikunj97> roofline is the expected peak

08:05 <nikunj97> Since the problem is memory bound, I calculated the peak expected value from the values obtained from roofline and plotted that as expected peak performance

08:06 <nikunj97> would that not suffice?

08:06 <heller1> sorry, I have no idea what I am looking at

08:06 <nikunj97> aah, they're that bad damn. Let me try to beautify them for you

08:06 <heller1> no, the 'expected' looks very wrong

08:07 <nikunj97> how so?

08:07 <heller1> I think

08:08 <heller1> for the hisilicon, what's that drop at 24 cores?

08:08 wate123_Jun has quit [Ping timeout: 240 seconds]

08:08 <nikunj97> I calculated expected as follows: 1. Calculate peak bandwidth (from stream triad) for that many cores, 2. Calculate the peak performance considering 2 loads and 1 store i.e. 1 MLUP per 24 Byte

08:08 <heller1> it should be a line that's monotonically increasing and eventually converging to the value of AI*peak_bw

08:09 <nikunj97> HiSilicon had a drop in memory bandwidth going from 16 to 24 cores

08:09 <heller1> that's expectd

08:09 <heller1> * that's expected

08:09 <heller1> look again at its architecture

08:09 <nikunj97> yes, so you see a drop in performance across the board as well

08:09 <heller1> no, you just measured wrong

08:10 <heller1> or interpreted the result wrongly

08:10 <nikunj97> also the cores go up til 64 cores (with hyperthreading)

08:10 <heller1> remember that I told you to watch out for SMT and NUMA stuff?

08:10 <nikunj97> hartmut asked me to measure till 32 if you don't hyperthreading stuff

08:11 <heller1> 2 threads on a single core (with hyperthreading) isn't the same as 2 threads on two different cores

08:11 <nikunj97> yes, I remember that

08:11 <nikunj97> should I use something like numactl then?

08:11 <heller1> did you pin the openmp threads for the stream benchmark?

08:11 <nikunj97> yes, I did

08:12 <nikunj97> you mean this: export OMP_NUM_THREADS=4, right?

08:12 <heller1> no

08:12 <nikunj97> crap

08:13 <nikunj97> so I use numactl to pin cores and then run everything again?

08:13 <heller1> HPX does what you want here. OpenMP not so much, at least not the gnu implementation

08:13 <heller1> or hwloc-bind

08:13 <heller1> which is a tad easier, since you can visualize the topology nicely

08:14 <simbergm> `OMP_PROC_BIND=thread` might be what you're looking for

08:14 <nikunj97> stream used that in FAQ, so I thought, you meant that by thread pinning

08:14 <nikunj97> simbergm, thanks!

08:14 <nikunj97> heller1, how should I proceed then?

08:14 <nikunj97> what would you suggest?

08:15 <heller1> redo your measurements with the correct pinning ;)

08:15 Abhishek09 has joined #ste||ar

08:15 <nikunj97> using hwloc-bind, alright :/

08:15 <simbergm> nikunj97: if you're using slurm to set the cores, you can get the same bindings in openmp and hpx by using `--hpx:use-process-mask` (openmp will automatically use the mask)

08:15 <heller1> choose whichever method you think fits best

08:16 <heller1> ms[m]: even the GNU implementation?

08:16 <simbergm> heller: it might not

08:16 <simbergm> not sure about the differences between the implementations, some will at least

08:16 <heller1> yeah ...

08:17 Hashmi has joined #ste||ar

08:17 <simbergm> doing it manually is the surest way of course

08:17 <nikunj97> btw what if I do srun -c <num_cores> --threads-per-core=1?

08:17 <nikunj97> will that also be similar to thread pinning?

08:18 <heller1> depends on 1) How slurm is configured 2) What your OpenMP implementation does

08:18 <nikunj97> could you elaborate?

08:19 <heller1> slurm and OpenMP do interact sometimes, sometimes not

08:19 <nikunj97> will it work for HPX?

08:19 <heller1> huge option space there. Best consult the docs of your cluster

08:19 <nikunj97> if I run my benchmarks with that, I mean

08:19 <heller1> yes, HPX always does the right thing (tm)

08:19 <heller1> well

08:20 <heller1> check it

08:20 <heller1> --hpx:print-bind is yoru friend

08:20 <heller1> lstopo is your friend

08:20 <heller1> the manual of your OpenMP implementation is your friend, as well as the manual of your cluster ;)

08:21 <nikunj97> will have to look at the manual somewhere, all I have right now is a specification of the cluster

08:21 <heller1> what you have to ensure is that you always measure the same thing.

08:23 <heller1> and be sure to know what you measure

08:28 <nikunj97> simbergm, if I do slurm ... -c 20 ./hpx_executable --hpx:user-process-mask. Will it make sure to bind only to slurm allocated cores? Is that what you meant?

08:31 <heller1> nikunj97: check with `--hpx:print-bind`

08:31 <simbergm> nikunj97: yeah, that's what it's supposed to do (it's *use*-process-mask, not user btw)

08:32 <simbergm> and you can always check with print-bind like heller suggested

08:32 <simbergm> it might depend on slurm how slurm is configured as well, but as far as I know it'll always set a mask for the process

08:33 <nikunj97> simbergm, heller1 thanks! I'll try doing it this way and check

08:37 wate123_Jun has joined #ste||ar

08:41 wate123_Jun has quit [Ping timeout: 252 seconds]

08:58 <simbergm> mdiers I forgot to mention yesterday: 4306 was cherry picked into the 1.4.1 release, so those changes should already be there

08:58 <simbergm> in any case I've just updated the pr so hopefully we can get it in this week

09:16 wate123_Jun has joined #ste||ar

09:16 <mdiers[m]> <simbergm " mdiers I forgot to mention yes"> the cherry picks are already in master?

09:16 <simbergm> mdiers: good question, apparently not

09:16 <simbergm> they are in 1.4.1 though

09:17 <simbergm> normally we'd merge release branches back into master, but I think 1.4.1 might not have gotten that treatment

09:17 <simbergm> in that case you'll still have to wait for the PR... sorry for the confusion!

09:19 <mdiers[m]> ms: all right, no problem ;-)

09:20 wate123_Jun has quit [Ping timeout: 252 seconds]

09:32 nikunj97 has quit [Remote host closed the connection]

09:33 nikunj97 has joined #ste||ar

09:43 <nikunj97> heller1, just realized that hisilicon one is incomplete. That's why you don't see it hit the peak

09:44 <nikunj97> it has 64 cores and 64 threads and I plotted only till 32 cores

09:50 Abhishek09 has quit [Remote host closed the connection]

09:50 <nikunj97> also with xeon, afaik --hpx-threads=x uses the first x cores from machine. So the xeon graphs should've used the correct cores as well (since hyperthreads attached 0,21;1,22;and so on)

09:50 wate123_Jun has joined #ste||ar

09:54 <nikunj97> *0,20;1,21; and so on

09:55 wate123_Jun has quit [Ping timeout: 252 seconds]

10:11 <heller1> yes, as said, HPX should be fine. I am concerned about your OMP comparisons

10:12 <nikunj97> yea, I'll do those peak bandwidth calculations again

10:12 <nikunj97> with numactl

10:12 <heller1> never used that

10:14 <nikunj97> it's pretty easy, `numactl --localalloc --physcpubind=0-20` would mean that cores 0 to 20 are used and each of them is bound to their specific node

10:14 <nikunj97> that's what JSC people asked me to use for thread pinning

10:24 wate123_Jun has joined #ste||ar

10:28 wate123_Jun has quit [Ping timeout: 252 seconds]

10:46 Hashmi has quit [Quit: Connection closed for inactivity]

10:54 Hashmi has joined #ste||ar

11:01 wate123_Jun has joined #ste||ar

11:06 wate123_Jun has quit [Ping timeout: 252 seconds]

11:41 wate123_Jun has joined #ste||ar

11:45 wate123_Jun has quit [Ping timeout: 240 seconds]

12:02 hkaiser has joined #ste||ar

12:21 wate123_Jun has joined #ste||ar

12:26 wate123_Jun has quit [Ping timeout: 252 seconds]

12:42 <nikunj97> heller1, what did you think was wrong with xeon e5 graphs?

12:43 <nikunj97> I have tried with both numactl and hwloc-bind and the updated results looks skewed. My runs are running better than expected peak.

12:45 <nikunj97> other core performance for hisilicon are still running. Will let you know the final results when they're done

12:48 <nikunj97> heller1, I've updated e5 graphs with the new one's. There's not much of a difference, except for the benchmark running faster than 5 core peak

12:58 wate123_Jun has joined #ste||ar

13:02 wate123_Jun has quit [Ping timeout: 252 seconds]

13:04 <nikunj97> heller1, I also added the rooflines for hisilicon1616 and xeon e5 as well (for double precision peak performance for CPU)

13:09 <nikunj97> I think arithmetic intensity of 1/8 (assuming 3 loads and 1 store) is not apt for the calculations. I don't like seeing values above peak expected performance

13:30 wate123_Jun has joined #ste||ar

13:32 hkaiser has quit [Quit: bye]

13:34 wate123_Jun has quit [Ping timeout: 252 seconds]

13:42 weilewei has joined #ste||ar

14:06 Hashmi has quit [Quit: Connection closed for inactivity]

14:07 diehlpk_work has joined #ste||ar

14:08 wate123_Jun has joined #ste||ar

14:12 wate123_Jun has quit [Ping timeout: 240 seconds]

14:20 wate123_Jun has joined #ste||ar

14:24 akheir has joined #ste||ar

14:24 <nikunj97> heller1, one thing that I noticed from your paper. You decompose the stencil to blocks, while I'm decomposing into lines

14:26 <nikunj97> I'm thinking to take this route as well. This way I can work on evenly sized stencils (i.e. x and y dimensions does not differ by a large factor)

14:35 nikunj97 has quit [Ping timeout: 252 seconds]

14:39 nikunj97 has joined #ste||ar

14:43 Hashmi has joined #ste||ar

14:43 akheir1 has joined #ste||ar

14:46 <heller1> Yes, blocking is the way to go. Just makes it significantly harder in distributed

14:46 hkaiser has joined #ste||ar

14:46 akheir has quit [Read error: Connection reset by peer]

14:46 <heller1> Well, did you plot the values in the normal roofline?

14:47 <Yorlik> o/

14:47 <heller1> I'm just very confused about the presentation about your expected performance

14:47 <Yorlik> hkaiser: I ended up writing my own primitive version of a yield_while: https://gist.github.com/McKillroy/5b851aebc0ebfe2048123b48857d9d6c

14:48 <Yorlik> It's like a lightweight selfmade fake scheduler with limited scope (parloop chunks)

14:49 <Yorlik> But so far no more deadlocks.

14:50 nikunj97 has quit [Ping timeout: 260 seconds]

14:50 <hkaiser> Yorlik: sure, even if you use HPX that doesn't prevent you from writing your own facilities

14:50 <Yorlik> It all comes over time.

14:51 <Yorlik> Can't learn everything at once. :)

14:56 <hkaiser> simbergm: ?

15:02 nan11 has joined #ste||ar

15:09 <simbergm> hkaiser: ??

15:10 <hkaiser> hey

15:10 <hkaiser> #4485

15:10 <hkaiser> you propose to movethe function serialization to the functional module

15:10 <simbergm> yeah

15:10 <simbergm> yes, not sure about that yet... hence the draft

15:11 <hkaiser> I'm confused - when creating the functional module we intentionally left that out, didn't we?

15:11 <simbergm> however, since you made serialization more lightweight it wouldn't be too bad

15:12 <hkaiser> simbergm: I don't remember what we discussed

15:12 <hkaiser> did we want to make essentially all modules dpend on serialization or make serialization depend on all modules

15:13 <simbergm> we have made other things depend directly on serialization now as well, instead of having the serialization separately

15:13 <hkaiser> iirc, we wanted for serialization to provide the infrastructure needed and serialization support for ambient data types (from the std library, etc.)

15:13 <simbergm> that was also before you had made the serialization module

15:14 <simbergm> it would still be nice to have it separate, but there are too many other things that still have serialization implemented intrusively that it doesn't make sense to have them in a separate module at the moment

15:14 <hkaiser> simbergm: I'm not saying what you propose is wrong, I'm trying to discuss to find a uniform solution

15:15 <hkaiser> so I think we can agree that the serialization module itself should have serialization support for ambient types

15:15 <simbergm> hkaiser: yeah, didn't take it that way

15:15 <simbergm> yep, agreed on that

15:15 <simbergm> and that's pretty much the way it is right now

15:15 <hkaiser> std:: and possibly boost::

15:15 <simbergm> yep

15:15 <simbergm> external dependencies essentially

15:16 <hkaiser> and serialization support for our own types go into their respective module

15:16 <simbergm> we have lot of functionality that still has intrusive serialization, those would have to be separated out as well for this whole concept of separate serialization to make sense

15:16 <hkaiser> I'm not sure I'd like that

15:17 <simbergm> in that case having function serialization separate never made sense either :P

15:17 <hkaiser> so yes, I think moving the function serialization into the functional module is a correct move

15:17 <simbergm> the original reason for separating function serialization (it was the first one I did it for) was that serialization was a heavy dependency

15:17 <hkaiser> yah, not sure anymore why we did this (same with any, btw)

15:18 <simbergm> since it's lightweight now it makes much less sense to have it in a separate module

15:18 <hkaiser> right

15:18 <hkaiser> ok, count me in, I think this is sensible

15:19 <simbergm> well, the benefit of having serialization completely separate is that it would allow building hpx with serialization, but consumers not having to pull in any serialization if they don't want to

15:19 <simbergm> but that is a lot of effort for not too much gain at the moment

15:19 <simbergm> there are more important things before that

15:20 <simbergm> so I think having it in the functional module makes most sense at least right now

15:20 <hkaiser> nod, I agree

15:21 nikunj97 has joined #ste||ar

15:21 <simbergm> did you see my reply on the async modules? does what I wrote make sense?

15:21 <simbergm> actually, jbjnr yt? do you think it would be feasible to not use dataflow in the guided_pool_executor?

15:23 <simbergm> see https://github.com/STEllAR-GROUP/hpx/pull/4478

15:23 <simbergm> we can of course both get rid of dataflow in that executor and have some sort of base module for async/apply/dataflow

15:24 bita has joined #ste||ar

15:25 <nikunj97> heller1, I did not plot it. Let me do that now and see how it looks

15:26 <hkaiser> simbergm: in a meeting now, will get back to you

15:26 <simbergm> hkaiser: np

16:23 <hkaiser> simbergm: ok, I'm back

16:24 <hkaiser> simbergm: I added a comment to the ticket

16:24 <hkaiser> we can break the circular dependencies by separating the executor related specializations of the dataflow implementation

16:24 <hkaiser> similar to what you've done for hpx::future

16:26 gonidelis has joined #ste||ar

16:27 <hkaiser> simbergm: hmm, I was sure I did add a comment on the ticket, but apparently I didn't :/

16:39 rtohid has joined #ste||ar

16:41 <bita> hkaiser, I set two localities for this, https://github.com/STEllAR-GROUP/phylanx/blob/e3639f42e37ad302f84c0bdcf9e3e8016470e485/tests/performance/temp.cpp, and I get the "thread pool is not running" error

16:42 <hkaiser> bita: ok, I'll have a look, thanks

16:42 <hkaiser> bita: running on 2 localities?

16:42 <bita> This test should fail throwing another exception, because cannon_product does not work on two localities.

16:42 <bita> yes

16:42 <hkaiser> ok

16:43 <bita> If you like to run it, I can make another test and create an issue. should I do it?

16:43 <hkaiser> this test is fine for now, thanks

16:43 <bita> got it :)

16:43 <hkaiser> but please feel free to create an issue

16:43 <bita> Okay :)

16:51 Hashmi has quit [Quit: Connection closed for inactivity]

16:53 <bita> hkaiser, I made the example using dot_d and it worked. So, I think the problem is setting 2 localities for cannon. I tested with cannon on 4 localities and it worked too

16:53 <hkaiser> ok

16:54 <bita> so the problem is we always get the "thread pool is not running" exception instead of the real exception. The good news is everything else is working

16:56 <bita> I will create an issue if you think it makes sense to you. however, fixing this should not be the priority

16:58 <hkaiser> I'd like to find out what's wrong - it should work

16:58 <bita> of course

17:09 Hashmi has joined #ste||ar

17:17 <hkaiser> bita: btw, do you think it would be sensible to make the last two arguments to random_d and constand_d optional (the current locality and the number of localities)?

17:17 <hkaiser> those could be initialized from the current HPX locality, if needed

17:21 <bita> I am confused, I think the user should decide how many localities she needs for her generated new array (of constants or randoms)

17:21 <hkaiser> bita: yes, of course

17:22 <hkaiser> bita: but most of the time the number of localities to use is the same as the number of localities the HPX applications runs on

17:22 <hkaiser> so it could be derived at runtime as default values

17:22 <bita> Okay, I get it

17:23 <bita> I will work on that after getting perftests done

17:25 <nikunj97> heller1, I updated the x86 roofline with 2d stencil performance recorded

17:25 <nikunj97> you can check it in the drive link

17:26 <nikunj97> 15 and 20 core performance is not so great compared to 5 and 10 core ones

17:27 <nikunj97> btw the files are named as x86_64_<data_type>.png, where data_type denotes the performance of the stencil with that type

17:41 weilewei has quit [Remote host closed the connection]

17:46 <nikunj97> heller1, just updated the hisilicon runs as well

17:55 <simbergm> hkaiser: yep, I think how to get rid of the dependency is clear (let's see tomorrow...)

17:55 <simbergm> I was more thinking about how to structure the rest of the files

17:56 <simbergm> I don't think it makes much sense to only move the dataflow executor specializations to the execution module, I would in that case also move them for async, apply, and sync

17:56 <simbergm> which doesn't leave much in the local_async module...

17:57 <simbergm> but I would perhaps add the slightly silly async_base module with the bare minimum, move the specializations to the execution module

17:58 <simbergm> and maybe still have a local_async module that just pulls in all of that, but not sure about this yet

17:59 gonidelis has quit [Remote host closed the connection]

18:00 <simbergm> if I get rid of the local_async module a user can also include just say hpx/{executors,execution}/async.hpp for the local only versions

18:00 <hkaiser> simbergm: there is much more than the executor specializations for async and friends in local_async

18:00 <simbergm> I guess the only reason I don't like that naming is because it breaks the symmetry with the async module... but then that could be called distributed_execution

18:00 <hkaiser> but yah, all of this applies to async, apply, dataflow, future etc

18:01 <simbergm> hkaiser: there's the launch policy and executor specializations

18:01 <hkaiser> launch policy can stay in local_async, no?

18:01 gonidelis has joined #ste||ar

18:02 <simbergm> yeah, it can

18:02 <simbergm> wait, no, it can't

18:02 nan11 has quit [Remote host closed the connection]

18:02 <simbergm> that's the problem

18:03 <hkaiser> and the base templates for the specializations too? or should those be somewhere else?

18:03 <simbergm> at least the way dataflow is used currently in guided_pool_executor

18:03 <simbergm> guided_pool_executor uses the launch policy specializations

18:04 ibalampanis has joined #ste||ar

18:04 <simbergm> but maybe it can just specify an explicit executor instead

18:04 <hkaiser> well that exeutor could directly dispatch to something like async_launch_policy_dispatch

18:05 <hkaiser> that one was created for similar reasons, to break the dependency of the parallel executor on async

18:05 <simbergm> yep, that's what I would like it to do

18:05 <simbergm> I mean, we managed to get rid of dataflow elsewhere

18:05 <hkaiser> just dataflow_dispatch or simila

18:05 <hkaiser> r

18:06 <ibalampanis> Hello to everyone! Have a nice day!

18:06 <simbergm> right

18:06 <simbergm> hi ibalampanis!

18:07 <simbergm> I'll try to talk to jbjnr tomorrow (latest at the meeting)

18:07 <ibalampanis> hkaiser: I 'd like to ask you, how many proposals for GSoC have you received?

18:08 <simbergm> I think if it can be implemented without dataflow that'd nice

18:08 <simbergm> but otherwise I'll do some reshuffling

18:09 <simbergm> I'd move the base templates into execution then as well, or into a completely separate module

18:09 <simbergm> ibalampanis: about 15 I think

18:09 <ibalampanis> Interesting! Good luck to everyone!

18:10 <hkaiser> simbergm: nah, let's extract the dataflow engine

18:11 <simbergm> hkaiser: what do you mean by that exactly?

18:11 <hkaiser> or hide it in a template specialization such that we need to expose an unimplemented base template only, same as for future

18:14 <simbergm> expose to what? the guided_pool_executor will need the specialization as well

18:14 <simbergm> we might be thinking about the same thing, I'm just not communicating it well (I hope...)

18:14 <simbergm> the base templates can be in execution or in a dependency of execution

18:15 <simbergm> the executor specializations can be in execution

18:16 <hkaiser> nod, I see what you mean

18:17 <hkaiser> but local_async and friends could depend on the specializations that does not depend on executors

18:17 <hkaiser> so the executors can depend on local_async

18:18 <hkaiser> I'd consider local_async to be fairly low level

18:18 <simbergm> ok, we might have different ideas about what local_async should contain

18:18 <simbergm> I consider it a user facing module

18:19 <simbergm> but the only other specialization is the launch policy specialization and that one depends even on a concrete executor, not just execution traits

18:19 nan11 has joined #ste||ar

18:21 <simbergm> and yes, local_async could contain just this https://github.com/msimberg/hpx/blob/6d3fef5d738ff7e421366538ffab4535e433a55a/libs/local_async/include/hpx/local_async/async.hpp#L70-L79 and the base template

18:21 akheir1_ has joined #ste||ar

18:22 <simbergm> I was just calling it async_base :P

18:22 <hkaiser> simbergm: sure

18:22 <hkaiser> sounds like a plan

18:23 <hkaiser> async_base is fine as it would hold all base templates for async, dataflow, even future

18:24 <simbergm> I'll put something together tomorrow and then we can discuss at the meeting

18:24 <hkaiser> bita: the exception is thrown because of bad exception handling for exceptions that escape hpx_main()

18:24 akheir1 has quit [Ping timeout: 256 seconds]

18:24 <hkaiser> the reson is that cannon require a perfect square of tiles, but your example has only 1*2 tiles

18:25 <hkaiser> that also explains why it works for 4 localities

18:25 <simbergm> hkaiser: exactly, didn't even think about future_base, but that could make sense as well

18:26 <simbergm> I think we might go for a separate future_base module as there's a bit more code there, but it could go either way

18:28 <hkaiser> sure

18:28 <hkaiser> but the ideas should be the same

18:30 <hkaiser> bita: see #4487 (HPX)

18:30 <bita> hkaiser, I thought I would get "All tiles in the tile row/column do not have equal height/width" but I was mistaken. We don't have an exception for not being a perfect square

18:30 <hkaiser> bita: we do, it was just not properly propagated to you

18:30 <hkaiser> #4487 fixes that

18:31 <bita> got it, thank you

18:36 rtohid has left #ste||ar [#ste||ar]

18:42 nan11 has quit [Remote host closed the connection]

18:43 nan11 has joined #ste||ar

18:51 ibalampanis has quit [Remote host closed the connection]

18:51 weilewei has joined #ste||ar

18:55 rtohid has joined #ste||ar

18:58 rtohid has quit [Remote host closed the connection]

19:04 rtohid has joined #ste||ar

19:27 nk__ has joined #ste||ar

19:29 nikunj97 has quit [Ping timeout: 256 seconds]

19:36 akheir1 has joined #ste||ar

19:38 akheir1_ has quit [Ping timeout: 240 seconds]

19:44 nk__ has quit [Read error: Connection reset by peer]

19:44 nikunj97 has joined #ste||ar

19:57 nan11 has quit [Ping timeout: 240 seconds]

20:12 <nikunj97> heller1, interesting way to use a proxy vector<shared_future<void>> for dataflow!

20:13 <heller1> ?

20:15 <nikunj97> in your paper, you use vector of shared future

20:15 <nikunj97> and create dependencies with it, while working on the actual grid

20:17 <nikunj97> I find that a creative solution. I was thinking of ways to create a dataflow but couldn't think of anything other than vector<shared_future<vector> >

20:17 <heller1> Yeah

20:17 <heller1> Cool results

20:17 <nikunj97> you understood the graphs?

20:18 <heller1> Well hkaiser followed a different philosophy with the 1d stencil examples in the hpx repo

20:19 <nikunj97> he uses vector < shared_future < partition_data > > in 1d stencil

20:19 <heller1> Yes

20:19 <heller1> I like the "traditional" roofline ones most

20:19 <nikunj97> and partition_data is a vector again

20:19 <heller1> Yes, which is what you said

20:19 <nikunj97> yes

20:20 <nikunj97> gotcha, will send you plots wrt rooflines next time

20:20 <heller1> Yes

20:21 <heller1> Anyways, the x86 graphs look good as well

20:21 <nikunj97> what do you think of the results btw? I don't understand why it would reduce post a certain core count instead of keeping consistent?

20:21 <hkaiser> nikunj97: where can I see those graphs?

20:21 <nikunj97> hkaiser, wait let me send you the drive link

20:22 <nikunj97> hkaiser, see pm for the link

20:22 <heller1> And as expected

20:22 <hkaiser> git it, thanks

20:23 <heller1> The arm ones still look odd with those drops in them, those shouldn't be there

20:23 <heller1> Could be a bug in the binding code

20:24 <nikunj97> could be

20:24 <heller1> Or with how the os reports the topology

20:24 <heller1> Can you share the output of lstopo on one it those arm nodes please?

20:24 <nikunj97> their arm node doesn't have hyperthreading though. So that's the performance you see

20:24 <nikunj97> sure, a sec

20:24 <nikunj97> heller1, you'd like a photo or console output?

20:26 <heller1> nikunj97: doesn't matter

20:27 <heller1> nikunj97: also, the output of when you run the application with 64 cores and --hpx-print-bind

20:27 <nikunj97> heller1, here: https://gist.github.com/NK-Nikunj/bf2836e63215eb51d1eb13d8ef7c8462

20:28 <heller1> nikunj97: it's especially the measurement at 24 cores which puts me off

20:30 <nikunj97> the 24 core example is running. hpx:print-bind prints only 0-23 PUs only

20:30 <heller1> nikunj97: regarding the quality of the results: Those are as expected. you nicely observe that you hit the memory bottleneck (when the curve flattens), once that's happening, you mostly observe the effect of too little parallelism

20:30 <nikunj97> 24 core run: https://gist.github.com/NK-Nikunj/bf2836e63215eb51d1eb13d8ef7c8462#file-24core_run

20:31 <heller1> nikunj97: ugh, this thing has 4 numa nodes?

20:31 <heller1> that sucks

20:31 <heller1> and explains everything ;)

20:31 <nikunj97> ohh did I not tell you that?

20:31 <nikunj97> I should've mentioned

20:32 <nikunj97> btw 64 core run: https://gist.github.com/NK-Nikunj/bf2836e63215eb51d1eb13d8ef7c8462#file-64core_run

20:32 <heller1> nope

20:33 <nikunj97> that was the reason why we were seeing peaks at 16 and 32

20:33 <nikunj97> and a dip at 24

20:34 <heller1> https://en.wikichip.org/wiki/hisilicon/kunpeng/hi1616

20:35 <heller1> can you confirm that this is the one that you are using?

20:35 <heller1> yeah, more or less

20:35 <heller1> you should see another dip at 48 cores

20:36 <heller1> so what you need to do is to fill one NUMA domain first, and then just fill the next ones entirely

20:36 <heller1> or go 1 core per domain, 2 per domain etc.

20:36 <nikunj97> lscpu: https://gist.github.com/NK-Nikunj/bf2836e63215eb51d1eb13d8ef7c8462#file-lscpu

20:37 <nikunj97> I see, I finally know nicely how to use hwloc-bind. I can use that

20:37 wate123_Jun has quit [Remote host closed the connection]

20:38 <heller1> yes

20:38 wate123_Jun has joined #ste||ar

20:38 <heller1> for HPX, you need something similar, but a tad different

20:38 <heller1> but I am sure you'll figure it out

20:39 <nikunj97> I see

20:39 <nikunj97> I'll try to figure it out

20:39 <heller1> so your task now is to explain why that drop is happening and why there's a peak at those specific core counts

20:39 <nikunj97> btw hwloc-bind with --hpx:use-process-mask works flawlessly

20:40 <heller1> ok

20:40 <nikunj97> so that's one solution for HPX

20:40 <heller1> whatever floats your boat ;)

20:40 <nikunj97> :D

20:42 wate123_Jun has quit [Ping timeout: 240 seconds]

20:45 <nikunj97> heller1, cross NUMA traffic is all I can think of

20:45 <nikunj97> basically 1 NUMA is completely filled and other is only half filled

20:46 <heller1> well, convince me with data

20:46 <nikunj97> heller1, sure :)

20:46 <nikunj97> I'm working on a futurized version now

20:46 <nikunj97> I feel that may provide minor optimization

20:47 <heller1> yes

20:47 <heller1> spoiler alert: it is not cross numa traffic.

20:48 <heller1> https://en.wikichip.org/wiki/hisilicon/kunpeng/hi1616

20:48 gonidelis has quit [Remote host closed the connection]

20:49 <heller1> has all the hints you need, do the measurements I suggested, maybe even find a tool that allows you to read out some hardware performance counters that might give you a hint here

20:49 <nikunj97> I wouldn't believe memory bandwidth coz that was either consistent or increasing (or very minor drops)

20:49 <heller1> well, try and see

20:50 <nikunj97> will do!

20:57 <heller1> wee, nice number of proposals

20:59 <heller1> hkaiser: diehlpk_mobile jbjnr ms are we having a call about the selection of the gsoc proposals next week?

20:59 nikunj97 has quit [Quit: Leaving]

21:00 <hkaiser> heller1: might be a good idea

21:00 <hkaiser> heller1: also we have the pmc call tomorrow

21:00 <heller1> yeah, I know

21:01 <heller1> I am not sure I can make the PMC call tomorrow

21:02 <heller1> I am on forced vacation until the 8th. and my wife is working tomorrow afternoon

21:09 <hkaiser> heller1: ok

21:10 <heller1> I'll try though

21:12 wate123_Jun has joined #ste||ar

21:17 wate123_Jun has quit [Ping timeout: 252 seconds]

21:20 wate123_Jun has joined #ste||ar

21:42 weilewei has quit [Remote host closed the connection]

21:42 rtohid has quit [Remote host closed the connection]

21:46 rtohid has joined #ste||ar

21:46 weilewei has joined #ste||ar

21:51 wate123_Jun has quit [Remote host closed the connection]

22:02 wate123_Jun has joined #ste||ar

22:07 bita has quit [Read error: Connection reset by peer]

22:13 rtohid has quit [Remote host closed the connection]

22:13 akheir1 has quit [Quit: Leaving]

22:14 bita has joined #ste||ar

22:21 Hashmi has quit [Quit: Connection closed for inactivity]

22:23 wate123_Jun has quit [Remote host closed the connection]

22:23 wate123_Jun has joined #ste||ar

22:28 wate123_Jun has quit [Ping timeout: 240 seconds]

22:32 bita has quit [Ping timeout: 240 seconds]

22:56 wate123_Jun has joined #ste||ar

22:57 nikunj has quit [Ping timeout: 252 seconds]

22:57 nikunj has joined #ste||ar

23:00 nan11 has joined #ste||ar

23:00 wate123_Jun has quit [Ping timeout: 252 seconds]

23:36 wate123_Jun has joined #ste||ar

23:40 wate123_Jun has quit [Ping timeout: 252 seconds]