#ste||ar on 2021-03-03 — irc logs at irclog.cct.lsu.edu

2020-09-17 16:16 K-ballo changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:19 pedro_barbosa[m] has quit [Ping timeout: 240 seconds]

00:19 ms[m] has quit [Ping timeout: 244 seconds]

00:19 tiagofg[m] has quit [Ping timeout: 258 seconds]

00:20 heller1 has quit [Ping timeout: 258 seconds]

00:22 pedro_barbosa[m] has joined #ste||ar

00:23 beauty2 has quit [Ping timeout: 264 seconds]

00:24 tiagofg[m] has joined #ste||ar

00:24 heller1 has joined #ste||ar

00:27 ms[m] has joined #ste||ar

01:37 nanmiao has quit [Quit: Connection closed]

01:59 <srinivasyadav227> does datapar_execution mean running the algorithm in parallel?

01:59 <srinivasyadav227> in #2330 it shows that certain transformations are applied

01:59 <srinivasyadav227> for random-access input/output sequences of arithmetic types only.

01:59 <srinivasyadav227> So does that mean the transformations which are applied for vectorization?

02:00 <srinivasyadav227> and for others it will use par execution policiy. (is this hpx::execution::parallel_policy)

02:04 <srinivasyadav227> I did not understand what datapar_execution clearly means and how its different from std::execution::par or hpx::execution::parallel_policy

02:06 K-ballo has quit [Quit: K-ballo]

02:06 <hkaiser> srinivasyadav227: it means running using vectorization and in parallel

02:07 <hkaiser> hpx::execution::par is similar to std:;execution::par, just for HPX

02:07 <srinivasyadav227> thats analogous to std::par_unseq right?

02:07 <hkaiser> no

02:07 <hkaiser> par is par, and par_unseq is par_unseq

02:08 <hkaiser> see https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag

02:10 <hkaiser> datapar is analogous to wg21.link/p0350, it's very experimental

02:10 * srinivasyadav227 sent a long message: < https://matrix.org/_matrix/media/r0/download/matrix.org/CJmVNSBkauVZRodbpDeaEOLw/message.txt >

02:10 <srinivasyadav227> is this right?

02:11 <hkaiser> exactly

02:11 <hkaiser> HPX doesn't do anything special for par_unseq (it's mapped to par) and unseq, not supported at all

02:11 <srinivasyadav227> this means datapar is similar to par_unseq?

02:11 <hkaiser> kind of

02:13 <hkaiser> par_unseq parallelises and vectorizes based on the compiler's capabilities, while datapar uses special vector-register types to perform the vectorization (see p0350)

02:13 <srinivasyadav227> ok..i was confused here..because it was mentioned in #2330 and 2271

02:13 <srinivasyadav227> so should we implement both? datapar and par_unseq

02:14 <hkaiser> par_unseq/unseq are more important, but we should disentangle our experimental datapar implementation from teh base algorithms

02:15 <hkaiser> unseq/par_unseq and datapar are independent

02:15 <srinivasyadav227> vectorisation means we definitely have to use vector registers right?.. in par_unseq compiler figures how to use them, datapar means uses some specific (say intel avx512).. is this right?

02:18 <hkaiser> vectorization means that in the end we use vector registers, but this can be implicit (leave it to the compiler, i.e.e par_unseq/unseq) or force it through datapar and simd types, see p0350 and N4755)

02:24 <srinivasyadav227> okay, par_unseq/unseq means its uses vectorization (that is implicit, compiler figures out), datapar means explicit use of some specific vector registers for vectorization

02:25 <hkaiser> correct

02:25 <hkaiser> datapar uses special c++ types (simd types) that represent vector registers

02:26 <srinivasyadav227> like avx512 right?

02:26 <hkaiser> there should also be dataseq, not sure if we have that

02:27 <hkaiser> simd<double, 8> would represent avx512, yes

02:28 <hkaiser> it's a c++ type that holds 8 doubles and has operators overloaded, and those are implemented using compiler intrinsics mapping onto vector operations

02:28 <srinivasyadav227> ahh..finally got clarity..thank you :)

02:30 <srinivasyadav227> in#2330 its mentioned that dataseq would be available in future, in think not implemented yet

02:31 <srinivasyadav227> * in#2330 its mentioned that dataseq would be available in future, i think not implemented yet

02:31 <hkaiser> nod

02:32 <hkaiser> I think we should focus on two things: a) disentangle the existing experimental datapar implementation, and b) implement par_unseq/unseq (I think that's a separate ticket)

02:32 <hkaiser> #2271

02:33 <hkaiser> I did start implementing that (#3063), but that didn't go anywhere

02:41 <srinivasyadav227> in summary

02:41 <srinivasyadav227> currently hpx supports hpx::execution::par, hpx::datapar_execution::seq

02:41 <srinivasyadav227> and needs hpx::execution::unseq and hpx::execution::par_unseq (#2271)

02:41 <srinivasyadav227> and for datapar (its supports for 4 algorithms from #2330)

02:41 <srinivasyadav227> and we should remove datapar experimental and adapt N4755 (#5157)

02:42 <hkaiser> right

02:42 <hkaiser> not really remove datapar, just make the base algorithm implementations independent of it

02:43 <hkaiser> perhaps have the same using a separate tag_invoke based implementation

02:45 <srinivasyadav227> ok, could you provide some example or give some insight about first step of 5157, I now have some clarity with tag_invoke finally :)

02:46 <hkaiser> well, currently the datapar implementation is part of the base algorithm implementation based on some internal functions being implemented twice: once normal and once for datapar

02:47 <hkaiser> I'd rather simplify the normal algorithms by removing support for datapar and reimplement the datapar stuff independently behind a tag_invoke specialization

02:48 <hkaiser> the datapar was an experiment that doesn't give too good results and ended up complicating the implementations

02:49 <hkaiser> with datapar all internal loops needed to have a prefix and a postfix loop to handle the tapering of available iterations (i.e. if the overall number of iterations is not divisible by the width of the vector registers)

02:50 <hkaiser> also to account for (mis-)alignment of array eleement at the start of the iterations

02:50 <srinivasyadav227> oh..thats seems to be bad actually, we use some kind of masking right

02:50 <hkaiser> if things are misaligned masking doesn't help either

02:51 <hkaiser> well, it might help, actually - you're right

02:51 <hkaiser> I'm not a vectorization expert

02:53 <srinivasyadav227> even I am not that good at it, but I have worked with intel vector registers and its intrinsics related stuff...so got some idea abt it :)

02:54 <hkaiser> good

02:54 <hkaiser> to summarize - there is plenty of work to be done related to vectorization and you're more than welcome to pick something interesting for you

02:56 <srinivasyadav227> yea definitely, so should I go with #5157 first step or #2271?,, which would you recommend ? :)

02:57 <hkaiser> whatever is more interesting to you ;-)

02:58 <srinivasyadav227> haha.. ok :-), thank you so much for your time, got some real clarity about #5157 #2271 #2330

03:01 hkaiser has quit [Quit: bye]

03:44 <jedi18[m]1> The failing tests in my PRs are not related to my changes right?

04:32 bita has joined #ste||ar

05:00 diehlpk_work has quit [Remote host closed the connection]

05:34 bita has quit [Ping timeout: 240 seconds]

08:23 jehelset has joined #ste||ar

09:10 <rori> which failures are we talking about? I have to trigger the testing, but the failure on macOS is known and not due to your changes

10:49 <jedi18[m]1> Ohh ok, no I meant the "build-and-test" one but yeah I guess it's not related, nevermind

11:52 K-ballo has joined #ste||ar

12:58 hkaiser has joined #ste||ar

13:42 K-ballo has quit [Quit: K-ballo]

13:43 K-ballo has joined #ste||ar

13:51 <hkaiser> rori: yt?

13:52 <hkaiser> rori: what's wrong with the clang-oldest and gcc-oldest builders on pizdaint? was the boost version removed?

13:53 <rori> yes I had a pb with boost on daint but it should be solved

13:54 <hkaiser> rori: ok, thanks!

13:54 <rori> At least it passed on [master](https://cdash.cscs.ch/index.php?project=HPX)

13:54 <hkaiser> also, is the hip problem on rostam something we can help with?

13:55 <K-ballo> hip problems? how old is rostam?

13:55 <hkaiser> lol

13:57 <rori> I think the question is how old are the AMD GPUs on Rostam? :P

13:57 <rori> I don't know how to solve it for now, I think ms was on it with alireza but not easily fixable

13:58 <hkaiser> rori: ok, I'll talk to Al

13:59 <hkaiser> parsa: would you be able to help with this: https://app.circleci.com/pipelines/github/STEllAR-GROUP/phylanx/975/workflows/df8a186f-6fba-47bb-8fb6-11c39f037d32?

13:59 <hkaiser> something broke again

14:19 <rori> thanks hkaiser

14:38 <hkaiser> rori: do you have any test programs we could use to verify the setup is working properly?

14:41 hkaiser has quit [Read error: Connection reset by peer]

14:42 hkaiser has joined #ste||ar

14:55 diehlpk_work has joined #ste||ar

15:04 <rori> which setup are we talking about?

15:05 <rori> If it's to test the AMD gpu, any of the failing tests, I can give you a link

15:06 <rori> https://cdash.cscs.ch/viewTest.php?onlyfailed&buildid=149552

15:18 <hkaiser> rori: I meant how to test whether the hip setup is working on rostam (independently of hpx)

15:27 <rori> I think we would have to wait for ms, cause I'm not entirely sure what the problem is, in my mind it was a hardware problem, especially as the same setup was working before and also as the new rocm/4.0.1 installed also have some problems ([see run on new PR](https://cdash.cscs.ch/buildSummary.php?buildid=149552)) but I can have a look at the hip setup tomorrow if this is the real problem

15:27 bita has joined #ste||ar

15:28 <rori> `hipErrorInvalidDevice`

15:29 <rori> Just compiling a simple kernel (without hpx) would probably throw the error

15:32 nanmiao has joined #ste||ar

15:33 nanmiao has quit [Client Quit]

15:37 nanmiao has joined #ste||ar

16:00 <hkaiser> rori: Alireza is looking into it

16:11 <hkaiser> wash[m]: yt?

16:17 <hkaiser> jedi18[m]1: you're unstoppable ;-)

16:18 <rori> <hkaiser "rori: Alireza is looking into it"> Thanks a lot!

16:18 <hkaiser> rori: pizdaint seems to be still unhappy :/

16:18 <hkaiser> the boost issues still pop up, I believe

16:19 <jedi18[m]1> Thanks :D, also I noticed this is one of the gsoc projects. https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-(GSoC)-2021#disentangle-segmented-algorithms . It's okay if I keep doing this since you won't be able to provide it for gsoc anymore right?

16:19 <hkaiser> sure, please keep going - we don't know if anybody would be interested...

16:20 <hkaiser> and we can always modify tasks as we go

16:20 <jedi18[m]1> Oh ok sure! :)

16:21 <hkaiser> jedi18[m]1: also, I believe that after the 5th algorithm you will have enough and you will want to move on to more interesting things ;-)

16:22 <rori> The last builds are good cf cdash

16:22 <rori> Have you retriggered the builds?

16:22 <hkaiser> rori: ahh ok - cool

16:22 <hkaiser> I tried

16:23 <rori> which PR?

16:23 <jedi18[m]1> Yeah gotta admit, it is getting a tad bit repetitive. I'll do one or two more and then try the rest later after some other unrelated tasks

16:23 <hkaiser> ahh now it's fine

16:23 <hkaiser> rori: I take everything back - sorry for the noise

16:24 <hkaiser> jedi18[m]1: perfect

16:28 <jedi18[m]1> A doubt, I'm not catching errors when compiling locally and running the tests so I think I'm doing it wrong. As of now I'm right clicking the project (Modules/Parallelism/hpx_algorithms) and then running it. I guess that's the wrong way of doing it?

16:29 <jedi18[m]1> Some of the errors, mostly the ones on the deprecated overloads

16:32 <hkaiser> jedi18[m]1: the tests are under tests/unit/modules/algorithms or similar

16:33 <jedi18[m]1> No I can run the tests that's not the problem

16:33 <hkaiser> jedi18[m]1: what's the problem, then?

16:35 <jedi18[m]1> I mean some overloads which are not instantiated by any of the for any of the tests, so I don't get an error if I named some variables wrong or something until I run it on the CI

16:36 <jedi18[m]1> Like just now for a deprecated overload on count.hpp in container algorithms, I'd typed "first, last instead of hpx::util::begin(rng),

16:36 <jedi18[m]1> hpx::util::end(rng)" https://github.com/STEllAR-GROUP/hpx/pull/5220/checks?check_run_id=2023688803

16:36 <jedi18[m]1> That error dindt show up when running any of the tests

16:37 <jedi18[m]1> Which I'm guessing is because the tests don't use hpx::parallel::count so that overload is never instantiated

16:38 <K-ballo> jedi18[m]1: sounds like you are using msvc?

16:39 <jedi18[m]1> Yeah

16:39 <hkaiser> jedi18[m]1: you might not have run all the tests

16:40 <jedi18[m]1> Do I need to run all the tests? I only ran the ones with "count" in it

16:41 <hkaiser> try figuring out what test is failing on the ci

16:41 <hkaiser> and then see if you have really run that locally

16:41 <K-ballo> there's a number of checks on templates that can be done eagerly, or deferred until instantiation.. msvc always defers them

16:41 <K-ballo> that said, there shouldn't be "uncovered" paths

16:42 <K-ballo> if an overloa disn't instantiated by any test, then a test is missing

16:42 <jedi18[m]1> <hkaiser "try figuring out what test is fa"> It's the build that's failing, not a test

16:43 <jedi18[m]1> <K-ballo "if an overloa disn't instantiate"> Hmm wait I'll tell you which all tests I ran

16:43 <hkaiser> jedi18[m]1: looks like the ci is right

16:44 <hkaiser> and there could be tests for certains overloads missing - msvc will not report errors on those, but clang/gcc will

16:45 <jedi18[m]1> Oh ok, so do I set up another build or just keep relying on the CI?

16:46 <jedi18[m]1> I guess another build would help me learn more

16:46 <K-ballo> definitely do learn another build, but still add whatever tests are missing

16:47 <jedi18[m]1> <K-ballo "definitely do learn another buil"> Was gonna say that :D, ok I'll add the missing tests

16:48 <rori> <K-ballo "if an overloa disn't instantiate"> +1

16:53 <jedi18[m]1> Yeah only hpx::count is being called in the tests, hpx::parallel:count isn't called anywhere

17:04 <jedi18[m]1> Do I have to add the test in test_count_async, test_count_exception etc as well, or will calling test_count only be enough? (since the overload is deprecated)

17:25 <gnikunj[m]> hkaiser: here you go ;) https://github.com/NK-Nikunj/hpx-kokkos-resilience/blob/main/tests/async_replay_device.cpp#L38-L41

17:25 <gnikunj[m]> full support for direct API and with replay executor on device and host \o/

17:27 <gnikunj[m]> we need to discuss about bulk async and error to return now. I think the above progress is enough for them (we're doing what they don't support, so we can talk at length about how we make that happen). I'll try to get things implemented for bulk async before the meeting as well (if it's not too complicated).

17:30 <hkaiser> \o/

17:44 nanmiao has quit [Quit: Connection closed]

17:50 hkaiser has quit [Read error: Connection reset by peer]

17:51 hkaiser has joined #ste||ar

18:17 RostamLog has joined #ste||ar

18:39 <pedro_barbosa[m]> Hey guys, I was attempting to execute this example (https://gist.github.com/PADBarbosa/693331b4ebe31e59958f27c9723fb65a) both in the GPU and CPU, but I don't know how to, is there anywhere I can read about doing these sort of things?

18:42 <hkaiser> can't you just run it?

18:43 <gnikunj[m]> pedro_barbosa: do you mean writing a cmake for it?

18:44 <pedro_barbosa[m]> Yeah it runs but I wanted to multiply half of the array in the GPU and the other half in the CPU concurrently

18:44 <gnikunj[m]> also, https://gist.github.com/PADBarbosa/693331b4ebe31e59958f27c9723fb65a#file-gistfile1-txt-L49 does not look right to me. v and vh are on the device and you're using copy on the host. it should either not compile or throw error at runtime.

18:45 <gnikunj[m]> unless hkaiser has internal trickery with this stuff (in which case I should do something in kokkos too)

18:46 <hkaiser> the vector does different things depending on the allocator

18:46 <pedro_barbosa[m]> Is the vh array on the device as well? the idea was to multiply v on the GPU and then copy the result to v nad print it

18:46 <pedro_barbosa[m]> * Is the vh array on the device as well? the idea was to multiply v on the GPU and then copy the result to v and print it

18:47 <hkaiser> if it's a noraml (CPU) allocator, the data is on the CPU, if it's hpx::cuda::experimental::allocator, then the data is on the device

18:49 <pedro_barbosa[m]> * Is the vh array on the device as well? the idea was to multiply v on the GPU and then copy the result to vh and print it

18:50 <pedro_barbosa[m]> So, is it possible to multiply half of the array on the GPU and the other half on the CPU and then copy the values from the device and merge it with the one that was multiplied on the CPU?

18:52 <gnikunj[m]> apologies, didn't see vh was on the host

18:55 nanmiao has joined #ste||ar

19:02 <pedro_barbosa[m]> No problem, but do you know how to divide the work between the GPU and CPU, I could do it manually without implementing any HPX function but I think there are HPX functions to do it in a more efficient way

19:03 <hkaiser> pedro_barbosa[m]: there is no way for us to do that with one vector

19:05 <pedro_barbosa[m]> I'm using 2, one in the CPU and the other in the GPU, my idea was to copy half of the array on the CPU to the GPU and then multiply each value and then copy it back and merge it with the one on the CPU which will also be multiplied by 2

19:14 <hkaiser> makes sense

19:15 nanmiao has quit [Quit: Connection closed]

19:16 <hkaiser> if you launch one of our parallel algorithms with a vector that refers to the device, then the kernel should run on the device as well if you use the device executor

19:16 <hkaiser> so what you wrote should work

19:59 nanmiao has joined #ste||ar

20:43 V|r has joined #ste||ar

20:43 Vir has quit [Ping timeout: 246 seconds]

20:51 V|r is now known as Vir

22:22 hkaiser has quit [Read error: Connection reset by peer]

22:23 hkaiser has joined #ste||ar

22:33 bita has quit [Read error: Connection reset by peer]

22:33 bita has joined #ste||ar

23:09 nanmiao has quit [Quit: Connection closed]