K-ballo changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
pedro_barbosa[m] has quit [Ping timeout: 240 seconds]
ms[m] has quit [Ping timeout: 244 seconds]
tiagofg[m] has quit [Ping timeout: 258 seconds]
heller1 has quit [Ping timeout: 258 seconds]
pedro_barbosa[m] has joined #ste||ar
beauty2 has quit [Ping timeout: 264 seconds]
tiagofg[m] has joined #ste||ar
heller1 has joined #ste||ar
ms[m] has joined #ste||ar
nanmiao has quit [Quit: Connection closed]
<srinivasyadav227> does datapar_execution mean running the algorithm in parallel?
<srinivasyadav227> in #2330 it shows that certain transformations are applied
<srinivasyadav227> for random-access input/output sequences of arithmetic types only.
<srinivasyadav227> So does that mean the transformations which are applied for vectorization?
<srinivasyadav227> and for others it will use par execution policiy. (is this hpx::execution::parallel_policy)
<srinivasyadav227> I did not understand what datapar_execution clearly means and how its different from std::execution::par or hpx::execution::parallel_policy
K-ballo has quit [Quit: K-ballo]
<hkaiser> srinivasyadav227: it means running using vectorization and in parallel
<hkaiser> hpx::execution::par is similar to std:;execution::par, just for HPX
<srinivasyadav227> thats analogous to std::par_unseq right?
<hkaiser> no
<hkaiser> par is par, and par_unseq is par_unseq
<hkaiser> datapar is analogous to wg21.link/p0350, it's very experimental
<srinivasyadav227> is this right?
<hkaiser> exactly
<hkaiser> HPX doesn't do anything special for par_unseq (it's mapped to par) and unseq, not supported at all
<srinivasyadav227> this means datapar is similar to par_unseq?
<hkaiser> kind of
<hkaiser> par_unseq parallelises and vectorizes based on the compiler's capabilities, while datapar uses special vector-register types to perform the vectorization (see p0350)
<srinivasyadav227> ok..i was confused here..because it was mentioned in #2330 and 2271
<srinivasyadav227> so should we implement both? datapar and par_unseq
<hkaiser> par_unseq/unseq are more important, but we should disentangle our experimental datapar implementation from teh base algorithms
<hkaiser> unseq/par_unseq and datapar are independent
<srinivasyadav227> vectorisation means we definitely have to use vector registers right?.. in par_unseq compiler figures how to use them, datapar means uses some specific (say intel avx512).. is this right?
<hkaiser> vectorization means that in the end we use vector registers, but this can be implicit (leave it to the compiler, i.e.e par_unseq/unseq) or force it through datapar and simd types, see p0350 and N4755)
<srinivasyadav227> okay, par_unseq/unseq means its uses vectorization (that is implicit, compiler figures out), datapar means explicit use of some specific vector registers for vectorization
<hkaiser> correct
<hkaiser> datapar uses special c++ types (simd types) that represent vector registers
<srinivasyadav227> like avx512 right?
<hkaiser> there should also be dataseq, not sure if we have that
<hkaiser> simd<double, 8> would represent avx512, yes
<hkaiser> it's a c++ type that holds 8 doubles and has operators overloaded, and those are implemented using compiler intrinsics mapping onto vector operations
<srinivasyadav227> ahh..finally got clarity..thank you :)
<srinivasyadav227> in#2330 its mentioned that dataseq would be available in future, in think not implemented yet
<srinivasyadav227> * in#2330 its mentioned that dataseq would be available in future, i think not implemented yet
<hkaiser> nod
<hkaiser> I think we should focus on two things: a) disentangle the existing experimental datapar implementation, and b) implement par_unseq/unseq (I think that's a separate ticket)
<hkaiser> #2271
<hkaiser> I did start implementing that (#3063), but that didn't go anywhere
<srinivasyadav227> in summary
<srinivasyadav227> currently hpx supports hpx::execution::par, hpx::datapar_execution::seq
<srinivasyadav227> and needs hpx::execution::unseq and hpx::execution::par_unseq (#2271)
<srinivasyadav227> and for datapar (its supports for 4 algorithms from #2330)
<srinivasyadav227> and we should remove datapar experimental and adapt N4755 (#5157)
<hkaiser> right
<hkaiser> not really remove datapar, just make the base algorithm implementations independent of it
<hkaiser> perhaps have the same using a separate tag_invoke based implementation
<srinivasyadav227> ok, could you provide some example or give some insight about first step of 5157, I now have some clarity with tag_invoke finally :)
<hkaiser> well, currently the datapar implementation is part of the base algorithm implementation based on some internal functions being implemented twice: once normal and once for datapar
<hkaiser> I'd rather simplify the normal algorithms by removing support for datapar and reimplement the datapar stuff independently behind a tag_invoke specialization
<hkaiser> the datapar was an experiment that doesn't give too good results and ended up complicating the implementations
<hkaiser> with datapar all internal loops needed to have a prefix and a postfix loop to handle the tapering of available iterations (i.e. if the overall number of iterations is not divisible by the width of the vector registers)
<hkaiser> also to account for (mis-)alignment of array eleement at the start of the iterations
<srinivasyadav227> oh..thats seems to be bad actually, we use some kind of masking right
<hkaiser> if things are misaligned masking doesn't help either
<hkaiser> well, it might help, actually - you're right
<hkaiser> I'm not a vectorization expert
<srinivasyadav227> even I am not that good at it, but I have worked with intel vector registers and its intrinsics related stuff...so got some idea abt it :)
<hkaiser> good
<hkaiser> to summarize - there is plenty of work to be done related to vectorization and you're more than welcome to pick something interesting for you
<srinivasyadav227> yea definitely, so should I go with #5157 first step or #2271?,, which would you recommend ? :)
<hkaiser> whatever is more interesting to you ;-)
<srinivasyadav227> haha.. ok :-), thank you so much for your time, got some real clarity about #5157 #2271 #2330
hkaiser has quit [Quit: bye]
<jedi18[m]1> The failing tests in my PRs are not related to my changes right?
bita has joined #ste||ar
diehlpk_work has quit [Remote host closed the connection]
bita has quit [Ping timeout: 240 seconds]
jehelset has joined #ste||ar
<rori> which failures are we talking about? I have to trigger the testing, but the failure on macOS is known and not due to your changes
<jedi18[m]1> Ohh ok, no I meant the "build-and-test" one but yeah I guess it's not related, nevermind
K-ballo has joined #ste||ar
hkaiser has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
<hkaiser> rori: yt?
<hkaiser> rori: what's wrong with the clang-oldest and gcc-oldest builders on pizdaint? was the boost version removed?
<rori> yes I had a pb with boost on daint but it should be solved
<hkaiser> rori: ok, thanks!
<rori> At least it passed on [master](https://cdash.cscs.ch/index.php?project=HPX)
<hkaiser> also, is the hip problem on rostam something we can help with?
<K-ballo> hip problems? how old is rostam?
<hkaiser> lol
<rori> I think the question is how old are the AMD GPUs on Rostam? :P
<rori> I don't know how to solve it for now, I think ms was on it with alireza but not easily fixable
<hkaiser> rori: ok, I'll talk to Al
<hkaiser> something broke again
<rori> thanks hkaiser
<hkaiser> rori: do you have any test programs we could use to verify the setup is working properly?
hkaiser has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
diehlpk_work has joined #ste||ar
<rori> which setup are we talking about?
<rori> If it's to test the AMD gpu, any of the failing tests, I can give you a link
<hkaiser> rori: I meant how to test whether the hip setup is working on rostam (independently of hpx)
<rori> I think we would have to wait for ms, cause I'm not entirely sure what the problem is, in my mind it was a hardware problem, especially as the same setup was working before and also as the new rocm/4.0.1 installed also have some problems ([see run on new PR](https://cdash.cscs.ch/buildSummary.php?buildid=149552)) but I can have a look at the hip setup tomorrow if this is the real problem
bita has joined #ste||ar
<rori> `hipErrorInvalidDevice`
<rori> Just compiling a simple kernel (without hpx) would probably throw the error
nanmiao has joined #ste||ar
nanmiao has quit [Client Quit]
nanmiao has joined #ste||ar
<hkaiser> rori: Alireza is looking into it
<hkaiser> wash[m]: yt?
<hkaiser> jedi18[m]1: you're unstoppable ;-)
<rori> <hkaiser "rori: Alireza is looking into it"> Thanks a lot!
<hkaiser> rori: pizdaint seems to be still unhappy :/
<hkaiser> the boost issues still pop up, I believe
<jedi18[m]1> Thanks :D, also I noticed this is one of the gsoc projects. https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-(GSoC)-2021#disentangle-segmented-algorithms . It's okay if I keep doing this since you won't be able to provide it for gsoc anymore right?
<hkaiser> sure, please keep going - we don't know if anybody would be interested...
<hkaiser> and we can always modify tasks as we go
<jedi18[m]1> Oh ok sure! :)
<hkaiser> jedi18[m]1: also, I believe that after the 5th algorithm you will have enough and you will want to move on to more interesting things ;-)
<rori> The last builds are good cf cdash
<rori> Have you retriggered the builds?
<hkaiser> rori: ahh ok - cool
<hkaiser> I tried
<rori> which PR?
<jedi18[m]1> Yeah gotta admit, it is getting a tad bit repetitive. I'll do one or two more and then try the rest later after some other unrelated tasks
<hkaiser> ahh now it's fine
<hkaiser> rori: I take everything back - sorry for the noise
<hkaiser> jedi18[m]1: perfect
<jedi18[m]1> A doubt, I'm not catching errors when compiling locally and running the tests so I think I'm doing it wrong. As of now I'm right clicking the project (Modules/Parallelism/hpx_algorithms) and then running it. I guess that's the wrong way of doing it?
<jedi18[m]1> Some of the errors, mostly the ones on the deprecated overloads
<hkaiser> jedi18[m]1: the tests are under tests/unit/modules/algorithms or similar
<jedi18[m]1> No I can run the tests that's not the problem
<hkaiser> jedi18[m]1: what's the problem, then?
<jedi18[m]1> I mean some overloads which are not instantiated by any of the for any of the tests, so I don't get an error if I named some variables wrong or something until I run it on the CI
<jedi18[m]1> Like just now for a deprecated overload on count.hpp in container algorithms, I'd typed "first, last instead of hpx::util::begin(rng),
<jedi18[m]1> That error dindt show up when running any of the tests
<jedi18[m]1> Which I'm guessing is because the tests don't use hpx::parallel::count so that overload is never instantiated
<K-ballo> jedi18[m]1: sounds like you are using msvc?
<jedi18[m]1> Yeah
<hkaiser> jedi18[m]1: you might not have run all the tests
<jedi18[m]1> Do I need to run all the tests? I only ran the ones with "count" in it
<hkaiser> try figuring out what test is failing on the ci
<hkaiser> and then see if you have really run that locally
<K-ballo> there's a number of checks on templates that can be done eagerly, or deferred until instantiation.. msvc always defers them
<K-ballo> that said, there shouldn't be "uncovered" paths
<K-ballo> if an overloa disn't instantiated by any test, then a test is missing
<jedi18[m]1> <hkaiser "try figuring out what test is fa"> It's the build that's failing, not a test
<jedi18[m]1> <K-ballo "if an overloa disn't instantiate"> Hmm wait I'll tell you which all tests I ran
<hkaiser> jedi18[m]1: looks like the ci is right
<hkaiser> and there could be tests for certains overloads missing - msvc will not report errors on those, but clang/gcc will
<jedi18[m]1> Oh ok, so do I set up another build or just keep relying on the CI?
<jedi18[m]1> I guess another build would help me learn more
<K-ballo> definitely do learn another build, but still add whatever tests are missing
<jedi18[m]1> <K-ballo "definitely do learn another buil"> Was gonna say that :D, ok I'll add the missing tests
<rori> <K-ballo "if an overloa disn't instantiate"> +1
<jedi18[m]1> Yeah only hpx::count is being called in the tests, hpx::parallel:count isn't called anywhere
<jedi18[m]1> Do I have to add the test in test_count_async, test_count_exception etc as well, or will calling test_count only be enough? (since the overload is deprecated)
<gnikunj[m]> full support for direct API and with replay executor on device and host \o/
<gnikunj[m]> we need to discuss about bulk async and error to return now. I think the above progress is enough for them (we're doing what they don't support, so we can talk at length about how we make that happen). I'll try to get things implemented for bulk async before the meeting as well (if it's not too complicated).
<hkaiser> \o/
nanmiao has quit [Quit: Connection closed]
hkaiser has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
RostamLog has joined #ste||ar
<pedro_barbosa[m]> Hey guys, I was attempting to execute this example (https://gist.github.com/PADBarbosa/693331b4ebe31e59958f27c9723fb65a) both in the GPU and CPU, but I don't know how to, is there anywhere I can read about doing these sort of things?
<hkaiser> can't you just run it?
<gnikunj[m]> pedro_barbosa: do you mean writing a cmake for it?
<pedro_barbosa[m]> Yeah it runs but I wanted to multiply half of the array in the GPU and the other half in the CPU concurrently
<gnikunj[m]> also, https://gist.github.com/PADBarbosa/693331b4ebe31e59958f27c9723fb65a#file-gistfile1-txt-L49 does not look right to me. v and vh are on the device and you're using copy on the host. it should either not compile or throw error at runtime.
<gnikunj[m]> unless hkaiser has internal trickery with this stuff (in which case I should do something in kokkos too)
<hkaiser> the vector does different things depending on the allocator
<pedro_barbosa[m]> Is the vh array on the device as well? the idea was to multiply v on the GPU and then copy the result to v nad print it
<pedro_barbosa[m]> * Is the vh array on the device as well? the idea was to multiply v on the GPU and then copy the result to v and print it
<hkaiser> if it's a noraml (CPU) allocator, the data is on the CPU, if it's hpx::cuda::experimental::allocator, then the data is on the device
<pedro_barbosa[m]> * Is the vh array on the device as well? the idea was to multiply v on the GPU and then copy the result to vh and print it
<pedro_barbosa[m]> So, is it possible to multiply half of the array on the GPU and the other half on the CPU and then copy the values from the device and merge it with the one that was multiplied on the CPU?
<gnikunj[m]> apologies, didn't see vh was on the host
nanmiao has joined #ste||ar
<pedro_barbosa[m]> No problem, but do you know how to divide the work between the GPU and CPU, I could do it manually without implementing any HPX function but I think there are HPX functions to do it in a more efficient way
<hkaiser> pedro_barbosa[m]: there is no way for us to do that with one vector
<pedro_barbosa[m]> I'm using 2, one in the CPU and the other in the GPU, my idea was to copy half of the array on the CPU to the GPU and then multiply each value and then copy it back and merge it with the one on the CPU which will also be multiplied by 2
<hkaiser> makes sense
nanmiao has quit [Quit: Connection closed]
<hkaiser> if you launch one of our parallel algorithms with a vector that refers to the device, then the kernel should run on the device as well if you use the device executor
<hkaiser> so what you wrote should work
nanmiao has joined #ste||ar
V|r has joined #ste||ar
Vir has quit [Ping timeout: 246 seconds]
V|r is now known as Vir
hkaiser has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
bita has quit [Read error: Connection reset by peer]
bita has joined #ste||ar
nanmiao has quit [Quit: Connection closed]