#ste||ar on 2022-05-27 — irc logs at irclog.cct.lsu.edu

2021-08-06 22:55 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu

01:16 K-ballo has quit [Quit: K-ballo]

01:19 hkaiser has joined #ste||ar

02:18 hkaiser has quit [Quit: Bye!]

06:20 Yorlik has joined #ste||ar

12:00 K-ballo has joined #ste||ar

12:01 bohrch has joined #ste||ar

12:10 hkaiser has joined #ste||ar

12:16 bohrch has quit [Ping timeout: 252 seconds]

13:40 aalekhn has joined #ste||ar

14:09 bohrch has joined #ste||ar

14:17 bohrch has quit [Ping timeout: 252 seconds]

14:26 bohr has joined #ste||ar

14:33 bohr has quit [Ping timeout: 252 seconds]

14:38 bohr has joined #ste||ar

14:42 bohr has quit [Client Quit]

14:51 bohr has joined #ste||ar

14:53 bohr has quit [Client Quit]

14:57 bohr has joined #ste||ar

15:01 bohr has quit [Client Quit]

15:05 bohr has joined #ste||ar

15:25 bohr has quit [Ping timeout: 252 seconds]

15:27 bohr has joined #ste||ar

15:33 bohr has quit [Quit: Client closed]

15:35 bohr has joined #ste||ar

15:58 bohr has quit [Ping timeout: 252 seconds]

16:29 diehlpk_work has joined #ste||ar

18:58 aalekhn has quit [Quit: Connection closed for inactivity]

19:38 <gonidelis[m]> Coroutines are meant for asynchornous interraction between algorithms. While asynchrony is not to be mixed with parallelism and/or performance purposes it is of HPX's interest that we actually do gain perf benefit from them (beyond ease of codability). The question is, how could a mechanism that aims towards perf. be dependent on dynamic memory allocation :| ?

19:38 <gonidelis[m]> Shouldn't dynamic frame alloc. be a no-no for a whatever parallel programming paradigm? I am playing the devil's advocate once more.

19:38 <gonidelis[m]> K-ballo hkaiser opinions? ^^

19:38 <gonidelis[m]> or even satacker

19:38 <K-ballo> mp

19:38 <K-ballo> no

19:39 <K-ballo> spawning a thread involves memory allocation, both for its internal control structures and for its stack

19:39 <K-ballo> should spawning threads be a no-no in parallel programming?

19:42 <gonidelis[m]> yes!

19:42 <gonidelis[m]> no!

19:42 <gonidelis[m]> i mean i thhought of that you are right

19:43 <gonidelis[m]> parallel execution is completely dynamic is what you say

19:43 <K-ballo> it doesn't have to be

19:44 <gonidelis[m]> p2300 tries to do as much work as possible pre-execution. set up the whole imp. and then fire the execution. coroutines inhibit that cause, no?

19:44 <gonidelis[m]> oh how/

19:44 <gonidelis[m]> ?

19:44 <satacker[m]> I think it'll also depend somewhat on a particular implementation of coroutine. For example sometimes it could only be a syntactic sugar, sometimes it's beneficial for an optimization (will add a source later)

19:44 <K-ballo> why would coroutines inhibit that?

19:47 <gonidelis[m]> yeah I don't know. I guess I am still figuring it out

19:48 <hkaiser> gonidelis[m]: minimizing dynamic allocation is definitely a good thing

19:48 <gonidelis[m]> is operation_state dyanmic in senders receivers? I reckon no, coroutine state on the other hand is allocated on the heap

19:48 <K-ballo> anything senders/receivers is probably allocation free, it's a pretty low level primitive

19:49 <hkaiser> gonidelis[m]: usually no allocations are needed for s/r, just some of those need it

19:49 <gonidelis[m]> is operation_state dynamic? hm....

19:49 <gonidelis[m]> yes

19:49 <gonidelis[m]> !

19:49 <K-ballo> and by that I mean the core of sender/receivers, not the things you build on top of it

19:49 <gonidelis[m]> yy of course ^^

19:49 <hkaiser> some of the s/r algorithms need allocations

19:50 <gonidelis[m]> so K-ballo you are basically saying that s/r just skip the dirty work on their definition and do actually involve dynamic memory allocation deeper on

19:50 <K-ballo> no

19:50 <K-ballo> if your primitive does memory allocation when it doesn't need to, you can't remove it from the outside

19:50 <gonidelis[m]> also hkaiser https://godbolt.org/z/PvY5PMezd

19:51 <gonidelis[m]> check lines #94-#101. `then` does have a state

19:51 <gonidelis[m]> K-ballo: ahh... go again?

19:52 <hkaiser> gonidelis[m]: ok

19:53 <hkaiser> gonidelis[m]: ours doesn't need one: https://github.com/STEllAR-GROUP/hpx/blob/master/libs/core/execution/include/hpx/execution/algorithms/then.hpp

19:55 <gonidelis[m]> hkaiser: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2300r5.html#example-then

19:55 <gonidelis[m]> but you are right afaiu there is no op_state here

19:55 <gonidelis[m]> huh

19:55 <gonidelis[m]> will ask eric

19:55 <Yorlik> What would currently the best system for convergence of SIMD, parallel algorithms and GPU based computations? Does something like that already exist?

19:55 <satacker[m]> Also, coroutines go (more) with concurrency right?

19:57 <hkaiser> gonidelis[m]: e.g. split() needs an allocation

19:57 <gonidelis[m]> satacker: I would promise that the original idea was communication

19:57 <hkaiser> start_detached as well

19:57 <gonidelis[m]> why split?

19:58 <gonidelis[m]> Yorlik: HPX B)

19:58 <satacker[m]> > <@satacker:matrix.org> ```... (full message at https://libera.ems.host/_matrix/media/r0/download/libera.chat/9de6a936cc32f109a55019da7e3d9ea8435515cc)

19:58 <Yorlik> HPX ofc is part of that movement. But i guess GPU programming is another beast isn't it?

19:59 <gonidelis[m]> yay...we have people to blame for that if you want

19:59 <gonidelis[m]> I kno gdaiss has been writing some integration code of HPX on GPUs

19:59 <hkaiser> gonidelis[m]: yes, you're right - our implementation of start_detached relies on our split - so the need for an allocation propagates

20:00 <Yorlik> I'm simply wondering what to look out for in the future. Especially if I can organize my data better (Maybe more SoA instead of AoS) to make it capable for autovectorization and accelarator use.

20:00 <hkaiser> Yorlik: we focussed on being able to integrate existing GPU kernels into our async execution infrastructure

20:00 <hkaiser> e.g. you can launch a CUDA kernel and get an hpx::future back that becomes ready when the kernel has fnished running

20:01 <satacker[m]> Can GPGPU be generalized to include it in HPX? All the hardwares being different?

20:01 <Yorlik> Existing kernels? Like pre-written shaders/kernels for specific problems?

20:01 <hkaiser> yes

20:02 <Yorlik> The whole topic still looks very chaotic to me: Compute shaders, OpenCL, Cuda, SyCl, It looks a bit like a mess and vendor wars.

20:02 <Yorlik> So - I guess every sane person want unification for obvious reasons.

20:03 <hkaiser> that's what we tried to do, we support CUDA, HIP, and kokkos, currently

20:03 <hkaiser> gdaiss[m]: plans to add support for sycl

20:03 <Yorlik> I wonder if there is anything happening at the side of C++ standardization concerning heterogenous computing already. I guess executors are a start, right?

20:03 <hkaiser> Yorlik: senders/receivers

20:04 <Yorlik> I think I heard that recently. Not yet in C++23 I guess?

20:04 <hkaiser> however writing the kernels themselves is not under consideration, that is vendor specific

20:04 <hkaiser> nope, target is now c++26

20:04 <hkaiser> :/

20:05 <Yorlik> So it's like design patterns for algorithms implemented by vendors and plugged in into e.g. HPX?

20:05 <Yorlik> E.g. matrix math, linear algebra, ML or whatever?

20:06 <hkaiser> Yorlik: it's more that different vendors require different extensions to compile things down to the device

20:06 <hkaiser> like CUDA's __host__ and __device__ directives, etc.

20:07 <Yorlik> And no common abstraction in sight?

20:07 <hkaiser> have not seen any

20:07 <Yorlik> Allright. Thanks for the info!

20:08 <hkaiser> there is a clang version that supports compiling for nvidia devices, and another clang version for hip

20:08 <hkaiser> etc.

20:09 <Yorlik> For our physics engine we have now decided to not write our own for now, but write an abstraction layer and plug bullet physics into it. That way we can make our own engine later or swap out. Might be easier to do one later when APIs have further developed and stabilized.

20:09 <Yorlik> Bullet already has some GPGPU capabilities.

20:17 <hkaiser> sure

21:21 <diehlpk_work> It seems that ranges are broken in clang 13

21:49 <hkaiser> diehlpk_work: what ranges?

21:52 <diehlpk_work> https://github.com/ModernCPPBook/Examples/blob/main/appendix/ranges.cpp

21:52 <diehlpk_work> std::view has some bug in clang 13

21:52 <diehlpk_work> I can compile it with gcc 11 and gcc 12

21:53 <diehlpk_work> but not with clang 13 and clang 14

21:53 <hkaiser> ok

21:53 <hkaiser> diehlpk_work: btw, I'd suggest to use clang-format to unify the code formatting

21:53 <hkaiser> for instance use https://github.com/STEllAR-GROUP/hpx/blob/master/.clang-format

21:54 <diehlpk_work> https://pastebin.com/TwK1jgmt

21:54 <diehlpk_work> hkaiser, I do

21:54 <diehlpk_work> https://github.com/ModernCPPBook/Examples/blob/main/format.sh

21:54 <hkaiser> ok

21:54 <diehlpk_work> I use the Google style

21:55 <diehlpk_work> But we can use the HPX one, I do not care

21:55 <hkaiser> well, the code you linked doesn't look formatted ;-)

21:55 <hkaiser> for the error - this is from the clang v12 - is that correct?

21:56 <diehlpk_work> No it is formated

21:56 <diehlpk_work> clang version 13.0.0 (Fedora 13.0.0-3.fc35)

21:56 <diehlpk_work> Fedora should use clang 13

21:57 <hkaiser> ahh, now it is

21:57 <diehlpk_work> and Fedora 36 uses clang 14

21:57 <diehlpk_work> I could not compile it with both of them

21:57 <diehlpk_work> gcc works for me.

21:57 <diehlpk_work> Just wanted to test the book exmaples with clang and gcc to make sure they will work

21:57 <hkaiser> that's more than strange - will try

21:58 <diehlpk_work> Yes, I forgot to run the script

21:58 <diehlpk_work> I found a issue on clang's github for 13

22:23 diehlpk_work has quit [Remote host closed the connection]

22:48 hkaiser has quit [Quit: Bye!]