hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu
K-ballo has quit [Quit: K-ballo]
hkaiser has joined #ste||ar
hkaiser has quit [Quit: Bye!]
Yorlik has joined #ste||ar
K-ballo has joined #ste||ar
bohrch has joined #ste||ar
hkaiser has joined #ste||ar
bohrch has quit [Ping timeout: 252 seconds]
aalekhn has joined #ste||ar
bohrch has joined #ste||ar
bohrch has quit [Ping timeout: 252 seconds]
bohr has joined #ste||ar
bohr has quit [Ping timeout: 252 seconds]
bohr has joined #ste||ar
bohr has quit [Client Quit]
bohr has joined #ste||ar
bohr has quit [Client Quit]
bohr has joined #ste||ar
bohr has quit [Client Quit]
bohr has joined #ste||ar
bohr has quit [Ping timeout: 252 seconds]
bohr has joined #ste||ar
bohr has quit [Quit: Client closed]
bohr has joined #ste||ar
bohr has quit [Ping timeout: 252 seconds]
diehlpk_work has joined #ste||ar
aalekhn has quit [Quit: Connection closed for inactivity]
<gonidelis[m]>
Coroutines are meant for asynchornous interraction between algorithms. While asynchrony is not to be mixed with parallelism and/or performance purposes it is of HPX's interest that we actually do gain perf benefit from them (beyond ease of codability). The question is, how could a mechanism that aims towards perf. be dependent on dynamic memory allocation :| ?
<gonidelis[m]>
Shouldn't dynamic frame alloc. be a no-no for a whatever parallel programming paradigm? I am playing the devil's advocate once more.
<gonidelis[m]>
K-ballo hkaiser opinions? ^^
<gonidelis[m]>
or even satacker
<K-ballo>
mp
<K-ballo>
no
<K-ballo>
spawning a thread involves memory allocation, both for its internal control structures and for its stack
<K-ballo>
should spawning threads be a no-no in parallel programming?
<gonidelis[m]>
yes!
<gonidelis[m]>
no!
<gonidelis[m]>
i mean i thhought of that you are right
<gonidelis[m]>
parallel execution is completely dynamic is what you say
<K-ballo>
it doesn't have to be
<gonidelis[m]>
p2300 tries to do as much work as possible pre-execution. set up the whole imp. and then fire the execution. coroutines inhibit that cause, no?
<gonidelis[m]>
oh how/
<gonidelis[m]>
?
<satacker[m]>
I think it'll also depend somewhat on a particular implementation of coroutine. For example sometimes it could only be a syntactic sugar, sometimes it's beneficial for an optimization (will add a source later)
<K-ballo>
why would coroutines inhibit that?
<gonidelis[m]>
yeah I don't know. I guess I am still figuring it out
<hkaiser>
gonidelis[m]: minimizing dynamic allocation is definitely a good thing
<gonidelis[m]>
is operation_state dyanmic in senders receivers? I reckon no, coroutine state on the other hand is allocated on the heap
<K-ballo>
anything senders/receivers is probably allocation free, it's a pretty low level primitive
<hkaiser>
gonidelis[m]: usually no allocations are needed for s/r, just some of those need it
<gonidelis[m]>
is operation_state dynamic? hm....
<gonidelis[m]>
yes
<gonidelis[m]>
!
<K-ballo>
and by that I mean the core of sender/receivers, not the things you build on top of it
<gonidelis[m]>
yy of course ^^
<hkaiser>
some of the s/r algorithms need allocations
<gonidelis[m]>
so K-ballo you are basically saying that s/r just skip the dirty work on their definition and do actually involve dynamic memory allocation deeper on
<K-ballo>
no
<K-ballo>
if your primitive does memory allocation when it doesn't need to, you can't remove it from the outside
<gonidelis[m]>
but you are right afaiu there is no op_state here
<gonidelis[m]>
huh
<gonidelis[m]>
will ask eric
<Yorlik>
What would currently the best system for convergence of SIMD, parallel algorithms and GPU based computations? Does something like that already exist?
<satacker[m]>
Also, coroutines go (more) with concurrency right?
<hkaiser>
gonidelis[m]: e.g. split() needs an allocation
<gonidelis[m]>
satacker: I would promise that the original idea was communication
<Yorlik>
HPX ofc is part of that movement. But i guess GPU programming is another beast isn't it?
<gonidelis[m]>
yay...we have people to blame for that if you want
<gonidelis[m]>
I kno gdaiss has been writing some integration code of HPX on GPUs
<hkaiser>
gonidelis[m]: yes, you're right - our implementation of start_detached relies on our split - so the need for an allocation propagates
<Yorlik>
I'm simply wondering what to look out for in the future. Especially if I can organize my data better (Maybe more SoA instead of AoS) to make it capable for autovectorization and accelarator use.
<hkaiser>
Yorlik: we focussed on being able to integrate existing GPU kernels into our async execution infrastructure
<hkaiser>
e.g. you can launch a CUDA kernel and get an hpx::future back that becomes ready when the kernel has fnished running
<satacker[m]>
Can GPGPU be generalized to include it in HPX? All the hardwares being different?
<Yorlik>
Existing kernels? Like pre-written shaders/kernels for specific problems?
<hkaiser>
yes
<Yorlik>
The whole topic still looks very chaotic to me: Compute shaders, OpenCL, Cuda, SyCl, It looks a bit like a mess and vendor wars.
<Yorlik>
So - I guess every sane person want unification for obvious reasons.
<hkaiser>
that's what we tried to do, we support CUDA, HIP, and kokkos, currently
<hkaiser>
gdaiss[m]: plans to add support for sycl
<Yorlik>
I wonder if there is anything happening at the side of C++ standardization concerning heterogenous computing already. I guess executors are a start, right?
<hkaiser>
Yorlik: senders/receivers
<Yorlik>
I think I heard that recently. Not yet in C++23 I guess?
<hkaiser>
however writing the kernels themselves is not under consideration, that is vendor specific
<hkaiser>
nope, target is now c++26
<hkaiser>
:/
<Yorlik>
So it's like design patterns for algorithms implemented by vendors and plugged in into e.g. HPX?
<Yorlik>
E.g. matrix math, linear algebra, ML or whatever?
<hkaiser>
Yorlik: it's more that different vendors require different extensions to compile things down to the device
<hkaiser>
like CUDA's __host__ and __device__ directives, etc.
<Yorlik>
And no common abstraction in sight?
<hkaiser>
have not seen any
<Yorlik>
Allright. Thanks for the info!
<hkaiser>
there is a clang version that supports compiling for nvidia devices, and another clang version for hip
<hkaiser>
etc.
<Yorlik>
For our physics engine we have now decided to not write our own for now, but write an abstraction layer and plug bullet physics into it. That way we can make our own engine later or swap out. Might be easier to do one later when APIs have further developed and stabilized.
<Yorlik>
Bullet already has some GPGPU capabilities.
<hkaiser>
sure
<diehlpk_work>
It seems that ranges are broken in clang 13