K-ballo changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
hkaiser has joined #ste||ar
nanmiao has joined #ste||ar
nanmiao has quit [Quit: Connection closed]
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: bye]
jehelset has joined #ste||ar
jehelset has quit [Ping timeout: 268 seconds]
jehelset has joined #ste||ar
bita has joined #ste||ar
sivoais has quit [Ping timeout: 252 seconds]
jehelset has quit [Remote host closed the connection]
sivoais has joined #ste||ar
jehelset has joined #ste||ar
bita has quit [Ping timeout: 276 seconds]
K-ballo has joined #ste||ar
hkaiser has joined #ste||ar
<hkaiser>
gonidelis[m]: yt?
<gonidelis[m]>
hkaiser: yes
<hkaiser>
see pm pls
<gonidelis[m]>
hkaiser: i have no pm
bita has joined #ste||ar
<hkaiser>
hey ms[m], you around?
<ms[m]>
hkaiser: yep, here
<ms[m]>
what's up?
<hkaiser>
see pm, pls
nanmiao has joined #ste||ar
<hkaiser>
gonidelis[m]: see pm, pls
diehlpk_work has joined #ste||ar
<gnikunj[m]>
ms: yt?
<hkaiser>
gonidelis[m]: I'm ready whenever you are
<gonidelis[m]>
ok
<gonidelis[m]>
give me a sec
<rachitt_shah[m]>
<ms[m] "rachitt_shah, dashohoxha, Roshee"> Hey everyone, when is the GSoD deadline closing to fill this form?
nanmiao has quit [Quit: Connection closed]
<rachitt_shah[m]>
ms[m] can you please share, if possible.
<gnikunj[m]>
rachitt_shah[m]: iirc it's May 2nd
<ms[m]>
rachitt_shah[m]: it's right there on the application form ;) may 11
<rachitt_shah[m]>
I assumes today would be the deadline.
<rachitt_shah[m]>
* I assumed today would be the deadline.
<gnikunj[m]>
turns out it's really slow to use default_executor in hpx-kokkos
<gnikunj[m]>
ms: btw kokkos parallel-for has a ton of contention when you have loops that iterate over large range. When I had a loop with range 0,1000; it took 2s to complete the kokkos parallel-for which could've been done in 0.02s in HPX
<gnikunj[m]>
I've started looking into the code. If I find anything, I'll let you know.
nanmiao has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
<ms[m]>
gnikunj: the overhead comparing kokkos::parallel_for to the manual loop with async is not surprising
<ms[m]>
especially for cuda, creating those independent instances creates streams (which is a blocking call iirc on top of things) and that's expensive
<ms[m]>
so just creating the executors for that manual loop is expensive
<ms[m]>
for the hpx backend I thought it would be lighter weight, but looking at the code I realize I have some dynamic allocation there (related to handling of the global/independent instances and scratch space; it could perhaps be lazily allocated, not sure)
<ms[m]>
which while relatively cheap, might add up in that case
<ms[m]>
the hpx parallel_executor is pretty much as lightweight as it gets
<ms[m]>
so I'd recommend you see how much of the overhead you get rid of my moving the creation of the kokkos executors outside the loop
<ms[m]>
whatever is remaining may be related to the internal synchronization in the kokkos backend's parallel for loop (it uses e.g. a latch internally, and wraps that in a future, which for sure adds more overhead than just the plain hpx parallel_executor)
<ms[m]>
I'd still also have to check that it actually just creates one task when the size of the range is 1 (and this could be optimized to run inline)
<ms[m]>
in general my comment is that it's not really optimized for your use case of one-off tasks
Vir has quit [*.net *.split]
Vir has joined #ste||ar
<ms[m]>
rachitt_shah[m]: no worries, looking forward to seeing your applicatin!
<ms[m]>
*application
<rachitt_shah[m]>
> rachitt_shah[m]: no worries, looking forward to seeing your applicatin!
<rachitt_shah[m]>
Thank you ms ! Will finish it over this weekend. Would there be further interviews too?
<ms[m]>
rachitt_shah: depends on the quality and quantity of applications we get, but it's possible, yes
<rachitt_shah[m]>
<ms[m] "rachitt_shah: depends on the qua"> Got it. Will update here with my questions.
jehelset has quit [Ping timeout: 268 seconds]
diehlpk_work has quit [Remote host closed the connection]
<gnikunj[m]>
ms it seems like it. What about kokkos tasks?
<ms[m]>
gnikunj: maybe, it really depends what the final use case is
<ms[m]>
also, there's nothing wrong with just using hpx's parallel_executor on the host side
<ms[m]>
right tool for the right job and all that
<gnikunj[m]>
ms: yeah, got it. I was trying to explore what I should use for the backend. Given kokkos task is specialized for handling a single task, it should be better off for my use case.
<gnikunj[m]>
<ms[m] "right tool for the right job and"> yup that makes sense!
<ms[m]>
could be better, but it goes in the other direction with very fine-grained tasks (stackless)
<gnikunj[m]>
could you elaborate?
<ms[m]>
you might not gain anything by that either
<gnikunj[m]>
damn I see :/
<ms[m]>
if it's just portability for single tasks that you're looking for you might not even need kokkos
<ms[m]>
but if this is eventually supposed to be extended to bulk execution and various parallel algorithms kokkos probably makes sense
<ms[m]>
I'd take a step back and think what you're really aiming for
<gnikunj[m]>
yeah, I figured. I believe things will be better when I add a resilience execution space, it should be better.
<gnikunj[m]>
coz then we'll have resilience over bulk execution
<gnikunj[m]>
<ms[m] "but if this is eventually suppos"> right
<ms[m]>
and by "might not gain anything" I mean that it'll probably be faster than using the parallel algorithms, but it's still kind of overkill to use kokkos tasks for a single task
<ms[m]>
in any case, I'd first try to figure out how much of those overheads come just from the creation of those kokkos executors/instances