K-ballo changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
hkaiser has joined #ste||ar
nanmiao has joined #ste||ar
nanmiao has quit [Quit: Connection closed]
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: bye]
jehelset has joined #ste||ar
jehelset has quit [Ping timeout: 268 seconds]
jehelset has joined #ste||ar
bita has joined #ste||ar
sivoais has quit [Ping timeout: 252 seconds]
jehelset has quit [Remote host closed the connection]
sivoais has joined #ste||ar
jehelset has joined #ste||ar
bita has quit [Ping timeout: 276 seconds]
K-ballo has joined #ste||ar
hkaiser has joined #ste||ar
<hkaiser> gonidelis[m]: yt?
<gonidelis[m]> hkaiser: yes
<hkaiser> see pm pls
<gonidelis[m]> hkaiser: i have no pm
bita has joined #ste||ar
<hkaiser> hey ms[m], you around?
<ms[m]> hkaiser: yep, here
<ms[m]> what's up?
<hkaiser> see pm, pls
nanmiao has joined #ste||ar
<hkaiser> gonidelis[m]: see pm, pls
diehlpk_work has joined #ste||ar
<gnikunj[m]> ms: yt?
<hkaiser> gonidelis[m]: I'm ready whenever you are
<gonidelis[m]> ok
<gonidelis[m]> give me a sec
<rachitt_shah[m]> <ms[m] "rachitt_shah, dashohoxha, Roshee"> Hey everyone, when is the GSoD deadline closing to fill this form?
nanmiao has quit [Quit: Connection closed]
<rachitt_shah[m]> ms[m] can you please share, if possible.
<gnikunj[m]> rachitt_shah[m]: iirc it's May 2nd
<ms[m]> rachitt_shah[m]: it's right there on the application form ;) may 11
<ms[m]> gnikunj: on my phone but write away
<rachitt_shah[m]> <ms[m] "rachitt_shah[m]: it's right ther"> Got it, didn't check. Sorry :(
<gnikunj[m]> ms: putting it here for you to see later. I did run some microbenchmarks. You can check the results and the benchmark here: https://gist.github.com/NK-Nikunj/f88c7ddd30d1f33438c33ea96ba32b4d
<rachitt_shah[m]> I assumes today would be the deadline.
<rachitt_shah[m]> * I assumed today would be the deadline.
<gnikunj[m]> turns out it's really slow to use default_executor in hpx-kokkos
<gnikunj[m]> ms: btw kokkos parallel-for has a ton of contention when you have loops that iterate over large range. When I had a loop with range 0,1000; it took 2s to complete the kokkos parallel-for which could've been done in 0.02s in HPX
<gnikunj[m]> I've started looking into the code. If I find anything, I'll let you know.
nanmiao has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
<ms[m]> gnikunj: the overhead comparing kokkos::parallel_for to the manual loop with async is not surprising
<ms[m]> especially for cuda, creating those independent instances creates streams (which is a blocking call iirc on top of things) and that's expensive
<ms[m]> so just creating the executors for that manual loop is expensive
<ms[m]> for the hpx backend I thought it would be lighter weight, but looking at the code I realize I have some dynamic allocation there (related to handling of the global/independent instances and scratch space; it could perhaps be lazily allocated, not sure)
<ms[m]> which while relatively cheap, might add up in that case
<ms[m]> the hpx parallel_executor is pretty much as lightweight as it gets
<ms[m]> so I'd recommend you see how much of the overhead you get rid of my moving the creation of the kokkos executors outside the loop
<ms[m]> whatever is remaining may be related to the internal synchronization in the kokkos backend's parallel for loop (it uses e.g. a latch internally, and wraps that in a future, which for sure adds more overhead than just the plain hpx parallel_executor)
<ms[m]> I'd still also have to check that it actually just creates one task when the size of the range is 1 (and this could be optimized to run inline)
<ms[m]> in general my comment is that it's not really optimized for your use case of one-off tasks
Vir has quit [*.net *.split]
Vir has joined #ste||ar
<ms[m]> rachitt_shah[m]: no worries, looking forward to seeing your applicatin!
<ms[m]> *application
<rachitt_shah[m]> > rachitt_shah[m]: no worries, looking forward to seeing your applicatin!
<rachitt_shah[m]> Thank you ms ! Will finish it over this weekend. Would there be further interviews too?
<ms[m]> rachitt_shah: depends on the quality and quantity of applications we get, but it's possible, yes
<rachitt_shah[m]> <ms[m] "rachitt_shah: depends on the qua"> Got it. Will update here with my questions.
jehelset has quit [Ping timeout: 268 seconds]
diehlpk_work has quit [Remote host closed the connection]
<gnikunj[m]> ms it seems like it. What about kokkos tasks?
<ms[m]> gnikunj: maybe, it really depends what the final use case is
<ms[m]> also, there's nothing wrong with just using hpx's parallel_executor on the host side
<ms[m]> right tool for the right job and all that
<gnikunj[m]> ms: yeah, got it. I was trying to explore what I should use for the backend. Given kokkos task is specialized for handling a single task, it should be better off for my use case.
<gnikunj[m]> <ms[m] "right tool for the right job and"> yup that makes sense!
<ms[m]> could be better, but it goes in the other direction with very fine-grained tasks (stackless)
<gnikunj[m]> could you elaborate?
<ms[m]> you might not gain anything by that either
<gnikunj[m]> damn I see :/
<ms[m]> if it's just portability for single tasks that you're looking for you might not even need kokkos
<ms[m]> but if this is eventually supposed to be extended to bulk execution and various parallel algorithms kokkos probably makes sense
<ms[m]> I'd take a step back and think what you're really aiming for
<gnikunj[m]> yeah, I figured. I believe things will be better when I add a resilience execution space, it should be better.
<gnikunj[m]> coz then we'll have resilience over bulk execution
<gnikunj[m]> <ms[m] "but if this is eventually suppos"> right
<ms[m]> and by "might not gain anything" I mean that it'll probably be faster than using the parallel algorithms, but it's still kind of overkill to use kokkos tasks for a single task
<ms[m]> in any case, I'd first try to figure out how much of those overheads come just from the creation of those kokkos executors/instances