<nikunj97>
ms[m], btw I wanted to ask if HPX backend for kokkos is complete (for on node)
<ms[m]>
nikunj97: ok, but what is the bibtex you're looking for? https://zenodo.org/record/3675272/export/hx and the same for the joss paper is essentially what we want to add to the docs
<ms[m]>
or are you actually looking for a link to the hpx docs to give to someone else?
<nikunj97>
kokkos has a port of miniFE and I wanted to benchmark it wrt hpx. Will the performance be comparable to having a hpx port itself?
<nikunj97>
ms[m], I wanted to cite the repo in my paper. I already have other HPX related citations in place
<ms[m]>
yeah, the kokkos backend is feature complete
<nikunj97>
ok great!
<ms[m]>
performance will be worse than the openmp backend naturally, but compared to vanilla hpx it may even be faster (it uses a slightly different executor)
<ms[m]>
it essentially uses what is now the thread_pool_executor in hpx itself
<ms[m]>
it's just not called that in the kokkos backend, because it came first
<ms[m]>
for citing the repo zenodo is the right thing
<nikunj97>
is thread_pool_executor better than the block_executor we have?
<nikunj97>
alright, I'll cite from zenodo
<ms[m]>
better is subjective; block_executor uses the thread_pool_executor
<ms[m]>
the thread_pool_executor has a more limited interface, block_executor lets you choose to run work on an arbitrary set of pus/cores/numa nodes
<nikunj97>
that's why the block_executor improved significantly recently
<ms[m]>
thread_pool_executor only allows a contiguous range of worker thread ids (actually it's restricted_thread_pool_executor)
<ms[m]>
yeah, it improved after me having made it worse ;)
<nikunj97>
why do we not have papers on executor performance improvements btw?
<ms[m]>
nikunj97: this is what you get with the zenodo doi doi.org/10.5281/zenodo.598202
<ms[m]>
no, it's all too recent
<nikunj97>
ohh yea, this is what I wanted. This will do!
<ms[m]>
it's the state of the branch before I removed the change to object libraries
<ms[m]>
it does contain quite a few other changes as well (the libs have been split into two parts) which I don't think affects cmake generation time, but I don't know for sure
Nikunj__ has joined #ste||ar
mcopik has joined #ste||ar
mcopik has quit [Client Quit]
Nikunj__ has quit [Read error: Connection reset by peer]
nikunj97 has quit [Ping timeout: 246 seconds]
Amy2 has joined #ste||ar
Amy1 has quit [Ping timeout: 240 seconds]
Amy2 has quit [Ping timeout: 264 seconds]
Amy2 has joined #ste||ar
nikunj97 has joined #ste||ar
Amy2 has quit [Ping timeout: 265 seconds]
Amy2 has joined #ste||ar
kale[m] has joined #ste||ar
kale[m] has quit [Client Quit]
kale[m] has joined #ste||ar
nikunj97 has quit [Remote host closed the connection]
Amy2 has quit [Ping timeout: 264 seconds]
Amy2 has joined #ste||ar
Amy2 has quit [Ping timeout: 256 seconds]
Amy2 has joined #ste||ar
hkaiser has joined #ste||ar
<weilewei>
ms[m] meeting now
<weilewei>
gsoc
<ms[m]>
weilewei: thanks!
<ms[m]>
weilewei: please start without me, I'll join as soon as I can
Amy2 has quit [Ping timeout: 256 seconds]
Amy2 has joined #ste||ar
nan111 has joined #ste||ar
mcopik has joined #ste||ar
mcopik has quit [Client Quit]
diehlpk_work has joined #ste||ar
kale[m] has quit [Ping timeout: 256 seconds]
kale[m] has joined #ste||ar
karame_ has joined #ste||ar
<K-ballo>
ms[m]: there are cyclic dependencies between modules, is that
<K-ballo>
known?
<hkaiser>
on master?
<K-ballo>
yes
<hkaiser>
uhh
<hkaiser>
why does circleci pass, then?
<ms[m]>
K-ballo: there's one I know of which isn't caught by cpp-dependencies
<ms[m]>
does the circleci check complain?
<K-ballo>
there's a circle ci check for dependencies?
<hkaiser>
yes
<ms[m]>
in any case, what is the cyclic dependency?
<ms[m]>
cpp-dependencies doesn't know about our generated headers
<ms[m]>
hkaiser isn't that for vulnerabilities in external dependencies?
<hkaiser>
ms[m]: it can scan cmake dependencies, I think
<hkaiser>
haven't looked too closely, though
<ms[m]>
K-ballo: iirc I had to fix some cyclic cmake target dependencies on the object libraries branch, it wouldn't have compiled otherwise
<ms[m]>
hkaiser: I may be misunderstanding it as well
LiliumAtratum has joined #ste||ar
LiliumAtratum has quit [Remote host closed the connection]
nikunj has quit [Ping timeout: 260 seconds]
nikunj has joined #ste||ar
kale[m] has quit [Ping timeout: 260 seconds]
kale[m] has joined #ste||ar
karame_ has quit [Quit: Ping timeout (120 seconds)]
kale[m] has quit [Ping timeout: 256 seconds]
kale[m] has joined #ste||ar
karame_ has joined #ste||ar
akheir has quit [Ping timeout: 246 seconds]
kale[m] has quit [Ping timeout: 272 seconds]
kale[m] has joined #ste||ar
<nikunj>
heller1: is it possible to have a decreasing memory bandwidth for increasing core counts? I'm seeing this behavior on raspberry pi and am unable to explain it.
sayefsakin has joined #ste||ar
rtohid has left #ste||ar [#ste||ar]
kale[m] has quit [Ping timeout: 265 seconds]
kale[m] has joined #ste||ar
<heller1>
Yes, that's possible
<heller1>
If the bus and/or memory controller can't deal with the concurrency
kale[m] has quit [Ping timeout: 260 seconds]
Amy2 has quit [Ping timeout: 256 seconds]
Amy2 has joined #ste||ar
<nikunj>
Why would anyone want more processing units when the bus can't handle concurrency?
<hkaiser>
nikunj: more PUs doesn't necessarily mean more memory bus pressure
<nikunj>
Ohh, so you mean the bus can handle only a certain amount of memory bandwidth and concurrency at the same time?
<hkaiser>
yes
<nikunj>
That answers why my results were distorted. Thanks heller1 and hkaiser
<nikunj>
hkaiser: see pm pls
joe[m]1 has joined #ste||ar
<Yorlik>
How could HPX help me to improve <level 3> cache locality in this topology? Is there a way to keep tasks local to a group of cores they were put in? Is HPX already trying this? link: https://i.imgur.com/Z83PFaz.png
sayef_ has joined #ste||ar
<hkaiser>
Yorlik: you can create separate thread pools and keep those confined to a numa domain
sayefsakin has quit [Ping timeout: 256 seconds]
<hkaiser>
however, that would prevent stealing across the pools
<Yorlik>
A parallel loop would probably distribute only to one pool or could I spread?
<Yorlik>
Like round robin the pools
<Yorlik>
In the moment I'm trying to put stress from the cache by using small object pools. the pool I wrote works nicely in a test and also is faster, unfortunately there's some Lua interop issue I need to solve first.
<Yorlik>
The speedup just by a naive vector based pool already is like 3-10x.
<Yorlik>
It's crazy.
<Yorlik>
But the memory bandwidth I calculated form the loss in performance can go down to an abysmal 25-50 MB per second - so I really need to do something. Still not fully understanding the problem ...