<zao>
I've got a feeling I've got to dive into dataflow soon.
<zao>
I've got a codebase that has a lot of input and output arrays in void functions.
<zao>
I can move most of the output arrays into a struct or tuple that I return as I hpx::async the function, but there's a lot of tie and unpacking going on.
<zao>
A big messy main function to untangle here, and I'm not even at the simulation part yet.
<zao>
Now time to sleep before I break this code more :D
akheir has quit [Quit: Leaving]
hkaiser has quit [Quit: bye]
kale_ has joined #ste||ar
kale_ has quit [Client Quit]
nikunj97 has joined #ste||ar
gonidelis has joined #ste||ar
<zao>
Right now I return a future of a struct with all the output arguments, but there's a lot of moving out of that after the call site.
gonidelis has quit [Remote host closed the connection]
hkaiser has joined #ste||ar
<zao>
I've got a legacy loop that I'd like to HPX:ify, which accesses a lot of arrays with the loop index. Do I do something like hpx::parallel::for_each with some sort of counting iterator, or is there a better way to invoke a function in parallel for each of [0..n)?
<nikunj97>
the code I run is following: `hwloc-bind "NUMANode:0-3.cores:0-$((${core}/4 - 1))" ./stream.sh 10`
<nikunj97>
stream array size is 150M
<nikunj97>
./stream.sh 10 runs stream benchmark 10 times and returns the maximum bandwidth amongst those 10 runs
<nikunj97>
heller1, is this a strange behavior or can we explain such a behavior?
<heller1>
This looks like dynamic frequency scaling?
<nikunj97>
I do not play with frequency
<nikunj97>
afaik, frequency is kept at base clock
<heller1>
You don't
<heller1>
What about the system?
<nikunj97>
the system should also be kept at same frequency
<nikunj97>
I have to ask about this
<heller1>
cpufreq something
<nikunj97>
cpupower is not installed on the node I'm using unfortunately
<nikunj97>
I've emailed the prof asking if the CPU clock speed is kept at a constant or not
<heller1>
cpufreq?
<nikunj97>
cpufreq isn't installed as well
<heller1>
You can also query the stuff on the sys filesystem
<nikunj97>
there's no cpuinfo_cur_freq in /sys/devices/system/cpu/cpu*/cpufreq/
<nikunj97>
zao, why does it take so long for the generating phase of cmake on some architectures?
<zao>
On some platforms it's expensive to create process or run the compiler.
<zao>
In particular, checking out Intel licenses tends to take time.
<nikunj97>
what does the generating phase do btw?
<zao>
Other slowdowns can be due to using a network file system for things, having latencies in metadata operations.
<nikunj97>
heller1, hwloc-bind cores:0-9.PU:0 with stream results show a peak bandwidth of 42Gb/s while with HPX my results are around 4500MLUPS which is about 1.7 times the performance expected at that bandwidth
<nikunj97>
however the results become expected if I run stream with OMP_NUM_THREADS instead of using hwloc-bind
<nikunj97>
heller1, what could be the reason behind it?
<nikunj97>
this is the command I run where ${core} comes from for loop
<hkaiser>
Amy1: optimize in what sense?
<Amy1>
improve the code performance.
<diehlpk_work>
hkaiser, see pm
<nikunj97>
Amy1, 2 quick suggestions make the loop such that line 7 and line 6 is switched. That will give you cache benefits. Furthermore the loops look embarrassingly parallel so they can be parallelized.
<zao>
Do you really need the dense `delta_pos` array, or could it be possible that you could get away with something less?
<zao>
Like the distance to grid cells or other reference points or something.
<zao>
nikunj97: Amusingly enough, this is pretty spot on the shape of loop I'm trying to apply HPX to in my codebase.
<zao>
Finding it hard to beat plain old OpenMP reduces on single-node.
<nikunj97>
zao, the one Amy1 shared?
<zao>
aye
<zao>
(mine's worse tho, as there's sums made over the second nesting)
<zao>
n=10M
<nikunj97>
iirc array indexing works like arr[y][x] and not arr[x][y] where x,y are points on the cartesian plane with x and y components. Also if you're working with contiguous memory it's easier to make it cache optimized. Is there something wrong in what I'm saying?
weilewei has joined #ste||ar
<zao>
I'm a bit surprised that there's three separate arrays for the three coordinates.
<Amy1>
Are there IRC channels about program optimization?
<zao>
So used to that they're interleaved from graphics.
<zao>
As nikunj says, the primary factor here if you're not changing anything algorithmically would be to consider data layout and loop order.
Amy1 has quit [Quit: WeeChat 2.2]
<zao>
...
Amy1 has joined #ste||ar
<nikunj97>
Amy1, if you can't change the algorithm itself, then they're the only 2 options for optimization I can think of
<Amy1>
Are there IRC channels about program optimization?
<Amy1>
nikunj97: which optimization?
<nikunj97>
switching loops between line 6 and 7 for cache benefits and parallelizing the loops since they're embarrassingly parallel
<nikunj97>
cache benefits with contiguous memory in such cases can give about a 10x boost in performance. Never tried it myself, but the guy from cppcon says so..
<nikunj97>
heller1, quick question: why are we considering stream triad benchmark for memory bandwidth and not the other 3 mentioned?
<nikunj97>
if we take results from stream copy benchmark, my results looks expected behavior
<Amy1>
nikunj97: only 1.2 speedup.
<Amy1>
but simd width is 4.
<nikunj97>
Amy1, can you show me the code again?
<nikunj97>
Amy1, I don't think simd will help in your case. It looks like a memory bound problem.
<nikunj97>
Amy1, just realized that you cannot work with more than 1 array's contiguous memory. Your loop structure right now is inefficient.
<Amy1>
nikunj97: it seems reasonable. so how to optimize it?
<nikunj97>
Amy1, I don't see any other optimizations really, sorry
<nikunj97>
you may have to change the algorithm itself to gain significant speed up
<zao>
Not sure if you saw my previous message as you quit somehow.
<Amy1>
zao: sorry, I don't see
<zao>
20:26 <zao> As nikunj says, the primary factor here if you're not changing anything algorithmically would be to consider data layout and loop order.
<zao>
Which in essence implies keeping your caches happy.
<Amy1>
zao: you can do any optimization include algrithom
<zao>
Going to disjoint blocks of memory for X/Y/Z may be sub-optimal.
<zao>
It could help with vectorization, of course, depending on whether you actually do something like that.
<zao>
It's hard to reason about something like this without knowing anything about why the input and output data is shaped like it is.
<hkaiser>
weilewei: first I'd suggest you try to understand what the code does and what is the threading interface needed for
<hkaiser>
weilewei: could be that sme of that needs to still use pthreads, not sure - depends what it's used for
<weilewei>
hkaiser I see, yea, I am trying to understand the code, found pthread seems like the core part
<hkaiser>
yah, could very well be
<weilewei>
does hpx have direct function calling hpx thread? I only uses async...
<weilewei>
I am sure hpx has, but where is hpx interface
<weilewei>
hkaiser ^^
<weilewei>
just calling hpx::thread ?
<hkaiser>
yes
<weilewei>
thanks, let me look into code and understand it more
nikunj97 has quit [Ping timeout: 260 seconds]
<Yorlik>
Noob Question: I made a custom depth first forward iterator for a tree and want to access the internal state of the iterator inside the loop (recursion level variable). But I am getting only the initial state on every iteration. Is there a way to fix that?
<Yorlik>
To mee it looks as if the iterator I am passing is copied and a copy being used for the loop.
<Yorlik>
OK fixed. Rewrote my for to "for ( auto e = it.begin(); e != it.end(); ++e )" :D Thank you Rubberducky!
* zao
quacks
<Yorlik>
:)
<Yorlik>
Measured 76 ns / item at 1,000,000 items Quadtree. Not sure if that's fast or slow though.
<hkaiser>
that's very fast - too fast
<Yorlik>
Measurement error?
<hkaiser>
ahh, per item - seems to be ok
<Yorlik>
Put a volatile assign in the loop so it doesn't get optimized away.
<zao>
Yorlik: What was that thing that let you chunk iterations yourself?
<Yorlik>
Like statically or the autochunker?
<zao>
I don't remember the divisions, but something that gave you ranges you got to process yourself, instead of invoking something for every single element.
<Yorlik>
Oh this - I'd have to look it up. hkaiser mentioned it
<Yorlik>
I'm not currently using it.
<zao>
ah, don't worry about it.
<Yorlik>
OK
<Yorlik>
12us now with a small trre - I think I am measuring cache effects with the large trees
<hkaiser>
zao: use the bulk_async_execute executor customization point
<hkaiser>
for_llop and friends are just fancy wrappers around that anyways
<Yorlik>
Holy cow: 9ns / item on 10,000,000 items, 15500 ns on a 100 item tree. There is no amortized creation cost in it. It's really just the loop. I guess that' the power of the cache.
<Yorlik>
FFS artifact - lol. Damned bugs.
<zao>
"Done with density kernel calculations in 75.08"
<zao>
Assuming the code actually does what it should, this is nice.
hkaiser has quit [Read error: Connection reset by peer]
<Yorlik>
LOL - my bug was a feature I had forgotten about: I had set a maximum tree depth parameter to a low value, so leaves got pooled which made it appear faster. Times are much slower now ~9-12 micoseconds / item - too slow I think. But there is an interesting observation: It seems the number of nodes in the quadtree for randomly distributed items is around 2.43 times the number of items. is that a known relation? Is there
<Yorlik>
some theory for it? I think It can surely be explained statistically.