hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
<hkaiser> zao: heh
<zao> I've got a feeling I've got to dive into dataflow soon.
<zao> I've got a codebase that has a lot of input and output arrays in void functions.
<zao> I can move most of the output arrays into a struct or tuple that I return as I hpx::async the function, but there's a lot of tie and unpacking going on.
<zao> A big messy main function to untangle here, and I'm not even at the simulation part yet.
<zao> Now time to sleep before I break this code more :D
akheir has quit [Quit: Leaving]
hkaiser has quit [Quit: bye]
kale_ has joined #ste||ar
kale_ has quit [Client Quit]
nikunj97 has joined #ste||ar
gonidelis has joined #ste||ar
<zao> Right now I return a future of a struct with all the output arguments, but there's a lot of moving out of that after the call site.
gonidelis has quit [Remote host closed the connection]
hkaiser has joined #ste||ar
<zao> I've got a legacy loop that I'd like to HPX:ify, which accesses a lot of arrays with the loop index. Do I do something like hpx::parallel::for_each with some sort of counting iterator, or is there a better way to invoke a function in parallel for each of [0..n)?
<heller1> for_loop
<zao> oooh
<nikunj97> heller1, this may help us understand why we're not getting higher performance: https://gist.github.com/NK-Nikunj/f9cce956d7e29da80f0f577e866f2df3
<nikunj97> the code I run is following: `hwloc-bind "NUMANode:0-3.cores:0-$((${core}/4 - 1))" ./stream.sh 10`
<nikunj97> stream array size is 150M
<nikunj97> ./stream.sh 10 runs stream benchmark 10 times and returns the maximum bandwidth amongst those 10 runs
<nikunj97> heller1, is this a strange behavior or can we explain such a behavior?
<heller1> This looks like dynamic frequency scaling?
<nikunj97> I do not play with frequency
<nikunj97> afaik, frequency is kept at base clock
<heller1> You don't
<heller1> What about the system?
<nikunj97> the system should also be kept at same frequency
<nikunj97> I have to ask about this
<heller1> cpufreq something
<nikunj97> cpupower is not installed on the node I'm using unfortunately
<nikunj97> I've emailed the prof asking if the CPU clock speed is kept at a constant or not
<heller1> cpufreq?
<nikunj97> cpufreq isn't installed as well
<heller1> You can also query the stuff on the sys filesystem
<nikunj97> there's no cpuinfo_cur_freq in /sys/devices/system/cpu/cpu*/cpufreq/
<nikunj97> zao, why does it take so long for the generating phase of cmake on some architectures?
<zao> On some platforms it's expensive to create process or run the compiler.
<zao> In particular, checking out Intel licenses tends to take time.
<nikunj97> what does the generating phase do btw?
<zao> Other slowdowns can be due to using a network file system for things, having latencies in metadata operations.
<nikunj97> heller1, hwloc-bind cores:0-9.PU:0 with stream results show a peak bandwidth of 42Gb/s while with HPX my results are around 4500MLUPS which is about 1.7 times the performance expected at that bandwidth
<nikunj97> however the results become expected if I run stream with OMP_NUM_THREADS instead of using hwloc-bind
<nikunj97> heller1, what could be the reason behind it?
Amy1 has quit [Quit: WeeChat 2.2]
Amy1 has joined #ste||ar
Amy1 has quit [Client Quit]
Amy1 has joined #ste||ar
diehlpk_work has joined #ste||ar
<Amy1> how to optimize this code?
<heller1> nikunj97: you set the number of threads one time but don't the other time?
<nikunj97> I set the number of threads everytime I run
<nikunj97> hwloc-bind "cores:0-$((${core}-1)).PU:0" "${ROOT}"/stencil_parallel --Nx=8192 --Ny=131072 --hpx:use-process-mask
<nikunj97> this is the command I run where ${core} comes from for loop
<hkaiser> Amy1: optimize in what sense?
<Amy1> improve the code performance.
<diehlpk_work> hkaiser, see pm
<nikunj97> Amy1, 2 quick suggestions make the loop such that line 7 and line 6 is switched. That will give you cache benefits. Furthermore the loops look embarrassingly parallel so they can be parallelized.
<zao> Do you really need the dense `delta_pos` array, or could it be possible that you could get away with something less?
<zao> Like the distance to grid cells or other reference points or something.
<zao> nikunj97: Amusingly enough, this is pretty spot on the shape of loop I'm trying to apply HPX to in my codebase.
<zao> Finding it hard to beat plain old OpenMP reduces on single-node.
<nikunj97> zao, the one Amy1 shared?
<zao> aye
<zao> (mine's worse tho, as there's sums made over the second nesting)
<zao> n=10M
<nikunj97> iirc array indexing works like arr[y][x] and not arr[x][y] where x,y are points on the cartesian plane with x and y components. Also if you're working with contiguous memory it's easier to make it cache optimized. Is there something wrong in what I'm saying?
weilewei has joined #ste||ar
<zao> I'm a bit surprised that there's three separate arrays for the three coordinates.
<Amy1> Are there IRC channels about program optimization?
<zao> So used to that they're interleaved from graphics.
<zao> As nikunj says, the primary factor here if you're not changing anything algorithmically would be to consider data layout and loop order.
Amy1 has quit [Quit: WeeChat 2.2]
<zao> ...
Amy1 has joined #ste||ar
<nikunj97> Amy1, if you can't change the algorithm itself, then they're the only 2 options for optimization I can think of
<Amy1> Are there IRC channels about program optimization?
<Amy1> nikunj97: which optimization?
<nikunj97> switching loops between line 6 and 7 for cache benefits and parallelizing the loops since they're embarrassingly parallel
<nikunj97> cache benefits with contiguous memory in such cases can give about a 10x boost in performance. Never tried it myself, but the guy from cppcon says so..
<nikunj97> heller1, quick question: why are we considering stream triad benchmark for memory bandwidth and not the other 3 mentioned?
<nikunj97> if we take results from stream copy benchmark, my results looks expected behavior
<Amy1> nikunj97: only 1.2 speedup.
<Amy1> but simd width is 4.
<nikunj97> Amy1, can you show me the code again?
<nikunj97> Amy1, I don't think simd will help in your case. It looks like a memory bound problem.
<Amy1> Nbody is 4*1024 and Ndim is 3.
<Amy1> Yeah, it's a memory bound.
<nikunj97> Amy1, a speed up of 1.2 with a memory bound problem is expected since you're not compute bound
<nikunj97> Amy1, I don't see a change in loop structure in your new code
<nikunj97> also are you working with simd in the code you share?
<Amy1> nikunj97: you can see this.
Amy1 has quit [Quit: WeeChat 2.2]
Amy1 has joined #ste||ar
<nikunj97> Amy1, just realized that you cannot work with more than 1 array's contiguous memory. Your loop structure right now is inefficient.
<Amy1> nikunj97: it seems reasonable. so how to optimize it?
<nikunj97> Amy1, I don't see any other optimizations really, sorry
<nikunj97> you may have to change the algorithm itself to gain significant speed up
<zao> Not sure if you saw my previous message as you quit somehow.
<Amy1> zao: sorry, I don't see
<zao> 20:26 <zao> As nikunj says, the primary factor here if you're not changing anything algorithmically would be to consider data layout and loop order.
<zao> Which in essence implies keeping your caches happy.
<Amy1> zao: you can do any optimization include algrithom
<zao> Going to disjoint blocks of memory for X/Y/Z may be sub-optimal.
<zao> It could help with vectorization, of course, depending on whether you actually do something like that.
<zao> It's hard to reason about something like this without knowing anything about why the input and output data is shaped like it is.
Amy1 has quit [Quit: WeeChat 2.2]
Amy1 has joined #ste||ar
Nikunj__ has joined #ste||ar
nikunj97 has quit [Ping timeout: 260 seconds]
<weilewei> hkaiser i just started looking at LibCDS (concurrent data structure) code, is the main action provide hpx thread interface similar to this interface: https://github.com/khizmax/libcds/blob/master/cds/threading/details/pthread_manager.h
<weilewei> i also noticed msvc interface: https://github.com/khizmax/libcds/blob/master/cds/threading/details/msvc_manager.h, similar to the pthread one
<weilewei> *is the main action to provide ... ?
nikunj97 has joined #ste||ar
Nikunj__ has quit [Ping timeout: 256 seconds]
Nikunj__ has joined #ste||ar
nikunj97 has quit [Ping timeout: 256 seconds]
nikunj97 has joined #ste||ar
Nikunj__ has quit [Ping timeout: 240 seconds]
<hkaiser> weilewei: yah, that
<hkaiser> weilewei: first I'd suggest you try to understand what the code does and what is the threading interface needed for
<hkaiser> weilewei: could be that sme of that needs to still use pthreads, not sure - depends what it's used for
<weilewei> hkaiser I see, yea, I am trying to understand the code, found pthread seems like the core part
<hkaiser> yah, could very well be
<weilewei> does hpx have direct function calling hpx thread? I only uses async...
<weilewei> I am sure hpx has, but where is hpx interface
<weilewei> hkaiser ^^
<weilewei> just calling hpx::thread ?
<hkaiser> yes
<weilewei> thanks, let me look into code and understand it more
nikunj97 has quit [Ping timeout: 260 seconds]
<Yorlik> Noob Question: I made a custom depth first forward iterator for a tree and want to access the internal state of the iterator inside the loop (recursion level variable). But I am getting only the initial state on every iteration. Is there a way to fix that?
<Yorlik> To mee it looks as if the iterator I am passing is copied and a copy being used for the loop.
<Yorlik> OK fixed. Rewrote my for to "for ( auto e = it.begin(); e != it.end(); ++e )" :D Thank you Rubberducky!
* zao quacks
<Yorlik> :)
<Yorlik> Measured 76 ns / item at 1,000,000 items Quadtree. Not sure if that's fast or slow though.
<hkaiser> that's very fast - too fast
<Yorlik> Measurement error?
<hkaiser> ahh, per item - seems to be ok
<Yorlik> Put a volatile assign in the loop so it doesn't get optimized away.
<zao> Yorlik: What was that thing that let you chunk iterations yourself?
<Yorlik> Like statically or the autochunker?
<zao> I don't remember the divisions, but something that gave you ranges you got to process yourself, instead of invoking something for every single element.
<Yorlik> Oh this - I'd have to look it up. hkaiser mentioned it
<Yorlik> I'm not currently using it.
<zao> ah, don't worry about it.
<Yorlik> OK
<Yorlik> 12us now with a small trre - I think I am measuring cache effects with the large trees
<hkaiser> zao: use the bulk_async_execute executor customization point
<hkaiser> for_llop and friends are just fancy wrappers around that anyways
<Yorlik> Holy cow: 9ns / item on 10,000,000 items, 15500 ns on a 100 item tree. There is no amortized creation cost in it. It's really just the loop. I guess that' the power of the cache.
<Yorlik> FFS artifact - lol. Damned bugs.
<zao> "Done with density kernel calculations in 75.08"
<zao> Assuming the code actually does what it should, this is nice.
hkaiser has quit [Read error: Connection reset by peer]
<Yorlik> LOL - my bug was a feature I had forgotten about: I had set a maximum tree depth parameter to a low value, so leaves got pooled which made it appear faster. Times are much slower now ~9-12 micoseconds / item - too slow I think. But there is an interesting observation: It seems the number of nodes in the quadtree for randomly distributed items is around 2.43 times the number of items. is that a known relation? Is there
<Yorlik> some theory for it? I think It can surely be explained statistically.
<Yorlik> Loops: 2424 Diff: 12470 ns/loop ( 1000 items)
<Yorlik> Loops: 24330 Diff: 13243 ns/loop ( 10000 items)
<Yorlik> Loops: 243724 Diff: 11149 ns/loop ( 100000 items)
<Yorlik> Loops: 2440763 Diff: 9825 ns/loop ( 1000000 items)
hkaiser has joined #ste||ar
<Yorlik> hkaiser: Fixed that bug which gave these nice iteration times. Now it's a horrible ~12 microseconds/item :(
<Yorlik> It was a feature kicking in: My limitation of tree depth, so it started pooling itemsinto buckets even over the limit.
<hkaiser> Yorlik: debug or release build?
<Yorlik> Debug.
<hkaiser> never measure perf in Debug builds
* Yorlik goes measuring release
<Yorlik> hkaiser: But I need to justify me optimizing !!!! ;)
<hkaiser> right
<Yorlik> hkaiser: A wee bit besser:
<Yorlik> Loops: 2424 Diff: 579 ns/loop ( 1000 items)
<Yorlik> Loops: 243724 Diff: 594 ns/loop ( 100000 items)
<Yorlik> Loops: 2440763 Diff: 487 ns/loop ( 1000000 items)
<Yorlik> Loops: 24330 Diff: 630 ns/loop ( 10000 items)
<Yorlik> Interesting is this constant factor of ~2.4 nodes/item
<Yorlik> (random distribution)
<hkaiser> Yorlik: yah, you'r filling your tree evenly
<Yorlik> The good news it: this scales by a constant factor
<Yorlik> Just made the 64 ary tree btw
<Yorlik> Needed minimal corrctions
<Yorlik> Its generic now
<Yorlik> iter time increases it seems - still testing
<hkaiser> that quad-tree is filled by a bit more than half
<Yorlik> But nodes per item decrease
<Yorlik> Still working on the sausage.
<hkaiser> logbase2(8 * fillfactor) == 2.4
<Yorlik> For the 64ary it seems to go towards 1.7
<hkaiser> makes sense too, I guess
<Yorlik> Argh no
Yorlik has quit [Read error: Connection reset by peer]
Yorlik has joined #ste||ar
<Yorlik> Seems the 64ary is more memory hungry - lol - just crashed explorer and some apps
<Yorlik> Loops: 1738 Diff: 3072 ns/loop ( 1000 items)
<Yorlik> Loops: 20946 Diff: 3455 ns/loop ( 10000 items)
<Yorlik> Loops: 228397 Diff: 4023 ns/loop ( 100000 items)
<Yorlik> Loops: 2378127 Diff: 3337 ns/loop ( 1000000 items)
<Yorlik> It's getting expensive
<Yorlik> And this was release
Yorlik has quit [Read error: Connection reset by peer]
Yorlik has joined #ste||ar
Yorlik has quit [Read error: Connection reset by peer]
Yorlik has joined #ste||ar