#ste||ar on 2020-04-12 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

00:00 <hkaiser> zao: heh

00:00 <zao> I've got a feeling I've got to dive into dataflow soon.

00:01 <zao> I've got a codebase that has a lot of input and output arrays in void functions.

00:03 <zao> I can move most of the output arrays into a struct or tuple that I return as I hpx::async the function, but there's a lot of tie and unpacking going on.

00:03 <zao> A big messy main function to untangle here, and I'm not even at the simulation part yet.

00:04 <zao> Now time to sleep before I break this code more :D

00:41 akheir has quit [Quit: Leaving]

02:07 hkaiser has quit [Quit: bye]

09:09 kale_ has joined #ste||ar

09:09 kale_ has quit [Client Quit]

09:16 nikunj97 has joined #ste||ar

11:24 gonidelis has joined #ste||ar

11:37 <zao> Right now I return a future of a struct with all the output arguments, but there's a lot of moving out of that after the call site.

13:05 gonidelis has quit [Remote host closed the connection]

13:16 hkaiser has joined #ste||ar

14:26 <zao> I've got a legacy loop that I'd like to HPX:ify, which accesses a lot of arrays with the loop index. Do I do something like hpx::parallel::for_each with some sort of counting iterator, or is there a better way to invoke a function in parallel for each of [0..n)?

14:28 <heller1> for_loop

14:28 <zao> oooh

14:33 <heller1> https://stellar-group.github.io/hpx-docs/latest/html/libs/algorithms/api.html#_CPPv3N3hpx8parallel2v28for_loopENSt5decayI1IE4typeE1IDpRR4Args

14:37 <nikunj97> heller1, this may help us understand why we're not getting higher performance: https://gist.github.com/NK-Nikunj/f9cce956d7e29da80f0f577e866f2df3

14:38 <nikunj97> the code I run is following: `hwloc-bind "NUMANode:0-3.cores:0-$((${core}/4 - 1))" ./stream.sh 10`

14:38 <nikunj97> stream array size is 150M

14:39 <nikunj97> ./stream.sh 10 runs stream benchmark 10 times and returns the maximum bandwidth amongst those 10 runs

14:44 <nikunj97> heller1, is this a strange behavior or can we explain such a behavior?

14:44 <heller1> This looks like dynamic frequency scaling?

14:45 <nikunj97> I do not play with frequency

14:45 <nikunj97> afaik, frequency is kept at base clock

14:46 <heller1> You don't

14:46 <heller1> What about the system?

14:47 <nikunj97> the system should also be kept at same frequency

14:47 <nikunj97> I have to ask about this

14:47 <heller1> cpufreq something

14:49 <nikunj97> cpupower is not installed on the node I'm using unfortunately

14:53 <nikunj97> I've emailed the prof asking if the CPU clock speed is kept at a constant or not

14:58 <heller1> cpufreq?

14:58 <nikunj97> cpufreq isn't installed as well

14:58 <heller1> You can also query the stuff on the sys filesystem

15:03 <nikunj97> there's no cpuinfo_cur_freq in /sys/devices/system/cpu/cpu*/cpufreq/

15:57 <nikunj97> zao, why does it take so long for the generating phase of cmake on some architectures?

15:58 <zao> On some platforms it's expensive to create process or run the compiler.

15:58 <zao> In particular, checking out Intel licenses tends to take time.

15:58 <nikunj97> what does the generating phase do btw?

15:58 <zao> Other slowdowns can be due to using a network file system for things, having latencies in metadata operations.

16:34 <nikunj97> heller1, hwloc-bind cores:0-9.PU:0 with stream results show a peak bandwidth of 42Gb/s while with HPX my results are around 4500MLUPS which is about 1.7 times the performance expected at that bandwidth

16:37 <nikunj97> however the results become expected if I run stream with OMP_NUM_THREADS instead of using hwloc-bind

16:41 <nikunj97> heller1, what could be the reason behind it?

17:13 Amy1 has quit [Quit: WeeChat 2.2]

17:13 Amy1 has joined #ste||ar

17:13 Amy1 has quit [Client Quit]

17:14 Amy1 has joined #ste||ar

17:45 diehlpk_work has joined #ste||ar

17:50 <Amy1> https://glot.io/snippets/fmfxjak6yu

17:50 <Amy1> how to optimize this code?

18:07 <heller1> nikunj97: you set the number of threads one time but don't the other time?

18:08 <nikunj97> I set the number of threads everytime I run

18:08 <nikunj97> hwloc-bind "cores:0-$((${core}-1)).PU:0" "${ROOT}"/stencil_parallel --Nx=8192 --Ny=131072 --hpx:use-process-mask

18:08 <nikunj97> this is the command I run where ${core} comes from for loop

18:11 <hkaiser> Amy1: optimize in what sense?

18:12 <Amy1> improve the code performance.

18:13 <diehlpk_work> hkaiser, see pm

18:15 <nikunj97> Amy1, 2 quick suggestions make the loop such that line 7 and line 6 is switched. That will give you cache benefits. Furthermore the loops look embarrassingly parallel so they can be parallelized.

18:17 <zao> Do you really need the dense `delta_pos` array, or could it be possible that you could get away with something less?

18:18 <zao> Like the distance to grid cells or other reference points or something.

18:21 <zao> nikunj97: Amusingly enough, this is pretty spot on the shape of loop I'm trying to apply HPX to in my codebase.

18:21 <zao> Finding it hard to beat plain old OpenMP reduces on single-node.

18:21 <nikunj97> zao, the one Amy1 shared?

18:22 <zao> aye

18:23 <zao> (mine's worse tho, as there's sums made over the second nesting)

18:23 <zao> n=10M

18:24 <nikunj97> iirc array indexing works like arr[y][x] and not arr[x][y] where x,y are points on the cartesian plane with x and y components. Also if you're working with contiguous memory it's easier to make it cache optimized. Is there something wrong in what I'm saying?

18:25 weilewei has joined #ste||ar

18:25 <zao> I'm a bit surprised that there's three separate arrays for the three coordinates.

18:26 <Amy1> Are there IRC channels about program optimization?

18:26 <zao> So used to that they're interleaved from graphics.

18:26 <zao> As nikunj says, the primary factor here if you're not changing anything algorithmically would be to consider data layout and loop order.

18:26 Amy1 has quit [Quit: WeeChat 2.2]

18:26 <zao> ...

18:27 Amy1 has joined #ste||ar

18:28 <nikunj97> Amy1, if you can't change the algorithm itself, then they're the only 2 options for optimization I can think of

18:29 <Amy1> Are there IRC channels about program optimization?

18:29 <Amy1> nikunj97: which optimization?

18:30 <nikunj97> switching loops between line 6 and 7 for cache benefits and parallelizing the loops since they're embarrassingly parallel

18:31 <nikunj97> cache benefits with contiguous memory in such cases can give about a 10x boost in performance. Never tried it myself, but the guy from cppcon says so..

18:32 <nikunj97> heller1, quick question: why are we considering stream triad benchmark for memory bandwidth and not the other 3 mentioned?

18:33 <nikunj97> if we take results from stream copy benchmark, my results looks expected behavior

18:37 <Amy1> nikunj97: only 1.2 speedup.

18:38 <Amy1> but simd width is 4.

18:38 <nikunj97> Amy1, can you show me the code again?

18:38 <nikunj97> Amy1, I don't think simd will help in your case. It looks like a memory bound problem.

18:40 <Amy1> https://glot.io/snippets/fmfxjak6yu

18:40 <Amy1> Nbody is 4*1024 and Ndim is 3.

18:40 <Amy1> Yeah, it's a memory bound.

18:43 <nikunj97> Amy1, a speed up of 1.2 with a memory bound problem is expected since you're not compute bound

18:43 <nikunj97> Amy1, I don't see a change in loop structure in your new code

18:44 <nikunj97> also are you working with simd in the code you share?

18:45 <Amy1> https://godbolt.org/z/o6xkry

18:45 <Amy1> nikunj97: you can see this.

18:49 Amy1 has quit [Quit: WeeChat 2.2]

18:49 Amy1 has joined #ste||ar

18:49 <nikunj97> Amy1, just realized that you cannot work with more than 1 array's contiguous memory. Your loop structure right now is inefficient.

18:51 <Amy1> nikunj97: it seems reasonable. so how to optimize it?

18:55 <nikunj97> Amy1, I don't see any other optimizations really, sorry

18:55 <nikunj97> you may have to change the algorithm itself to gain significant speed up

18:56 <zao> Not sure if you saw my previous message as you quit somehow.

18:58 <Amy1> zao: sorry, I don't see

18:58 <zao> 20:26 <zao> As nikunj says, the primary factor here if you're not changing anything algorithmically would be to consider data layout and loop order.

18:58 <zao> Which in essence implies keeping your caches happy.

18:58 <Amy1> zao: you can do any optimization include algrithom

18:59 <zao> Going to disjoint blocks of memory for X/Y/Z may be sub-optimal.

18:59 <zao> It could help with vectorization, of course, depending on whether you actually do something like that.

18:59 <zao> It's hard to reason about something like this without knowing anything about why the input and output data is shaped like it is.

19:00 Amy1 has quit [Quit: WeeChat 2.2]

19:00 Amy1 has joined #ste||ar

19:07 Nikunj__ has joined #ste||ar

19:10 nikunj97 has quit [Ping timeout: 260 seconds]

19:12 <weilewei> hkaiser i just started looking at LibCDS (concurrent data structure) code, is the main action provide hpx thread interface similar to this interface: https://github.com/khizmax/libcds/blob/master/cds/threading/details/pthread_manager.h

19:12 <weilewei> i also noticed msvc interface: https://github.com/khizmax/libcds/blob/master/cds/threading/details/msvc_manager.h, similar to the pthread one

19:14 <weilewei> *is the main action to provide ... ?

19:15 nikunj97 has joined #ste||ar

19:18 Nikunj__ has quit [Ping timeout: 256 seconds]

19:25 Nikunj__ has joined #ste||ar

19:28 nikunj97 has quit [Ping timeout: 256 seconds]

19:29 nikunj97 has joined #ste||ar

19:33 Nikunj__ has quit [Ping timeout: 240 seconds]

19:36 <hkaiser> weilewei: yah, that

19:36 <hkaiser> weilewei: first I'd suggest you try to understand what the code does and what is the threading interface needed for

19:39 <hkaiser> weilewei: could be that sme of that needs to still use pthreads, not sure - depends what it's used for

19:52 <weilewei> hkaiser I see, yea, I am trying to understand the code, found pthread seems like the core part

19:53 <hkaiser> yah, could very well be

19:53 <weilewei> does hpx have direct function calling hpx thread? I only uses async...

19:54 <weilewei> I am sure hpx has, but where is hpx interface

19:56 <weilewei> hkaiser ^^

19:59 <weilewei> just calling hpx::thread ?

19:59 <hkaiser> yes

20:00 <weilewei> thanks, let me look into code and understand it more

20:35 nikunj97 has quit [Ping timeout: 260 seconds]

21:11 <Yorlik> Noob Question: I made a custom depth first forward iterator for a tree and want to access the internal state of the iterator inside the loop (recursion level variable). But I am getting only the initial state on every iteration. Is there a way to fix that?

21:12 <Yorlik> To mee it looks as if the iterator I am passing is copied and a copy being used for the loop.

21:18 <Yorlik> OK fixed. Rewrote my for to "for ( auto e = it.begin(); e != it.end(); ++e )" :D Thank you Rubberducky!

21:21 * zao quacks

21:29 <Yorlik> :)

21:30 <Yorlik> Measured 76 ns / item at 1,000,000 items Quadtree. Not sure if that's fast or slow though.

21:30 <hkaiser> that's very fast - too fast

21:30 <Yorlik> Measurement error?

21:30 <hkaiser> ahh, per item - seems to be ok

21:31 <Yorlik> Put a volatile assign in the loop so it doesn't get optimized away.

21:36 <zao> Yorlik: What was that thing that let you chunk iterations yourself?

21:37 <Yorlik> Like statically or the autochunker?

21:37 <zao> I don't remember the divisions, but something that gave you ranges you got to process yourself, instead of invoking something for every single element.

21:38 <Yorlik> Oh this - I'd have to look it up. hkaiser mentioned it

21:39 <Yorlik> I'm not currently using it.

21:39 <zao> ah, don't worry about it.

21:39 <Yorlik> OK

21:40 <Yorlik> 12us now with a small trre - I think I am measuring cache effects with the large trees

21:40 <hkaiser> zao: use the bulk_async_execute executor customization point

21:41 <hkaiser> for_llop and friends are just fancy wrappers around that anyways

21:43 <Yorlik> Holy cow: 9ns / item on 10,000,000 items, 15500 ns on a 100 item tree. There is no amortized creation cost in it. It's really just the loop. I guess that' the power of the cache.

21:51 <Yorlik> FFS artifact - lol. Damned bugs.

21:51 <zao> "Done with density kernel calculations in 75.08"

21:51 <zao> Assuming the code actually does what it should, this is nice.

22:32 hkaiser has quit [Read error: Connection reset by peer]

23:00 <Yorlik> LOL - my bug was a feature I had forgotten about: I had set a maximum tree depth parameter to a low value, so leaves got pooled which made it appear faster. Times are much slower now ~9-12 micoseconds / item - too slow I think. But there is an interesting observation: It seems the number of nodes in the quadtree for randomly distributed items is around 2.43 times the number of items. is that a known relation? Is there

23:00 <Yorlik> some theory for it? I think It can surely be explained statistically.

23:01 <Yorlik> Loops: 2424 Diff: 12470 ns/loop ( 1000 items)

23:01 <Yorlik> Loops: 24330 Diff: 13243 ns/loop ( 10000 items)

23:01 <Yorlik> Loops: 243724 Diff: 11149 ns/loop ( 100000 items)

23:01 <Yorlik> Loops: 2440763 Diff: 9825 ns/loop ( 1000000 items)

23:04 hkaiser has joined #ste||ar

23:08 <Yorlik> hkaiser: Fixed that bug which gave these nice iteration times. Now it's a horrible ~12 microseconds/item :(

23:08 <Yorlik> It was a feature kicking in: My limitation of tree depth, so it started pooling itemsinto buckets even over the limit.

23:14 <hkaiser> Yorlik: debug or release build?

23:15 <Yorlik> Debug.

23:15 <hkaiser> never measure perf in Debug builds

23:15 * Yorlik goes measuring release

23:16 <Yorlik> hkaiser: But I need to justify me optimizing !!!! ;)

23:17 <hkaiser> right

23:22 <Yorlik> hkaiser: A wee bit besser:

23:22 <Yorlik> Loops: 2424 Diff: 579 ns/loop ( 1000 items)

23:22 <Yorlik> Loops: 243724 Diff: 594 ns/loop ( 100000 items)

23:22 <Yorlik> Loops: 2440763 Diff: 487 ns/loop ( 1000000 items)

23:22 <Yorlik> Loops: 24330 Diff: 630 ns/loop ( 10000 items)

23:23 <Yorlik> Interesting is this constant factor of ~2.4 nodes/item

23:23 <Yorlik> (random distribution)

23:46 <hkaiser> Yorlik: yah, you'r filling your tree evenly

23:46 <Yorlik> The good news it: this scales by a constant factor

23:46 <Yorlik> Just made the 64 ary tree btw

23:47 <Yorlik> Needed minimal corrctions

23:47 <Yorlik> Its generic now

23:47 <Yorlik> iter time increases it seems - still testing

23:47 <hkaiser> that quad-tree is filled by a bit more than half

23:47 <Yorlik> But nodes per item decrease

23:48 <Yorlik> Still working on the sausage.

23:48 <hkaiser> logbase2(8 * fillfactor) == 2.4

23:48 <Yorlik> For the 64ary it seems to go towards 1.7

23:48 <hkaiser> makes sense too, I guess

23:50 <Yorlik> Argh no

23:50 Yorlik has quit [Read error: Connection reset by peer]

23:51 Yorlik has joined #ste||ar

23:51 <Yorlik> Seems the 64ary is more memory hungry - lol - just crashed explorer and some apps

23:52 <Yorlik> Loops: 1738 Diff: 3072 ns/loop ( 1000 items)

23:52 <Yorlik> Loops: 20946 Diff: 3455 ns/loop ( 10000 items)

23:52 <Yorlik> Loops: 228397 Diff: 4023 ns/loop ( 100000 items)

23:52 <Yorlik> Loops: 2378127 Diff: 3337 ns/loop ( 1000000 items)

23:52 <Yorlik> It's getting expensive

23:52 <Yorlik> And this was release

23:56 Yorlik has quit [Read error: Connection reset by peer]

23:57 Yorlik has joined #ste||ar

23:58 Yorlik has quit [Read error: Connection reset by peer]

23:59 Yorlik has joined #ste||ar