#ste||ar on 2020-04-11 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

02:28 hkaiser has quit [Quit: bye]

02:28 diehlpk_work_ has quit [Remote host closed the connection]

05:31 <heller1> That looks pretty fine grain already

07:01 mreese3 has joined #ste||ar

07:04 maxwellr96 has quit [Ping timeout: 260 seconds]

07:14 nikunj97 has joined #ste||ar

09:01 nikunj97 has quit [Quit: Leaving]

09:03 nikunj97 has joined #ste||ar

09:46 nikunj97 has quit [Read error: Connection reset by peer]

09:48 nikunj97 has joined #ste||ar

10:21 <zao> If I'm writing a HPX application from scratch on a cluster, what version is the most likely to actually work? 1.4.1, stable, master-HEAD?

10:21 <zao> A bit tempted to see if I can futurize this thing.

10:21 <nikunj97> zao, from my experience I'd say master-HEAD

10:21 <nikunj97> 1.4.1 does not have the new block executor

10:22 <heller1> 1.4.1 or stable would be my choice

10:22 <nikunj97> also there were problems with APEX integration

10:22 <nikunj97> that I faced, which recently got fixed in master

10:22 <heller1> stable is almost HEAD but with circle having passed

10:22 <nikunj97> aah, then stable I'd say

10:25 <nikunj97> heller1, I'm finally getting better performances in Hi1616

10:25 <nikunj97> at least near peak bandwidth performance

10:26 <heller1> Great, what did you change?

10:27 <nikunj97> it was previously using an older arm-hpc-compiler

10:27 <nikunj97> plus there was a problem with nsimd that I noticed yesterday

10:27 <nikunj97> they've corrected that

10:27 <nikunj97> these 2 changes were honestly enough to make the change

10:28 <nikunj97> plus nsimd was previously built with gcc while everything else was arm-hpc-compiler

10:28 <nikunj97> I was looking at hpx side of things while the problem lied somewhere else

10:29 <nikunj97> I realized all of this when I was building everything from scratch yesterday for my final benchmarking

10:29 <zao> `-fgo-fast-plz`

10:31 <nikunj97> heller1, I didn't think this would make this big a difference, but this is what it is ;)

10:32 <nikunj97> 64 core performance is still not optimal though, even when thread-idle rate is about 11% for both 32 and 64 core. If it's about the same for both, I speculate that the cpu is waiting memory is being fetched. I'll get stream results again and see for myself.

10:44 gonidelis has joined #ste||ar

10:45 <nikunj97> heller1, I'll run the benchmarks on all platforms now. This will be enough for my lab based project. After this, I'll work on writing iterators for various stencil traveling algorithms for better performances. Those will be both distributed and futurized

11:21 <nikunj97> heller1, ok. looking at the code I realized that it's using the default chunk policy

11:21 <nikunj97> that was the major reason for the performance difference

11:25 <heller1> Cool!

11:25 <heller1> What chunker do you use?

11:25 <nikunj97> auto_chunk_size

11:26 <nikunj97> and I passed anywhere from 50-5000 as a parameter

11:27 <nikunj97> but all of them ended with about 2.4GLUPS, while the default is performing at ~4GLUPS

11:28 <heller1> Ok, so the default is better?

11:29 <nikunj97> apparently yes

11:30 <heller1> Anyway, the 64 core issue could be some other overhead related to some concurrency problem

11:31 <heller1> Run it with perf and see where the top 10 hotspots are

11:31 <nikunj97> perf is the one you shared yesterday, right?

11:35 <heller1> Yes

11:36 <heller1> Similar to vtune

12:58 hkaiser has joined #ste||ar

14:07 akheir has joined #ste||ar

14:31 <nikunj97> heller1, do I need to build hotspot at my cluster node? It won't allow me to use copied perf.data on a local machine

14:59 akheir1 has joined #ste||ar

15:01 akheir has quit [Ping timeout: 260 seconds]

15:09 <heller1> Why can't you open any perf data locally?

15:09 <nikunj97> it opens but with broken stack trace

15:09 <nikunj97> ?? everywhere basically

15:09 <heller1> Well, yeah

15:10 <heller1> You need to tell it where to find the debugging symbols

15:10 <nikunj97> how do I do that. they're not available locally

15:11 <nikunj97> also it's very slow to load results

15:11 <heller1> Yeah, that's difficult

15:11 <nikunj97> I switched to perf report -g -n btw

15:11 <nikunj97> on the cluster node to check hotspots

15:16 <nikunj97> heller1, what I wanted was perf-archive http://man7.org/linux/man-pages/man1/perf-archive.1.html

15:16 <nikunj97> this will enable me to use it locally

15:19 <nikunj97> heller1, https://imgur.com/a/RNxSqNN

15:27 <nikunj97> heller1, how do I analyse the result? Pls help me

15:28 <heller1> Compare it with a 32 core run

15:28 <heller1> And use a better gui

15:55 mcopik has joined #ste||ar

15:56 mcopik has quit [Client Quit]

16:25 gonidelis has quit [Ping timeout: 240 seconds]

16:32 wate123__ has joined #ste||ar

16:50 <zao> Whoa, is stable really a good two weeks ago?

16:50 <zao> Had to check that my Git client wasn't stuck :)

16:52 <zao> Hrm no, web claims it's two days ago.

16:59 <zao> Ah, it refused to fetch a tag that would clobber :D

16:59 <zao> (carry on, nothing to see here)

17:21 Nikunj__ has joined #ste||ar

17:23 wate123__ has quit [Remote host closed the connection]

17:24 nikunj97 has quit [Ping timeout: 260 seconds]

17:44 <zao> Oh gods, not again... C:\stellar\hpx\libs\topology\include\hpx\topology\topology.hpp(27): fatal error C1083: Cannot open include file: 'hwloc.h': No such file or directory

17:45 <zao> Someone should confiscate my computer license tonight. I hadn't checked out the tag :D

17:52 Hashmi has joined #ste||ar

18:02 <hkaiser> heller1: hey

18:02 <heller1> Hey hkaiser

18:03 <hkaiser> heller1: I could use a second pair of eye looking at #4512, to make sure I have not created any races

18:03 <hkaiser> would you have the time for that?

18:03 <heller1> Sure

18:03 <hkaiser> thanks

18:10 <Nikunj__> heller1, one clear difference between 32 core and 64 core is scheduler takes 28% in hotspot result for 64 cores while only 12% for 32 cores

18:11 <Nikunj__> I think the reduced performance even with same idle rates is not due to limiting bandwidth but due to contention by scheduler

18:13 <hkaiser> Nikunj__: then the idlerates would be different

18:13 <Nikunj__> hkaiser, idle rates were about the same

18:14 <Nikunj__> 32 core had 10-11% while 64 core had 12-13%

18:14 <Nikunj__> and both 32 and 64 core performances were very similar even when memory bandwidth still increased

18:15 <hkaiser> Nikunj__: that means that the scheduler contention is about the same in both cases, no>

18:15 <hkaiser> ?

18:15 <Nikunj__> sure, but both peak performance and peak bandwidth increased during that duration while our application didn't scale even with same idle rates

18:23 <Nikunj__> I didn't get why it's behaving like this. I'll test some more with it and see what's causing it.

18:26 <Nikunj__> hkaiser, does thread-idle rate not count scheduler invocations etc.?

18:36 <heller1> There are some spots the idle rates don't get

18:36 <heller1> Have yourself a build without idle rates and try again

18:40 <heller1> My guess would be that there is contention in some atomic. The idle rate counter is one of them

18:41 <hkaiser> heller1: the idle-rates are non-atomic

18:42 <hkaiser> but it could be the timers

18:44 <heller1> Hu?

18:45 <heller1> A right... It's taking the time where the overheads came from

18:46 <zao> I'm using Visual Studio's CMake and I think it's getting upset over HPX's CMake config - https://gist.github.com/zao/49a425d97f76ec41fe781bf222745821

18:47 <zao> HPXConfig.cmake contains:

18:47 <zao> set(HPX_HWLOC_ROOT "C:\local\hwloc-win64-build-2.1.0")

18:47 <zao> Which is how I specified HWLOC_ROOT on the command line when configuring HPX.

18:47 <zao> Should HPX normalize or escape these paths in some way?

18:50 <zao> Same problem when generating with regular CMake on command line too.

18:50 <zao> simbergm: Are you involved with these parts?

18:52 <zao> If I specify paths with forward slashes when configuring HPX it generates the right thing, but that's unholy and unnatural in CMD.

18:57 <hkaiser> zao: yah, that's a bug

18:59 <hkaiser> zao: care to create a ticket?

19:06 <zao> 4513

20:02 Hashmi has quit [Quit: Connection closed for inactivity]

20:17 <hkaiser> thanks zao!

21:09 <Yorlik> o/

21:48 <hkaiser> Yorlik: hey

21:48 <Yorlik> Heyo !

21:48 <hkaiser> talk now?

21:48 <Yorlik> Great !

21:49 <Yorlik> Same link?

21:49 <hkaiser> yep

21:51 <Yorlik> For some reason the link doesn'Ät work anymore for me

21:51 <hkaiser> sec

21:51 <hkaiser> Yorlik: https://lsu.zoom.us/j/3340410194

21:51 <Yorlik> Now I'm in

22:10 <Nikunj__> heller1, those numbers were with idle rate off

22:12 <Nikunj__> I have 2 builds of HPX, all my tests I do are with the idle rate and apex otf trace. All the numbers that I quote are with release build of HPX without idle rates and apex. The difference in results between the two builds is usually low, never exceeding a 100MLUPS

22:26 Nikunj__ has quit [Read error: Connection reset by peer]

22:28 <heller1> I see

22:29 <heller1> Then it's something else, you either dig through the stack traces, or accept what you have and write it down

22:30 <heller1> You've got pretty nice results!

22:33 wate123_Jun has joined #ste||ar

22:47 wate123_Jun has quit [Ping timeout: 265 seconds]

23:15 akheir1 has quit [Remote host closed the connection]

23:16 akheir has joined #ste||ar

23:21 nikunj has quit [Read error: Connection reset by peer]

23:21 nikunj has joined #ste||ar

23:37 <zao> Only worked through a bit of my init code, but HPX is surprisingly usable.

23:50 nikunj has quit [Ping timeout: 265 seconds]

23:50 nikunj has joined #ste||ar