hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
hkaiser has quit [Quit: bye]
diehlpk_work_ has quit [Remote host closed the connection]
<heller1> That looks pretty fine grain already
mreese3 has joined #ste||ar
maxwellr96 has quit [Ping timeout: 260 seconds]
nikunj97 has joined #ste||ar
nikunj97 has quit [Quit: Leaving]
nikunj97 has joined #ste||ar
nikunj97 has quit [Read error: Connection reset by peer]
nikunj97 has joined #ste||ar
<zao> If I'm writing a HPX application from scratch on a cluster, what version is the most likely to actually work? 1.4.1, stable, master-HEAD?
<zao> A bit tempted to see if I can futurize this thing.
<nikunj97> zao, from my experience I'd say master-HEAD
<nikunj97> 1.4.1 does not have the new block executor
<heller1> 1.4.1 or stable would be my choice
<nikunj97> also there were problems with APEX integration
<nikunj97> that I faced, which recently got fixed in master
<heller1> stable is almost HEAD but with circle having passed
<nikunj97> aah, then stable I'd say
<nikunj97> heller1, I'm finally getting better performances in Hi1616
<nikunj97> at least near peak bandwidth performance
<heller1> Great, what did you change?
<nikunj97> it was previously using an older arm-hpc-compiler
<nikunj97> plus there was a problem with nsimd that I noticed yesterday
<nikunj97> they've corrected that
<nikunj97> these 2 changes were honestly enough to make the change
<nikunj97> plus nsimd was previously built with gcc while everything else was arm-hpc-compiler
<nikunj97> I was looking at hpx side of things while the problem lied somewhere else
<nikunj97> I realized all of this when I was building everything from scratch yesterday for my final benchmarking
<zao> `-fgo-fast-plz`
<nikunj97> heller1, I didn't think this would make this big a difference, but this is what it is ;)
<nikunj97> 64 core performance is still not optimal though, even when thread-idle rate is about 11% for both 32 and 64 core. If it's about the same for both, I speculate that the cpu is waiting memory is being fetched. I'll get stream results again and see for myself.
gonidelis has joined #ste||ar
<nikunj97> heller1, I'll run the benchmarks on all platforms now. This will be enough for my lab based project. After this, I'll work on writing iterators for various stencil traveling algorithms for better performances. Those will be both distributed and futurized
<nikunj97> heller1, ok. looking at the code I realized that it's using the default chunk policy
<nikunj97> that was the major reason for the performance difference
<heller1> Cool!
<heller1> What chunker do you use?
<nikunj97> auto_chunk_size
<nikunj97> and I passed anywhere from 50-5000 as a parameter
<nikunj97> but all of them ended with about 2.4GLUPS, while the default is performing at ~4GLUPS
<heller1> Ok, so the default is better?
<nikunj97> apparently yes
<heller1> Anyway, the 64 core issue could be some other overhead related to some concurrency problem
<heller1> Run it with perf and see where the top 10 hotspots are
<nikunj97> perf is the one you shared yesterday, right?
<heller1> Yes
<heller1> Similar to vtune
hkaiser has joined #ste||ar
akheir has joined #ste||ar
<nikunj97> heller1, do I need to build hotspot at my cluster node? It won't allow me to use copied perf.data on a local machine
akheir1 has joined #ste||ar
akheir has quit [Ping timeout: 260 seconds]
<heller1> Why can't you open any perf data locally?
<nikunj97> it opens but with broken stack trace
<nikunj97> ?? everywhere basically
<heller1> Well, yeah
<heller1> You need to tell it where to find the debugging symbols
<nikunj97> how do I do that. they're not available locally
<nikunj97> also it's very slow to load results
<heller1> Yeah, that's difficult
<nikunj97> I switched to perf report -g -n btw
<nikunj97> on the cluster node to check hotspots
<nikunj97> heller1, what I wanted was perf-archive http://man7.org/linux/man-pages/man1/perf-archive.1.html
<nikunj97> this will enable me to use it locally
<nikunj97> heller1, https://imgur.com/a/RNxSqNN
<nikunj97> heller1, how do I analyse the result? Pls help me
<heller1> Compare it with a 32 core run
<heller1> And use a better gui
mcopik has joined #ste||ar
mcopik has quit [Client Quit]
gonidelis has quit [Ping timeout: 240 seconds]
wate123__ has joined #ste||ar
<zao> Whoa, is stable really a good two weeks ago?
<zao> Had to check that my Git client wasn't stuck :)
<zao> Hrm no, web claims it's two days ago.
<zao> Ah, it refused to fetch a tag that would clobber :D
<zao> (carry on, nothing to see here)
Nikunj__ has joined #ste||ar
wate123__ has quit [Remote host closed the connection]
nikunj97 has quit [Ping timeout: 260 seconds]
<zao> Oh gods, not again... C:\stellar\hpx\libs\topology\include\hpx\topology\topology.hpp(27): fatal error C1083: Cannot open include file: 'hwloc.h': No such file or directory
<zao> Someone should confiscate my computer license tonight. I hadn't checked out the tag :D
Hashmi has joined #ste||ar
<hkaiser> heller1: hey
<heller1> Hey hkaiser
<hkaiser> heller1: I could use a second pair of eye looking at #4512, to make sure I have not created any races
<hkaiser> would you have the time for that?
<heller1> Sure
<hkaiser> thanks
<Nikunj__> heller1, one clear difference between 32 core and 64 core is scheduler takes 28% in hotspot result for 64 cores while only 12% for 32 cores
<Nikunj__> I think the reduced performance even with same idle rates is not due to limiting bandwidth but due to contention by scheduler
<hkaiser> Nikunj__: then the idlerates would be different
<Nikunj__> hkaiser, idle rates were about the same
<Nikunj__> 32 core had 10-11% while 64 core had 12-13%
<Nikunj__> and both 32 and 64 core performances were very similar even when memory bandwidth still increased
<hkaiser> Nikunj__: that means that the scheduler contention is about the same in both cases, no>
<hkaiser> ?
<Nikunj__> sure, but both peak performance and peak bandwidth increased during that duration while our application didn't scale even with same idle rates
<Nikunj__> I didn't get why it's behaving like this. I'll test some more with it and see what's causing it.
<Nikunj__> hkaiser, does thread-idle rate not count scheduler invocations etc.?
<heller1> There are some spots the idle rates don't get
<heller1> Have yourself a build without idle rates and try again
<heller1> My guess would be that there is contention in some atomic. The idle rate counter is one of them
<hkaiser> heller1: the idle-rates are non-atomic
<hkaiser> but it could be the timers
<heller1> Hu?
<heller1> A right... It's taking the time where the overheads came from
<zao> I'm using Visual Studio's CMake and I think it's getting upset over HPX's CMake config - https://gist.github.com/zao/49a425d97f76ec41fe781bf222745821
<zao> HPXConfig.cmake contains:
<zao> set(HPX_HWLOC_ROOT "C:\local\hwloc-win64-build-2.1.0")
<zao> Which is how I specified HWLOC_ROOT on the command line when configuring HPX.
<zao> Should HPX normalize or escape these paths in some way?
<zao> Same problem when generating with regular CMake on command line too.
<zao> simbergm: Are you involved with these parts?
<zao> If I specify paths with forward slashes when configuring HPX it generates the right thing, but that's unholy and unnatural in CMD.
<hkaiser> zao: yah, that's a bug
<hkaiser> zao: care to create a ticket?
<zao> 4513
Hashmi has quit [Quit: Connection closed for inactivity]
<hkaiser> thanks zao!
<Yorlik> o/
<hkaiser> Yorlik: hey
<Yorlik> Heyo !
<hkaiser> talk now?
<Yorlik> Great !
<Yorlik> Same link?
<hkaiser> yep
<Yorlik> For some reason the link doesn'Ät work anymore for me
<hkaiser> sec
<Yorlik> Now I'm in
<Nikunj__> heller1, those numbers were with idle rate off
<Nikunj__> I have 2 builds of HPX, all my tests I do are with the idle rate and apex otf trace. All the numbers that I quote are with release build of HPX without idle rates and apex. The difference in results between the two builds is usually low, never exceeding a 100MLUPS
Nikunj__ has quit [Read error: Connection reset by peer]
<heller1> I see
<heller1> Then it's something else, you either dig through the stack traces, or accept what you have and write it down
<heller1> You've got pretty nice results!
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 265 seconds]
akheir1 has quit [Remote host closed the connection]
akheir has joined #ste||ar
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
<zao> Only worked through a bit of my init code, but HPX is surprisingly usable.
nikunj has quit [Ping timeout: 265 seconds]
nikunj has joined #ste||ar