hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
gonidelis has quit [Ping timeout: 240 seconds]
karame_ has quit [Ping timeout: 240 seconds]
<diehlpk_work> weilewei, I think you can take the course as some external course
<weilewei> diehlpk_work ok, I need to think about it as I also have some requirement for taking external course. I need to check
bita_ has quit [Quit: Leaving]
shahrzad has quit [Ping timeout: 240 seconds]
shahrzad has joined #ste||ar
shahrzad has quit [Ping timeout: 252 seconds]
wate123_Jun has quit [Remote host closed the connection]
wate123_Jun has joined #ste||ar
shahrzad has joined #ste||ar
diehlpk_work has quit [Remote host closed the connection]
hkaiser has quit [Quit: bye]
shahrzad has quit [Remote host closed the connection]
shahrzad has joined #ste||ar
shahrzad has quit [Remote host closed the connection]
shahrzad has joined #ste||ar
nan11 has quit [Remote host closed the connection]
wate123_Jun has quit [Remote host closed the connection]
Vir has quit [Ping timeout: 256 seconds]
Vir has joined #ste||ar
Vir has joined #ste||ar
Vir has quit [Changing host]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 256 seconds]
weilewei has quit [Remote host closed the connection]
shahrzad has quit [Ping timeout: 240 seconds]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 240 seconds]
shahrzad has joined #ste||ar
shahrzad has quit [Quit: Leaving]
nikunj97 has joined #ste||ar
jaafar has quit [Remote host closed the connection]
jaafar has joined #ste||ar
<nikunj97> heller1, on some testing with hwloc-bind I can see that indeed E5 consistently performs at about 2700-2800MLUPS (float) and 1100-1300MLUPS (double)
<nikunj97> that's when I use hwloc-bind to bind equal number of cores from each package
<nikunj97> it starts off low and then saturates to that number no matter how many cores I put into
<nikunj97> which is the behavior we're looking at
<simbergm> everyone interested in having a say in whether the pragma once pr goes in or not, please respond to the email on hpx-devel
<simbergm> we're doing a vote!
nikunj97 has quit [Ping timeout: 240 seconds]
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
nikunj has quit [Ping timeout: 264 seconds]
nikunj has joined #ste||ar
nikunj97 has joined #ste||ar
<heller1> ms: weeeh!
<heller1> nikunj97: that's expected behavior
<nikunj97> heller1, yes it is, and I'm happy about it
<heller1> good!
<heller1> now, you will be able to see the effect of cache blocking
<nikunj97> heller1, btw I couldn't understand why performance dropped when we started performing on both CPUs
<nikunj97> both on Xeon and Hi1616
<heller1> nikunj97: not always, just in some configurations
<nikunj97> for ex: the performance dropped when going from 10 cores to 20 cores (1 CPU and 2 CPU respectively)
<nikunj97> in xeon e5
<heller1> right
<nikunj97> while the memory bandwidth doubled, which I think is expected since each CPU is is attached to a quad channel RAM
<heller1> yes
<heller1> in theory
<heller1> here is the catch
<heller1> you had more than half (I think it was even more than 2/3) of the threads bound to one NUMA domain and the remaining few to another
<heller1> now
<heller1> a single core is not designed to max out the available bandwidth of its attached memory controller
<heller1> however, you usually see a saturation of the bandwidth when using about 2/3 of the cores attached to a single memory controller
<heller1> what you then see, is that 2/3 of your program is operating at maximum speed, while the remaining 1/3 is operating way slower, because they don't get as much memory bandwidth
<heller1> as such, slowing down the entire algorithm, since we are basically operating in a lock step fashion, where the slowest path dominates the overall runtime
<nikunj97> aah, that makes sense! but why the drop going from one complete CPU to 2 complete CPUs?
<heller1> that's another story
<heller1> compare the idle rates in your runs
<nikunj97> using a profiler?
<heller1> the number of tasks executed, and the average runtime of those tasks
<heller1> the HPX performance counter should suffice here
<nikunj97> you think that the stencil dimensions are not large enough to keep all CPUs busy all the time?
<nikunj97> also, if we make it into a dataflow style, we should not observe a sharp drop in cases of non uniform memory bandwidth usage
<heller1> figure it out
<heller1> we still do
<nikunj97> why's that?
<heller1> there's still the slowest path
<nikunj97> aah, alright. we're not doing anything to speed up the slowest path
<heller1> well, you will mitigate the effect to some degree
<nikunj97> thanks for the lead, I'll try to figure idle rate now!
<heller1> because potentially, you can hide some latencies
<nikunj97> btw, I also found some improvements with loop unrolling, about ~100-150MLUPS
<heller1> the catch here however is, that we don't have asynchronous memory operations, so we can't schedule other useful in between
<heller1> the dataflow approach as such will only be really benefitial once going distributed
<nikunj97> so it's beneficial to convert the code to a dataflow approach at this point
<heller1> and only if the "communication surface" is small enough to be hidden by computation
<heller1> depends
<heller1> for this static problem: unlikely. for a more dynamic one: there are lots of benefits
<nikunj97> aah, I think I got some of what you said. But it is making sense to me
<nikunj97> I also think that I took a wrong example for simd performance
<nikunj97> I should've chosen something on the compute bound side
<heller1> :P
<heller1> get those cache optimizations in, and you'll be compute bound
<nikunj97> am trying. I want to get as much performance as possible :D
tiagofg[m] has joined #ste||ar
nikunj97 has quit [Quit: Leaving]
kale_ has joined #ste||ar
nikunj97 has joined #ste||ar
nikunj97 has quit [Remote host closed the connection]
kale_ has quit [Ping timeout: 258 seconds]
nikunj97 has joined #ste||ar
gonidelis has joined #ste||ar
nikunj97 has quit [Ping timeout: 240 seconds]
nikunj97 has joined #ste||ar
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
<nikunj97> how do I call a templated function from hpx::async?
<nikunj97> found it: do an async lambda and call the function from the lambda
<heller1> or spell out the type
parsa has joined #ste||ar
<nikunj97> spelling out the type still doesn't work
<nikunj97> foo<type>, args... doesn't seem to work
<nikunj97> pure virtual method called
<nikunj97> terminate called without an active exception
<nikunj97> cool error message from HPX ^^
diehlpk_work has joined #ste||ar
Hashmi has joined #ste||ar
hkaiser has joined #ste||ar
wate123_Jun has joined #ste||ar
shahrzad has joined #ste||ar
weilewei has joined #ste||ar
hkaiser_ has joined #ste||ar
hkaiser has quit [Ping timeout: 240 seconds]
<nikunj97> heller1, you were right about CPU idling. The execution time is significantly higher than task execution time * num tasks.
<nikunj97> let me try profiling to identify where things are going wrong
karame_ has joined #ste||ar
shahrzad has quit [Ping timeout: 240 seconds]
<hkaiser_> nikunj97: idle-rate!
<nikunj97> hkaiser_, how do I use that?
<nikunj97> I am using Valgrind to check for timings and stuff
<hkaiser_> --hpx:print-counter=/threads/idle-rate ?
<nikunj97> ohh, didn't know it existed. wait let me try
<hkaiser_> nikunj97: needs to be enabled at compile time, however
<nikunj97> ohh with compile time option?
<nikunj97> you mean compiling HPX right?
<hkaiser_> -DHPX_WITH_THREAD_IDLE_RATES=On while compiling HPX
<nikunj97> why did we decide to keep it off by default?
<hkaiser_> simbergm: is there a way to add an anchor in the HPX docs for each perf counter in the table?
<hkaiser_> it adds some (possibly minor) overhead
<nikunj97> simbergm, that will be of great help!
<nikunj97> hkaiser_, aah! I see
<nikunj97> hkaiser_, is there any other compile time option that I'd want to use?
<hkaiser_> nikunj97: I think not
<nikunj97> alright, let me compile it again and use that
nan11 has joined #ste||ar
<simbergm> hkaiser_: maybe... it looks like it's possible to create a target, but you'd have to manually know the link (i.e. there'd be no button like with headings)
<simbergm> I'll look around
<hkaiser_> simbergm: thanks, I'd volutnteer to add those
K-ballo has quit [Remote host closed the connection]
K-ballo has joined #ste||ar
<nikunj97> hkaiser_, /threads{locality#0/pool#default/worker-thread#19}/idle-rate,1,5.473154,[s],6652,[0.01%]
<nikunj97> this shows a different story somehow
shahrzad has joined #ste||ar
bita has joined #ste||ar
K-ballo has quit [Remote host closed the connection]
K-ballo has joined #ste||ar
<hkaiser_> nikunj97: numa 0 has ~33% and numa 1 has ~66% idle-rate
<nikunj97> how did you calculate that?
<nikunj97> I thought 0.01% mentioned is the idle rate
<hkaiser_> well, your counters show that
<hkaiser_> no 0.01% is the unit of measure
<nikunj97> what are all the values then?
<nikunj97> comma separated values I mean
<hkaiser_> RTFM
<hkaiser_> nikunj97: HPX docs <quote>These lines have 6 fields, the counter name, the sequence number of the counter invocation, the time stamp at which this information has been sampled, the unit of measure for the time stamp, the actual counter value, and an optional unit of measure for the counter value.</quote>
<nikunj97> aah so 3596 -> 35.96% idle rate
<hkaiser_> right
<nikunj97> what should I do to decrease the idle rate? more tasks?
<hkaiser_> more parallelism
<hkaiser_> ask yurself why is num1 more idle than numa domain 0
rtohid has joined #ste||ar
<nikunj97> good question. Because they're allocated on numa 0 and migrated to numa 1?
<nikunj97> causing a delay and idling cpus
<hkaiser_> that's what you can improve, then
<nikunj97> how many tasks does parallel_for create?
<hkaiser_> as many as it creates chunks
<nikunj97> I found that using plain futures improved performance
<heller1> nikunj97: you are doing strong scaling right now. Try the same with scaling the number of elements with the number of CPUs
<hkaiser_> depends on the chunk sizes, I guess
<hkaiser_> also, try using the parallel_aggregated executor, that is more efficient than the default one
<nikunj97> heller1, isn't the current dimension of 8192x131072 not enough for strong scaling?
<nikunj97> hkaiser_, I'm using hpx::compute::host::block_executor<>
<hkaiser_> ok
<hkaiser_> that's fine, I guess
<nan11> hkaiser_ will we have meeting today?
<hkaiser_> nan11: if you would like to meet, I'm ready
shahrzad has quit [Ping timeout: 256 seconds]
<nan11> Yes, could we have a short meeting
<hkaiser_> sure
Hashmi has quit [Quit: Connection closed for inactivity]
shahrzad has joined #ste||ar
<nikunj97> hkaiser_, what happened to executor_traits? I can't see them in the documentation anymore
<nikunj97> did we change the syntax for `exec.async_apply`?
shahrzad has quit [Ping timeout: 246 seconds]
<hkaiser_> no
<hkaiser_> nikunj97: but you shouldn't ever use exec.async_execute or similar
<hkaiser_> always use the customization point async_execute(exec, ...)
<nikunj97> what does async_execute return?
<nikunj97> a future?
<hkaiser_> the same as executor.async_execute returns
<hkaiser_> simbergm: yt?
<hkaiser_> simbergm: I'm trying to fix the issues with #4487
<hkaiser_> but I'm running into errors in the shared_priority_queue_scheduler that are unrelated
<nikunj97> hkaiser_, got it. thanks!
<simbergm> hkaiser_: half here
<hkaiser_> simbergm: I added some asserts to the scheduler that fire now on that PR
<simbergm> not sure I can help you but what errors do you get? jbjnr would be the one to ask about shared_priority_queue_scheduler
<hkaiser_> not sure who can look at that scheduler, I don't know anything about it
<simbergm> I guess the errors are in the pr status? I'll have a look
<hkaiser_> basically the local_nums used for the threads end up being -1
<hkaiser_> and then are used as array indices
<simbergm> the cross_pool_injection test also still has problems and it's using the shared priority queue scheduler
<simbergm> hmm, ok
<hkaiser_> yes, all of those fire the assert now
<simbergm> that could be me
<simbergm> related to the thread num refactoring
<hkaiser_> nod, right
<simbergm> but you say only in the shared priority queue scheduler? that would be the one that actually uses the local numbers, the others I'm not sure they even set the local numbers (only global)
<hkaiser_> I just pushed the changes, you'll see it in the CI runs
<hkaiser_> yes
<simbergm> in any case, I'll have a look latest monday
<hkaiser_> have not seen it for others
<hkaiser_> sure, no rush
<hkaiser_> I have it off my desk now ;-)
<simbergm> as in I'll have a look at fixing it
<simbergm> if I get any ideas I'll let you know over the weekend
<simbergm> :P
<hkaiser_> ok, thanks - try to relax over the free days
<simbergm> but you got around the other failing tests? the actual exception handling? or is this in the way of that?
<hkaiser_> yes
<hkaiser_> that's fixed, I believe
hkaiser_ has quit [Read error: Connection reset by peer]
Yorlik has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
Yorlik has joined #ste||ar
<nikunj97> heller1, hkaiser using `hpx::compute::host::block_executor` was leading to thread idling. Replacing it with `hpx::parallel::execution::parallel_executor_aggregated` fixed most of the thread idling. Now floats idle about 1.5-2.5% and doubles at 14%
<nikunj97> also in case of doubles, it's numa domain 1 that's causing the idling. Is there a way through which I could tell `hpx::parallel::execution::parallel_executor_aggregated` about the numa domains, just like there was with block_executor?
<nikunj97> hpx::compute::host::numa_domains() doesn't seem to work with parallel_executor_aggregated
<hkaiser> nikunj97: that's complicated and you are hitting a gap in our functionality
<nikunj97> is it too complicated?
<nikunj97> if not, I can try to add that to it
<heller1> Hmmm, interesting
<heller1> That's definitely a performance regression
<simbergm> nikunj97: master or 1.4.1?
<nikunj97> this is with master
<simbergm> from when exactly? or which commit?
<nikunj97> simbergm, e89bb10
<nikunj97> it's about 1.5 month old now
<heller1> So the idle rate is better, what about the actual performance though?
<nikunj97> it's at almost peak now
<nikunj97> so it gets to the peak expected and then saturates
<nikunj97> more like what we expected
<nikunj97> but couldn't see in the graph. Now I can reproduce what we expected
<simbergm> you might want to try latest master, the block executor changed to use a different executor in its implementation
<simbergm> also, are you scheduling individual tasks or doing parallel fors/bulk_async_execute?
<nikunj97> I'm doing a parallel_for
<nikunj97> without any chunk based input
<simbergm> ok, good
<nikunj97> let me try it with latest master then
<heller1> Can I see a graph please?
<simbergm> in that case I would compare parallel_executor (the default), parallel_executor_aggregate, thread_pool_executor, and block_executor
<nikunj97> heller1, I haven't plotted yet. I just ran it a bunch of times with the parameters I used in graph to see how it was behaving
<simbergm> you might also want to try `--hpx:queuing=shared-priority` (a different scheduler that handles numa better than the default one)
<nikunj97> simbergm, I'll try it as well
<nikunj97> sure, I can do a comparison about executors as well
<simbergm> and `--hpx:numa-sensitive`...
<simbergm> so many combinations you could try :P
<nikunj97> heller1, let me plot them tomorrow
<nikunj97> simbergm, I'm using numa-sensitive already
<simbergm> ok, good
shahrzad has joined #ste||ar
shahrzad has quit [Remote host closed the connection]
<heller1> It should look a bit like this. Just noticed that I have that drop there as well ;)
<heller1> If the graph is flattening in a uniform way, you're doing it wrong. Adding a second numa domain should double the bandwidth
<nikunj97> heller1, let me run a script tonight and see what the results are
<nikunj97> and then we can discuss over it tomorrow
<nikunj97> should I run them with hwloc-bind just to make sure?
<nikunj97> also about arm, you said running n cores from each numa domain. Should I write a script that does that?
<nikunj97> heller1, fig 6.13, 6.14 pg-102. Why are HPX stream results significantly different from OpenMP ones?
<heller1> Sure, do it
rtohid has left #ste||ar [#ste||ar]
<heller1> Grain size
<nikunj97> simbergm, just tried with the new block_executor. It's much better now. My results finally got to about 6700MLUPS on E5 at 20 core limit
<nikunj97> that's a 2300MLUP jump from my previous result
<simbergm> nikunj97: very good to hear!
<nikunj97> let me take a closer look at the idle rates now
gonidelis has quit [Remote host closed the connection]
<diehlpk_work> hkaiser, Do you know if Ali is around?
<diehlpk_work> I have issues with the boost modules
<nikunj97> diehlpk_work, you too?
<nikunj97> is it related to boost not found?
<nikunj97> even when you load the module?
<hkaiser> diehlpk_work: send him mail
Hashmi has joined #ste||ar
<diehlpk_work> nikunj97, Using my own build of boost works
<diehlpk_work> The preloaded module files does not
<nikunj97> yes that's what I'm facing as well
<nikunj97> I thought I messed up with my configuration. Looks like it's happening to others as well
<Yorlik> o/
<nikunj97> heller1, I see a lot of idling on Hi1616
<nikunj97> even with single numanode
<nikunj97> the application is not scaling on arm basically
wate123_Jun has quit [Remote host closed the connection]
wate123_Jun has joined #ste||ar
<nikunj97> hkaiser, why does --hpx:numa-sensitive not have an option for 4 numa domains?
<nikunj97> my Hi1616 have 4 numa nodes and I'm lacking performance
<nikunj97> is there a way around it?
<weilewei> hkaiser the mpi_isend approach you introduced works on my code. However, there seems to be some physics knowledge prevent me to do so. So I use a simpler version and waiting for physicist to get back my questions to me. Thanks!
<hkaiser> weilewei: sure
<hkaiser> nikunj97: --hpx:numa-sensitive has essentially no effect on the execution, iirc - also it had never a way to specify the number of numa domains
<nikunj97> what is it used for then?
<nikunj97> also, `hpx::init: std::exception caught: Invalid argument value for --hpx:numa-sensitive. Allowed values are 0, 1, or 2`
<hkaiser> right, don't think about using it, I doubt it has any effect whatsoever
<nikunj97> alright
<nikunj97> I still have some 40% idle rate on ARM to tackle
<nikunj97> and about 30
<nikunj97> on x86
<nikunj97> but with x86 it's on full node utilisation while on ARM it starts as early as 16 cores
<nikunj97> increasing the problem size by 10x seems to make cpu utilisation a bit better. But it still is pretty significant idle time
<nikunj97> hkaiser, anything from hpx to improve performance?
<hkaiser> nikunj97: high idle-rates almost always point towards too little parallelism
<nikunj97> hkaiser, got it to work :D
<nikunj97> 11k MLUPS on ARM
<nikunj97> from 3K MLUPS
<hkaiser> nice
<nikunj97> I had to use basic futures to see the grain-size
<nikunj97> now it's idling at 10%
<nikunj97> I'm trying to fine tune and once I find the right grain-size, I'll use static_chunk_size to define the number of splits
<nikunj97> I just hit 41k MLUPS no ARM
<hkaiser> nikunj97: you could use the auto-chunker and print the value it has come up with
<hkaiser> and then use that value
<nikunj97> doesn't auto_chunker require a constructor value?
<hkaiser> have you looked at it?
<nikunj97> you mean this right?
<hkaiser> you _can_ specify both, the time to use as the minimal chunk duration and the number of iterations to measure
<hkaiser> yes
<hkaiser> but a bit newer than the one you linked
<nikunj97> how does it know if it's num iterations or the duration?
<nikunj97> hkaiser, you might want to look at this: https://gist.github.com/NK-Nikunj/2f95affe08743873471658116af1b2bd
<nikunj97> too good for my eye
<nikunj97> took me some time to figure it out, but these are some mind boggling results for me
<nikunj97> I feel like checking for correctness of my algorithm just to make sure if these are even right numbers
<hkaiser> nikunj97: idle-rates are excellent
<nikunj97> yes! they're finally under control
<nikunj97> but seriously 80GLUPS is a lot really
<nikunj97> I feel like checking for correctness now, let me try printing a smaller matrix for correctness
<nikunj97> hkaiser, I can confirm that the algorithm is indeed correct!
<nikunj97> the print results are accurate
<nikunj97> damn, I never thought that I was doing 28x slower than achievable
<nikunj97> also, I have no clue why I have this high performance numbers. will have to ask heller1 about this
wate123_Jun has quit [Remote host closed the connection]
Hashmi has quit [Quit: Connection closed for inactivity]
wate123_Jun has joined #ste||ar
<nikunj97> ohh, they were apparently cache effects. so many cores. So much L2 cache...
<Yorlik> :D
<Yorlik> The cache is your best friend :)
<nikunj97> Yorlik, :D
<nikunj97> my application is crashing, something is wrong with my code :/
<nikunj97> found the mistake
<nikunj97> the performance was all a hoax :/
<nikunj97> wrong placement of hpx::wait_all
<nikunj97> I had my hopes so high
<nikunj97> hkaiser, they're not real numbers, my bad :/
<hkaiser> lol
<hkaiser> that's how you write papers! you finally figured it out ;-)
<nikunj97> xD
<nikunj97> I will get there... eventually
<nikunj97> will plot some graphs comparing with roofline to compare with the peak performance available at that arithmetic intensity
<nikunj97> for now I'll go sleep
nikunj97 has quit [Read error: Connection reset by peer]
bita has quit [Ping timeout: 252 seconds]
<Yorlik> Any idea to improve this? Speed? Something from <algorithm>? https://godbolt.org/z/GQ8pdW
nan11 has quit [Remote host closed the connection]