hkaiser changed the topic of #ste||ar to: The topic is 'STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
<hkaiser> Yorlik: here
<Yorlik> Would you have a moment helping me understand the profiler result from my server? It seems my meory bandwidth is much smaller than it should be and it also seems I am spending a ton of time in HPX functions
<Yorlik> It seems like 54% of the time is in hpx.dll
<hkaiser> Yorlik: could be if you have not enough work
<hkaiser> look at idle rates
<Yorlik> I am just looping over the objects and loading them.
<Yorlik> And reapeat that over and over again.
<Yorlik> I think the parallel for call could probably be optimized
<hkaiser> ok
<Yorlik> In the moment it looks like this:
<Yorlik> futures.push_back(
<Yorlik> hpx::parallel::execution::par( hpx::parallel::execution::task ),
<Yorlik> 0, m_e_type::maxindex,
<Yorlik> hpx::parallel::for_loop(
<Yorlik> &update_entity<I>
<Yorlik> ) );
<Yorlik> ---
<Yorlik> (It's called inside a function template for the specific entity specialization, that's the <I>)
<hkaiser> what's your question?
<Yorlik> Is there anything wrong with this par loop?
<Yorlik> Something that couzld make it slow
<Yorlik> And how could I tweak it if possible?
<Yorlik> Like give it a chunk size.
<hkaiser> why do you think it's 'slow'?
<Yorlik> Because my frametimes sucvkk
<Yorlik> The usable memory bandwidth I see is about 1.5 GB/s
<Yorlik> Using the size of all entities and the frametime
<Yorlik> I'm just doing some very simple rolling statistics on the frame level and the very simple update functiopn
<hkaiser> how many hp xthreads are being created and what is the average thread execution time (lenth)
<Yorlik> You mean how many tasks?
<hkaiser> yes
<hkaiser> how many hpx threads
<hkaiser> (the for_loop creates some as well)
<Yorlik> I Ithink I need to install the counters before I can answer that question. Gotta work on that.
<hkaiser> Yorlik: ok, no need to 'install' them, just pass on the command line
<Yorlik> OK - how that again?
<hkaiser> see docs
<Yorlik> OK. Reading up ...
<Yorlik> This one? /agas{locality#0/total}/count/num_threads,17,16.052212,[s],0
<hkaiser> I think it's /threads{locality#0/total}/count/cumulative
<Yorlik> OK
<hkaiser> and /threads/time/average
<hkaiser> this one is the most useful one, in my book: /threads/idle-rate
<hkaiser> but needs to be enabled at compile time
<Yorlik> How do I add sever counter? Just comma separated list?
<hkaiser> -DHPX_WITH_THREAD_IDLE_RATES=0
<hkaiser> several command line options
<hkaiser> --hpx:prin-counter=/counter1 --hpx:prin-counter=/counter2 ...
<hkaiser> --hpx:print-counter=/counter1 --hpx:print-counter=/counter2 ...
<Yorlik> OK - T
<Yorlik> X
<Yorlik> :)
nikunj97 has quit [Ping timeout: 240 seconds]
<Yorlik> hkaiser:
<Yorlik> /threads{locality#0/total/total}/idle-rate,4,3.083962,[s],6477,[0.01%]
<Yorlik> /threads{locality#0/total/total}/time/average,4,3.100528,[s],877050,[ns]
<hkaiser> right, as expected - 64% idle-rate
<hkaiser> 900 microsecnds thread length is not too bad
<Yorlik> Thats the for loop chuinks?
<hkaiser> everything
<Yorlik> IC
<hkaiser> average
<hkaiser> tells us that there is not parallelism (thus 64% idle-rate)
<Yorlik> I laso have this: /threads{locality#0/pool#default/worker-thread#0}/idle-rate,4,3.100500,[s],8120,[0.01%]
<hkaiser> even worse
<Yorlik> /threads{locality#0/pool#default/worker-thread#0}/time/average,4,3.098264,[s],457046,[ns]
<hkaiser> that's the per-core values
<hkaiser> the total values are the averages over the core values
<Yorlik> Let me drop on period (1 sec):
<Yorlik> /threads{locality#0/total/total}/count/cumulative,4,3.096336,[s],4913
<Yorlik> /threads{locality#0/pool#default/worker-thread#1}/count/cumulative,4,3.098132,[s],1217
<Yorlik> /threads{locality#0/pool#default/worker-thread#3}/count/cumulative,4,3.097165,[s],1213
<Yorlik> /threads{locality#0/pool#default/worker-thread#0}/count/cumulative,4,3.096396,[s],1256
<Yorlik> /threads{locality#0/pool#default/worker-thread#2}/count/cumulative,4,3.098139,[s],1236
<Yorlik> /threads{locality#0/total/total}/idle-rate,4,3.083962,[s],6477,[0.01%]
<Yorlik> /threads{locality#0/pool#default/worker-thread#0}/idle-rate,4,3.100500,[s],8120,[0.01%]
<Yorlik> /threads{locality#0/pool#default/worker-thread#1}/idle-rate,4,3.098207,[s],7931,[0.01%]
<Yorlik> /threads{locality#0/pool#default/worker-thread#2}/idle-rate,4,3.084335,[s],1695,[0.01%]
<Yorlik> /threads{locality#0/pool#default/worker-thread#3}/idle-rate,4,3.084015,[s],8154,[0.01%]
<Yorlik> /threads{locality#0/total/total}/time/average,4,3.100528,[s],877050,[ns]
<Yorlik> /threads{locality#0/pool#default/worker-thread#0}/time/average,4,3.098264,[s],457046,[ns]
<Yorlik> /threads{locality#0/pool#default/worker-thread#1}/time/average,4,3.084364,[s],521684,[ns]
<Yorlik> /threads{locality#0/pool#default/worker-thread#2}/time/average,4,3.084062,[s],2.06616e+06,[ns]
<Yorlik> /threads{locality#0/pool#default/worker-thread#3}/time/average,4,3.100560,[s],465189,[ns]
<Yorlik> /threads{locality#0/total/total}/count/cumulative,5,4.103032,[s],10439
<Yorlik> Thats 1 second, should be roughly 10-13 frames
<hkaiser> 10000 threads overall
<hkaiser> one core is doing almost all of the work
<hkaiser> core2 has 16% idle rate (which is ok), the rest aren't doing anything
<Yorlik> That conforms with my measuring, when I did manual thread id prints from std thread and HPX too
<Yorlik> How can I fix this?
<hkaiser> shrug
<Yorlik> I'd like to round robin the cores
<Yorlik> I mean it's your runtime - you should able to tell me :D
<hkaiser> work should get stolen if there is any
<Yorlik> The result is low framerate
<hkaiser> Yorlik: sum all the work and compare to execution (wall clock) time
<hkaiser> that will give you a sense of how much work was done
<hkaiser> the frame rate is low as everything is done on one core
<hkaiser> (well, almost everything)
<Yorlik> What is "work" in your book here? I simply wanted to maximize the loops/sec here.
<hkaiser> the idle-rate gives you the ration of thread-execution-time over wall-clock-time
<Yorlik> So they're lazy for some reason
<Yorlik> Wall clock is 1000 ms here
<hkaiser> no, just not enough work
<Yorlik> What is work?
<hkaiser> tasks
<Yorlik> I mean -- loading the objects at 1.5 GB/s is lame !
<hkaiser> 4000 tasks, 4 cores, that's about 1000 tasks per core
<hkaiser> average length is about 500 microsecs
<hkaiser> that means that you're using only half of your compute resources
<hkaiser> you would get the same when running on just 2 cores
<Yorlik> This test is not so much about computing but memory bandwidth. I don't understand why the bandwidth is so low
<hkaiser> bandwith is low because you don't do a lot of work
<Yorlik> So doing more work would make it faster? That sounds crazy
<Yorlik> After all I just want it to loop as fast as possible over all the entities
<hkaiser> I don't know what you're doing - but there is no reason why using hpx should limit your bandwidth
<Yorlik> The cores are actually at > 90% in resource monitor
<hkaiser> Yorlik: yah, that is because the hpx scheduler tries to run things constantly, that is a red herring
<Yorlik> So thats the idle time
<hkaiser> well, sure - instead of suspending the thread, it keeps running in case new wrok is created
<hkaiser> so for the OS it looks like as if the thread was 'doing things'
<Yorlik> Why is it idle and not running the next task of the parallel loop instead?
<hkaiser> because there is no 'next task' otherwise it would run it
<Yorlik> I don't understand why the memory bandwidth is so low. I'll do a test with an empty update function
<hkaiser> Yorlik: is that a release build? or debug?
<Yorlik> There is not much difference betwen them
<hkaiser> I doubt that
<Yorlik> frametime is about 73-100 ms
<hkaiser> there is usually a factor of 10 between them
<Yorlik> I get the same numbers with the update function being empty - I'definitely burning time elsewhere. So I'm not really measuring memory bandwidth
<Yorlik> I'll double check the entire call chain
<Yorlik> OK - my frametime is down to 20-30 ms now and I don't know why. I had kust played with some smallish things and I atually undid them and the time stays low, which kinda corresponds to ~4.0 GB/sec. Now I'm scared
<Yorlik> Maybe the server shaped up just by me looking at it ... :D
hkaiser has quit [Quit: bye]
<Yorlik> Arrived at 4.9 GB/sec. Heap profiling "off" helps a lot ... :D
<zao> :D
nikunj has joined #ste||ar
<Yorlik> Still - it shozuld be faster, imo :D
nikunj97 has joined #ste||ar
nikunj has quit [Ping timeout: 260 seconds]
nikunj97 has quit [Ping timeout: 260 seconds]
nikunj97 has joined #ste||ar
nikunj97 has quit [Ping timeout: 265 seconds]
nikunj97 has joined #ste||ar
<Yorlik> Is it possible to cancel a scheduled task? E.g. for a timer application?
<Yorlik> I mean a task that has yielded.
hkaiser has joined #ste||ar
<Yorlik> When I'm starting a lambda as hpx async, does the return type of the lambda determine the template parameter of the future?
<Yorlik> Like hpx::future<lambda_rettype>
<Yorlik> NVM - figured it out
nikunj97 has quit [Ping timeout: 240 seconds]
nikunj has joined #ste||ar
<hkaiser> Yorlik: yes
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
nikunj has quit [Read error: Connection reset by peer]
nikunj97 has joined #ste||ar
nikunj97 has quit [Ping timeout: 268 seconds]
hkaiser has quit [Ping timeout: 240 seconds]
V|r has joined #ste||ar
hkaiser has joined #ste||ar