#ste||ar on 2020-02-16 — irc logs at irclog.cct.lsu.edu

2019-12-03 02:04 hkaiser changed the topic of #ste||ar to: The topic is 'STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:04 <hkaiser> Yorlik: here

00:05 <Yorlik> Would you have a moment helping me understand the profiler result from my server? It seems my meory bandwidth is much smaller than it should be and it also seems I am spending a ton of time in HPX functions

00:07 <Yorlik> It seems like 54% of the time is in hpx.dll

00:08 <hkaiser> Yorlik: could be if you have not enough work

00:08 <hkaiser> look at idle rates

00:09 <Yorlik> I am just looping over the objects and loading them.

00:09 <Yorlik> And reapeat that over and over again.

00:09 <Yorlik> I think the parallel for call could probably be optimized

00:10 <hkaiser> ok

00:11 <Yorlik> In the moment it looks like this:

00:11 <Yorlik> futures.push_back(

00:11 <Yorlik> hpx::parallel::execution::par( hpx::parallel::execution::task ),

00:11 <Yorlik> 0, m_e_type::maxindex,

00:11 <Yorlik> hpx::parallel::for_loop(

00:11 <Yorlik> &update_entity<I>

00:11 <Yorlik> ) );

00:11 <Yorlik> ---

00:12 <Yorlik> (It's called inside a function template for the specific entity specialization, that's the <I>)

00:12 <hkaiser> what's your question?

00:12 <Yorlik> Is there anything wrong with this par loop?

00:12 <Yorlik> Something that couzld make it slow

00:13 <Yorlik> And how could I tweak it if possible?

00:13 <Yorlik> Like give it a chunk size.

00:13 <hkaiser> why do you think it's 'slow'?

00:13 <Yorlik> Because my frametimes sucvkk

00:14 <Yorlik> The usable memory bandwidth I see is about 1.5 GB/s

00:14 <Yorlik> Using the size of all entities and the frametime

00:14 <Yorlik> I'm just doing some very simple rolling statistics on the frame level and the very simple update functiopn

00:14 <hkaiser> how many hp xthreads are being created and what is the average thread execution time (lenth)

00:15 <Yorlik> You mean how many tasks?

00:15 <hkaiser> yes

00:15 <hkaiser> how many hpx threads

00:15 <hkaiser> (the for_loop creates some as well)

00:16 <Yorlik> I Ithink I need to install the counters before I can answer that question. Gotta work on that.

00:16 <hkaiser> Yorlik: ok, no need to 'install' them, just pass on the command line

00:17 <Yorlik> OK - how that again?

00:17 <hkaiser> see docs

00:17 <hkaiser> https://stellar-group.github.io/hpx-docs/latest/html/manual/optimizing_hpx_applications.html#performance-counters

00:17 <Yorlik> OK. Reading up ...

00:29 <Yorlik> This one? /agas{locality#0/total}/count/num_threads,17,16.052212,[s],0

00:30 <hkaiser> I think it's /threads{locality#0/total}/count/cumulative

00:30 <Yorlik> OK

00:30 <hkaiser> and /threads/time/average

00:31 <hkaiser> this one is the most useful one, in my book: /threads/idle-rate

00:31 <hkaiser> but needs to be enabled at compile time

00:31 <Yorlik> How do I add sever counter? Just comma separated list?

00:32 <hkaiser> -DHPX_WITH_THREAD_IDLE_RATES=0

00:32 <hkaiser> several command line options

00:32 <hkaiser> --hpx:prin-counter=/counter1 --hpx:prin-counter=/counter2 ...

00:32 <hkaiser> --hpx:print-counter=/counter1 --hpx:print-counter=/counter2 ...

00:32 <Yorlik> OK - T

00:32 <Yorlik> X

00:32 <Yorlik> :)

00:34 nikunj97 has quit [Ping timeout: 240 seconds]

00:42 <Yorlik> hkaiser:

00:42 <Yorlik> /threads{locality#0/total/total}/idle-rate,4,3.083962,[s],6477,[0.01%]

00:42 <Yorlik> /threads{locality#0/total/total}/time/average,4,3.100528,[s],877050,[ns]

00:43 <hkaiser> right, as expected - 64% idle-rate

00:43 <hkaiser> 900 microsecnds thread length is not too bad

00:43 <Yorlik> Thats the for loop chuinks?

00:43 <hkaiser> everything

00:43 <Yorlik> IC

00:43 <hkaiser> average

00:44 <hkaiser> tells us that there is not parallelism (thus 64% idle-rate)

00:44 <Yorlik> I laso have this: /threads{locality#0/pool#default/worker-thread#0}/idle-rate,4,3.100500,[s],8120,[0.01%]

00:44 <hkaiser> even worse

00:44 <Yorlik> /threads{locality#0/pool#default/worker-thread#0}/time/average,4,3.098264,[s],457046,[ns]

00:44 <hkaiser> that's the per-core values

00:45 <hkaiser> the total values are the averages over the core values

00:45 <Yorlik> Let me drop on period (1 sec):

00:45 <Yorlik> /threads{locality#0/total/total}/count/cumulative,4,3.096336,[s],4913

00:45 <Yorlik> /threads{locality#0/pool#default/worker-thread#1}/count/cumulative,4,3.098132,[s],1217

00:45 <Yorlik> /threads{locality#0/pool#default/worker-thread#3}/count/cumulative,4,3.097165,[s],1213

00:45 <Yorlik> /threads{locality#0/pool#default/worker-thread#0}/count/cumulative,4,3.096396,[s],1256

00:45 <Yorlik> /threads{locality#0/pool#default/worker-thread#2}/count/cumulative,4,3.098139,[s],1236

00:45 <Yorlik> /threads{locality#0/total/total}/idle-rate,4,3.083962,[s],6477,[0.01%]

00:45 <Yorlik> /threads{locality#0/pool#default/worker-thread#0}/idle-rate,4,3.100500,[s],8120,[0.01%]

00:45 <Yorlik> /threads{locality#0/pool#default/worker-thread#1}/idle-rate,4,3.098207,[s],7931,[0.01%]

00:45 <Yorlik> /threads{locality#0/pool#default/worker-thread#2}/idle-rate,4,3.084335,[s],1695,[0.01%]

00:45 <Yorlik> /threads{locality#0/pool#default/worker-thread#3}/idle-rate,4,3.084015,[s],8154,[0.01%]

00:45 <Yorlik> /threads{locality#0/total/total}/time/average,4,3.100528,[s],877050,[ns]

00:46 <Yorlik> /threads{locality#0/pool#default/worker-thread#0}/time/average,4,3.098264,[s],457046,[ns]

00:46 <Yorlik> /threads{locality#0/pool#default/worker-thread#1}/time/average,4,3.084364,[s],521684,[ns]

00:46 <Yorlik> /threads{locality#0/pool#default/worker-thread#2}/time/average,4,3.084062,[s],2.06616e+06,[ns]

00:46 <Yorlik> /threads{locality#0/pool#default/worker-thread#3}/time/average,4,3.100560,[s],465189,[ns]

00:46 <Yorlik> /threads{locality#0/total/total}/count/cumulative,5,4.103032,[s],10439

00:46 <Yorlik> Thats 1 second, should be roughly 10-13 frames

00:46 <hkaiser> 10000 threads overall

00:46 <hkaiser> one core is doing almost all of the work

00:47 <hkaiser> core2 has 16% idle rate (which is ok), the rest aren't doing anything

00:47 <Yorlik> That conforms with my measuring, when I did manual thread id prints from std thread and HPX too

00:47 <Yorlik> How can I fix this?

00:47 <hkaiser> shrug

00:48 <Yorlik> I'd like to round robin the cores

00:48 <Yorlik> I mean it's your runtime - you should able to tell me :D

00:48 <hkaiser> work should get stolen if there is any

00:49 <Yorlik> The result is low framerate

00:49 <hkaiser> Yorlik: sum all the work and compare to execution (wall clock) time

00:49 <hkaiser> that will give you a sense of how much work was done

00:49 <hkaiser> the frame rate is low as everything is done on one core

00:50 <hkaiser> (well, almost everything)

00:50 <Yorlik> What is "work" in your book here? I simply wanted to maximize the loops/sec here.

00:51 <hkaiser> the idle-rate gives you the ration of thread-execution-time over wall-clock-time

00:52 <Yorlik> So they're lazy for some reason

00:52 <Yorlik> Wall clock is 1000 ms here

00:52 <hkaiser> no, just not enough work

00:52 <Yorlik> What is work?

00:53 <hkaiser> tasks

00:53 <Yorlik> I mean -- loading the objects at 1.5 GB/s is lame !

00:55 <hkaiser> 4000 tasks, 4 cores, that's about 1000 tasks per core

00:55 <hkaiser> average length is about 500 microsecs

00:55 <hkaiser> that means that you're using only half of your compute resources

00:56 <hkaiser> you would get the same when running on just 2 cores

00:58 <Yorlik> This test is not so much about computing but memory bandwidth. I don't understand why the bandwidth is so low

00:58 <hkaiser> bandwith is low because you don't do a lot of work

00:59 <Yorlik> So doing more work would make it faster? That sounds crazy

00:59 <Yorlik> After all I just want it to loop as fast as possible over all the entities

00:59 <hkaiser> I don't know what you're doing - but there is no reason why using hpx should limit your bandwidth

01:01 <Yorlik> The cores are actually at > 90% in resource monitor

01:01 <hkaiser> Yorlik: yah, that is because the hpx scheduler tries to run things constantly, that is a red herring

01:01 <Yorlik> So thats the idle time

01:02 <hkaiser> well, sure - instead of suspending the thread, it keeps running in case new wrok is created

01:02 <hkaiser> so for the OS it looks like as if the thread was 'doing things'

01:03 <Yorlik> Why is it idle and not running the next task of the parallel loop instead?

01:04 <hkaiser> because there is no 'next task' otherwise it would run it

01:05 <Yorlik> I don't understand why the memory bandwidth is so low. I'll do a test with an empty update function

01:06 <hkaiser> Yorlik: is that a release build? or debug?

01:06 <Yorlik> There is not much difference betwen them

01:06 <hkaiser> I doubt that

01:07 <Yorlik> frametime is about 73-100 ms

01:07 <hkaiser> there is usually a factor of 10 between them

01:12 <Yorlik> I get the same numbers with the update function being empty - I'definitely burning time elsewhere. So I'm not really measuring memory bandwidth

01:12 <Yorlik> I'll double check the entire call chain

02:05 <Yorlik> OK - my frametime is down to 20-30 ms now and I don't know why. I had kust played with some smallish things and I atually undid them and the time stays low, which kinda corresponds to ~4.0 GB/sec. Now I'm scared

02:06 <Yorlik> Maybe the server shaped up just by me looking at it ... :D

02:14 hkaiser has quit [Quit: bye]

06:12 <Yorlik> Arrived at 4.9 GB/sec. Heap profiling "off" helps a lot ... :D

06:20 <zao> :D

06:43 nikunj has joined #ste||ar

07:49 <Yorlik> Still - it shozuld be faster, imo :D

07:50 nikunj97 has joined #ste||ar

07:53 nikunj has quit [Ping timeout: 260 seconds]

08:21 nikunj97 has quit [Ping timeout: 260 seconds]

11:11 nikunj97 has joined #ste||ar

12:20 nikunj97 has quit [Ping timeout: 265 seconds]

13:29 nikunj97 has joined #ste||ar

14:07 <Yorlik> Is it possible to cancel a scheduled task? E.g. for a timer application?

14:07 <Yorlik> I mean a task that has yielded.

15:11 hkaiser has joined #ste||ar

15:26 <Yorlik> When I'm starting a lambda as hpx async, does the return type of the lambda determine the template parameter of the future?

15:26 <Yorlik> Like hpx::future<lambda_rettype>

15:40 <Yorlik> NVM - figured it out

16:18 nikunj97 has quit [Ping timeout: 240 seconds]

16:26 nikunj has joined #ste||ar

16:37 <hkaiser> Yorlik: yes

16:40 nikunj has quit [Read error: Connection reset by peer]

16:43 nikunj has joined #ste||ar

16:48 nikunj has quit [Read error: Connection reset by peer]

16:48 nikunj97 has joined #ste||ar

17:09 nikunj97 has quit [Ping timeout: 268 seconds]

17:19 hkaiser has quit [Ping timeout: 240 seconds]

19:08 V|r has joined #ste||ar

19:55 hkaiser has joined #ste||ar