#ste||ar on 2019-07-12 — irc logs at irclog.cct.lsu.edu

2019-06-17 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoD: https://developers.google.com/season-of-docs/

01:44 eschnett has joined #ste||ar

01:57 K-ballo1 has joined #ste||ar

01:58 K-ballo has quit [Ping timeout: 268 seconds]

01:58 K-ballo1 is now known as K-ballo

05:06 mdiers_ has quit [Remote host closed the connection]

05:07 mdiers_ has joined #ste||ar

06:54 nk__ has joined #ste||ar

06:55 nikunj has quit [Ping timeout: 252 seconds]

09:41 nk__ has quit [Ping timeout: 250 seconds]

09:45 nk__ has joined #ste||ar

09:59 nk__ has quit [Ping timeout: 246 seconds]

12:04 eschnett has quit [Quit: eschnett]

12:43 eschnett has joined #ste||ar

12:58 K-ballo1 has joined #ste||ar

13:02 K-ballo has quit [Ping timeout: 245 seconds]

13:02 K-ballo1 is now known as K-ballo

13:02 Yorlik has quit [Ping timeout: 245 seconds]

13:33 hkaiser has joined #ste||ar

14:12 diehlpk has joined #ste||ar

14:13 <diehlpk> hkaiser, we have scaling results as octotiger does not let pvfmm look good

14:15 <hkaiser> diehlpk: so pvfmm is worse than OT?

14:16 <diehlpk> Yes

14:17 <diehlpk> compute nodes: [1, 2, 4, 8, 16, 32]

14:17 <diehlpk> PVFMM [s]: [2.7171, 1.9655, 1.3204, 0.9433, 0.8271, 0.7105]

14:17 <diehlpk> O-T [s]: [0.8062, 0.5564, 0.3135, 0.17149, 0.1737, 0.1466]

14:17 <diehlpk> We do not scale due to te small problem size

14:19 <diehlpk> But we compared apples with plums since we use different integration rules, data structures, and so on

14:34 rori has joined #ste||ar

14:38 <hkaiser> diehlpk: ok, will that be added to the paper?

14:40 <diehlpk> hkaiser, No, Gregor and Dirk wrote an answer to the committee and attached the plots

14:41 <diehlpk> I found a bug in Dominic"s code after some intense testing because the solif sphere was never tested in distributed. Gregot was able to fix the code and could run the small benchmark

14:41 <diehlpk> So we do not have time to add it to the paper

14:42 <diehlpk> But at least the one reviewer has seen his wished comparison

14:42 <hkaiser> ok

14:42 <hkaiser> thanks for your effort!

14:43 <diehlpk> Yeah, Gregor, Dirk, and I slept not much the last two days to do this comparison and we will see if they are happy with it

14:45 <hkaiser> I think so, all will be well

14:47 <diehlpk> I hope, we upload the paper right now and we are done with it

14:48 <hkaiser> nod, take a rest you deserved it

14:48 <diehlpk> Need to prepare my 30 minutes talk for Wed

14:51 <hkaiser> diehlpk: you can do that on the plane ;-)

14:53 <diehlpk> Yes, good point

14:59 K-ballo1 has joined #ste||ar

15:01 K-ballo has quit [Ping timeout: 246 seconds]

15:04 K-ballo1 has quit [Ping timeout: 258 seconds]

15:37 diehlpk has quit [Ping timeout: 252 seconds]

15:41 K-ballo has joined #ste||ar

16:04 Yorlik has joined #ste||ar

16:33 <Yorlik> hkaiser: yt?

16:33 <hkaiser> here

16:33 <Yorlik> I just had a very nice test - pleasant for HPX I think

16:33 <hkaiser> ok

16:33 <Yorlik> I was running my test program with 104 workers

16:34 <Yorlik> using hpx to schedule thetasks

16:34 <Yorlik> each slot was seen as a task, but not run as a task

16:34 <Yorlik> I made it so, that each slot at least took 200us

16:34 <hkaiser> nod

16:34 <Yorlik> So - the theoretical mini9mum task was always 200+ us

16:35 <Yorlik> The measurements also showed it worked - the average was about 210 or so

16:36 <Yorlik> Then I let these workers - which are very depoendent on each other run in a buffer of 9182 items

16:36 <Yorlik> 8192

16:36 <hkaiser> is that a lot?

16:36 <Yorlik> So each had about 80 space on average

16:36 <Yorlik> It was 64 byte data - one cache line

16:36 <hkaiser> k

16:37 <Yorlik> So - 64*8192 byte total buffer size

16:37 <hkaiser> not much data

16:37 <Yorlik> I made a tim measurement INSIDE the work() mothod

16:37 <Yorlik> So in the end I had a total of the real work time spent

16:37 <Yorlik> Without any overhead

16:38 <Yorlik> I let it run for 2 hours

16:38 <Yorlik> And the divided the total sum of time spent in the work() by the time the program was runnng

16:38 <Yorlik> this I used as an efficiency parameter

16:38 K-ballo1 has joined #ste||ar

16:38 <hkaiser> what did you get?

16:38 <Yorlik> Whats the real work done compared to the total runtime

16:39 <Yorlik> I controlled the batch size to not exceed a certain limit and I forced a yield when it became too small

16:39 <Yorlik> Running with 6 worker threads I yielded an efficiency of 5.79

16:39 <hkaiser> nice

16:39 <Yorlik> Thats a parllel efficiency of 96%

16:40 K-ballo has quit [Ping timeout: 248 seconds]

16:40 K-ballo1 is now known as K-ballo

16:40 <Yorlik> And in an environment with ectreme mutual dependency

16:40 <Yorlik> 104 workers running in a pipeline

16:40 <Yorlik> There's a lot of possibility to get in each others way

16:40 <Yorlik> I'm retty happy with that result

16:40 <Yorlik> :)

16:41 <hkaiser> good

16:41 <hkaiser> I'm glad you are

16:41 <Yorlik> The big problem I think with the pipeline is, that a task can only run ahead that far - depending on the predecessor in the queue

16:41 <Yorlik> That could horribly raise your kappa

16:42 <Yorlik> It would be interesting to do some measurements on a many core machine

16:42 <Yorlik> And see how the USL applies

16:42 <hkaiser> btw hpx has a perf counter (idle-rate, needs to be enabled at compile time) that should have given you the same information

16:43 <Yorlik> I'll get into perf counters later

16:43 <Yorlik> I think I'll now work a bit more on that datastructuire to make it as good and usable as I can

16:43 <hkaiser> but - nice result

16:43 <Yorlik> Yeah - I was afraid it would be abysmal - but this is niuce

16:44 <hkaiser> now you should exit the task instead of yelding and restart a new task whenever work is available

16:44 <Yorlik> Once I have it in a usable shape I'll work on instrumentation

16:44 <hkaiser> that would make things more dynamic as the number of workers would adapt itself to the amount of work

16:44 <Yorlik> The setup here is different

16:44 <Yorlik> Its a pipeline

16:44 <Yorlik> Not a parallel setup

16:45 <hkaiser> still

16:45 <Yorlik> But I want to have a possibility for parallel work inside a tsage of the pipeline

16:45 <Yorlik> If I find a good way to to that, that wopuld be the killer

16:45 <Yorlik> Because slow stages would autoscale

16:45 <hkaiser> so your pipeline has 104 stages?

16:45 <Yorlik> Yes

16:45 <hkaiser> nod

16:45 <Yorlik> one worker each

16:46 <hkaiser> ok

16:46 <Yorlik> 4 are the default and 100 just for the measurement and increase load

16:46 <Yorlik> Its parameterized - running a 1000 stage pipeline is easy

16:46 <Yorlik> just a number

16:46 <hkaiser> is that useful at all?

16:46 <Yorlik> I don't think so

16:47 <Yorlik> The main use case is a single producer single consumer Q

16:47 <Yorlik> Just to shovel data asap between threads

16:47 <Yorlik> An I can dynamically insert or remov clients

16:47 <Yorlik> So - having a logger in really quick and remove it is easy

16:48 <Yorlik> I might use it to coordinate between physics and gameplay updates in a frame

16:48 <Yorlik> or to shovel messages between updaters and the dispatcher

16:48 <Yorlik> Thats the main use cases for us

16:51 <Yorlik> I think I just got an idea for a next experiment

16:51 <Yorlik> And how to solve the task making you mentioned

16:51 <Yorlik> Its actually easy

16:52 <Yorlik> Just need to make the instrumentation atomic now :)

16:55 <hkaiser> or use the existing hpx instrumentation ;-)

16:56 <Yorlik> How can I in a for loop add a bunch of futures to be watched after the for loop

16:57 <Yorlik> In the loop I'm launching a bunch of tasks I have to wait for

16:57 <Yorlik> What's the best way to do that ?

16:58 <Yorlik> Just vector of hpx::future?

16:58 <Yorlik> but hpx::future is a template right? Or is it type erased?

17:03 <Yorlik> Woops - I can just use future<void>

17:13 <Yorlik> I am getting an error for this:

17:13 <Yorlik> tasks.push_back( hpx::async( hpx::launch::async, &Client::work, item ) );

17:13 <Yorlik> probably because work is a member function and its not an hpx object

17:13 <Yorlik> How can i cun this strictly local again for a test?

17:34 K-ballo1 has joined #ste||ar

17:36 K-ballo has quit [Ping timeout: 272 seconds]

17:36 K-ballo1 is now known as K-ballo

17:56 rori has quit [Quit: bye]

18:32 <hkaiser> Yorlik: this should actually work, if anything you can use async(hpx::util::bind(&Client::work, item)); or an equivalent solution involving a lambda

18:33 <Yorlik> I think I need CRTP - the call is inside the runner and work is an abstract member

18:33 <Yorlik> So i created a CRTP interface like this:

18:33 <hkaiser> should work anyways

18:33 <hkaiser> try passing (&Client::work, &item)

18:34 <hkaiser> either way should be fine, (if 'item' is copyable)

18:34 <hkaiser> if it's not copyable do async(&Client::work, std::ref(item))

18:35 <Yorlik> Don't I have to pass this ?

18:36 <hkaiser> well, I thought 'item' was the object to use, i.e. item.work()

18:36 <Yorlik> work is a member function

18:36 <hkaiser> of item?

18:36 <Yorlik> no - of Client

18:36 <hkaiser> what type is item?

18:36 <Yorlik> the workers inherit from client

18:37 <Yorlik> item is just the data in the buffer

18:37 <hkaiser> or is item to be passed to work(...)?

18:37 <Yorlik> Yes

18:37 <hkaiser> ahh

18:37 <hkaiser> then you need to supply the 'this' of the Client object you want to invoke 'work' on

18:37 <Yorlik> it seems this compiles - futures.push_back( hpx::async( &Client::work, this, std::ref(item) ) );

18:37 <hkaiser> yes, that's it

18:37 <Yorlik> Now I have to work on a last error

18:38 <Yorlik> And remove that dreaded CRTP :D

18:38 <Yorlik> Though i can use it to save one iondirection

18:38 <hkaiser> this way however you have to make sure 'item' outlives the invocation of work()

18:38 <Yorlik> And since we're in a tight loop - maybe I keep it

18:38 <Yorlik> It will always

18:44 <Yorlik> Why does visual studio crash when something starts working ? :D

18:48 <zao> Only one of you and VS can be working at the same time.

18:48 <zao> Also, you're using sophisticated C++, it's supposed to kill tools :)

18:48 <Yorlik> XD

18:48 <Yorlik> It's so much over my head and VS smells it using it to stab me in the back

18:57 <Yorlik> hkaiser: It seems to work now - I have run a longer test to check the impact on efficiency.

18:59 <hkaiser> Yorlik: as long as your workload is ~200us you will not see any impact

18:59 <hkaiser> 200us or more

19:00 <Yorlik> It's abit of an atypical workload I think

19:00 <Yorlik> Because the clients can step on each others toes

19:00 <Yorlik> I expect an impact when varying the runtimes of the clients

19:00 <Yorlik> Like slow and fast clients between 200 and 400 us or so

19:01 <Yorlik> Because they will start to bump against each other section of the buffer

19:01 <Yorlik> But HPX will probably mitigate that a lot

19:01 <Yorlik> because now I have autoscaling

19:01 <Yorlik> I need to test

19:05 <Yorlik> In a USL modeling I'd expect the kappa value to change a lot depending on the variation of the item times per worker and the size of the buffer

19:05 <Yorlik> the "bumping into each other" could be seen as a form of crosstalk, I belkieve.#

19:20 <Yorlik> Seems with taskifying I introduced new bugs - time to fix ...

19:34 <hkaiser> Yorlik: btw, for running a predefined number of tasks you could run a parallel::for_each() instead of running separate tasks - especially if you are not interested in separate result values from those tasks

19:34 <Yorlik> Makes sense

19:34 <hkaiser> (I'm reading back only now - wrt your question how to create a bunch of tsks

19:35 <Yorlik> It seems to work, but there is some odd behavior - I think I might have introduced bugs, maybe even a race

19:36 <Yorlik> I do not yet fully understand what's going on.

19:36 <Yorlik> Oh - got it - a race

19:37 <Yorlik> Code that was running single threaded is now taskified - my instrumenting counter needs to become atomic

19:47 nk__ has joined #ste||ar

20:03 <Yorlik> Can I query the hpx:threads parameter from inside the program?

20:08 <hkaiser> Yorlik: what parameters?

20:08 <Yorlik> The hpx::threads

20:09 <Yorlik> I'd like to compute the thread efficiency in the output

20:09 <hkaiser> I'm not sure I understand

20:09 <Yorlik> I start the program with --hpx::threads=6

20:09 <Yorlik> I wont to get the 6 inside my code

20:10 <Yorlik> BTW: I have bog hopes for the results of the GSOC

20:10 <Yorlik> big hopes

20:10 <Yorlik> E.g. I didn't find any doc on the parallel for

20:11 <Yorlik> It was just mentioned in some release notes

20:11 <Yorlik> I often fall back to look up stuff in my doxygen

20:11 <hkaiser> ahh

20:11 <hkaiser> hpx::get_num_os_threads()

20:11 <Yorlik> And thats just the workers, rightß

20:12 <Yorlik> NOt io OR ANYTHINGß

20:12 <Yorlik> Woopscaps

20:12 <hkaiser> https://en.cppreference.com/w/cpp/algorithm/for_each

20:12 <hkaiser> except it's in hpx::parallel namespace

20:12 <Yorlik> Sweet :)

20:12 <Yorlik> I like the standard conformance of HPX a lot

20:13 <hkaiser> num_os_threads is just the number of worker threads

20:13 <Yorlik> is parallel for considered faster / less overhead?

20:13 <hkaiser> depends on what you want to do

20:13 <Yorlik> as opposed to manually making tasks

20:14 <hkaiser> it has less overhead compared to launching N tasks with N futures

20:14 <hkaiser> but it's fork-join

20:14 <Yorlik> I sent you a git link in PM

20:14 <hkaiser> obviously

20:14 <Yorlik> to the central part

20:14 <Yorlik> You might wanna have a look there

20:14 * Yorlik looks really cute all at a sudden :)

20:15 <hkaiser> Yorlik: calling wait on the futures can be done better with wait_all(futures)

20:16 <Yorlik> My efficiency has gone down dramatically with taskifying. I just finished a 20 minute test: https://gitlab.com/arcanimanext/arcanimalibs/disruptor/blob/y_wip/src/include/disruptor_hpx.hpp#L479

20:16 <Yorlik> woops

20:16 <hkaiser> in any case, future::wait and hpx::wait_all do not rethrow exceptions, if you need those you will have to call future::get on all futures

20:17 <Yorlik> Efficiency = ~1.03 times faster than serial processing.

20:17 <Yorlik> Used to be 5+

20:17 <Yorlik> So - efficiency has suffered a great deal from my changes

20:17 <hkaiser> too many tasks, too little work

20:17 <Yorlik> The tasks are 200+us each

20:18 <Yorlik> I need to check my measurements - there might be dragons hidden

20:19 <hkaiser> use /threads/count/cumulative and /threads/time/average perf counters

20:19 <Yorlik> average is the tsk average?

20:19 <hkaiser> also, enable /threads/idle-rate (add -DHPX_WITH_THREAD_IDLE_RATES=On to cmake)

20:20 <hkaiser> those will give you the information you need

20:22 <Yorlik> OK - rebuilding HPX ..

20:24 <Yorlik> Time to read perf counter docs ..

20:24 * Yorlik is in the toy shop ..

20:25 <hkaiser> Yorlik: https://stellar-group.github.io/hpx/docs/sphinx/latest/html/manual/optimizing_hpx_applications.html

20:25 <Yorlik> Thanks !!

20:33 eschnett has quit [Quit: eschnett]

20:41 hkaiser has quit [Quit: bye]

20:42 <Yorlik> This explodes: hpx::performance_counters::performance_counter average_thread_counter ("/threads/time/average");

20:43 <Yorlik> Seems I missed the locality

20:47 <Yorlik> Argh - compiled wrongly

20:49 <Yorlik> Recompiuling ... the punishment ...

20:50 <Yorlik> By the end of the century I will be an HPX expert :D

21:23 eschnett has joined #ste||ar

21:26 K-ballo1 has joined #ste||ar

21:29 K-ballo has quit [Ping timeout: 246 seconds]

21:29 K-ballo1 is now known as K-ballo

21:32 nk__ has quit [Remote host closed the connection]

21:32 nikunj has joined #ste||ar

21:37 <Yorlik> The counter "/threads{locality#0/total}/count/cumulative" should give me all time spent in my local tasks, is that correct?

21:37 <Yorlik> In nanoseconds

21:38 <Yorlik> And this should give me seconds: cumulative_thread_counter.get_value<double>().get()/1000000000

21:38 <Yorlik> correct?

21:38 <Yorlik> I'm asking because I'm getting implausible values

21:52 hkaiser has joined #ste||ar

21:59 <Yorlik> hkaiser: I got the counters running, but I am getting weird results:

21:59 <Yorlik> MAIN: Elapsed 65 of 120 seconds. cum: 0.000872788 s., avg: 0.000258611 s

21:59 <Yorlik> I am dividing the counters by 1e9 to get seconds

21:59 nikunj has quit [Ping timeout: 276 seconds]

21:59 <Yorlik> assuming they are in ns

22:00 nikunj has joined #ste||ar

22:07 <hkaiser> I htink they are

22:07 <hkaiser> the cumulative counter gives you thread counts, the other one average execution time

22:08 <Yorlik> Ohhh

22:08 <Yorlik> That makes more sense - I was misunderstanding the thread counter then

22:08 <Yorlik> So - it's like the numkbver of tasks ever launched an finished?

22:09 <hkaiser> yes

22:09 <Yorlik> Makes sense

22:09 <Yorlik> What wouzld be the best way to trace a function or an instance of a member function invocation?

22:09 <Yorlik> Like measuring a single worker

22:09 <hkaiser> what do you mean by 'trace'?

22:10 <Yorlik> The workers are instances of a class

22:10 <Yorlik> they invode their worker

22:10 <Yorlik> err their worker function

22:10 <Yorlik> I want to know the timings and numbers for each instance of my Client Class

22:10 <hkaiser> that you have to do yourself, but you could create your own perf counter and use the min/max counters to give you the extrema

22:10 <Yorlik> because i will give them different setups

22:11 <Yorlik> OK

22:11 <Yorlik> I'll check that out

22:11 <Yorlik> I was pretty impressed skimming over the amount of stuff already in the system.

22:11 <Yorlik> Time to spam useless statistics ;)

22:12 <hkaiser> those are /arithmetics/min and /arithmetics/max

22:13 <Yorlik> I'll work on that ranged for first

22:13 <Yorlik> Then counters

22:13 <Yorlik> I have a feeling I am creating way too many tasks

22:14 <hkaiser> I'm sure of that

22:14 <Yorlik> I wonder how I would convert this:

22:14 <Yorlik> for ( size_t slot = cur_counter + 1; slot <= ( cur_batch_size > 20 ? cur_counter + 20 : current_limit_ ); slot++ )

22:14 <Yorlik> into a parallel loop

22:15 <hkaiser> use parallel::for_loop()

22:15 <Yorlik> :)

22:16 <Yorlik> Do I need a special header for this?

22:18 <Yorlik> OK found it

22:20 K-ballo1 has joined #ste||ar

22:20 <hkaiser> this is described here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4796.pdf

22:21 <Yorlik> hkaiser: For some reason VS suggests only hpx::parallel::v1 ... but the docs say v2 ... what's that about?

22:21 <hkaiser> also here: http://stellar-group.org/2016/03/hpx-and-index-based-cpp-parallel-loops/

22:21 K-ballo has quit [Ping timeout: 245 seconds]

22:21 K-ballo1 is now known as K-ballo

22:21 <hkaiser> ignore the ::v1/::v2, those are inline namespaces that disappear

22:22 <Yorlik> OK

22:22 <Yorlik> intellisense just seems to need more time scanning then

22:22 <hkaiser> here is the original proposal http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0075r1.pdf

22:23 K-ballo1 has joined #ste||ar

22:23 <Yorlik> Is there any ETA for these?

22:23 <Yorlik> 2003?

22:23 <hkaiser> not sure about the parallelism TS2, K-ballo do you know that?

22:24 <hkaiser> K-ballo1: ^^

22:24 <hkaiser> Yorlik: it's not in c++20, I think

22:24 <Yorlik> doe they accept any callable or does it have to be a lambda?

22:25 <hkaiser> any compatible callable

22:25 <Yorlik> cpmpatible as in taking an int ?

22:25 K-ballo has quit [Ping timeout: 245 seconds]

22:25 K-ballo1 is now known as K-ballo

22:25 <hkaiser> compatible as in taking whatever loop variables you use

22:27 <Yorlik> What execution püolica should I choose? It's some of hpx::parallel::execution:: ... right?

22:28 <hkaiser> hpx::parallel::execution::par

22:29 <hkaiser> gtg

22:29 hkaiser has quit [Quit: bye]

22:29 <Yorlik> O/

22:41 nikunj has quit [Remote host closed the connection]

23:12 <Yorlik> Can't believe I finally figured this out ... holy parallel cow loop...

23:13 <Yorlik> Someone needs to lock in hkaiser to finally write the hpx bible ...