hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoD: https://developers.google.com/season-of-docs/
eschnett has joined #ste||ar
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 268 seconds]
K-ballo1 is now known as K-ballo
mdiers_ has quit [Remote host closed the connection]
mdiers_ has joined #ste||ar
nk__ has joined #ste||ar
nikunj has quit [Ping timeout: 252 seconds]
nk__ has quit [Ping timeout: 250 seconds]
nk__ has joined #ste||ar
nk__ has quit [Ping timeout: 246 seconds]
eschnett has quit [Quit: eschnett]
eschnett has joined #ste||ar
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 245 seconds]
K-ballo1 is now known as K-ballo
Yorlik has quit [Ping timeout: 245 seconds]
hkaiser has joined #ste||ar
diehlpk has joined #ste||ar
<diehlpk> hkaiser, we have scaling results as octotiger does not let pvfmm look good
<hkaiser> diehlpk: so pvfmm is worse than OT?
<diehlpk> Yes
<diehlpk> compute nodes: [1, 2, 4, 8, 16, 32]
<diehlpk> PVFMM [s]: [2.7171, 1.9655, 1.3204, 0.9433, 0.8271, 0.7105]
<diehlpk> O-T [s]: [0.8062, 0.5564, 0.3135, 0.17149, 0.1737, 0.1466]
<diehlpk> We do not scale due to te small problem size
<diehlpk> But we compared apples with plums since we use different integration rules, data structures, and so on
rori has joined #ste||ar
<hkaiser> diehlpk: ok, will that be added to the paper?
<diehlpk> hkaiser, No, Gregor and Dirk wrote an answer to the committee and attached the plots
<diehlpk> I found a bug in Dominic"s code after some intense testing because the solif sphere was never tested in distributed. Gregot was able to fix the code and could run the small benchmark
<diehlpk> So we do not have time to add it to the paper
<diehlpk> But at least the one reviewer has seen his wished comparison
<hkaiser> ok
<hkaiser> thanks for your effort!
<diehlpk> Yeah, Gregor, Dirk, and I slept not much the last two days to do this comparison and we will see if they are happy with it
<hkaiser> I think so, all will be well
<diehlpk> I hope, we upload the paper right now and we are done with it
<hkaiser> nod, take a rest you deserved it
<diehlpk> Need to prepare my 30 minutes talk for Wed
<hkaiser> diehlpk: you can do that on the plane ;-)
<diehlpk> Yes, good point
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 246 seconds]
K-ballo1 has quit [Ping timeout: 258 seconds]
diehlpk has quit [Ping timeout: 252 seconds]
K-ballo has joined #ste||ar
Yorlik has joined #ste||ar
<Yorlik> hkaiser: yt?
<hkaiser> here
<Yorlik> I just had a very nice test - pleasant for HPX I think
<hkaiser> ok
<Yorlik> I was running my test program with 104 workers
<Yorlik> using hpx to schedule thetasks
<Yorlik> each slot was seen as a task, but not run as a task
<Yorlik> I made it so, that each slot at least took 200us
<hkaiser> nod
<Yorlik> So - the theoretical mini9mum task was always 200+ us
<Yorlik> The measurements also showed it worked - the average was about 210 or so
<Yorlik> Then I let these workers - which are very depoendent on each other run in a buffer of 9182 items
<Yorlik> 8192
<hkaiser> is that a lot?
<Yorlik> So each had about 80 space on average
<Yorlik> It was 64 byte data - one cache line
<hkaiser> k
<Yorlik> So - 64*8192 byte total buffer size
<hkaiser> not much data
<Yorlik> I made a tim measurement INSIDE the work() mothod
<Yorlik> So in the end I had a total of the real work time spent
<Yorlik> Without any overhead
<Yorlik> I let it run for 2 hours
<Yorlik> And the divided the total sum of time spent in the work() by the time the program was runnng
<Yorlik> this I used as an efficiency parameter
K-ballo1 has joined #ste||ar
<hkaiser> what did you get?
<Yorlik> Whats the real work done compared to the total runtime
<Yorlik> I controlled the batch size to not exceed a certain limit and I forced a yield when it became too small
<Yorlik> Running with 6 worker threads I yielded an efficiency of 5.79
<hkaiser> nice
<Yorlik> Thats a parllel efficiency of 96%
K-ballo has quit [Ping timeout: 248 seconds]
K-ballo1 is now known as K-ballo
<Yorlik> And in an environment with ectreme mutual dependency
<Yorlik> 104 workers running in a pipeline
<Yorlik> There's a lot of possibility to get in each others way
<Yorlik> I'm retty happy with that result
<Yorlik> :)
<hkaiser> good
<hkaiser> I'm glad you are
<Yorlik> The big problem I think with the pipeline is, that a task can only run ahead that far - depending on the predecessor in the queue
<Yorlik> That could horribly raise your kappa
<Yorlik> It would be interesting to do some measurements on a many core machine
<Yorlik> And see how the USL applies
<hkaiser> btw hpx has a perf counter (idle-rate, needs to be enabled at compile time) that should have given you the same information
<Yorlik> I'll get into perf counters later
<Yorlik> I think I'll now work a bit more on that datastructuire to make it as good and usable as I can
<hkaiser> but - nice result
<Yorlik> Yeah - I was afraid it would be abysmal - but this is niuce
<hkaiser> now you should exit the task instead of yelding and restart a new task whenever work is available
<Yorlik> Once I have it in a usable shape I'll work on instrumentation
<hkaiser> that would make things more dynamic as the number of workers would adapt itself to the amount of work
<Yorlik> The setup here is different
<Yorlik> Its a pipeline
<Yorlik> Not a parallel setup
<hkaiser> still
<Yorlik> But I want to have a possibility for parallel work inside a tsage of the pipeline
<Yorlik> If I find a good way to to that, that wopuld be the killer
<Yorlik> Because slow stages would autoscale
<hkaiser> so your pipeline has 104 stages?
<Yorlik> Yes
<hkaiser> nod
<Yorlik> one worker each
<hkaiser> ok
<Yorlik> 4 are the default and 100 just for the measurement and increase load
<Yorlik> Its parameterized - running a 1000 stage pipeline is easy
<Yorlik> just a number
<hkaiser> is that useful at all?
<Yorlik> I don't think so
<Yorlik> The main use case is a single producer single consumer Q
<Yorlik> Just to shovel data asap between threads
<Yorlik> An I can dynamically insert or remov clients
<Yorlik> So - having a logger in really quick and remove it is easy
<Yorlik> I might use it to coordinate between physics and gameplay updates in a frame
<Yorlik> or to shovel messages between updaters and the dispatcher
<Yorlik> Thats the main use cases for us
<Yorlik> I think I just got an idea for a next experiment
<Yorlik> And how to solve the task making you mentioned
<Yorlik> Its actually easy
<Yorlik> Just need to make the instrumentation atomic now :)
<hkaiser> or use the existing hpx instrumentation ;-)
<Yorlik> How can I in a for loop add a bunch of futures to be watched after the for loop
<Yorlik> In the loop I'm launching a bunch of tasks I have to wait for
<Yorlik> What's the best way to do that ?
<Yorlik> Just vector of hpx::future?
<Yorlik> but hpx::future is a template right? Or is it type erased?
<Yorlik> Woops - I can just use future<void>
<Yorlik> I am getting an error for this:
<Yorlik> tasks.push_back( hpx::async( hpx::launch::async, &Client::work, item ) );
<Yorlik> probably because work is a member function and its not an hpx object
<Yorlik> How can i cun this strictly local again for a test?
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 272 seconds]
K-ballo1 is now known as K-ballo
rori has quit [Quit: bye]
<hkaiser> Yorlik: this should actually work, if anything you can use async(hpx::util::bind(&Client::work, item)); or an equivalent solution involving a lambda
<Yorlik> I think I need CRTP - the call is inside the runner and work is an abstract member
<Yorlik> So i created a CRTP interface like this:
<hkaiser> should work anyways
<hkaiser> try passing (&Client::work, &item)
<hkaiser> either way should be fine, (if 'item' is copyable)
<hkaiser> if it's not copyable do async(&Client::work, std::ref(item))
<Yorlik> Don't I have to pass this ?
<hkaiser> well, I thought 'item' was the object to use, i.e. item.work()
<Yorlik> work is a member function
<hkaiser> of item?
<Yorlik> no - of Client
<hkaiser> what type is item?
<Yorlik> the workers inherit from client
<Yorlik> item is just the data in the buffer
<hkaiser> or is item to be passed to work(...)?
<Yorlik> Yes
<hkaiser> ahh
<hkaiser> then you need to supply the 'this' of the Client object you want to invoke 'work' on
<Yorlik> it seems this compiles - futures.push_back( hpx::async( &Client::work, this, std::ref(item) ) );
<hkaiser> yes, that's it
<Yorlik> Now I have to work on a last error
<Yorlik> And remove that dreaded CRTP :D
<Yorlik> Though i can use it to save one iondirection
<hkaiser> this way however you have to make sure 'item' outlives the invocation of work()
<Yorlik> And since we're in a tight loop - maybe I keep it
<Yorlik> It will always
<Yorlik> Why does visual studio crash when something starts working ? :D
<zao> Only one of you and VS can be working at the same time.
<zao> Also, you're using sophisticated C++, it's supposed to kill tools :)
<Yorlik> XD
<Yorlik> It's so much over my head and VS smells it using it to stab me in the back
<Yorlik> hkaiser: It seems to work now - I have run a longer test to check the impact on efficiency.
<hkaiser> Yorlik: as long as your workload is ~200us you will not see any impact
<hkaiser> 200us or more
<Yorlik> It's abit of an atypical workload I think
<Yorlik> Because the clients can step on each others toes
<Yorlik> I expect an impact when varying the runtimes of the clients
<Yorlik> Like slow and fast clients between 200 and 400 us or so
<Yorlik> Because they will start to bump against each other section of the buffer
<Yorlik> But HPX will probably mitigate that a lot
<Yorlik> because now I have autoscaling
<Yorlik> I need to test
<Yorlik> In a USL modeling I'd expect the kappa value to change a lot depending on the variation of the item times per worker and the size of the buffer
<Yorlik> the "bumping into each other" could be seen as a form of crosstalk, I belkieve.#
<Yorlik> Seems with taskifying I introduced new bugs - time to fix ...
<hkaiser> Yorlik: btw, for running a predefined number of tasks you could run a parallel::for_each() instead of running separate tasks - especially if you are not interested in separate result values from those tasks
<Yorlik> Makes sense
<hkaiser> (I'm reading back only now - wrt your question how to create a bunch of tsks
<Yorlik> It seems to work, but there is some odd behavior - I think I might have introduced bugs, maybe even a race
<Yorlik> I do not yet fully understand what's going on.
<Yorlik> Oh - got it - a race
<Yorlik> Code that was running single threaded is now taskified - my instrumenting counter needs to become atomic
nk__ has joined #ste||ar
<Yorlik> Can I query the hpx:threads parameter from inside the program?
<hkaiser> Yorlik: what parameters?
<Yorlik> The hpx::threads
<Yorlik> I'd like to compute the thread efficiency in the output
<hkaiser> I'm not sure I understand
<Yorlik> I start the program with --hpx::threads=6
<Yorlik> I wont to get the 6 inside my code
<Yorlik> BTW: I have bog hopes for the results of the GSOC
<Yorlik> big hopes
<Yorlik> E.g. I didn't find any doc on the parallel for
<Yorlik> It was just mentioned in some release notes
<Yorlik> I often fall back to look up stuff in my doxygen
<hkaiser> ahh
<hkaiser> hpx::get_num_os_threads()
<Yorlik> And thats just the workers, rightß
<Yorlik> NOt io OR ANYTHINGß
<Yorlik> Woopscaps
<hkaiser> except it's in hpx::parallel namespace
<Yorlik> Sweet :)
<Yorlik> I like the standard conformance of HPX a lot
<hkaiser> num_os_threads is just the number of worker threads
<Yorlik> is parallel for considered faster / less overhead?
<hkaiser> depends on what you want to do
<Yorlik> as opposed to manually making tasks
<hkaiser> it has less overhead compared to launching N tasks with N futures
<hkaiser> but it's fork-join
<Yorlik> I sent you a git link in PM
<hkaiser> obviously
<Yorlik> to the central part
<Yorlik> You might wanna have a look there
* Yorlik looks really cute all at a sudden :)
<hkaiser> Yorlik: calling wait on the futures can be done better with wait_all(futures)
<Yorlik> My efficiency has gone down dramatically with taskifying. I just finished a 20 minute test: https://gitlab.com/arcanimanext/arcanimalibs/disruptor/blob/y_wip/src/include/disruptor_hpx.hpp#L479
<Yorlik> woops
<hkaiser> in any case, future::wait and hpx::wait_all do not rethrow exceptions, if you need those you will have to call future::get on all futures
<Yorlik> Efficiency = ~1.03 times faster than serial processing.
<Yorlik> Used to be 5+
<Yorlik> So - efficiency has suffered a great deal from my changes
<hkaiser> too many tasks, too little work
<Yorlik> The tasks are 200+us each
<Yorlik> I need to check my measurements - there might be dragons hidden
<hkaiser> use /threads/count/cumulative and /threads/time/average perf counters
<Yorlik> average is the tsk average?
<hkaiser> also, enable /threads/idle-rate (add -DHPX_WITH_THREAD_IDLE_RATES=On to cmake)
<hkaiser> those will give you the information you need
<Yorlik> OK - rebuilding HPX ..
<Yorlik> Time to read perf counter docs ..
* Yorlik is in the toy shop ..
<Yorlik> Thanks !!
eschnett has quit [Quit: eschnett]
hkaiser has quit [Quit: bye]
<Yorlik> This explodes: hpx::performance_counters::performance_counter average_thread_counter ("/threads/time/average");
<Yorlik> Seems I missed the locality
<Yorlik> Argh - compiled wrongly
<Yorlik> Recompiuling ... the punishment ...
<Yorlik> By the end of the century I will be an HPX expert :D
eschnett has joined #ste||ar
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 246 seconds]
K-ballo1 is now known as K-ballo
nk__ has quit [Remote host closed the connection]
nikunj has joined #ste||ar
<Yorlik> The counter "/threads{locality#0/total}/count/cumulative" should give me all time spent in my local tasks, is that correct?
<Yorlik> In nanoseconds
<Yorlik> And this should give me seconds: cumulative_thread_counter.get_value<double>().get()/1000000000
<Yorlik> correct?
<Yorlik> I'm asking because I'm getting implausible values
hkaiser has joined #ste||ar
<Yorlik> hkaiser: I got the counters running, but I am getting weird results:
<Yorlik> MAIN: Elapsed 65 of 120 seconds. cum: 0.000872788 s., avg: 0.000258611 s
<Yorlik> I am dividing the counters by 1e9 to get seconds
nikunj has quit [Ping timeout: 276 seconds]
<Yorlik> assuming they are in ns
nikunj has joined #ste||ar
<hkaiser> I htink they are
<hkaiser> the cumulative counter gives you thread counts, the other one average execution time
<Yorlik> Ohhh
<Yorlik> That makes more sense - I was misunderstanding the thread counter then
<Yorlik> So - it's like the numkbver of tasks ever launched an finished?
<hkaiser> yes
<Yorlik> Makes sense
<Yorlik> What wouzld be the best way to trace a function or an instance of a member function invocation?
<Yorlik> Like measuring a single worker
<hkaiser> what do you mean by 'trace'?
<Yorlik> The workers are instances of a class
<Yorlik> they invode their worker
<Yorlik> err their worker function
<Yorlik> I want to know the timings and numbers for each instance of my Client Class
<hkaiser> that you have to do yourself, but you could create your own perf counter and use the min/max counters to give you the extrema
<Yorlik> because i will give them different setups
<Yorlik> OK
<Yorlik> I'll check that out
<Yorlik> I was pretty impressed skimming over the amount of stuff already in the system.
<Yorlik> Time to spam useless statistics ;)
<hkaiser> those are /arithmetics/min and /arithmetics/max
<Yorlik> I'll work on that ranged for first
<Yorlik> Then counters
<Yorlik> I have a feeling I am creating way too many tasks
<hkaiser> I'm sure of that
<Yorlik> I wonder how I would convert this:
<Yorlik> for ( size_t slot = cur_counter + 1; slot <= ( cur_batch_size > 20 ? cur_counter + 20 : current_limit_ ); slot++ )
<Yorlik> into a parallel loop
<hkaiser> use parallel::for_loop()
<Yorlik> :)
<Yorlik> Do I need a special header for this?
<Yorlik> OK found it
K-ballo1 has joined #ste||ar
<Yorlik> hkaiser: For some reason VS suggests only hpx::parallel::v1 ... but the docs say v2 ... what's that about?
K-ballo has quit [Ping timeout: 245 seconds]
K-ballo1 is now known as K-ballo
<hkaiser> ignore the ::v1/::v2, those are inline namespaces that disappear
<Yorlik> OK
<Yorlik> intellisense just seems to need more time scanning then
K-ballo1 has joined #ste||ar
<Yorlik> Is there any ETA for these?
<Yorlik> 2003?
<hkaiser> not sure about the parallelism TS2, K-ballo do you know that?
<hkaiser> K-ballo1: ^^
<hkaiser> Yorlik: it's not in c++20, I think
<Yorlik> doe they accept any callable or does it have to be a lambda?
<hkaiser> any compatible callable
<Yorlik> cpmpatible as in taking an int ?
K-ballo has quit [Ping timeout: 245 seconds]
K-ballo1 is now known as K-ballo
<hkaiser> compatible as in taking whatever loop variables you use
<Yorlik> What execution püolica should I choose? It's some of hpx::parallel::execution:: ... right?
<hkaiser> hpx::parallel::execution::par
<hkaiser> gtg
hkaiser has quit [Quit: bye]
<Yorlik> O/
nikunj has quit [Remote host closed the connection]
<Yorlik> Can't believe I finally figured this out ... holy parallel cow loop...
<Yorlik> Someone needs to lock in hkaiser to finally write the hpx bible ...