<diehlpk>
We do not scale due to te small problem size
<diehlpk>
But we compared apples with plums since we use different integration rules, data structures, and so on
rori has joined #ste||ar
<hkaiser>
diehlpk: ok, will that be added to the paper?
<diehlpk>
hkaiser, No, Gregor and Dirk wrote an answer to the committee and attached the plots
<diehlpk>
I found a bug in Dominic"s code after some intense testing because the solif sphere was never tested in distributed. Gregot was able to fix the code and could run the small benchmark
<diehlpk>
So we do not have time to add it to the paper
<diehlpk>
But at least the one reviewer has seen his wished comparison
<hkaiser>
ok
<hkaiser>
thanks for your effort!
<diehlpk>
Yeah, Gregor, Dirk, and I slept not much the last two days to do this comparison and we will see if they are happy with it
<hkaiser>
I think so, all will be well
<diehlpk>
I hope, we upload the paper right now and we are done with it
<hkaiser>
nod, take a rest you deserved it
<diehlpk>
Need to prepare my 30 minutes talk for Wed
<hkaiser>
diehlpk: you can do that on the plane ;-)
<diehlpk>
Yes, good point
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 246 seconds]
K-ballo1 has quit [Ping timeout: 258 seconds]
diehlpk has quit [Ping timeout: 252 seconds]
K-ballo has joined #ste||ar
Yorlik has joined #ste||ar
<Yorlik>
hkaiser: yt?
<hkaiser>
here
<Yorlik>
I just had a very nice test - pleasant for HPX I think
<hkaiser>
ok
<Yorlik>
I was running my test program with 104 workers
<Yorlik>
using hpx to schedule thetasks
<Yorlik>
each slot was seen as a task, but not run as a task
<Yorlik>
I made it so, that each slot at least took 200us
<hkaiser>
nod
<Yorlik>
So - the theoretical mini9mum task was always 200+ us
<Yorlik>
The measurements also showed it worked - the average was about 210 or so
<Yorlik>
Then I let these workers - which are very depoendent on each other run in a buffer of 9182 items
<Yorlik>
8192
<hkaiser>
is that a lot?
<Yorlik>
So each had about 80 space on average
<Yorlik>
It was 64 byte data - one cache line
<hkaiser>
k
<Yorlik>
So - 64*8192 byte total buffer size
<hkaiser>
not much data
<Yorlik>
I made a tim measurement INSIDE the work() mothod
<Yorlik>
So in the end I had a total of the real work time spent
<Yorlik>
Without any overhead
<Yorlik>
I let it run for 2 hours
<Yorlik>
And the divided the total sum of time spent in the work() by the time the program was runnng
<Yorlik>
this I used as an efficiency parameter
K-ballo1 has joined #ste||ar
<hkaiser>
what did you get?
<Yorlik>
Whats the real work done compared to the total runtime
<Yorlik>
I controlled the batch size to not exceed a certain limit and I forced a yield when it became too small
<Yorlik>
Running with 6 worker threads I yielded an efficiency of 5.79
<hkaiser>
nice
<Yorlik>
Thats a parllel efficiency of 96%
K-ballo has quit [Ping timeout: 248 seconds]
K-ballo1 is now known as K-ballo
<Yorlik>
And in an environment with ectreme mutual dependency
<Yorlik>
104 workers running in a pipeline
<Yorlik>
There's a lot of possibility to get in each others way
<Yorlik>
I'm retty happy with that result
<Yorlik>
:)
<hkaiser>
good
<hkaiser>
I'm glad you are
<Yorlik>
The big problem I think with the pipeline is, that a task can only run ahead that far - depending on the predecessor in the queue
<Yorlik>
That could horribly raise your kappa
<Yorlik>
It would be interesting to do some measurements on a many core machine
<Yorlik>
And see how the USL applies
<hkaiser>
btw hpx has a perf counter (idle-rate, needs to be enabled at compile time) that should have given you the same information
<Yorlik>
I'll get into perf counters later
<Yorlik>
I think I'll now work a bit more on that datastructuire to make it as good and usable as I can
<hkaiser>
but - nice result
<Yorlik>
Yeah - I was afraid it would be abysmal - but this is niuce
<hkaiser>
now you should exit the task instead of yelding and restart a new task whenever work is available
<Yorlik>
Once I have it in a usable shape I'll work on instrumentation
<hkaiser>
that would make things more dynamic as the number of workers would adapt itself to the amount of work
<Yorlik>
The setup here is different
<Yorlik>
Its a pipeline
<Yorlik>
Not a parallel setup
<hkaiser>
still
<Yorlik>
But I want to have a possibility for parallel work inside a tsage of the pipeline
<Yorlik>
If I find a good way to to that, that wopuld be the killer
<Yorlik>
Because slow stages would autoscale
<hkaiser>
so your pipeline has 104 stages?
<Yorlik>
Yes
<hkaiser>
nod
<Yorlik>
one worker each
<hkaiser>
ok
<Yorlik>
4 are the default and 100 just for the measurement and increase load
<Yorlik>
Its parameterized - running a 1000 stage pipeline is easy
<Yorlik>
just a number
<hkaiser>
is that useful at all?
<Yorlik>
I don't think so
<Yorlik>
The main use case is a single producer single consumer Q
<Yorlik>
Just to shovel data asap between threads
<Yorlik>
An I can dynamically insert or remov clients
<Yorlik>
So - having a logger in really quick and remove it is easy
<Yorlik>
I might use it to coordinate between physics and gameplay updates in a frame
<Yorlik>
or to shovel messages between updaters and the dispatcher
<Yorlik>
Thats the main use cases for us
<Yorlik>
I think I just got an idea for a next experiment
<Yorlik>
And how to solve the task making you mentioned
<Yorlik>
Its actually easy
<Yorlik>
Just need to make the instrumentation atomic now :)
<hkaiser>
or use the existing hpx instrumentation ;-)
<Yorlik>
How can I in a for loop add a bunch of futures to be watched after the for loop
<Yorlik>
In the loop I'm launching a bunch of tasks I have to wait for
<Yorlik>
What's the best way to do that ?
<Yorlik>
Just vector of hpx::future?
<Yorlik>
but hpx::future is a template right? Or is it type erased?
<Yorlik>
probably because work is a member function and its not an hpx object
<Yorlik>
How can i cun this strictly local again for a test?
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 272 seconds]
K-ballo1 is now known as K-ballo
rori has quit [Quit: bye]
<hkaiser>
Yorlik: this should actually work, if anything you can use async(hpx::util::bind(&Client::work, item)); or an equivalent solution involving a lambda
<Yorlik>
I think I need CRTP - the call is inside the runner and work is an abstract member
<Yorlik>
So i created a CRTP interface like this:
<hkaiser>
should work anyways
<hkaiser>
try passing (&Client::work, &item)
<hkaiser>
either way should be fine, (if 'item' is copyable)
<hkaiser>
if it's not copyable do async(&Client::work, std::ref(item))
<Yorlik>
Don't I have to pass this ?
<hkaiser>
well, I thought 'item' was the object to use, i.e. item.work()
<Yorlik>
work is a member function
<hkaiser>
of item?
<Yorlik>
no - of Client
<hkaiser>
what type is item?
<Yorlik>
the workers inherit from client
<Yorlik>
item is just the data in the buffer
<hkaiser>
or is item to be passed to work(...)?
<Yorlik>
Yes
<hkaiser>
ahh
<hkaiser>
then you need to supply the 'this' of the Client object you want to invoke 'work' on
<Yorlik>
it seems this compiles - futures.push_back( hpx::async( &Client::work, this, std::ref(item) ) );
<hkaiser>
yes, that's it
<Yorlik>
Now I have to work on a last error
<Yorlik>
And remove that dreaded CRTP :D
<Yorlik>
Though i can use it to save one iondirection
<hkaiser>
this way however you have to make sure 'item' outlives the invocation of work()
<Yorlik>
And since we're in a tight loop - maybe I keep it
<Yorlik>
It will always
<Yorlik>
Why does visual studio crash when something starts working ? :D
<zao>
Only one of you and VS can be working at the same time.
<zao>
Also, you're using sophisticated C++, it's supposed to kill tools :)
<Yorlik>
XD
<Yorlik>
It's so much over my head and VS smells it using it to stab me in the back
<Yorlik>
hkaiser: It seems to work now - I have run a longer test to check the impact on efficiency.
<hkaiser>
Yorlik: as long as your workload is ~200us you will not see any impact
<hkaiser>
200us or more
<Yorlik>
It's abit of an atypical workload I think
<Yorlik>
Because the clients can step on each others toes
<Yorlik>
I expect an impact when varying the runtimes of the clients
<Yorlik>
Like slow and fast clients between 200 and 400 us or so
<Yorlik>
Because they will start to bump against each other section of the buffer
<Yorlik>
But HPX will probably mitigate that a lot
<Yorlik>
because now I have autoscaling
<Yorlik>
I need to test
<Yorlik>
In a USL modeling I'd expect the kappa value to change a lot depending on the variation of the item times per worker and the size of the buffer
<Yorlik>
the "bumping into each other" could be seen as a form of crosstalk, I belkieve.#
<Yorlik>
Seems with taskifying I introduced new bugs - time to fix ...
<hkaiser>
Yorlik: btw, for running a predefined number of tasks you could run a parallel::for_each() instead of running separate tasks - especially if you are not interested in separate result values from those tasks
<Yorlik>
Makes sense
<hkaiser>
(I'm reading back only now - wrt your question how to create a bunch of tsks
<Yorlik>
It seems to work, but there is some odd behavior - I think I might have introduced bugs, maybe even a race
<Yorlik>
I do not yet fully understand what's going on.
<Yorlik>
Oh - got it - a race
<Yorlik>
Code that was running single threaded is now taskified - my instrumenting counter needs to become atomic
nk__ has joined #ste||ar
<Yorlik>
Can I query the hpx:threads parameter from inside the program?
<hkaiser>
Yorlik: what parameters?
<Yorlik>
The hpx::threads
<Yorlik>
I'd like to compute the thread efficiency in the output
<hkaiser>
I'm not sure I understand
<Yorlik>
I start the program with --hpx::threads=6
<Yorlik>
I wont to get the 6 inside my code
<Yorlik>
BTW: I have bog hopes for the results of the GSOC
<Yorlik>
big hopes
<Yorlik>
E.g. I didn't find any doc on the parallel for
<Yorlik>
It was just mentioned in some release notes
<Yorlik>
I often fall back to look up stuff in my doxygen