hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC2018: https://wp.me/p4pxJf-k1
nikunj97 has quit [Quit: Leaving]
hkaiser has quit [Quit: bye]
diehlpk_mobile has joined #ste||ar
eschnett has joined #ste||ar
parsa[w] has quit [Read error: Connection reset by peer]
stmatengss has joined #ste||ar
parsa[w] has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
stmatengss1 has joined #ste||ar
stmatengss has quit [Read error: Connection reset by peer]
diehlpk_mobile has quit [Quit: Yaaic - Yet another Android IRC client - http://www.yaaic.org]
nanashi55 has quit [Ping timeout: 264 seconds]
nanashi55 has joined #ste||ar
anushi has quit [Quit: Bye]
eschnett has quit [Quit: eschnett]
Anushi1998 has joined #ste||ar
Anushi1998 has quit [Remote host closed the connection]
Anushi1998 has joined #ste||ar
Anushi1998 has quit [Remote host closed the connection]
jakub_golinowski has joined #ste||ar
stmatengss1 has quit [Quit: Leaving.]
Anushi1998 has joined #ste||ar
<jakub_golinowski>
M-ms, yt?
stmatengss has joined #ste||ar
stmatengss has quit [Client Quit]
stmatengss has joined #ste||ar
<M-ms>
jakub_golinowski: here
<jakub_golinowski>
Hey, so I am now running different configurations in debug mode to learn more about nstripes and numThreads
<jakub_golinowski>
the first observation is that the dependecy on the nstripes does not seem to be U-shaped
<jakub_golinowski>
at least for Mandelbrot example
<jakub_golinowski>
I did not plot it but just by looking at the numbers: it is decreasing at the beginning (steeply for very small nstripes) and then saturates and stays at the same level for nstripes = 100 and nstripes = num_pixels (max possible value)
<jakub_golinowski>
and any other value in between
<jakub_golinowski>
My guess is that in mandelbrot there is a hidden for loop with 500 iterations for each body of the parallel_for loop so even if we divide into single tasks the performance does not drop that much
<jakub_golinowski>
because still each parallel_for body is significantly bigger than overhead
<M-ms>
yeah that's a good observation, the decrease with small nstripes is still correct behaviour
<M-ms>
then you'd need to check the absolute performance compared to tbb or pthreads
<M-ms>
so this is not really worrying yet
<M-ms>
(in release mode of course)
<M-ms>
btw, remember that if you want to step through code but still have decent performance there's also the RelWithDebInfo build type
<M-ms>
but you shouldn't need that now (and proper benchmarks should still be done with Release)
<jakub_golinowski>
So I am also checking that with tbb
<jakub_golinowski>
and tbb has basically the same behaviour
<M-ms>
ok, that's good
<jakub_golinowski>
But I am confused, because I was convinced that there should be some cost of too small chunking
<M-ms>
yeah...
<jakub_golinowski>
and with nstripes=0 (then in case of hpx each loop body is executed by separate task the performance seems to be one of the best)
<M-ms>
still here btw, I'm not sure what the problem could be so will have a look at the code
<M-ms>
as a sanity check, nstripes = 1 should give the same performance with the HPX and pthreads backends
<M-ms>
(with 4 or less threads though, so you don't use hyperthreads)
<M-ms>
and tbb as well...
<jakub_golinowski>
yes yes for the nstripes in ranges 1-8 the behaviour is as expected
<M-ms>
do you mean absolute performance between hpx, tbb and pthreads?
<jakub_golinowski>
hpx_nstripes vs tbb so fat
<M-ms>
(I'm not talking about performance improving going from 1 to higher nstripes)
<jakub_golinowski>
so far
<M-ms>
ok, so you should now do this with Release now
<jakub_golinowski>
Ah yeah, I already am talking about release
<jakub_golinowski>
so I was looking at debug first to confirm the paths but then I switch to release for the numbers
<M-ms>
right, ok
<jakub_golinowski>
by paths I a mean what code was actually executed in which configuration
<M-ms>
and you mean now that hpx and tbb are the same with nstripes = 1 but tbb scales better as you increase nstripes?
<jakub_golinowski>
to make it more tangible I am running it again now with smaller workload per loop
<jakub_golinowski>
I changed the number of iterations within 1 parallel_for body to 10 (from 500)
<M-ms>
with the hpx backend, if nstripes = 0 it uses one task per pixel? or do you let hpx do the chunking in that case?
<jakub_golinowski>
it is the old/current version in hpx_nstripes mode so there is one task per pixel
<M-ms>
try running them next time with only 4 threads, then you don't have to guess about effects from hyperthreading
<M-ms>
so the minimum is around 0.3 seconds for both backends, which is good
hkaiser has joined #ste||ar
<M-ms>
jakub_golinowski: so please try still letting hpx do the chunking when nstripes = 0, each pixel still has to go through ParallelLoopBodyWrapper but it's the same as for tbb so performance should be similar
<jakub_golinowski>
M-ms Ok this will be done for the next compilation of OpenCVs
<jakub_golinowski>
so for any nstripes > 0 should we say that the user knows what is he doing and accept it
<M-ms>
ok, good
<M-ms>
and yes, nstripes > 0 should be respected
<jakub_golinowski>
Oh that is interesting
<jakub_golinowski>
I am running it on 4 cores and the U-shape shows
<M-ms>
4 threads?
<jakub_golinowski>
--hpx:threads=4 for now (numThreads not yet supported for hpx backend)
<jakub_golinowski>
and for tbb I call cv::setNumThreads(4)
<M-ms>
there's a bit of a U shape in the logs you sent earlier as well
<jakub_golinowski>
and HPX is doing better than TBB seems it might be due to PU pinning
<hkaiser>
jakub_golinowski: coult you create a graph?
<jakub_golinowski>
hkaiser, for now I can quickly do it in Libre calc
<hkaiser>
or google spreadsheet
<jakub_golinowski>
but I think I will do a proper benchmark from it because it is interesting to investigate it further (run it a few times, compute stds etc)
<hkaiser>
sure
<jakub_golinowski>
so Ok I am going for google spreadsheet for now
<hkaiser>
graphs are just better for understanding the data than lists of numbers
<M-ms>
hpx is almost 5 times faster at its best compared to nstripes = 1 (with 4 threads), slightly weird but not bad...
<hkaiser>
wow
<M-ms>
indeed :P
stmatengss has quit [Quit: Leaving.]
<github>
[hpx] hkaiser force-pushed ready_future from 343eaf1 to 78842e0: https://git.io/vhrDS
<github>
hpx/ready_future 78842e0 hkaiser: Adding direct data value to future to avoid allocation for make_ready_future....
<jakub_golinowski>
Took a bit longer than expected
<jakub_golinowski>
but I have to say that google spreadsheets have nice plotting tools
<hkaiser>
thanks
<hkaiser>
now please explain what I'm seeing
<jakub_golinowski>
All sheets have the same schema but present data for different configs
mcopik has joined #ste||ar
<jakub_golinowski>
The variable nstripes is the job partitioning argument used within OpenCV to enforce chunking of the job on the parallel backend
<jakub_golinowski>
So let us focus on first sheet: hpx_nstripes-r4-10i
<hkaiser>
k
<jakub_golinowski>
this is the OpenCV built with HPX backend respecting the nstripes variable and introducing as many tasks to deal with the job as given by the nstripes variable
<jakub_golinowski>
There are two ranges in which nstripes was varied: small begining range (0-99) and linspace of the whole range (50 points)
<jakub_golinowski>
the small range was introduced to see the behaviour when number of tasks is very small. For example for nstripes=1 we have basically sequential execution, hence the steep drop in the subrange 1-8 of nstripes
<jakub_golinowski>
I set the range to 100 to see if there will be some chagnes visible beyond nstripes=4*num_threads
<jakub_golinowski>
The linspace over the whole range was introduce to observe the U-shape effect (or at least the second part of the U)
<jakub_golinowski>
which is deemed to be the result of the to small task size
<hkaiser>
k
<hkaiser>
good
<hkaiser>
jakub_golinowski: so hpx times going up for large number of tasks is caused by the overheads
<hkaiser>
tbb does not expose that behavior
<jakub_golinowski>
So this is what I think. Because big nstripes means a lot of very small tasks therefore more overhead for task management
<hkaiser>
yes
<jakub_golinowski>
I am not sure what is tbb doing exactly with nstripes parameter but it is interesting that they have a bump for the nstripes close to the half of the whole range
<hkaiser>
jakub_golinowski: is the problem size constant for all measurements?
<jakub_golinowski>
Yes it is, the range in both experiments is 25920000
<hkaiser>
k
<jakub_golinowski>
because our current approach is to set hpx::parallel::execution::static_chunk_size fixed(1);
<github>
[hpx] hkaiser force-pushed ready_future from 78842e0 to edbedb5: https://git.io/vhrDS
<github>
hpx/ready_future edbedb5 hkaiser: Adding direct data value to future to avoid allocation for make_ready_future....
<hkaiser>
jakub_golinowski: thanks fot his, very interesting
<jakub_golinowski>
and then just use as many chunks as nstripes says
mcopik has quit [Ping timeout: 276 seconds]
anushi has joined #ste||ar
mcopik has joined #ste||ar
<github>
[hpx] hkaiser force-pushed ready_future from edbedb5 to 6b9f8dc: https://git.io/vhrDS
<github>
hpx/ready_future 6b9f8dc hkaiser: Adding direct data value to future to avoid allocation for make_ready_future....
<M-ms>
jakub_golinowski: last question: the data point with hpx and nstripes = 0 was also with static_chunk_size(1)?
<jakub_golinowski>
yes
<M-ms>
I have a feeling tbb stays fast even with bigger nstripes because it sets a lower bound on the size of tasks but it's still allowed to use bigger chunks if it feels like it's necessary
<hkaiser>
how does it know how large the tasks are?
<hkaiser>
or is it purely a question of the number of tasks
<M-ms>
I don't know tbh
<M-ms>
otherwise they just have really low overheads
<hkaiser>
yah, they don't have separate stacks for their 'tasks'
<jakub_golinowski>
also just to add: If you look at the sheet 3 you can see that again the bump for tbb is clearly visible
stmatengss has joined #ste||ar
stmatengss has quit [Client Quit]
<M-ms>
those bumps are a bit weird, you weren't doing anything else on your laptop at the same time?
<hkaiser>
M-ms: could be an artifact of how tbb sizes the chunks
<jakub_golinowski>
I was doing other stuff so I am not yet going to deep into analyzing the bumps
<jakub_golinowski>
if in the overnight run they also show up I will start thinking harder about them
<M-ms>
and I know I said we should respect nstripes, but could you do the same plot with the default hpx chunker?
<jakub_golinowski>
right away
<M-ms>
thanks
mcopik_ has joined #ste||ar
mcopik has quit [Read error: Connection reset by peer]
<jakub_golinowski>
it is done
<jakub_golinowski>
sheet 5
nikunj97 has joined #ste||ar
<M-ms>
nice, so this is just parallel_for(par, ...)?
mcopik_ has quit [Ping timeout: 256 seconds]
<jakub_golinowski>
M-ms, exactly
<M-ms>
I'm tempted to say we should use this instead, it's so much more predictable. But then it's a bit slower compared to the static chunk size version. So you will rerun all the tests overnight?
<M-ms>
jakub_golinowski: was the one you just added with 4 threads, not 8? if that's the case the performance is good
<jakub_golinowski>
M-ms, the plots in sheet 5 are now in accordance with sheet name so 4 threads
mcopik_ has joined #ste||ar
<jakub_golinowski>
I just finished the 8 threads run and will add it to sheet 6 for completeness
<M-ms>
ok, good
<M-ms>
very good, so now the results from earlier (the ~50% overhead for the hpx backend) make sense, because they were the same as static_chunk-size(1) and nstripes = num_pixels
<jakub_golinowski>
?
<M-ms>
the plots we saw on monday, where hpx was maybe 50% slower
<M-ms>
those were run with num_stripes = 0 and static_chunk_size(0)
<M-ms>
just talking out loud
<M-ms>
static_chunk_size(1) of course
<M-ms>
uhm, the results using hpx, default chunker and e.g. nstripes are strange
<M-ms>
nstripes = 2
<M-ms>
it shouldn't be 4 times faster than nstripes = 1
<jakub_golinowski>
well wait
<jakub_golinowski>
First: I shuffled the order of sheets but the names are still correct
<jakub_golinowski>
So now the hpx-4t-10i is the hpx that ignores the nstripes and always uses the default HPX chunking
<jakub_golinowski>
it should be 4 times faster than for nstripes=1 because for nstripes=1 the backend is doing this check
<jakub_golinowski>
and ignores the parallel implementation and just runs sequential
<jakub_golinowski>
And for hpx_nstripes-4t-10i the difference is correct: the time is 2 times smaller for nstripes=2 compared to nstripes=1
<M-ms>
thanks, that explains it
<M-ms>
I was thinking about using nstripes but default chunker
<M-ms>
you'll still want to call ParallelLoopBodyWrapper
<jbjnr>
hkaiser: jakub_golinowski during our talks, we discussed with heller hkaiser M-ms + kokkos people how to improve task times for simple for loop, by haning N-ary tasks. I think this would help enormously with the opencv style where things are done in one huge for loop (or similar)
<jbjnr>
haning = having ^^
<jakub_golinowski>
jbjnr, could you tell sth more or give some links so I can read more on it?
<M-ms>
jbjnr: you'll have to specify n-ary a bit better
<jbjnr>
jakub_golinowski: I will create an issue on the tracker and you can read that. give me 5 mins
<hkaiser>
everything is doable, the question is how much effort would that be
<jbjnr>
I suspect not as much as we might think.
<hkaiser>
k
<jbjnr>
well - the main task reworking should be doable, but reworking the parallel algorithms might take some effort.
<jbjnr>
I don't recall how we actually generate the tasks in a parallel for/foreach/etc
anushi has quit [Ping timeout: 264 seconds]
<nikunj97>
hkaiser: yt?
<heller>
jbjnr: you know what the kokkos people do though?
<jbjnr>
tell me
<heller>
the "n-ary task" thing is a bit of an overstatement
<jbjnr>
and ...
<jbjnr>
???
<heller>
they're sheduling loop goes a bit like this: while(true){ sleep_while(state == inactive) run_task(); }
<heller>
that runs on all cores
<jbjnr>
ok
<jbjnr>
The main thing is that we could still create one task instead of N, and save a ton of queue-buggery
<jbjnr>
I do wonder if the last task to pull one off, finishes it before one of the earlier ones - what we do there.
<heller>
now, when a parallel for should get executed, they set a function pointer (the one that executes the loop body), and set the state to active
<jbjnr>
I see
<heller>
once the loop body finishes, it sets the state to inactive again
daissgr has joined #ste||ar
<jbjnr>
so the task isn't really a task 'proper'
<heller>
nope.
<jbjnr>
just a jump into the loop
<heller>
yup
<heller>
and then some atomics etc.
<jbjnr>
well I still like the N-ary idea.
<heller>
sure
<heller>
we could sure add similar logic to our scheduling loop
<jbjnr>
we'd probably have to do something similar - the thread_data is non-copyable, and the stack might be an issue- we'd end up running the task on the thread/stack of the scheduling loop itself
<jbjnr>
<what could go wrong>
<jbjnr>
:(
<jbjnr>
a nested async inside a parallel loop! yikes
hkaiser has quit [Quit: bye]
<heller>
jbjnr: kokkos is really going completely out of themselve to try to directly map to the underlying hardware
<jbjnr>
correct
<jbjnr>
that's why people are using it for HPC
<heller>
so, a HPX backend (doing everything with dynamic tasking) and comparing the performance, will give us the answer of what we need to do as the next step
<heller>
and there'll be two answers: 1) the microbenchmark performance 2) the mini apps that have been ported to kokkos
<heller>
and I'm pretty sure the microbenchmarks will just rock on the existing kokkos benchmarks and we won't stand a chance there (maybe within 10% of the performance). But I am not so certain regarding "real" applications
K-ballo has joined #ste||ar
stmatengss has joined #ste||ar
mcopik_ has quit [Ping timeout: 245 seconds]
eschnett has joined #ste||ar
<jbjnr>
heller: agreed - and this is why I have been banging on about affinity etc recently. We need to support the kind of things that kokkos is doing well if we want to be competitive for those tight loop apps that have been carefully optimized for the kokkos model.
hkaiser has joined #ste||ar
mcopik_ has joined #ste||ar
<M-ms>
jakub_golinowski: with https://github.com/STEllAR-GROUP/hpx/pull/3349 and `--hpx:ini=hpx.max_idle_loop_count=100` you should see CPU usage graphs similar to what you saw with TBB
<M-ms>
and if you're running more benchmarks tonight, could you add the openmp backend to the list as well?
<jakub_golinowski>
M-ms, nice one with the exponential idle backoff
<jakub_golinowski>
I am currently changing the OpenCV to include the conditional fixing of static_chunk_size
<jakub_golinowski>
and also to use numThreads in the start-stop case
<jakub_golinowski>
but then I have to generate my own agrc/argv
<jakub_golinowski>
and when there is hpx_main.hpp inlcuded by user then I have no chance to change numTHreads
<hkaiser>
why not?
<hkaiser>
if you use the technique from init_globally you'll still be able to supply hpx command line options
<hkaiser>
jakub_golinowski: hold on, let me say that again
<hkaiser>
jakub_golinowski: using hpx_main.hpp still allows for hpx to see the command line options
<M-ms>
I think he meant that with hpx_main.hpp one can't change the number of threads through cvSetNumThreads (since it would be called after HPX has been started)
<M-ms>
but one could restrict the number of used threads with a thread_pool_executor(num_threads)
<hkaiser>
ok, makes sense
rtohid has joined #ste||ar
twwright_ has joined #ste||ar
<hkaiser>
twwright_: any news on the networking end?
<heller>
jbjnr: i think I know how to achieve the kokkos style scheduling easily
<jakub_golinowski>
hkaiser, I meant what is in M-ms's clarification
<hkaiser>
nod
<nikunj97>
hkaiser: I bring some bad news
<hkaiser>
nikunj97: uhh ohh
<twwright_>
hkaiser, yes; internet2 made some routing changes which should have had an effect on the intermittent network timeouts. I let them know yesterday that we were still seeing timeouts
<nikunj97>
I talked to glibc about the C runtime and static intialization of global objects
<hkaiser>
twwright_: so they changed things but nothing changed?
<nikunj97>
hkaiser: They told me that to make things work for global object, I will either have to create a separate api to register them on HPX threads or create a separate toolchain
<hkaiser>
ok
<nikunj97>
Creating a separate toolchain will make things difficult
<nikunj97>
and is very time costly as well
<hkaiser>
right, we'll not going to do that
<jakub_golinowski>
M-ms, did you try building opencv with openmp before? Any nasty surprises there?
<nikunj97>
so I was thinking to of creating an api, but does HPX support registering objects?
<hkaiser>
sure, why not?
<twwright_>
kaiser, it seems that way and it’s not even consistent. I’ve been sending ITS updates on the open ticket (the same ticket that Carola is on)
<hkaiser>
twwright_: thanks
<hkaiser>
anything we can do in addition?
<M-ms>
jakub_golinowski: no, I did not but it should be straightforward since you don't need any external libraries
<nikunj97>
hkaiser: so should I explore the api way?
<jakub_golinowski>
ok, hope for the best
<hkaiser>
nikunj97: if you want
<nikunj97>
But there's catch in it. The runtime system starts at C main currently and global objects are constructed before. So registering them when the runtime system is not initialized will make things uglier
<hkaiser>
indeed
<nikunj97>
Does registering provides complete HPX functionality?
<hkaiser>
you can register functions which will be run right before hpx_main as a HPX thread
<twwright_>
hkaiser, I’m kicking ITS as hard as I can. They’ve gotten a lot of complaints from other departments on campus about this same issue as well and they do have an open ticket with internet2
<nikunj97>
hkaiser: I see, then I will look for ways to change their startup as well.
<hkaiser>
twwright_: ahh good - if others push as well it will get fixed eventually
<twwright_>
hkaiser, I’m hoping that it’s sooner
<hkaiser>
me too
<hkaiser>
makes you understand how much we depends on functioning networking infrastructure nowadays...
<nikunj97>
hkaiser: thanks for the help, I'll see what I can do to make things work
<hkaiser>
nikunj97: thanks
Anushi1998 has quit [Quit: Bye]
<K-ballo>
hkaiser: I don't like the stashing future<T> in general, but it should at the very least be constrained to T's that are movable, preferably nothrow as to not weaken future's exception specifications
twwright_ has quit [Quit: twwright_]
Anushi1998 has joined #ste||ar
anushi has joined #ste||ar
Anushi1998 has quit [Ping timeout: 264 seconds]
anushi is now known as Anushi1998
<hkaiser>
K-ballo: it's an experiment
jakub_golinowski has quit [Quit: Ex-Chat]
hkaiser has quit [Quit: bye]
stmatengss has quit [Quit: Leaving.]
eschnett has quit [Quit: eschnett]
<diehlpk_work>
heller, yt?
<diehlpk_work>
jbjnr, heller Please fill in your GSoC evaluaiton
<diehlpk_work>
Only 21.5 hours left
<heller>
diehlpk_work: still waiting on something from my student :(
akheir has joined #ste||ar
<heller>
Hope dies last
<nikunj97>
heller: did you fill the evaluation for me?
<diehlpk_work>
Sure, at least you are aware of it
<heller>
nikunj97: I guess I will, yeah
<nikunj97>
heller: thank you
<heller>
nikunj97: but you've been mostly interacting with Hartmut, I guess
<diehlpk_work>
nikunj97, Hartmut already did it
<nikunj97>
heller: yes. I can get you up to date with everything I've done
<nikunj97>
diehlpk_work: oh, so heller does not need to evaluate me?
<heller>
nikunj97: np. As long as you're not feeling left alone, all is good from my side
<heller>
nikunj97: only one mentor
<heller>
nikunj97: and Hartmut is admin. But I'll double check tomorrow morning
<nikunj97>
heller: oh ok. I'll still email you about everything I've done and get you up to date with all my research and integration.
<heller>
Sure, I'm very interested in the overall outcome
<diehlpk_work>
mcopik_, yet
twwright_ has joined #ste||ar
twwright_ has quit [Client Quit]
twwright_ has joined #ste||ar
akheir has quit [Quit: Leaving]
twwright_ has quit [Ping timeout: 256 seconds]
twwright_ has joined #ste||ar
hkaiser has joined #ste||ar
mcopik_ has quit [Ping timeout: 256 seconds]
twwright_ has quit [Quit: twwright_]
rtohid has left #ste||ar ["Leaving"]
jbjnr has quit [Read error: Connection reset by peer]