hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | | HPX: A cure for performance impaired parallel applications | | Buildbot: | Log: | GSoC2018:
nikunj97 has quit [Quit: Leaving]
hkaiser has quit [Quit: bye]
diehlpk_mobile has joined #ste||ar
eschnett has joined #ste||ar
parsa[w] has quit [Read error: Connection reset by peer]
stmatengss has joined #ste||ar
parsa[w] has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
stmatengss1 has joined #ste||ar
stmatengss has quit [Read error: Connection reset by peer]
diehlpk_mobile has quit [Quit: Yaaic - Yet another Android IRC client -]
nanashi55 has quit [Ping timeout: 264 seconds]
nanashi55 has joined #ste||ar
anushi has quit [Quit: Bye]
eschnett has quit [Quit: eschnett]
Anushi1998 has joined #ste||ar
Anushi1998 has quit [Remote host closed the connection]
Anushi1998 has joined #ste||ar
Anushi1998 has quit [Remote host closed the connection]
jakub_golinowski has joined #ste||ar
stmatengss1 has quit [Quit: Leaving.]
Anushi1998 has joined #ste||ar
M-ms, yt?
stmatengss has joined #ste||ar
stmatengss has quit [Client Quit]
stmatengss has joined #ste||ar
jakub_golinowski: here
Hey, so I am now running different configurations in debug mode to learn more about nstripes and numThreads
the first observation is that the dependecy on the nstripes does not seem to be U-shaped
at least for Mandelbrot example
I did not plot it but just by looking at the numbers: it is decreasing at the beginning (steeply for very small nstripes) and then saturates and stays at the same level for nstripes = 100 and nstripes = num_pixels (max possible value)
and any other value in between
My guess is that in mandelbrot there is a hidden for loop with 500 iterations for each body of the parallel_for loop so even if we divide into single tasks the performance does not drop that much
because still each parallel_for body is significantly bigger than overhead
yeah that's a good observation, the decrease with small nstripes is still correct behaviour
then you'd need to check the absolute performance compared to tbb or pthreads
so this is not really worrying yet
(in release mode of course)
btw, remember that if you want to step through code but still have decent performance there's also the RelWithDebInfo build type
but you shouldn't need that now (and proper benchmarks should still be done with Release)
So I am also checking that with tbb
and tbb has basically the same behaviour
ok, that's good
But I am confused, because I was convinced that there should be some cost of too small chunking
and with nstripes=0 (then in case of hpx each loop body is executed by separate task the performance seems to be one of the best)
still here btw, I'm not sure what the problem could be so will have a look at the code
as a sanity check, nstripes = 1 should give the same performance with the HPX and pthreads backends
(with 4 or less threads though, so you don't use hyperthreads)
and tbb as well...
yes yes for the nstripes in ranges 1-8 the behaviour is as expected
do you mean absolute performance between hpx, tbb and pthreads?
hpx_nstripes vs tbb so fat
(I'm not talking about performance improving going from 1 to higher nstripes)
so far
ok, so you should now do this with Release now
Ah yeah, I already am talking about release
so I was looking at debug first to confirm the paths but then I switch to release for the numbers
right, ok
by paths I a mean what code was actually executed in which configuration
and you mean now that hpx and tbb are the same with nstripes = 1 but tbb scales better as you increase nstripes?
to make it more tangible I am running it again now with smaller workload per loop
I changed the number of iterations within 1 parallel_for body to 10 (from 500)
with the hpx backend, if nstripes = 0 it uses one task per pixel? or do you let hpx do the chunking in that case?
it is the old/current version in hpx_nstripes mode so there is one task per pixel
try running them next time with only 4 threads, then you don't have to guess about effects from hyperthreading
so the minimum is around 0.3 seconds for both backends, which is good
hkaiser has joined #ste||ar
jakub_golinowski: so please try still letting hpx do the chunking when nstripes = 0, each pixel still has to go through ParallelLoopBodyWrapper but it's the same as for tbb so performance should be similar
M-ms Ok this will be done for the next compilation of OpenCVs
so for any nstripes > 0 should we say that the user knows what is he doing and accept it
ok, good
and yes, nstripes > 0 should be respected
Oh that is interesting
I am running it on 4 cores and the U-shape shows
4 threads?
--hpx:threads=4 for now (numThreads not yet supported for hpx backend)
and for tbb I call cv::setNumThreads(4)
there's a bit of a U shape in the logs you sent earlier as well
and HPX is doing better than TBB seems it might be due to PU pinning
jakub_golinowski: coult you create a graph?
hkaiser, for now I can quickly do it in Libre calc
or google spreadsheet
but I think I will do a proper benchmark from it because it is interesting to investigate it further (run it a few times, compute stds etc)
so Ok I am going for google spreadsheet for now
graphs are just better for understanding the data than lists of numbers
hpx is almost 5 times faster at its best compared to nstripes = 1 (with 4 threads), slightly weird but not bad...
indeed :P
stmatengss has quit [Quit: Leaving.]
[hpx] hkaiser force-pushed ready_future from 343eaf1 to 78842e0:
hpx/ready_future 78842e0 hkaiser: Adding direct data value to future to avoid allocation for make_ready_future....
Took a bit longer than expected
but I have to say that google spreadsheets have nice plotting tools
now please explain what I'm seeing
All sheets have the same schema but present data for different configs
mcopik has joined #ste||ar
The variable nstripes is the job partitioning argument used within OpenCV to enforce chunking of the job on the parallel backend
So let us focus on first sheet: hpx_nstripes-r4-10i
this is the OpenCV built with HPX backend respecting the nstripes variable and introducing as many tasks to deal with the job as given by the nstripes variable
There are two ranges in which nstripes was varied: small begining range (0-99) and linspace of the whole range (50 points)
the small range was introduced to see the behaviour when number of tasks is very small. For example for nstripes=1 we have basically sequential execution, hence the steep drop in the subrange 1-8 of nstripes
I set the range to 100 to see if there will be some chagnes visible beyond nstripes=4*num_threads
The linspace over the whole range was introduce to observe the U-shape effect (or at least the second part of the U)
which is deemed to be the result of the to small task size
jakub_golinowski: so hpx times going up for large number of tasks is caused by the overheads
tbb does not expose that behavior
So this is what I think. Because big nstripes means a lot of very small tasks therefore more overhead for task management
I am not sure what is tbb doing exactly with nstripes parameter but it is interesting that they have a bump for the nstripes close to the half of the whole range
jakub_golinowski: is the problem size constant for all measurements?
Yes it is, the range in both experiments is 25920000
because our current approach is to set hpx::parallel::execution::static_chunk_size fixed(1);
[hpx] hkaiser force-pushed ready_future from 78842e0 to edbedb5:
hpx/ready_future edbedb5 hkaiser: Adding direct data value to future to avoid allocation for make_ready_future....
jakub_golinowski: thanks fot his, very interesting
and then just use as many chunks as nstripes says
mcopik has quit [Ping timeout: 276 seconds]
anushi has joined #ste||ar
mcopik has joined #ste||ar
[hpx] hkaiser force-pushed ready_future from edbedb5 to 6b9f8dc:
hpx/ready_future 6b9f8dc hkaiser: Adding direct data value to future to avoid allocation for make_ready_future....
jakub_golinowski: last question: the data point with hpx and nstripes = 0 was also with static_chunk_size(1)?
I have a feeling tbb stays fast even with bigger nstripes because it sets a lower bound on the size of tasks but it's still allowed to use bigger chunks if it feels like it's necessary
how does it know how large the tasks are?
or is it purely a question of the number of tasks
I don't know tbh
otherwise they just have really low overheads
yah, they don't have separate stacks for their 'tasks'
also just to add: If you look at the sheet 3 you can see that again the bump for tbb is clearly visible
stmatengss has joined #ste||ar
stmatengss has quit [Client Quit]
those bumps are a bit weird, you weren't doing anything else on your laptop at the same time?
M-ms: could be an artifact of how tbb sizes the chunks
I was doing other stuff so I am not yet going to deep into analyzing the bumps
if in the overnight run they also show up I will start thinking harder about them
and I know I said we should respect nstripes, but could you do the same plot with the default hpx chunker?
right away
mcopik_ has joined #ste||ar
mcopik has quit [Read error: Connection reset by peer]
it is done
sheet 5
nikunj97 has joined #ste||ar
nice, so this is just parallel_for(par, ...)?
mcopik_ has quit [Ping timeout: 256 seconds]
M-ms, exactly
I'm tempted to say we should use this instead, it's so much more predictable. But then it's a bit slower compared to the static chunk size version. So you will rerun all the tests overnight?
jakub_golinowski: was the one you just added with 4 threads, not 8? if that's the case the performance is good
M-ms, the plots in sheet 5 are now in accordance with sheet name so 4 threads
mcopik_ has joined #ste||ar
I just finished the 8 threads run and will add it to sheet 6 for completeness
ok, good
very good, so now the results from earlier (the ~50% overhead for the hpx backend) make sense, because they were the same as static_chunk-size(1) and nstripes = num_pixels
the plots we saw on monday, where hpx was maybe 50% slower
those were run with num_stripes = 0 and static_chunk_size(0)
just talking out loud
static_chunk_size(1) of course
uhm, the results using hpx, default chunker and e.g. nstripes are strange
nstripes = 2
it shouldn't be 4 times faster than nstripes = 1
well wait
First: I shuffled the order of sheets but the names are still correct
So now the hpx-4t-10i is the hpx that ignores the nstripes and always uses the default HPX chunking
it should be 4 times faster than for nstripes=1 because for nstripes=1 the backend is doing this check
and ignores the parallel implementation and just runs sequential
And for hpx_nstripes-4t-10i the difference is correct: the time is 2 times smaller for nstripes=2 compared to nstripes=1
thanks, that explains it
I was thinking about using nstripes but default chunker
you'll still want to call ParallelLoopBodyWrapper
hkaiser: jakub_golinowski during our talks, we discussed with heller hkaiser M-ms + kokkos people how to improve task times for simple for loop, by haning N-ary tasks. I think this would help enormously with the opencv style where things are done in one huge for loop (or similar)
haning = having ^^
jbjnr, could you tell sth more or give some links so I can read more on it?
jbjnr: you'll have to specify n-ary a bit better
jakub_golinowski: I will create an issue on the tracker and you can read that. give me 5 mins
everything is doable, the question is how much effort would that be
I suspect not as much as we might think.
well - the main task reworking should be doable, but reworking the parallel algorithms might take some effort.
I don't recall how we actually generate the tasks in a parallel for/foreach/etc
anushi has quit [Ping timeout: 264 seconds]
hkaiser: yt?
jbjnr: you know what the kokkos people do though?
tell me
the "n-ary task" thing is a bit of an overstatement
and ...
they're sheduling loop goes a bit like this: while(true){ sleep_while(state == inactive) run_task(); }
that runs on all cores
The main thing is that we could still create one task instead of N, and save a ton of queue-buggery
I do wonder if the last task to pull one off, finishes it before one of the earlier ones - what we do there.
now, when a parallel for should get executed, they set a function pointer (the one that executes the loop body), and set the state to active
I see
once the loop body finishes, it sets the state to inactive again
daissgr has joined #ste||ar
so the task isn't really a task 'proper'
just a jump into the loop
and then some atomics etc.
well I still like the N-ary idea.
we could sure add similar logic to our scheduling loop
we'd probably have to do something similar - the thread_data is non-copyable, and the stack might be an issue- we'd end up running the task on the thread/stack of the scheduling loop itself
<what could go wrong>
a nested async inside a parallel loop! yikes
hkaiser has quit [Quit: bye]
jbjnr: kokkos is really going completely out of themselve to try to directly map to the underlying hardware
that's why people are using it for HPC
so, a HPX backend (doing everything with dynamic tasking) and comparing the performance, will give us the answer of what we need to do as the next step
and there'll be two answers: 1) the microbenchmark performance 2) the mini apps that have been ported to kokkos
and I'm pretty sure the microbenchmarks will just rock on the existing kokkos benchmarks and we won't stand a chance there (maybe within 10% of the performance). But I am not so certain regarding "real" applications
K-ballo has joined #ste||ar
stmatengss has joined #ste||ar
mcopik_ has quit [Ping timeout: 245 seconds]
eschnett has joined #ste||ar
heller: agreed - and this is why I have been banging on about affinity etc recently. We need to support the kind of things that kokkos is doing well if we want to be competitive for those tight loop apps that have been carefully optimized for the kokkos model.
hkaiser has joined #ste||ar
mcopik_ has joined #ste||ar
jakub_golinowski: with and `--hpx:ini=hpx.max_idle_loop_count=100` you should see CPU usage graphs similar to what you saw with TBB
and if you're running more benchmarks tonight, could you add the openmp backend to the list as well?
M-ms, nice one with the exponential idle backoff
I am currently changing the OpenCV to include the conditional fixing of static_chunk_size
and also to use numThreads in the start-stop case
but then I have to generate my own agrc/argv
and when there is hpx_main.hpp inlcuded by user then I have no chance to change numTHreads
why not?
if you use the technique from init_globally you'll still be able to supply hpx command line options
jakub_golinowski: hold on, let me say that again
jakub_golinowski: using hpx_main.hpp still allows for hpx to see the command line options
I think he meant that with hpx_main.hpp one can't change the number of threads through cvSetNumThreads (since it would be called after HPX has been started)
but one could restrict the number of used threads with a thread_pool_executor(num_threads)
ok, makes sense
rtohid has joined #ste||ar
twwright_ has joined #ste||ar
twwright_: any news on the networking end?
jbjnr: i think I know how to achieve the kokkos style scheduling easily
hkaiser, I meant what is in M-ms's clarification
hkaiser: I bring some bad news
nikunj97: uhh ohh
hkaiser, yes; internet2 made some routing changes which should have had an effect on the intermittent network timeouts. I let them know yesterday that we were still seeing timeouts
I talked to glibc about the C runtime and static intialization of global objects
twwright_: so they changed things but nothing changed?
hkaiser: They told me that to make things work for global object, I will either have to create a separate api to register them on HPX threads or create a separate toolchain
Creating a separate toolchain will make things difficult
and is very time costly as well
right, we'll not going to do that
M-ms, did you try building opencv with openmp before? Any nasty surprises there?
so I was thinking to of creating an api, but does HPX support registering objects?
sure, why not?
kaiser, it seems that way and it’s not even consistent. I’ve been sending ITS updates on the open ticket (the same ticket that Carola is on)
twwright_: thanks
anything we can do in addition?
jakub_golinowski: no, I did not but it should be straightforward since you don't need any external libraries
hkaiser: so should I explore the api way?
ok, hope for the best
nikunj97: if you want
But there's catch in it. The runtime system starts at C main currently and global objects are constructed before. So registering them when the runtime system is not initialized will make things uglier
Does registering provides complete HPX functionality?
you can register functions which will be run right before hpx_main as a HPX thread
hkaiser, I’m kicking ITS as hard as I can. They’ve gotten a lot of complaints from other departments on campus about this same issue as well and they do have an open ticket with internet2
hkaiser: I see, then I will look for ways to change their startup as well.
twwright_: ahh good - if others push as well it will get fixed eventually
hkaiser, I’m hoping that it’s sooner
me too
makes you understand how much we depends on functioning networking infrastructure nowadays...
hkaiser: thanks for the help, I'll see what I can do to make things work
nikunj97: thanks
Anushi1998 has quit [Quit: Bye]
hkaiser: I don't like the stashing future<T> in general, but it should at the very least be constrained to T's that are movable, preferably nothrow as to not weaken future's exception specifications
twwright_ has quit [Quit: twwright_]
Anushi1998 has joined #ste||ar
anushi has joined #ste||ar
Anushi1998 has quit [Ping timeout: 264 seconds]
anushi is now known as Anushi1998
K-ballo: it's an experiment
jakub_golinowski has quit [Quit: Ex-Chat]
hkaiser has quit [Quit: bye]
stmatengss has quit [Quit: Leaving.]
eschnett has quit [Quit: eschnett]
heller, yt?
jbjnr, heller Please fill in your GSoC evaluaiton
Only 21.5 hours left
diehlpk_work: still waiting on something from my student :(
akheir has joined #ste||ar
Hope dies last
heller: did you fill the evaluation for me?
Sure, at least you are aware of it
nikunj97: I guess I will, yeah
heller: thank you
nikunj97: but you've been mostly interacting with Hartmut, I guess
nikunj97, Hartmut already did it
heller: yes. I can get you up to date with everything I've done
diehlpk_work: oh, so heller does not need to evaluate me?
nikunj97: np. As long as you're not feeling left alone, all is good from my side
nikunj97: only one mentor
nikunj97: and Hartmut is admin. But I'll double check tomorrow morning
heller: oh ok. I'll still email you about everything I've done and get you up to date with all my research and integration.
Sure, I'm very interested in the overall outcome
mcopik_, yet
twwright_ has joined #ste||ar
twwright_ has quit [Client Quit]
twwright_ has joined #ste||ar
akheir has quit [Quit: Leaving]
twwright_ has quit [Ping timeout: 256 seconds]
twwright_ has joined #ste||ar
hkaiser has joined #ste||ar
mcopik_ has quit [Ping timeout: 256 seconds]
twwright_ has quit [Quit: twwright_]
rtohid has left #ste||ar ["Leaving"]
jbjnr has quit [Read error: Connection reset by peer]