aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
zbyerly__ has quit [Ping timeout: 240 seconds]
mcopik has quit [Ping timeout: 260 seconds]
parsa[w] has quit [Read error: Connection reset by peer]
hkaiser has quit [Quit: bye]
pagrubel has joined #ste||ar
pagrubel has quit [Ping timeout: 276 seconds]
K-ballo has quit [Quit: K-ballo]
vamatya_ has quit [Ping timeout: 246 seconds]
bikineev has quit [Remote host closed the connection]
mcopik has joined #ste||ar
bikineev has joined #ste||ar
bikineev has quit [Remote host closed the connection]
bikineev has quit [Remote host closed the connection]
bikineev has joined #ste||ar
http_GK1wmSU has joined #ste||ar
http_GK1wmSU has left #ste||ar [#ste||ar]
bikineev has quit [Remote host closed the connection]
<hkaiser>
jbjnr: you around?
pagrubel has joined #ste||ar
<jbjnr>
here for a min
<jbjnr>
hkaiser: ^^
<jbjnr>
what's up
<hkaiser>
jbjnr: just a q
<hkaiser>
how did you envision exposing oversubscription to the rp?
<jbjnr>
if the user says "--hpx:threads=1" but in int main the user adds the same thread to more than 1 pool, it would throw unless the user added --hpx:allow-oversubscription
<jbjnr>
(or similar)
<hkaiser>
ok, that's the command line, but how would the rp do that?
<jbjnr>
currently the RP only provides access to the threads that hpx:threads=N gives from the "old" thread affinity/binding
<hkaiser>
nod
<jbjnr>
so if the user wanted to get all N cores and ignore the hpx:threads option, we'd have to add some new funcions
<hkaiser>
and it throws if two pools want to use the same core
<hkaiser>
is oversubsciption a feature of the rp, the pool or is it PU-related?
<jbjnr>
yes, the user code might need two pools, and add one numa domain to one, and another to the other, but on a sibgle socket machine ... unless they allow oversubscription ...
<jbjnr>
feature of RP I guess
<hkaiser>
ok
<jbjnr>
because you can currently add the same PU to N pools
<hkaiser>
right
<jbjnr>
if you want - but we have not "handled" it yet
<hkaiser>
I know, but I'd like to implement that
<jbjnr>
me too
<jbjnr>
feel free to start
<hkaiser>
also we will need dynamic footprints for a pool
<jbjnr>
all we need to do ig=s generate a pu_mask and get the tread indexing right
<jbjnr>
for dynamic indexing I have a nice idea
<jbjnr>
dynamic pools I mean
<hkaiser>
our pools should already support that
<jbjnr>
We should use a system that is similar to the way Qt layout work, were a pool can be stretchy, or fixed
<jbjnr>
each Qt widget on screen can expand, or not and the user constrains it
<hkaiser>
yes
<jbjnr>
when we create pools, we should use similar semantics
<hkaiser>
that's what we have for the nested schedulers using the old resource_manager
<jbjnr>
(currently there is a problem in the schedulers etc, that if I add 2 cores to a pool, but they are core0 on domain 0, and core 0 on domain 1 - then the indexing might be dodgy)
<jbjnr>
(I need to look into that)
<hkaiser>
the pools don't care, the rp shoul dget that right
<jbjnr>
yes. I'm just not sure if we create them all in the right order and get the pu numbering ight for all a\cases
<hkaiser>
let's write tests
<jbjnr>
so far all my pools have been 'contiguous' etc
<jbjnr>
rn tests this morning and got goor mpi 1,2 ranks, but code locks up on N=4
<jbjnr>
ran tests this morning and got good mpi 1,2 ranks, but code locks up on N=4
<jbjnr>
is there anything we need to check when running in distributed that we might have messed up
<hkaiser>
not sure what you mean
<jbjnr>
(I am using run-hpx-main on all ranks - the code needs it)
<hkaiser>
we need tests for a set of use cases to support
<jbjnr>
yes
<hkaiser>
manual testing usin gthe example does not scale
<jbjnr>
(my tests don't work on N=4, so I am worried that I might have messed up and we create pools more than once by mistake or something like that.)
<hkaiser>
turn it into a test and commit
<jbjnr>
(matrix code btw)
<jbjnr>
but intial results look much better than before - got some normal scaling instead of the massive drop on n=2
<hkaiser>
good
<jbjnr>
still super shit compared to parsec
<jbjnr>
:(
<hkaiser>
so you carve out one or two cores just for the network?
<jbjnr>
yes. I disable all hpx networking and then put all mpi tasks onto an mpi pool
<jbjnr>
cos raffaele has wrapped all his mpi comms in small tasks
<hkaiser>
that will not scale ever
<jbjnr>
should work the same as raw mpi realy
<hkaiser>
as in a futurized execution tree, if some of the tsasks do mpi they will block the whole execution
<jbjnr>
that's why they are in their own pool, and the DAG has been carefully generated to handle them
<hkaiser>
ok
<jbjnr>
this is whay I am doing all this work in the first place
<hkaiser>
jbjnr: even if the mpi runs on separate cores, those tasks will still block all of the futurized execution graph as the mpi will 'cut through' this tree
<jbjnr>
the tasks that need the mpi data cannot run without it - they are blocked on those tasks
<jbjnr>
there is nothing we can do about that
<jbjnr>
but moving them onto their own pool at least stops the blocking mpi calls from blocking other work that is not using the mpi data
<jbjnr>
parsec is just so much better, it is really depressing,
<hkaiser>
jbjnr: do you know why?
mcopik has quit [Ping timeout: 260 seconds]
<jbjnr>
if I say "because hpx is a bit shit" then you will just shout at me. They schedule tasks before execution, so they optimize the dag, but I am truly stunned by the scale of the difference - I can get 870GFlops on one node - they get 1000 - it all seems to be do do with numa placement as far as I can tell. and the scheduling
<jbjnr>
bbiab
<hkaiser>
jbjnr: we know hpx is shit
<hkaiser>
and yah, numa placement is a big thing
zbyerly__ has joined #ste||ar
mcopik has joined #ste||ar
david_pf_ has quit [Quit: david_pf_]
mcopik has quit [Ping timeout: 255 seconds]
hkaiser has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
<github>
[hpx] hkaiser pushed 1 new commit to partitioned_vector: https://git.io/v7ovn
<github>
hpx/partitioned_vector 1860cc2 Hartmut Kaiser: Fixing test (cherry pick from master)
<jbjnr>
(hkaiser: when I said numa placement earlier - I should really include all cache related effects)
<jbjnr>
hkaiser: I have 3 threads that I cannot account for. if I disable IO_POOL and TIMER_POOL and run with threads=1, I expect one app thread that gets suspended once the worker threads kick in - any idea what the other two are?
pree_ has joined #ste||ar
<hkaiser>
tcp threads
<hkaiser>
2 of them
<hkaiser>
jbjnr: ^^
<github>
[hpx] hkaiser pushed 1 new commit to resource_partitioner: https://git.io/v7ofo
<hkaiser>
jbjnr: what did you change in the scheduler for this?
<jbjnr>
in these runs, I'm using my scheduler and an mpi pool with 1 thread reserved. the main difference in the scheduler is the placement of tasks and stealing
<jbjnr>
I'm not finished with the scheduler yet, but getting a bit fed up.
pree_ has joined #ste||ar
bikineev has quit [Read error: Connection reset by peer]
<hkaiser>
jbjnr: I hear you, it's like a piece of soap in the shower
<jbjnr>
lol
thundergroudon[m has quit [Ping timeout: 255 seconds]
taeguk[m] has quit [Ping timeout: 246 seconds]
<hkaiser>
you might have to use one pool per numa domain and be careful about placing tasks
bikineev has joined #ste||ar
<hkaiser>
so the main difference is the dedicated core for the tasks which do MPI calls
<jbjnr>
I'm almost doing that - in my scheduler, I allocate HPQueues based on numa domain and can control the stealing, so it is almost like having two pools.
auviga has joined #ste||ar
ABresting has joined #ste||ar
<hkaiser>
but if the tasks run on the other numa domain you could be still in trouble
<jbjnr>
the mpi pool is helping, but the scheduling improvments make a big differnece too. the speedup in some cases is very significant over the old hpx,
<jbjnr>
^^yes, the problem is that if you constrin tasks to one domain, then the other cores are idle, and finding a good balance has been tricky
<hkaiser>
well, you don't know whether this is caused by the scheduler alone
<hkaiser>
the one-node diffs are minimal
pagrubel is now known as patg[[w]]
<jbjnr>
in these plots the one node diffs are not evident, in my other runs they are. I have so many settings to adjust I don't know what I'm actually doing any more.
<hkaiser>
nod
<hkaiser>
jbjnr: are those changes generalizable?
<jbjnr>
IMHO we should throw away the siz schedulers in hpx and replace them with mine. so yes, very generalizable!
<jbjnr>
^siz
<jbjnr>
^six
<jbjnr>
#$%^
<hkaiser>
yes, let's do that
<jbjnr>
:)
<jbjnr>
I wasn't serious
<hkaiser>
I was
<hkaiser>
we use just one scheduler 100% anyways
<jbjnr>
Let me do some graphs with all six though before we pursue this further
<hkaiser>
sure
<jbjnr>
I need to fix mine so that it 'always' outperforms the other - currently there are some combinations of params that are worse
<hkaiser>
k
<jbjnr>
which is why say that I don't know what I'm doing. The single node diffs should be larger usually
<jbjnr>
(also - not all codes use high prority task the way this matrix stuff does, so other scheduklers might be appropriate)
<patg[[w]]>
jbjnr: I'd be interested in your graphs
<jbjnr>
patg[[w]]: I'll make sure you see them if and when I do them
<patg[[w]]>
If and when???
vamatya_ has joined #ste||ar
patg[[w]] has quit [Quit: Leaving]
pat[[w]] has joined #ste||ar
<jbjnr>
it takes some work!
<hkaiser>
jbjnr: do you have to stick with mpi for this?
<hkaiser>
can't you additionally use your pp?
<pat[[w]]>
jbjnr: what is the status of your pp?
<jbjnr>
hkaiser: cscs is not interested in my PP and I am under orders to make HPX work with MPI - hence the new pools and RP work
<hkaiser>
understand
<hkaiser>
so heller will need to fix those
<jbjnr>
pat[[w]]: the PP should work ok, can still produce lockups we think at high ranks and intensive thread counts, but not sure - not tested recently
<jbjnr>
hkaiser: I'm still going to work on the PP, just becasue cscs says no, that won't stop me
<jbjnr>
they are wrong about mpi+x
<hkaiser>
lol
<hkaiser>
sure they are
<jbjnr>
when they realize they were wrong, I'll be there to save them!
<hkaiser>
jbjnr: we've got the first small chunk of the big project, btw
<hkaiser>
keeps us afloat, though, it's just for one year - so not too bad
vamatya_ has quit [Ping timeout: 240 seconds]
pat[[w]] has quit [Quit: Leaving]
bikineev has quit [Read error: Connection reset by peer]
bikineev has joined #ste||ar
<jbjnr>
hkaiser: looks like I screwed up and used some of the new scheduler (but no mpi pool) numbers in the plot for the graph in the old hpx scheduler line, so that's why they seem almost the same.
<jbjnr>
(^just fyi)
bikineev has quit [Read error: Connection reset by peer]
bikineev has joined #ste||ar
<hkaiser>
jbjnr: ok
bikineev has quit [Read error: Connection reset by peer]
bikineev has joined #ste||ar
bikineev_ has joined #ste||ar
bikineev has quit [Read error: Connection reset by peer]
bikineev has joined #ste||ar
bikineev_ has quit [Ping timeout: 246 seconds]
bikineev has quit [Read error: Connection reset by peer]
bikineev has joined #ste||ar
thundergroudon[m has joined #ste||ar
bikineev_ has joined #ste||ar
bikineev has quit [Read error: Connection reset by peer]
bikineev has joined #ste||ar
bikineev_ has quit [Read error: Connection reset by peer]
bikineev_ has joined #ste||ar
bikineev has quit [Read error: Connection reset by peer]
taeguk[m] has joined #ste||ar
pree_ has quit [Quit: AaBbCc]
bikineev_ has quit [Ping timeout: 246 seconds]
bikineev has joined #ste||ar
bikineev has quit [Read error: Connection reset by peer]
bikineev has joined #ste||ar
bikineev has quit [Read error: Connection reset by peer]
<github>
[hpx] hkaiser pushed 1 new commit to resource_partitioner: https://git.io/v7onN
<github>
hpx/resource_partitioner cc295bf Hartmut Kaiser: Enable over-subscription of pus in resource_partitioner...