aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
EverYoun_ has quit [Remote host closed the connection]
Bibek has quit [Remote host closed the connection]
Bibek has joined #ste||ar
parsa has quit [Quit: Zzzzzzzzzzzz]
eschnett has quit [Quit: eschnett]
diehlpk has joined #ste||ar
EverYoung has joined #ste||ar
EverYoung has quit [Ping timeout: 255 seconds]
diehlpk has quit [Ping timeout: 268 seconds]
EverYoung has joined #ste||ar
hkaiser has quit [Quit: bye]
K-ballo has quit [Quit: K-ballo]
diehlpk has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
diehlpk has quit [Ping timeout: 248 seconds]
david_pfander has joined #ste||ar
david_pfander has quit [Ping timeout: 248 seconds]
david_pfander has joined #ste||ar
david_pfander1 has joined #ste||ar
david_pfander has quit [Read error: Connection reset by peer]
david_pfander1 is now known as david_pfander
david_pfander has quit [Ping timeout: 240 seconds]
david_pfander has joined #ste||ar
parsa has joined #ste||ar
parsa has quit [Read error: Connection reset by peer]
parsa| has joined #ste||ar
parsa| has quit [Read error: Connection reset by peer]
parsa has joined #ste||ar
david_pfander has quit [Ping timeout: 248 seconds]
parsa has quit [Quit: Zzzzzzzzzzzz]
parsa has joined #ste||ar
jbjnr has joined #ste||ar
parsa has quit [Quit: Zzzzzzzzzzzz]
<msimberg>
heller: did you already have more changes for the throttle test?
<msimberg>
was going to apply the patch from yesterday so you don't have to make a pr...
<heller>
msimberg: ok, I am testing stuff at the moment...
<heller>
msimberg: it looks like the exit decisions are sometimes to relaxed, and sometimes too strict
<heller>
so if I make the this_thread_executor work, other applications start to hang
<msimberg>
okay, what was the fix for that?
<msimberg>
so it really seems like it could use some option to say how strict it should be
<msimberg>
btw, thanks for the inspect link
<msimberg>
do I need special permissions on circleci to find those myself?
<heller>
no
<heller>
you just need to be logged in
<msimberg>
ok, that's it then
<heller>
msimberg: the fix seems to that we should check how many threads there are still available when doing wait_or_add_new, and not base our decision on whether there is no more work to add or not
<msimberg>
so threads available != more work to add? what do you mean exactly by threads available?
<heller>
the value in thread_map_count_
<heller>
there could be situations where we have suspended threads that still need to run on the scheduler
<heller>
(this is what happens with the timed executor tests)
<msimberg>
mmh, okay, but then that breaks the throttle test again (when you remove yourself) :)
<msimberg>
so more or less like it was before, right?
<heller>
I have a patch now, i think
<msimberg>
ok, nice
<heller>
testing right now...
<heller>
I'll file a PR against your branch soon then
<msimberg>
or does something trigger it automatically?
<jbjnr>
(I worry that you are spending time fixing things like this_thread_executor when really these executors should not exist).
<msimberg>
jbjnr: valid concern, but I'm not really trying to fix it as I hope heller will have a fix soon ;)
<msimberg>
but I guess your worry applies to him as well
<msimberg>
besides, I'm forced to read new parts of the codebase which is at least not a bad thing...
<jbjnr>
here's a task for you then - find out why we actually need some of theses executors and see if their functionality can be derived from the other executors that are used on the code base
<msimberg>
ooh, sure, I'll try
<heller>
yay, I wrote a python script to compute the USL graphs based on measurement points now ;)
<heller>
hope to get some nice insights with this
<msimberg>
btw, I think in this case it's not really a problem in the executors but the scheduling_loop termination, it just happens to appear in those tests
<msimberg>
heller: nice
<heller>
in the meantime, I have no idea what's going on with those dreaded tests :/
<msimberg>
shame
<msimberg>
so I tried to look around a bit more, and suspended threads only stay in the thread map, right?
<msimberg>
not in any of the other queues
<jbjnr>
my point is that the scheduling loop and it's termination criteria become more complex as they must support other execution models that might not actually be used any more and are just relics from some fudgery introduced years ago to support certain os_thread execution stuff, that might be better using some other approach.
<heller>
msimberg: correct
<heller>
jbjnr: I share your point
<msimberg>
jbjnr: fair point
<jbjnr>
heller: thanks for your support
<heller>
jbjnr: however, what I am trying to do is to make everything work what's there
<heller>
jbjnr: cleaning everything up is, I guess, a very HUGE task
<jbjnr>
correct
<msimberg>
and heller to continue, it's the timed tasks that are the problem in this case, no? because they would be suspended but not in any queues? so this would probably appear on any normal executor/scheduler/whatever
<heller>
msimberg: correct.
<msimberg>
I guess we can't check if a thread has been stolen at the moment?
<heller>
msimberg: you are fast ;)
<heller>
why would that be important?
<msimberg>
well, for the throttle test
<heller>
jbjnr: on a related note, I think our current design for task scheduling etc. is overly complex, we carry along a lot of technical depth
<msimberg>
the checks were relaxed because of the case when a thread has been stolen but is in the thread map
<msimberg>
i'm imagining that maybe the check should be thread_map_count_not_stolen == 0 or something like this
<msimberg>
but it gets complicated as well
<jbjnr>
"we carry along a lot of technical depth" ? not sure what you mean here
<heller>
jbjnr: I am almost at a point in saying that it is quicker to start from scratch with a clear design on what we want to support. We currently have several mechanisms for doing essentially the same thing, which we have to carry on because nobody actually cleans up the code base
<heller>
jbjnr: sorry, not depth, debt
<jbjnr>
I agreee
<jbjnr>
I am starting on this by throwing away all the schedulers - except my new one that I have rewritten over the last could of days whilst daint was down :)
<jbjnr>
^couple of days
<jbjnr>
hmmm. my new chat client does not show smileys :(
<msimberg>
something has to be done for the release though, even if it's in 6 months I guess cleaning up the schedulers and executors might be tight?
<msimberg>
properly, that is
<jbjnr>
schedulers is easy
<jbjnr>
rm -rf *
<msimberg>
:)
<jbjnr>
and remove a few CMake options
<jbjnr>
most of them are not used anywhere.
<msimberg>
okay executors then
<jbjnr>
trickier - hence my issue that nobody read
<jbjnr>
half of them are undocumented and unexplained - must wait till hartmut returns from SC
<msimberg>
I'm reading it! but I can't do much about it yet...
<msimberg>
(finding out about executors now) so all the local_priority_queue_executor, local_queue_executor etc. are for the case when you realize you need a different scheduler but are too lazy to create another pool with that scheduler?
<msimberg>
if most of the schedulers go then I guess at least those executors can (have to) go as well
<msimberg>
is it just convenience?
<heller>
I would look into the executors proposal
<heller>
I think that's the biggest issue
<heller>
that we don't have clear line between execution agents, execution context and executors etc
<heller>
If that would have been there, and properly applied to the code base, everything would be simpler
<jbjnr>
I don't think msimberg should go there just yet.
<jbjnr>
(I only want to remove unnecessary execs at the moment)
<msimberg>
yeah, and runtime suspension goes before any other (bigger) task for me until that's done
<msimberg>
but the throttling stuff is relevant for that... which is relevant to executors...
<heller>
hkaiser: can we get back to merging PRs only once they are green and reviewed?
parsa| has quit [Quit: Zzzzzzzzzzzz]
<heller>
hkaiser: I've been thinking about your dataflow issue... Could it also be a use after move? You should try the clang_tidy branch which fixes some of those
<K-ballo>
what's holding the clang tidy cleanup?
<heller>
There's this one use after move resolution in for_loop which hkaiser doesn't like. He wanted to look into it
<hkaiser>
jbjnr: it's a special thread used to properly synchronize shutdown
<jbjnr>
where is it supposed to execute? on any hpx thread or somewhere special?
<jbjnr>
it use id=-1 - soe anywhere ....
<jbjnr>
but it should not do that
<hkaiser>
it
<hkaiser>
's a kernel-thread
<hkaiser>
it's mostly sleeping
<hkaiser>
no sorry
<hkaiser>
mixing up things
<jbjnr>
register-thread triggers pool-create-thread which is an hpx task
<hkaiser>
jbjnr: that is the thread that eventually executes hpx_main
<jbjnr>
hmm. thanks
<jbjnr>
the thread id is messing up my stuff
<jbjnr>
I don't like having to add special cases - can I launch it on thread 0 instead of thread -1
<hkaiser>
sure
<jbjnr>
that should not make any difference
<hkaiser>
doesn't make a difference
<jbjnr>
ta
<jbjnr>
I'll check it doesn't cause harm
<hkaiser>
sure, it won't
parsa has quit [Quit: Zzzzzzzzzzzz]
parsa has joined #ste||ar
parsa has quit [Client Quit]
<heller>
hkaiser: optional was sitting there since 6 days, it turned green today ;)
<hkaiser>
so you're following what's going on - nice
<heller>
i sure do
<hkaiser>
it was a low-risk patch - I would like to get rid of the experimental optional
mbremer has joined #ste||ar
<mbremer>
@hkaiser yt?
<heller>
the reason I didn't review it was mainly since I think there are other patches that should have had a higher priority, not submitting is my way to keep those other patches at lower priority... looks like this technique is not effective
<jbjnr>
it is a suboptimal strategy
<heller>
sure, we have different priorities and should apply that to what's important to us, I guess
jakemp has joined #ste||ar
<heller>
merging without notice after a given period without review is suboptimal as well, I think
<heller>
hkaiser: what's wrong about experimental::optional?
parsa[[w]] has joined #ste||ar
<heller>
or std::optional
<K-ballo>
C++17 only
<heller>
sure
parsa[w] has quit [Ping timeout: 250 seconds]
<heller>
looks like all compilers we test on rostam do have it
<heller>
what about MSVC?
<K-ballo>
you mean std:: or std::experimental?
<K-ballo>
neither libstdc++ nor libc++ have std::optional pre 17, no idea about msvc
<hkaiser>
K-ballo: I meant std::experimental
hkaiser has quit [Quit: bye]
<heller>
Isn't either good enough?
<heller>
So it could be just a template alias, if either is there
hkaiser has joined #ste||ar
Bibek has quit [Quit: Leaving]
Bibek has joined #ste||ar
mbremer has quit [Quit: Page closed]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
nanashi55 has joined #ste||ar
<nanashi55>
Hello. I am trying to find information about the handling of locality failures (especially failure of the AGAS server). I also wonder if hpx supports connecting localities while hpx_main is being executed (elasticity)?
<hkaiser>
nanashi55: what failures do you encouter
<nanashi55>
I am writing a small paper for university which gives a small overview of hpx. Is a crash of the agas server somehow recoverable? Or does this mean an abortion of all tasks?
<jbjnr>
nanashi55: if the root node agas server goes down, then I think it'll be unrecoverable in the current implementation.
<jbjnr>
if another node went down, then in principle things could be recovered - though that would require a lot of extra exception code to be put in place
<hkaiser>
jbjnr: not sure
<jbjnr>
about node 0 or the others?
msimberg has quit [Ping timeout: 260 seconds]
hkaiser has quit [Read error: Connection reset by peer]
<jbjnr>
I'm sure other nodes failing could be handled, but node 0, I am not sure either
hkaiser has joined #ste||ar
<hkaiser>
nanashi55: I think currently any node going down would be the end of it
<hkaiser>
we have not invested any time in making things resilient
<jbjnr>
but in principle a non root node fail could be handled
<jbjnr>
it's just a case of putting the right exception handlers in place
<hkaiser>
jbjnr: erm, sorry for spoiling your illusions ;)
<hkaiser>
I don't think so
<jbjnr>
it'll be on our list eventually
<hkaiser>
absolutely
<hkaiser>
even more if we run on top of mpi
<hkaiser>
in this case it's game over anyways
<hkaiser>
nanashi55: but we support elasticity
<nanashi55>
Thank you. This helps a lot. I'm looking forward to see resilience in hpx
<jbjnr>
I'm assuming we are not falling on mpi's problems, I spent too much tim in the LF PP for that :)
<hkaiser>
jbjnr: right
<hkaiser>
nanashi55: would you mind showing us your paper once it's published?
<nanashi55>
hkaiser: I tried to start another node while the root node was already calculating and it resulted in a serialization error. So I was not sure. I must have made a mistake
<nanashi55>
hkaiser: I don't think it will be published outside of university. And it will be in german. I can send it to you nevertheless once it's finished
<hkaiser>
nanashi55: I'd be interested in seeing it
<hkaiser>
nanashi55: for adding nodes you need to do something special, see the heartbeat example
<nanashi55>
Okay. I will send it to you then
<hkaiser>
thanks
<jbjnr>
nanashi55: it is possible to add and remove nodes during runtime - if done corecly (ie not node fails, ut connect and disconnect)
<hkaiser>
also nodes added after the fact are not part of the distributed agas, so disconnetcing them does not cause issues
<jbjnr>
aha - I didn't realize there was a distinction - I see now why you would expect a fail on a non root node fail then - if it was part of the original start up as part of agas.
eschnett has quit [Quit: eschnett]
daissgr has quit [Quit: WeeChat 1.4]
gedaj has quit [Quit: leaving]
eschnett has joined #ste||ar
gedaj has joined #ste||ar
gedaj has quit [Client Quit]
gedaj has joined #ste||ar
gedaj has quit [Client Quit]
gedaj has joined #ste||ar
gedaj has quit [Quit: leaving]
gedaj has joined #ste||ar
gedaj has quit [Quit: leaving]
<nanashi55>
hkaiser: The heartbeat example showed me how to do it. Thanks
gedaj has joined #ste||ar
gedaj has quit [Client Quit]
gedaj has joined #ste||ar
<jakemp>
I'm updating my hpxMP runtime, and I'm having an issue with hpx not stopping. It seems to happen when I use dataflow with an executor. Has the way dataflow uses executors changed?