#ste||ar on 2017-11-17 — irc logs at irclog.cct.lsu.edu

2017-05-17 13:54 aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:11 EverYoun_ has quit [Remote host closed the connection]

00:11 Bibek has quit [Remote host closed the connection]

00:15 Bibek has joined #ste||ar

00:45 parsa has quit [Quit: Zzzzzzzzzzzz]

00:58 eschnett has quit [Quit: eschnett]

01:30 diehlpk has joined #ste||ar

01:36 EverYoung has joined #ste||ar

01:48 EverYoung has quit [Ping timeout: 255 seconds]

02:21 diehlpk has quit [Ping timeout: 268 seconds]

02:22 EverYoung has joined #ste||ar

02:25 hkaiser has quit [Quit: bye]

02:45 K-ballo has quit [Quit: K-ballo]

03:05 diehlpk has joined #ste||ar

03:17 EverYoung has quit [Remote host closed the connection]

03:45 diehlpk has quit [Ping timeout: 248 seconds]

04:03 david_pfander has joined #ste||ar

04:11 david_pfander has quit [Ping timeout: 248 seconds]

04:12 david_pfander has joined #ste||ar

04:17 david_pfander1 has joined #ste||ar

04:17 david_pfander has quit [Read error: Connection reset by peer]

04:17 david_pfander1 is now known as david_pfander

04:23 david_pfander has quit [Ping timeout: 240 seconds]

05:05 david_pfander has joined #ste||ar

05:39 parsa has joined #ste||ar

05:40 parsa has quit [Read error: Connection reset by peer]

05:40 parsa| has joined #ste||ar

05:42 parsa| has quit [Read error: Connection reset by peer]

05:43 parsa has joined #ste||ar

05:44 david_pfander has quit [Ping timeout: 248 seconds]

05:56 parsa has quit [Quit: Zzzzzzzzzzzz]

06:03 parsa has joined #ste||ar

06:41 jbjnr has joined #ste||ar

07:11 parsa has quit [Quit: Zzzzzzzzzzzz]

07:29 <msimberg> heller: did you already have more changes for the throttle test?

07:30 <msimberg> was going to apply the patch from yesterday so you don't have to make a pr...

07:30 <heller> msimberg: ok, I am testing stuff at the moment...

07:31 <heller> msimberg: it looks like the exit decisions are sometimes to relaxed, and sometimes too strict

07:31 <heller> so if I make the this_thread_executor work, other applications start to hang

07:31 <msimberg> okay, what was the fix for that?

07:32 <msimberg> so it really seems like it could use some option to say how strict it should be

07:33 <msimberg> btw, thanks for the inspect link

07:34 <msimberg> do I need special permissions on circleci to find those myself?

07:35 <heller> no

07:35 <heller> you just need to be logged in

07:35 <msimberg> ok, that's it then

07:36 <heller> msimberg: the fix seems to that we should check how many threads there are still available when doing wait_or_add_new, and not base our decision on whether there is no more work to add or not

07:37 <msimberg> so threads available != more work to add? what do you mean exactly by threads available?

07:38 <heller> the value in thread_map_count_

07:38 <heller> there could be situations where we have suspended threads that still need to run on the scheduler

07:39 <heller> (this is what happens with the timed executor tests)

07:39 <msimberg> mmh, okay, but then that breaks the throttle test again (when you remove yourself) :)

07:40 <msimberg> so more or less like it was before, right?

07:41 <heller> I have a patch now, i think

07:41 <msimberg> ok, nice

07:42 <heller> testing right now...

07:42 <heller> I'll file a PR against your branch soon then

07:43 <msimberg> ok

07:44 <heller> something is still very odd though...

07:44 <heller> https://gist.github.com/sithhell/3498c93aab48a601e72f88d83af7530a

07:44 <heller> look at the output here...

07:45 <heller> thread 1 appears three times in the output

07:45 <heller> but I guess that's a seperate issue

07:47 <msimberg> yes, I think I've seen this on master as well

07:47 <msimberg> maybe it needs an issue so we don't forget about it...

07:48 <heller> also: http://www.perfdynamics.com/Manifesto/USLscalability.html

07:48 <heller> we should really use that more often... should get a nice hint on where to look for improvements

07:49 <msimberg> yeah, watched the videos and read that yesterday, seems very useful :)

07:50 <heller> yup

07:50 <heller> just wanted to refresh the cache line ;)

07:59 <msimberg> yeah, doesn't hurt :)

08:00 jaafar_ has quit [Ping timeout: 246 seconds]

08:03 <github> [hpx] msimberg opened pull request #3012: Silence warning about casting away qualifiers in itt_notify.hpp (master...silence-const-cast) https://git.io/vFyau

08:33 <msimberg> heller: umm, meminfo should never be printed unless asked for, right? https://gist.github.com/msimberg/0620c9c1b8b23fa8224cc717bdcbdd48

08:33 <msimberg> or does something trigger it automatically?

09:25 <jbjnr> (I worry that you are spending time fixing things like this_thread_executor when really these executors should not exist).

09:29 <msimberg> jbjnr: valid concern, but I'm not really trying to fix it as I hope heller will have a fix soon ;)

09:29 <msimberg> but I guess your worry applies to him as well

09:29 <msimberg> besides, I'm forced to read new parts of the codebase which is at least not a bad thing...

09:31 <jbjnr> here's a task for you then - find out why we actually need some of theses executors and see if their functionality can be derived from the other executors that are used on the code base

09:36 <msimberg> ooh, sure, I'll try

09:37 <heller> yay, I wrote a python script to compute the USL graphs based on measurement points now ;)

09:37 <heller> hope to get some nice insights with this

09:37 <msimberg> btw, I think in this case it's not really a problem in the executors but the scheduling_loop termination, it just happens to appear in those tests

09:37 <msimberg> heller: nice

09:39 <heller> in the meantime, I have no idea what's going on with those dreaded tests :/

09:40 <msimberg> shame

09:40 <msimberg> so I tried to look around a bit more, and suspended threads only stay in the thread map, right?

09:40 <msimberg> not in any of the other queues

09:41 <jbjnr> my point is that the scheduling loop and it's termination criteria become more complex as they must support other execution models that might not actually be used any more and are just relics from some fudgery introduced years ago to support certain os_thread execution stuff, that might be better using some other approach.

09:41 <heller> msimberg: correct

09:41 <heller> jbjnr: I share your point

09:41 <msimberg> jbjnr: fair point

09:42 <jbjnr> heller: thanks for your support

09:42 <heller> jbjnr: however, what I am trying to do is to make everything work what's there

09:43 <heller> jbjnr: cleaning everything up is, I guess, a very HUGE task

09:43 <jbjnr> correct

09:43 <msimberg> and heller to continue, it's the timed tasks that are the problem in this case, no? because they would be suspended but not in any queues? so this would probably appear on any normal executor/scheduler/whatever

09:43 <heller> msimberg: correct.

09:44 <msimberg> I guess we can't check if a thread has been stolen at the moment?

09:44 <heller> msimberg: you are fast ;)

09:44 <heller> why would that be important?

09:44 <msimberg> well, for the throttle test

09:45 <heller> jbjnr: on a related note, I think our current design for task scheduling etc. is overly complex, we carry along a lot of technical depth

09:45 <msimberg> the checks were relaxed because of the case when a thread has been stolen but is in the thread map

09:46 <msimberg> i'm imagining that maybe the check should be thread_map_count_not_stolen == 0 or something like this

09:46 <msimberg> but it gets complicated as well

09:46 <jbjnr> "we carry along a lot of technical depth" ? not sure what you mean here

09:46 <heller> jbjnr: I am almost at a point in saying that it is quicker to start from scratch with a clear design on what we want to support. We currently have several mechanisms for doing essentially the same thing, which we have to carry on because nobody actually cleans up the code base

09:46 <heller> jbjnr: sorry, not depth, debt

09:46 <jbjnr> I agreee

09:47 <jbjnr> I am starting on this by throwing away all the schedulers - except my new one that I have rewritten over the last could of days whilst daint was down :)

09:47 <jbjnr> ^couple of days

09:47 <jbjnr> hmmm. my new chat client does not show smileys :(

09:48 <msimberg> something has to be done for the release though, even if it's in 6 months I guess cleaning up the schedulers and executors might be tight?

09:49 <msimberg> properly, that is

09:49 <jbjnr> schedulers is easy

09:50 <jbjnr> rm -rf *

09:50 <msimberg> :)

09:50 <jbjnr> and remove a few CMake options

09:50 <jbjnr> most of them are not used anywhere.

09:50 <msimberg> okay executors then

09:50 <jbjnr> trickier - hence my issue that nobody read

09:51 <jbjnr> half of them are undocumented and unexplained - must wait till hartmut returns from SC

09:51 <msimberg> I'm reading it! but I can't do much about it yet...

09:55 <msimberg> (finding out about executors now) so all the local_priority_queue_executor, local_queue_executor etc. are for the case when you realize you need a different scheduler but are too lazy to create another pool with that scheduler?

09:56 <msimberg> if most of the schedulers go then I guess at least those executors can (have to) go as well

09:56 <msimberg> is it just convenience?

09:58 <heller> I would look into the executors proposal

09:58 <heller> I think that's the biggest issue

09:58 <heller> that we don't have clear line between execution agents, execution context and executors etc

09:59 <heller> If that would have been there, and properly applied to the code base, everything would be simpler

10:01 <jbjnr> I don't think msimberg should go there just yet.

10:02 <jbjnr> (I only want to remove unnecessary execs at the moment)

10:04 <msimberg> yeah, and runtime suspension goes before any other (bigger) task for me until that's done

10:05 <msimberg> but the throttling stuff is relevant for that... which is relevant to executors...

10:06 <github> [hpx] StellarBot pushed 1 new commit to gh-pages: https://git.io/vFyMX

10:06 <github> hpx/gh-pages 7461267 StellarBot: Updating docs

10:08 <msimberg> what's the idea with the throttling scheduler being replaced by some rp functionality (#2876)?

10:10 <msimberg> ah, just use remove/add_processing_unit from user code? more or less...

10:10 <jbjnr> more or less

10:11 <jbjnr> ideally the rp would have an api for expanding/shrinking pools etc

10:11 <jbjnr> and I guess some aspects of suspend/resume too

10:12 <msimberg> which it does (I was wondering what those were for...), but I guess it's completely untested

10:13 <jbjnr> correct

10:13 <jbjnr> you're the expert now :)

10:13 <msimberg> yay, I thought it would take 10000 hours though...

10:14 <jbjnr> did you start counting?

10:17 <msimberg> uhm, no... maybe I should start, and then I can get an expert certificate once I'm at 10000

10:37 <jbjnr> if you are bored, here's a podcast on the subject http://freakonomics.com/podcast/peak/

10:37 <jbjnr> there was a followup to it too - can't remember the title though

10:39 <jbjnr> http://freakonomics.com/podcast/grit/ that was it

10:50 <jbjnr> heller: what can you tell me about min_tasks_to_steal_pending + min_tasks_to_steal_staged

10:52 <jbjnr> I'm troubled, because I thought they were used in the scheduler to control stealing, butI see them being used inside the thread_queue itself

10:54 <jbjnr> ok. ignore my question

12:07 hkaiser has joined #ste||ar

12:38 <hkaiser> heller: https://github.com/pacxx/pacxx-llvm

12:38 <heller> hkaiser: saw that, yeah

12:38 <heller> hkaiser: pretty cool!

12:38 <hkaiser> indeed

12:39 <heller> now we need to convince them to incorporate HPX futures and forget about OpenMP and TBB ;)

12:39 <hkaiser> let's do that!

12:39 <heller> didn't we have some guys from muenster in here a while back?

12:39 <hkaiser> shrug

12:39 <hkaiser> did we?

12:40 <heller> hkaiser: approach them, you are at the same conference as they are ;)

12:40 <hkaiser> you're closer to them ;)

12:40 <heller> i vaguely remember, might be wrong

12:40 <hkaiser> I'm home already

12:40 <heller> they presented at SC :P

12:40 <heller> ahh

12:40 <heller> too bad

12:40 <heller> how was the conference?

12:40 <hkaiser> tutorial went well, so did to BoF

12:40 <heller> hkaiser: btw, I am playing with USL again ;)

12:40 <hkaiser> everything else is a blur ;)

12:40 <hkaiser> what's USL?

12:41 <heller> universal scalability law

12:41 <hkaiser> ahh, good

12:41 <heller> want to incorporate it for my results section...

12:41 <hkaiser> add a 3rd dimension - grainsize

12:41 <heller> i want to relate alpha and beta to idle rate

12:41 <heller> but we'll see

12:42 <hkaiser> it needs a 3rd dimension

12:42 <heller> https://arxiv.org/pdf/0809.2541.pdf

12:42 <hkaiser> gives me a 404

12:42 <heller> hmm

12:42 <heller> grain size is already encounted for

12:43 <hkaiser> in the USL?

12:43 <hkaiser> ok, last time I looked it wasn't

12:48 <heller> https://arxiv.org/abs/0809.2541

12:48 <heller> can you access that?

12:49 <heller> from what I understand, grain size is part of the model

12:49 <hkaiser> nope

12:49 <heller> hmm

12:49 <hkaiser> ok, then I misunderstand the model ;)

12:49 <hkaiser> which is perfectly possible

12:57 K-ballo has joined #ste||ar

13:39 <K-ballo> https://github.com/jamboree/co2

13:43 <hkaiser> K-ballo: same semantics as co_await?

13:44 <K-ballo> not sure, just started reading

13:44 <hkaiser> interesting

14:04 hkaiser has quit [Quit: bye]

14:12 <K-ballo> heller: what's holding the component factory removals?

14:41 <heller> K-ballo: a review

14:54 hkaiser has joined #ste||ar

15:08 eschnett has joined #ste||ar

16:45 <github> [hpx] hkaiser pushed 1 new commit to master: https://git.io/vFSVS

16:45 <github> hpx/master 42a588b Hartmut Kaiser: Merge pull request #3004 from STEllAR-GROUP/future_data_void-warn...

16:59 <github> [hpx] hkaiser deleted optional at fcc888b: https://git.io/vFSrr

17:18 jaafar_ has joined #ste||ar

17:39 parsa has joined #ste||ar

18:10 parsa| has joined #ste||ar

18:13 parsa has quit [Ping timeout: 260 seconds]

18:15 <heller> hkaiser: can we get back to merging PRs only once they are green and reviewed?

18:23 parsa| has quit [Quit: Zzzzzzzzzzzz]

18:27 <heller> hkaiser: I've been thinking about your dataflow issue... Could it also be a use after move? You should try the clang_tidy branch which fixes some of those

18:28 <K-ballo> what's holding the clang tidy cleanup?

18:38 <heller> There's this one use after move resolution in for_loop which hkaiser doesn't like. He wanted to look into it

18:40 <K-ballo> mmh

18:40 <K-ballo> https://github.com/STEllAR-GROUP/hpx/commit/8e0212daad27f9c7543a876213f0ce59bfb57b0e#diff-f3b3662ac867dc1cda39d2ac7ce31ff2

18:41 <K-ballo> captured by the lambda, whether it is after move or before move is unspecified

18:41 <K-ballo> but it is a forward_as_tuple, it is a tuple of references

18:42 <K-ballo> but the move was a copy anyhow

18:45 <heller> K-ballo: right

18:45 <heller> Wait, the move was a copy?

18:45 <heller> How so?

18:46 <K-ballo> the target is a tuple of references of the same type

18:46 <K-ballo> it picks the defaulted copy constructor

18:46 <heller> Ok

18:47 <K-ballo> or the defaulted move constructor, which does the same

18:47 <K-ballo> there's some special wording for defaulted ops and rvalue ref members

18:47 <heller> TBH, I don't get exactly what happened in the original code

18:48 <K-ballo> the point is it isn't actually moving anything

18:48 <heller> So it's safe to revert it to the original version and mark it as false positive?

18:49 <K-ballo> both approaches seem equivalent on a cursory look

18:49 <K-ballo> it could have potentially mattered if the tuple ops weren't `=default`ed, as it was in the early days

18:51 <K-ballo> it is possible the move was necessary back then, in order for the code to compile

18:52 <K-ballo> rvalue references as members are weird

19:08 <hkaiser> heller: sure, I just merge things ince they are sitting for more than 10 days without review

19:09 <hkaiser> also, the tests where green

19:13 parsa has joined #ste||ar

19:15 <jbjnr> hkaiser: yt? or heller - https://github.com/STEllAR-GROUP/hpx/blob/a2d9428ab91f1a9ca42e94567721ede51d9880f6/src/runtime_impl.cpp#L335-L343 what's that for

19:15 <jbjnr> what is the run_helper thread/task for

19:16 <hkaiser> jbjnr: it's a special thread used to properly synchronize shutdown

19:20 <jbjnr> where is it supposed to execute? on any hpx thread or somewhere special?

19:20 <jbjnr> it use id=-1 - soe anywhere ....

19:20 <jbjnr> but it should not do that

19:21 <hkaiser> it

19:21 <hkaiser> 's a kernel-thread

19:21 <hkaiser> it's mostly sleeping

19:21 <hkaiser> no sorry

19:21 <hkaiser> mixing up things

19:22 <jbjnr> register-thread triggers pool-create-thread which is an hpx task

19:22 <hkaiser> jbjnr: that is the thread that eventually executes hpx_main

19:22 <jbjnr> hmm. thanks

19:22 <jbjnr> the thread id is messing up my stuff

19:23 <jbjnr> I don't like having to add special cases - can I launch it on thread 0 instead of thread -1

19:23 <hkaiser> sure

19:23 <jbjnr> that should not make any difference

19:23 <hkaiser> doesn't make a difference

19:23 <jbjnr> ta

19:23 <jbjnr> I'll check it doesn't cause harm

19:23 <hkaiser> sure, it won't

19:26 parsa has quit [Quit: Zzzzzzzzzzzz]

19:26 parsa has joined #ste||ar

19:27 parsa has quit [Client Quit]

19:37 <heller> hkaiser: optional was sitting there since 6 days, it turned green today ;)

19:38 <hkaiser> so you're following what's going on - nice

19:38 <heller> i sure do

19:38 <hkaiser> it was a low-risk patch - I would like to get rid of the experimental optional

19:39 mbremer has joined #ste||ar

19:40 <mbremer> @hkaiser yt?

19:41 <heller> the reason I didn't review it was mainly since I think there are other patches that should have had a higher priority, not submitting is my way to keep those other patches at lower priority... looks like this technique is not effective

19:43 <jbjnr> it is a suboptimal strategy

19:43 <heller> sure, we have different priorities and should apply that to what's important to us, I guess

19:43 jakemp has joined #ste||ar

19:44 <heller> merging without notice after a given period without review is suboptimal as well, I think

19:45 <heller> hkaiser: what's wrong about experimental::optional?

19:47 parsa[[w]] has joined #ste||ar

19:48 <heller> or std::optional

19:48 <K-ballo> C++17 only

19:50 <heller> sure

19:51 parsa[w] has quit [Ping timeout: 250 seconds]

19:52 <heller> looks like all compilers we test on rostam do have it

19:52 <heller> what about MSVC?

19:52 <K-ballo> you mean std:: or std::experimental?

19:53 <K-ballo> neither libstdc++ nor libc++ have std::optional pre 17, no idea about msvc

19:57 <hkaiser> K-ballo: I meant std::experimental

20:05 hkaiser has quit [Quit: bye]

20:14 <heller> Isn't either good enough?

20:17 <heller> So it could be just a template alias, if either is there

20:41 hkaiser has joined #ste||ar

20:45 Bibek has quit [Quit: Leaving]

20:49 Bibek has joined #ste||ar

20:56 mbremer has quit [Quit: Page closed]

21:13 EverYoung has joined #ste||ar

21:13 EverYoung has quit [Remote host closed the connection]

21:14 EverYoung has joined #ste||ar

21:22 nanashi55 has joined #ste||ar

21:28 <nanashi55> Hello. I am trying to find information about the handling of locality failures (especially failure of the AGAS server). I also wonder if hpx supports connecting localities while hpx_main is being executed (elasticity)?

21:31 <hkaiser> nanashi55: what failures do you encouter

21:34 <nanashi55> I am writing a small paper for university which gives a small overview of hpx. Is a crash of the agas server somehow recoverable? Or does this mean an abortion of all tasks?

21:43 <jbjnr> nanashi55: if the root node agas server goes down, then I think it'll be unrecoverable in the current implementation.

21:44 <jbjnr> if another node went down, then in principle things could be recovered - though that would require a lot of extra exception code to be put in place

21:45 <hkaiser> jbjnr: not sure

21:45 <jbjnr> about node 0 or the others?

21:45 msimberg has quit [Ping timeout: 260 seconds]

21:46 hkaiser has quit [Read error: Connection reset by peer]

21:46 <jbjnr> I'm sure other nodes failing could be handled, but node 0, I am not sure either

21:46 hkaiser has joined #ste||ar

21:46 <hkaiser> nanashi55: I think currently any node going down would be the end of it

21:46 <hkaiser> we have not invested any time in making things resilient

21:46 <jbjnr> but in principle a non root node fail could be handled

21:47 <jbjnr> it's just a case of putting the right exception handlers in place

21:47 <hkaiser> jbjnr: erm, sorry for spoiling your illusions ;)

21:47 <hkaiser> I don't think so

21:47 <jbjnr> it'll be on our list eventually

21:47 <hkaiser> absolutely

21:48 <hkaiser> even more if we run on top of mpi

21:48 <hkaiser> in this case it's game over anyways

21:48 <hkaiser> nanashi55: but we support elasticity

21:48 <nanashi55> Thank you. This helps a lot. I'm looking forward to see resilience in hpx

21:48 <jbjnr> I'm assuming we are not falling on mpi's problems, I spent too much tim in the LF PP for that :)

21:49 <hkaiser> jbjnr: right

21:49 <hkaiser> nanashi55: would you mind showing us your paper once it's published?

21:50 <nanashi55> hkaiser: I tried to start another node while the root node was already calculating and it resulted in a serialization error. So I was not sure. I must have made a mistake

21:51 <nanashi55> hkaiser: I don't think it will be published outside of university. And it will be in german. I can send it to you nevertheless once it's finished

21:51 <hkaiser> nanashi55: I'd be interested in seeing it

21:51 <hkaiser> nanashi55: for adding nodes you need to do something special, see the heartbeat example

21:51 <nanashi55> Okay. I will send it to you then

21:51 <hkaiser> thanks

21:52 <jbjnr> nanashi55: it is possible to add and remove nodes during runtime - if done corecly (ie not node fails, ut connect and disconnect)

21:52 <hkaiser> also nodes added after the fact are not part of the distributed agas, so disconnetcing them does not cause issues

21:54 <jbjnr> aha - I didn't realize there was a distinction - I see now why you would expect a fail on a non root node fail then - if it was part of the original start up as part of agas.

22:04 eschnett has quit [Quit: eschnett]

22:12 daissgr has quit [Quit: WeeChat 1.4]

22:25 gedaj has quit [Quit: leaving]

22:28 eschnett has joined #ste||ar

22:29 gedaj has joined #ste||ar

22:32 gedaj has quit [Client Quit]

22:32 gedaj has joined #ste||ar

22:36 gedaj has quit [Client Quit]

22:37 gedaj has joined #ste||ar

22:44 gedaj has quit [Quit: leaving]

22:47 gedaj has joined #ste||ar

22:58 gedaj has quit [Quit: leaving]

23:09 <nanashi55> hkaiser: The heartbeat example showed me how to do it. Thanks

23:09 gedaj has joined #ste||ar

23:11 gedaj has quit [Client Quit]

23:11 gedaj has joined #ste||ar

23:23 <jakemp> I'm updating my hpxMP runtime, and I'm having an issue with hpx not stopping. It seems to happen when I use dataflow with an executor. Has the way dataflow uses executors changed?

23:57 hkaiser has quit [Quit: bye]