#ste||ar on 2019-03-22 — irc logs at irclog.cct.lsu.edu

2018-08-26 23:03 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

01:27 eschnett has joined #ste||ar

02:55 eschnett has quit [Quit: eschnett]

03:20 hkaiser has quit [Quit: bye]

05:47 jaafar has quit [Ping timeout: 252 seconds]

06:40 Yorlik has quit [Read error: Connection reset by peer]

06:41 nikunj1997 has quit [Ping timeout: 258 seconds]

06:43 <heller> jbjnr_: https://github.com/STEllAR-GROUP/hpx/commit/e0ba6d26b2e242a512d0fb3bfbf3a4813f420ff5

07:43 nikunj has joined #ste||ar

07:53 <heller> jbjnr_: https://github.com/STEllAR-GROUP/hpx/pull/3751

08:14 david_pfander has joined #ste||ar

08:34 <simbergm> heller: something up with the gitlab runner? https://gitlab.com/stellar-group/hpx/pipelines/52858122

08:38 david_pfander has quit [Quit: david_pfander]

08:39 <heller> simbergm: yeah, I shut them off

08:39 <heller> Give me a few hours please

08:44 <heller> simbergm: I want to replace it with buildbot

08:44 <heller> simbergm: stellar-ci.org/buildbot

08:45 <jbjnr_> heller: thanks. I will test that. I believe I understand what I was/am doing wrong. I made a simple barrier of my own because I wanted to be certain I didn't have a bug elsewhere and I had the same lockup with my barrier. The initialization step in my barrier was done wrong - if ranks n=1,2,3,4..... entered the barrier before rank 0 had entered, then they did a count down step before rank=0 has reset it's counter from the last use,

08:45 <jbjnr_> so it worked on the first go, but failed on subsequent tries (if rank 0 wasn't first, but worked on when rank0 was first). I have fixed that and now my deadlock has gone. I suspect with the hpx barrier that I was creating the barrier on rank 0 after the other ranks had already created theirs and started counting down and possibly hitting the same problem, rank 0 would reset the count after some had already started - but I did not

08:45 <jbjnr_> look at the hpx barrier code properly yet. I will do now that I have established that my code is ok. Or at least it appears that the test and the PP are working as expected

08:46 <heller> ok

08:46 <heller> that rank 0 problem is interesting

08:47 <jbjnr_> currently the test is running with over 500k messages exchanged randomly between 3 nodes and no fails yet

08:47 <heller> ok

08:47 <heller> sounds good

08:47 <jbjnr_> it looks good too, but until I test on daint with 4k nodes ....

08:48 <heller> ;)

08:52 <jbjnr_> passed the 1million write test, now starting the 1m read test \o/

08:53 <simbergm> heller: uhm, you can have a few months ;) just wondered what going on with them

08:53 <simbergm> the new buildbot at least looks nice!

08:53 <heller> thanks, i'll disable gitlab right away

08:53 <heller> so we don't have those nasty false positives

08:54 <jbjnr_> buildbot looks the same to me, is there a new link?

08:54 <simbergm> stellar-ci.org/buildbot

08:54 <simbergm> the official one is still the same as before

08:55 <jbjnr_> link doesn't show anything for me :(

08:55 <heller> the buildbot runs renders to this directly on github: https://github.com/sithhell/stellarci-ci/pull/4/checks

08:56 <heller> and you can trigger reruns directly from the github UI

08:58 <heller> the spack part is the most interesting one, with which we can have a maintanabe option to select different dependencies by defining them in the config itself

08:59 <heller> by having something like that: https://github.com/sithhell/stellarci-ci/blob/c55ea77043c65a3db2396088acf852ec2e9b21e1/.stellarci.yml

08:59 <heller> spack has a nice feature of build caches, so the dependencies don't need to get rebuilt all the time

09:02 <heller> and we don't need to maintain a billion of docker images

10:49 Yorlik has joined #ste||ar

12:04 hkaiser has joined #ste||ar

12:06 zao_ has joined #ste||ar

12:07 zao has quit [Disconnected by services]

12:07 zao_ is now known as zao

12:14 david_pfander has joined #ste||ar

12:36 <hkaiser> simbergm: yt?

12:38 <hkaiser> simbergm: I think #3745 manifests the hangs in the suspend/resume test nicely (https://circleci.com/workflow-run/7f902a50-f036-4521-9e2d-a3a71170167b)

12:38 <hkaiser> I tried rerunning the tests 3 or 4 times, but it consistently fails there...

12:39 <hkaiser> could be what you need to diagnose things

12:49 david_pfander1 has joined #ste||ar

12:54 <simbergm> heller: nice, thanks for pointing that out

12:54 <simbergm> trying to hunt down the set_thread_state one at the moment

12:58 david_pfander1 has quit [Ping timeout: 245 seconds]

13:10 <jbjnr_> hkaiser diehlpk_work : I missed the octo call yesterday, but the news today is that I think the libfabric PP is now working well. After the latest tweaks and a quick hack to create a barrier, my tests are passing. I have submitted jobs to daint and will hopefully have some scaling graphs (by tonight/tomorrow, the queue is a bit full) of the parcelport that give me the confidence I need to try octotiger out using it.

13:11 <hkaiser> jbjnr_: nice

13:42 aserio has joined #ste||ar

13:43 hkaiser has quit [Quit: bye]

14:01 <diehlpk_work> jbjnr_, Nice

14:04 bibek has joined #ste||ar

14:13 daissgr has joined #ste||ar

14:16 <diehlpk_work> daissgr, Can we have a short phone call

14:22 <daissgr> diehlpk_work, Yeah! How about in one hour? My plane only landed a few hours ago (got futher stuck in London) so I am just now catching up on things at work.

14:26 <diehlpk_work> Sure, just oing me in irc and I will call you

14:27 <diehlpk_work> We have to look into the scaling results. For now it is always the best to run it on one node

14:27 <diehlpk_work> As long as the problem fits into the memory of one machine

14:29 <nikunj> I'm working on one of my OS assignments involving concurrency. So I decided to use HPX. Do I only need to copy the boost license header that we use for HPX files or do I need to mention anything else as well?

14:29 <jbjnr_> nikunj: coursework?

14:29 <nikunj> jbjnr_, yes

14:30 <jbjnr_> should be sufficien to just say "I used HPX, distributed under the boost license" and then add a link

14:30 <nikunj> I could probably add them later to HPX examples section as well. I'm implementing solutions to basic synchronization issues (ex: dining philosophers)

14:30 <nikunj> jbjnr_, ok thanks!

14:31 <jbjnr_> oh, you mean contributing code? then add the boost heder to any file you want to submit

14:31 <jbjnr_> (the licenses are really for people doing commercial work)

14:32 <nikunj> jbjnr_, I see

14:32 <nikunj> I will add the boost headers anyway

14:32 <zao> For using HPX, you just need to honor any requirements set forth in the license file. For contributions, you need to license your stuff in whatever way we want.

14:32 <nikunj> if we later decide to add those in the examples section I'll simply create a pull request from master

14:33 <nikunj> zao, makes sense

14:33 <zao> Note that some educational institutions has restrictions on whether you own your coursework at all.

14:33 <zao> (which is quite hecked up, but hey)

14:34 <nikunj> zao, ours require you to use any library available for concurrency

14:34 <nikunj> so I naturally decided on HPX

14:34 <zao> Brave ;)

14:34 * nikunj I'm up for some adventure right now

14:34 <jbjnr_> (the main thing about the boost license that differs from other open source licenses is that you don't have to include anything in your distributed binaries)

14:35 <nikunj> jbjnr_, ohh

14:35 <jbjnr_> other licenses say you have to include the license in software, not just source, but binaries too. You can use hpx in anything and never tell anyone if you don't want to

14:35 <jbjnr_> if you copy the source, you must still keep the copyright/license in the source, but apart from that...

14:36 <nikunj> jbjnr_, no I'm implementing everything myself except for the features that HPX provides me

14:36 <jbjnr_> so you don't have to do anything

14:37 <nikunj> jbjnr_, ok. That clears my doubts, thanks!

14:40 hkaiser has joined #ste||ar

15:10 <daissgr> diehlpk_work, I am back at the keyboard! What software should we use for the call?

15:35 <simbergm> heller: yt?

15:36 <simbergm> or hkaiser?

15:39 <diehlpk_work> daissgr, Was in another call

15:40 <diehlpk_work> daissgr, skype or jitsi

15:40 <diehlpk_work> landline

15:44 <diehlpk_work> daissgr, yet?

15:46 <daissgr> diehlpk_work, let's try skype

15:46 <daissgr> I want to see whether their new webapp still works under linux

15:48 <hkaiser> simbergm: here

15:49 <simbergm> hkaiser: great, wanted to ask if you might have some good ideas for this

15:49 <hkaiser> for the hang?

15:50 <simbergm> so I think the(/a) failure in the thread_test is trying to set the state for a thread that's already gone

15:50 <simbergm> yep

15:50 <hkaiser> ahh, that one

15:50 <hkaiser> no idea :/

15:50 <simbergm> specifically set_active_state might end up running after a thread has been deleted

15:50 <hkaiser> uhh

15:51 daissgr has quit [Ping timeout: 250 seconds]

15:51 <simbergm> we don't keep them alive longer than until the scheduler decides to delete them

15:51 <hkaiser> yes

15:51 daissgr has joined #ste||ar

15:51 <hkaiser> simbergm: heller has removed ref-counting from the thread ids at some point

15:52 <simbergm> yeah, that's why I was hoping to catch him first...

15:52 <simbergm> in the worst case set_active_state has a stale or even a recycled thread_id

15:52 aserio has quit [Ping timeout: 252 seconds]

15:53 <simbergm> not really sure what to do about

15:53 aserio has joined #ste||ar

15:53 <simbergm> it

15:54 <daissgr> diehlpk_work, Alternatively, what is your skype handle?

15:54 <diehlpk_work> see pm

15:55 <daissgr> oh I missed that! One second

16:21 Yorlik has quit [Ping timeout: 272 seconds]

16:23 <nikunj> is hpx::util::format_to thread safe?

16:24 <nikunj> or do I have to create my own formatter if I need a thread safe version

16:30 <K-ballo> thread safe with respect to what? itself? yeah

16:31 <nikunj> I mean if I call it from multiple threads, will it make sure to format correctly?

16:32 <nikunj> i.e every thread gets produces the desired formatted output

16:35 <K-ballo> it does reads on format arguments, writes to the output stream

16:36 <simbergm> K-ballo: I think he's wondering if it behaves like std::cout << a << b << c where a, b, c can be interleaved with output from another thread or if they're written all in one go

16:37 <nikunj> simbergm, yes that's what I mean

16:37 <K-ballo> ok.. that's unspecified

16:37 <nikunj> K-ballo, ok

16:42 <K-ballo> this reminds me we need to move back util::format definitions to a source file

16:43 <K-ballo> perhaps we can have whichever benchmark that depends on it include the source file directly

16:51 <heller> hkaiser: simbergm: yeah, that's a problem

16:51 <heller> the question is: how to fix it for such a corner case?

16:52 <heller> reintroducing ref counting for all thread_id handles should be fairly easy

16:52 <heller> but I think it will have quite some impact onto performance in general

16:53 <heller> a possible solution might be to handle them as some kind of weak pointer

16:54 <heller> I think you only need to increase the ref count in the case where you want to set the state of an active thread

16:55 <heller> but that won't solve the problem in general, I guess

17:03 <simbergm> heller: yeah, it'd be a shame to slow everything down for that case

17:03 <simbergm> the suspension tests look like they'll be straightforward to fix

17:03 <simbergm> it's just the shared priority scheduler that's timing out

17:04 <simbergm> not sure what changed though...

17:08 aserio has quit [Ping timeout: 250 seconds]

17:32 david_pfander has quit [Quit: david_pfander]

17:32 david_pfander1 has joined #ste||ar

17:34 <hkaiser> simbergm: on that branch timings might be different

17:36 david_pfander1 has quit [Ping timeout: 250 seconds]

17:41 <simbergm> hkaiser: I'm surprised it ever worked ;)

18:03 <hkaiser> simbergm: that's a good sign ;-)

18:11 <diehlpk_work> jbjnr_, The new version of hpx can load the the large files

18:12 <diehlpk_work> daissgr, and I will run some scaling experiments tomorrow

18:14 <diehlpk_work> hkaiser, Have you had time to send the e-mails to Michael and the other person?

18:20 aserio has joined #ste||ar

18:21 nikunj97 has joined #ste||ar

18:23 hkaiser has quit [Quit: bye]

18:25 nikunj has quit [Ping timeout: 272 seconds]

18:35 aserio has quit [Ping timeout: 250 seconds]

18:39 jaafar has joined #ste||ar

18:56 <jbjnr_> simbergm: what's wrong with the scheduler - my version here has changed substantially, so it might not be a real problem any more

18:56 <jbjnr_> also - what's the corner case that causes set_thread_state to run after a thread has terminated

19:02 <jbjnr_> diehlpk_work: great

19:02 <jbjnr_> I have a problem though. my work on the parcelport is based on recent master and I have a lot of changes to serialization etc that come from my parcelport RDMA work that is also on this branch.

19:03 <jbjnr_> If I do octotiger runs to compare, I'll be working from a much newer master than yours and it would be a lot of work to remove my rdma stuff and backport my changes to be on your octohpx branch

19:03 <diehlpk_work> We can switch hpx for octotiger

19:04 <diehlpk_work> We do not need this specific branch

19:04 <jbjnr_> I guess I'll have to run my branch mpi vs my branch libfabric and not compare directly to yours

19:04 <jbjnr_> ^^ok

19:04 <diehlpk_work> This branch was only needed for ppc

19:04 <diehlpk_work> But for x86 it should be fine

19:04 <jbjnr_> that's be nice. if you tested it on latest master and it runs fine, then we should use that

19:05 <jbjnr_> then my stuff will site nicely on top

19:05 <diehlpk_work> Can you compile hpx release with the branch you want to use?

19:05 <jbjnr_> shall I overwrite the installed octohpx-tcmalloc version you're using?

19:06 <diehlpk_work> yes

19:06 <jbjnr_> ok. I'll do it now

19:06 <jbjnr_> bbiab

19:06 <diehlpk_work> perfect, I will compile octotiger with it and run some tests

19:07 <diehlpk_work> We just like to use the same version of hpx for the scaling test tomorrow

19:09 <simbergm> jbjnr_: the scheduler problem is that wait_or_add_new returns false when it should be returning true when suspending in the shared priority scheduler (or vice versa, don't remember)

19:09 <simbergm> the other schedulers return early if it's time to suspend

19:10 <simbergm> and I don't understand how it wasn't a problem before

19:11 <simbergm> if wait_or_add_new goes then it won't be a problem anymore, but we'll still have to make sure it's able to suspend worker threads

19:11 <simbergm> the other one is when interrupting hpx threads

19:11 aserio has joined #ste||ar

19:11 <simbergm> if a thread is active and we try to change the state we schedule a new thread to change the state once it's not active anymore

19:12 <simbergm> and in the worst case the thread is already gone when the new thread runs

19:12 <jbjnr_> simbergm: ok, the wait_or_add_new has been fixed in my banch I think - which test times out?

19:12 <simbergm> but now that I think about it a bit more, at least the case where we try to set the state to pending it doesn't make any sense to schedule another thread

19:12 <simbergm> active is better than pending

19:13 <jbjnr_> I will run locally on master and on my branch and see if anyhthing changes

19:13 <simbergm> various suspension tests

19:13 <simbergm> shutdown_suspended_pus for example

19:13 <jbjnr_> ok

19:13 <jbjnr_> I will try it out

19:13 <simbergm> but not necessarily on master

19:13 <simbergm> they fail all the time on#3745

19:14 <jbjnr_> "if a thread is active and we try to change the state we schedule a new thread to change the state once it's not active anymore" WTF?

19:14 <jbjnr_> #3745 - I will look at it

19:18 <simbergm> lol

19:19 <simbergm> jbjnr_: look through this if: https://github.com/STEllAR-GROUP/hpx/blob/53775a7479ca30aef46970a946675b626b9ea342/hpx/runtime/threads/detail/set_thread_state.hpp#L45-L130

19:20 <simbergm> we can't change the state of a thread while it's active so we delay doing that by spawning another thread which will try to set the state later

19:22 <jbjnr_> diehlpk_work: /apps/daint/UES/biddisco/gcc/7.3.0/hpx4octotiger-tcmalloc-release now has build updated to master from 10 minutes ago

19:23 <jbjnr_> simbergm: and when does a thread state get changed whilst it is running? when does that happen?

19:24 <simbergm> probably other cases as well, but interrupting a thread sets the state to pending (to wake it up in case it was suspended)

19:24 <jbjnr_> when a thread suspends - it can presumably change it's own state, when it is activated, it can be changed - but

19:25 <simbergm> it's tests.unit.threads.thread that's been failing pretty regularly recently

19:25 <jbjnr_> another thread ahcnging it whilstt running - what's the use case ...

19:25 <simbergm> quite a few of those state changes are disallowed because they don't make sense

19:26 <simbergm> but suspended to pending does make sense

19:26 <simbergm> and active to pending is allowed, but I'm thinking that one can be ignored

19:26 <simbergm> need to try that out

19:26 <jbjnr_> nonsense all of it!

19:26 <simbergm> absolutely!

19:26 <jbjnr_> suspended to pending should not be a problem

19:26 <simbergm> and it's not

19:27 <jbjnr_> active to pending? threads can't be suspended from outside the thread - only when they suspend themselves surely

19:27 <jbjnr_> anyway, as long as you're onto it ...

19:28 <simbergm> jbjnr_: I appreciate you questioning it because it's giving me ideas ;)

19:28 <simbergm> but yes, active to pending is weird

19:29 <diehlpk_work> jbjnr_, Thanks, I will check if octotiger works

19:29 <jbjnr_> all my tests of libfabric on daint so far have passed, but 64 nodes and bigger are all stuck in the queue all day

19:29 <jbjnr_> been waiting since lunch time for them :(

19:30 <jbjnr_> but all tests on 2,4,8,16,32 nodes passed. doing random messages between all nodes

19:35 jaafar has quit [Quit: Konversation terminated!]

19:36 jaafar has joined #ste||ar

19:53 eschnett has joined #ste||ar

20:13 <jbjnr_> diehlpk_work: just fyi - debug tcalloc build is now updated to master too

20:13 <diehlpk_work> Ok, I will let Dominic know

20:15 <diehlpk_work> jbjnr_, How can we check our remaining node hours?

20:16 <parsa> on https://account.cscs.ch

20:17 <parsa> We've used 119 node hours

20:18 hkaiser has joined #ste||ar

20:21 <aserio> hkaiser: I have a FLeCSI question for you

20:21 <hkaiser> aserio: ok

20:21 <diehlpk_work> jbjnr_, the rotating star test works with the new version of hpx

20:22 <aserio> hkaiser: see pm

21:02 eschnett has quit [Quit: eschnett]

21:11 bibek has quit [Quit: Konversation terminated!]

21:24 nikunj97 has quit [Quit: Leaving]

21:29 <jbjnr_> diehlpk_work: parsa the web thingy is not updated real time, so use `sbucheck` on daint/ela to check usage

21:29 <jbjnr_> * d69: Authorized Daint constraints: gpu

21:29 <jbjnr_> DAINT Usage: 120 NODE HOURS (NH) Quota: 9,000 NH 1.3%

21:29 <jbjnr_> TAVE Usage: 0 NODE HOURS (NH) Quota: 300 NH 0.0%

21:29 <jbjnr_> actully it looks about right, but sbucheck is easy to use on the machine

21:29 <jbjnr_> ^easier

21:30 <jbjnr_> diehlpk_work: rotating star. great. This is a big relief. any problem with amster and we'd be stuffed.

21:30 aserio has quit [Quit: aserio]

23:03 mbremer has quit [Quit: Leaving.]