hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
eschnett has joined #ste||ar
eschnett has quit [Quit: eschnett]
hkaiser has quit [Quit: bye]
jaafar has quit [Ping timeout: 252 seconds]
Yorlik has quit [Read error: Connection reset by peer]
nikunj1997 has quit [Ping timeout: 258 seconds]
nikunj has joined #ste||ar
david_pfander has joined #ste||ar
<simbergm> heller: something up with the gitlab runner? https://gitlab.com/stellar-group/hpx/pipelines/52858122
david_pfander has quit [Quit: david_pfander]
<heller> simbergm: yeah, I shut them off
<heller> Give me a few hours please
<heller> simbergm: I want to replace it with buildbot
<heller> simbergm: stellar-ci.org/buildbot
<jbjnr_> heller: thanks. I will test that. I believe I understand what I was/am doing wrong. I made a simple barrier of my own because I wanted to be certain I didn't have a bug elsewhere and I had the same lockup with my barrier. The initialization step in my barrier was done wrong - if ranks n=1,2,3,4..... entered the barrier before rank 0 had entered, then they did a count down step before rank=0 has reset it's counter from the last use,
<jbjnr_> so it worked on the first go, but failed on subsequent tries (if rank 0 wasn't first, but worked on when rank0 was first). I have fixed that and now my deadlock has gone. I suspect with the hpx barrier that I was creating the barrier on rank 0 after the other ranks had already created theirs and started counting down and possibly hitting the same problem, rank 0 would reset the count after some had already started - but I did not
<jbjnr_> look at the hpx barrier code properly yet. I will do now that I have established that my code is ok. Or at least it appears that the test and the PP are working as expected
<heller> ok
<heller> that rank 0 problem is interesting
<jbjnr_> currently the test is running with over 500k messages exchanged randomly between 3 nodes and no fails yet
<heller> ok
<heller> sounds good
<jbjnr_> it looks good too, but until I test on daint with 4k nodes ....
<heller> ;)
<jbjnr_> passed the 1million write test, now starting the 1m read test \o/
<simbergm> heller: uhm, you can have a few months ;) just wondered what going on with them
<simbergm> the new buildbot at least looks nice!
<heller> thanks, i'll disable gitlab right away
<heller> so we don't have those nasty false positives
<jbjnr_> buildbot looks the same to me, is there a new link?
<simbergm> stellar-ci.org/buildbot
<simbergm> the official one is still the same as before
<jbjnr_> link doesn't show anything for me :(
<heller> the buildbot runs renders to this directly on github: https://github.com/sithhell/stellarci-ci/pull/4/checks
<heller> and you can trigger reruns directly from the github UI
<heller> the spack part is the most interesting one, with which we can have a maintanabe option to select different dependencies by defining them in the config itself
<heller> spack has a nice feature of build caches, so the dependencies don't need to get rebuilt all the time
<heller> and we don't need to maintain a billion of docker images
Yorlik has joined #ste||ar
hkaiser has joined #ste||ar
zao_ has joined #ste||ar
zao has quit [Disconnected by services]
zao_ is now known as zao
david_pfander has joined #ste||ar
<hkaiser> simbergm: yt?
<hkaiser> simbergm: I think #3745 manifests the hangs in the suspend/resume test nicely (https://circleci.com/workflow-run/7f902a50-f036-4521-9e2d-a3a71170167b)
<hkaiser> I tried rerunning the tests 3 or 4 times, but it consistently fails there...
<hkaiser> could be what you need to diagnose things
david_pfander1 has joined #ste||ar
<simbergm> heller: nice, thanks for pointing that out
<simbergm> trying to hunt down the set_thread_state one at the moment
david_pfander1 has quit [Ping timeout: 245 seconds]
<jbjnr_> hkaiser diehlpk_work : I missed the octo call yesterday, but the news today is that I think the libfabric PP is now working well. After the latest tweaks and a quick hack to create a barrier, my tests are passing. I have submitted jobs to daint and will hopefully have some scaling graphs (by tonight/tomorrow, the queue is a bit full) of the parcelport that give me the confidence I need to try octotiger out using it.
<hkaiser> jbjnr_: nice
aserio has joined #ste||ar
hkaiser has quit [Quit: bye]
<diehlpk_work> jbjnr_, Nice
bibek has joined #ste||ar
daissgr has joined #ste||ar
<diehlpk_work> daissgr, Can we have a short phone call
<daissgr> diehlpk_work, Yeah! How about in one hour? My plane only landed a few hours ago (got futher stuck in London) so I am just now catching up on things at work.
<diehlpk_work> Sure, just oing me in irc and I will call you
<diehlpk_work> We have to look into the scaling results. For now it is always the best to run it on one node
<diehlpk_work> As long as the problem fits into the memory of one machine
<nikunj> I'm working on one of my OS assignments involving concurrency. So I decided to use HPX. Do I only need to copy the boost license header that we use for HPX files or do I need to mention anything else as well?
<jbjnr_> nikunj: coursework?
<nikunj> jbjnr_, yes
<jbjnr_> should be sufficien to just say "I used HPX, distributed under the boost license" and then add a link
<nikunj> I could probably add them later to HPX examples section as well. I'm implementing solutions to basic synchronization issues (ex: dining philosophers)
<nikunj> jbjnr_, ok thanks!
<jbjnr_> oh, you mean contributing code? then add the boost heder to any file you want to submit
<jbjnr_> (the licenses are really for people doing commercial work)
<nikunj> jbjnr_, I see
<nikunj> I will add the boost headers anyway
<zao> For using HPX, you just need to honor any requirements set forth in the license file. For contributions, you need to license your stuff in whatever way we want.
<nikunj> if we later decide to add those in the examples section I'll simply create a pull request from master
<nikunj> zao, makes sense
<zao> Note that some educational institutions has restrictions on whether you own your coursework at all.
<zao> (which is quite hecked up, but hey)
<nikunj> zao, ours require you to use any library available for concurrency
<nikunj> so I naturally decided on HPX
<zao> Brave ;)
* nikunj I'm up for some adventure right now
<jbjnr_> (the main thing about the boost license that differs from other open source licenses is that you don't have to include anything in your distributed binaries)
<nikunj> jbjnr_, ohh
<jbjnr_> other licenses say you have to include the license in software, not just source, but binaries too. You can use hpx in anything and never tell anyone if you don't want to
<jbjnr_> if you copy the source, you must still keep the copyright/license in the source, but apart from that...
<nikunj> jbjnr_, no I'm implementing everything myself except for the features that HPX provides me
<jbjnr_> so you don't have to do anything
<nikunj> jbjnr_, ok. That clears my doubts, thanks!
hkaiser has joined #ste||ar
<daissgr> diehlpk_work, I am back at the keyboard! What software should we use for the call?
<simbergm> heller: yt?
<simbergm> or hkaiser?
<diehlpk_work> daissgr, Was in another call
<diehlpk_work> daissgr, skype or jitsi
<diehlpk_work> landline
<diehlpk_work> daissgr, yet?
<daissgr> diehlpk_work, let's try skype
<daissgr> I want to see whether their new webapp still works under linux
<hkaiser> simbergm: here
<simbergm> hkaiser: great, wanted to ask if you might have some good ideas for this
<hkaiser> for the hang?
<simbergm> so I think the(/a) failure in the thread_test is trying to set the state for a thread that's already gone
<simbergm> yep
<hkaiser> ahh, that one
<hkaiser> no idea :/
<simbergm> specifically set_active_state might end up running after a thread has been deleted
<hkaiser> uhh
daissgr has quit [Ping timeout: 250 seconds]
<simbergm> we don't keep them alive longer than until the scheduler decides to delete them
<hkaiser> yes
daissgr has joined #ste||ar
<hkaiser> simbergm: heller has removed ref-counting from the thread ids at some point
<simbergm> yeah, that's why I was hoping to catch him first...
<simbergm> in the worst case set_active_state has a stale or even a recycled thread_id
aserio has quit [Ping timeout: 252 seconds]
<simbergm> not really sure what to do about
aserio has joined #ste||ar
<simbergm> it
<daissgr> diehlpk_work, Alternatively, what is your skype handle?
<diehlpk_work> see pm
<daissgr> oh I missed that! One second
Yorlik has quit [Ping timeout: 272 seconds]
<nikunj> is hpx::util::format_to thread safe?
<nikunj> or do I have to create my own formatter if I need a thread safe version
<K-ballo> thread safe with respect to what? itself? yeah
<nikunj> I mean if I call it from multiple threads, will it make sure to format correctly?
<nikunj> i.e every thread gets produces the desired formatted output
<K-ballo> it does reads on format arguments, writes to the output stream
<simbergm> K-ballo: I think he's wondering if it behaves like std::cout << a << b << c where a, b, c can be interleaved with output from another thread or if they're written all in one go
<nikunj> simbergm, yes that's what I mean
<K-ballo> ok.. that's unspecified
<nikunj> K-ballo, ok
<K-ballo> this reminds me we need to move back util::format definitions to a source file
<K-ballo> perhaps we can have whichever benchmark that depends on it include the source file directly
<heller> hkaiser: simbergm: yeah, that's a problem
<heller> the question is: how to fix it for such a corner case?
<heller> reintroducing ref counting for all thread_id handles should be fairly easy
<heller> but I think it will have quite some impact onto performance in general
<heller> a possible solution might be to handle them as some kind of weak pointer
<heller> I think you only need to increase the ref count in the case where you want to set the state of an active thread
<heller> but that won't solve the problem in general, I guess
<simbergm> heller: yeah, it'd be a shame to slow everything down for that case
<simbergm> the suspension tests look like they'll be straightforward to fix
<simbergm> it's just the shared priority scheduler that's timing out
<simbergm> not sure what changed though...
aserio has quit [Ping timeout: 250 seconds]
david_pfander has quit [Quit: david_pfander]
david_pfander1 has joined #ste||ar
<hkaiser> simbergm: on that branch timings might be different
david_pfander1 has quit [Ping timeout: 250 seconds]
<simbergm> hkaiser: I'm surprised it ever worked ;)
<hkaiser> simbergm: that's a good sign ;-)
<diehlpk_work> jbjnr_, The new version of hpx can load the the large files
<diehlpk_work> daissgr, and I will run some scaling experiments tomorrow
<diehlpk_work> hkaiser, Have you had time to send the e-mails to Michael and the other person?
aserio has joined #ste||ar
nikunj97 has joined #ste||ar
hkaiser has quit [Quit: bye]
nikunj has quit [Ping timeout: 272 seconds]
aserio has quit [Ping timeout: 250 seconds]
jaafar has joined #ste||ar
<jbjnr_> simbergm: what's wrong with the scheduler - my version here has changed substantially, so it might not be a real problem any more
<jbjnr_> also - what's the corner case that causes set_thread_state to run after a thread has terminated
<jbjnr_> diehlpk_work: great
<jbjnr_> I have a problem though. my work on the parcelport is based on recent master and I have a lot of changes to serialization etc that come from my parcelport RDMA work that is also on this branch.
<jbjnr_> If I do octotiger runs to compare, I'll be working from a much newer master than yours and it would be a lot of work to remove my rdma stuff and backport my changes to be on your octohpx branch
<diehlpk_work> We can switch hpx for octotiger
<diehlpk_work> We do not need this specific branch
<jbjnr_> I guess I'll have to run my branch mpi vs my branch libfabric and not compare directly to yours
<jbjnr_> ^^ok
<diehlpk_work> This branch was only needed for ppc
<diehlpk_work> But for x86 it should be fine
<jbjnr_> that's be nice. if you tested it on latest master and it runs fine, then we should use that
<jbjnr_> then my stuff will site nicely on top
<diehlpk_work> Can you compile hpx release with the branch you want to use?
<jbjnr_> shall I overwrite the installed octohpx-tcmalloc version you're using?
<diehlpk_work> yes
<jbjnr_> ok. I'll do it now
<jbjnr_> bbiab
<diehlpk_work> perfect, I will compile octotiger with it and run some tests
<diehlpk_work> We just like to use the same version of hpx for the scaling test tomorrow
<simbergm> jbjnr_: the scheduler problem is that wait_or_add_new returns false when it should be returning true when suspending in the shared priority scheduler (or vice versa, don't remember)
<simbergm> the other schedulers return early if it's time to suspend
<simbergm> and I don't understand how it wasn't a problem before
<simbergm> if wait_or_add_new goes then it won't be a problem anymore, but we'll still have to make sure it's able to suspend worker threads
<simbergm> the other one is when interrupting hpx threads
aserio has joined #ste||ar
<simbergm> if a thread is active and we try to change the state we schedule a new thread to change the state once it's not active anymore
<simbergm> and in the worst case the thread is already gone when the new thread runs
<jbjnr_> simbergm: ok, the wait_or_add_new has been fixed in my banch I think - which test times out?
<simbergm> but now that I think about it a bit more, at least the case where we try to set the state to pending it doesn't make any sense to schedule another thread
<simbergm> active is better than pending
<jbjnr_> I will run locally on master and on my branch and see if anyhthing changes
<simbergm> various suspension tests
<simbergm> shutdown_suspended_pus for example
<jbjnr_> ok
<jbjnr_> I will try it out
<simbergm> but not necessarily on master
<simbergm> they fail all the time on#3745
<jbjnr_> "if a thread is active and we try to change the state we schedule a new thread to change the state once it's not active anymore" WTF?
<jbjnr_> #3745 - I will look at it
<simbergm> lol
<simbergm> we can't change the state of a thread while it's active so we delay doing that by spawning another thread which will try to set the state later
<jbjnr_> diehlpk_work: /apps/daint/UES/biddisco/gcc/7.3.0/hpx4octotiger-tcmalloc-release now has build updated to master from 10 minutes ago
<jbjnr_> simbergm: and when does a thread state get changed whilst it is running? when does that happen?
<simbergm> probably other cases as well, but interrupting a thread sets the state to pending (to wake it up in case it was suspended)
<jbjnr_> when a thread suspends - it can presumably change it's own state, when it is activated, it can be changed - but
<simbergm> it's tests.unit.threads.thread that's been failing pretty regularly recently
<jbjnr_> another thread ahcnging it whilstt running - what's the use case ...
<simbergm> quite a few of those state changes are disallowed because they don't make sense
<simbergm> but suspended to pending does make sense
<simbergm> and active to pending is allowed, but I'm thinking that one can be ignored
<simbergm> need to try that out
<jbjnr_> nonsense all of it!
<simbergm> absolutely!
<jbjnr_> suspended to pending should not be a problem
<simbergm> and it's not
<jbjnr_> active to pending? threads can't be suspended from outside the thread - only when they suspend themselves surely
<jbjnr_> anyway, as long as you're onto it ...
<simbergm> jbjnr_: I appreciate you questioning it because it's giving me ideas ;)
<simbergm> but yes, active to pending is weird
<diehlpk_work> jbjnr_, Thanks, I will check if octotiger works
<jbjnr_> all my tests of libfabric on daint so far have passed, but 64 nodes and bigger are all stuck in the queue all day
<jbjnr_> been waiting since lunch time for them :(
<jbjnr_> but all tests on 2,4,8,16,32 nodes passed. doing random messages between all nodes
jaafar has quit [Quit: Konversation terminated!]
jaafar has joined #ste||ar
eschnett has joined #ste||ar
<jbjnr_> diehlpk_work: just fyi - debug tcalloc build is now updated to master too
<diehlpk_work> Ok, I will let Dominic know
<diehlpk_work> jbjnr_, How can we check our remaining node hours?
<parsa> We've used 119 node hours
hkaiser has joined #ste||ar
<aserio> hkaiser: I have a FLeCSI question for you
<hkaiser> aserio: ok
<diehlpk_work> jbjnr_, the rotating star test works with the new version of hpx
<aserio> hkaiser: see pm
eschnett has quit [Quit: eschnett]
bibek has quit [Quit: Konversation terminated!]
nikunj97 has quit [Quit: Leaving]
<jbjnr_> diehlpk_work: parsa the web thingy is not updated real time, so use `sbucheck` on daint/ela to check usage
<jbjnr_> * d69: Authorized Daint constraints: gpu
<jbjnr_> DAINT Usage: 120 NODE HOURS (NH) Quota: 9,000 NH 1.3%
<jbjnr_> TAVE Usage: 0 NODE HOURS (NH) Quota: 300 NH 0.0%
<jbjnr_> actully it looks about right, but sbucheck is easy to use on the machine
<jbjnr_> ^easier
<jbjnr_> diehlpk_work: rotating star. great. This is a big relief. any problem with amster and we'd be stuffed.
aserio has quit [Quit: aserio]
mbremer has quit [Quit: Leaving.]