hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
jbjnr has quit [Ping timeout: 268 seconds]
eschnett has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: bye]
eschnett has quit [Quit: eschnett]
daissgr has joined #ste||ar
jbjnr has joined #ste||ar
nikunj97 has joined #ste||ar
nikunj97 has quit [Remote host closed the connection]
nikunj has joined #ste||ar
nikunj has quit [Remote host closed the connection]
daissgr has quit [Ping timeout: 240 seconds]
daissgr has joined #ste||ar
hkaiser has joined #ste||ar
nikunj97 has joined #ste||ar
daissgr has quit [Quit: WeeChat 1.9.1]
<hkaiser> simbergm: yt?
<simbergm> hkaiser: here
<hkaiser> simbergm: what's wrong with the doc builder? any idea?
<hkaiser> does it run oom?
<simbergm> no, it's not that but I don't know what else it is
<simbergm> I've just updated the sphinx packages in the docker image to see if it's just a buggy old version
<hkaiser> everything else is no nicely green nowadays ;-)
<simbergm> if that doesn't help I'll disable some or all of it
<hkaiser> s/no/so/
<simbergm> it is :D
<simbergm> mostly...
<simbergm> but we're getting there
<hkaiser> things are improving at least
<simbergm> definitely
<hkaiser> and we're seeing beautiful speedup across the board
<simbergm> I never managed to reproduce that docs build failure either locally or sshing into circleci
<hkaiser> grrr
<simbergm> could be some dodgy dependencies in the cmake configuration, but I'm out of ideas on how to fix it (besides updating the docker image)
<simbergm> or how to debug it
<simbergm> well, almost out of ideas
<simbergm> I'll try to get a stack trace if it keeps failing
<simbergm> the actual failure is that something is trying to create a directory that already exists, and it elegantly just bombs out on that
<simbergm> hkaiser: seems like there are more timeouts in the parallel algorithms after the last few merges
<hkaiser> are there now?
<hkaiser> have a link?
<simbergm> I don't think it's anything new but we need to have a look at it before the release
<simbergm> I'm cleaning up the sanitizers branch, might contain some fixes for that
<simbergm> I've seen them before, just not this frequently
<hkaiser> ok, I'll try to find the time to look, thanks
<simbergm> ok
<hkaiser> the sanitizer branch might give us some clue
<simbergm> I wouldn't worry yet
<simbergm> but it seems confined to parallel algorithms only so I hope it's not the new executor
<simbergm> or something it's using
<hkaiser> nod
<hkaiser> the latch perhaps
<zao> simbergm: It's interesting how the failed docs build took over 35 minutes while a good one takes about a minute.
<simbergm> zao: sure? the pdf docs take a loooong time, but it's only built on master
<simbergm> the html docs build quickly
<zao> Looking at this one I got a mail about - https://circleci.com/gh/STEllAR-GROUP/hpx/78052
<zao> vs. this one that seems to have gone well - https://circleci.com/gh/STEllAR-GROUP/hpx/78271
<zao> Have we've selectively disabled PDF generation or something somehow?
<zao> (I have no idea what we're doing)
<zao> Ah, the second one was on a branch.
<simbergm> correction, pdf docs build only on master and tags
<simbergm> yeah
<simbergm> I've selectively disabled it so that PRs wouldn't take that long
<simbergm> don't care much if the pdf build breaks, as long as the html build works
K-ballo has joined #ste||ar
<simbergm> I'm lucky the pdf works at all...
nikunj97 has quit [Ping timeout: 245 seconds]
<hkaiser> simbergm: Mr. grostig has disappeared again, I think we can disable the pdf builds and do it for releases only
<simbergm> :P
<simbergm> yeah, I'm not going to try too hard
<hkaiser> k
<simbergm> I'm hoping updating the dependencies will do it, but that'll be it
<simbergm> I'll disable it if it doesn't work
<hkaiser> k
<zao> Is it safe to run the three doc flavours at the same time, btw?
<simbergm> zao: it's supposed to be, but clearly something isn't right so maybe not
<simbergm> I don't think sphinx has anything against it in principle though
<zao> Just noticed in my trial run that there's three Sphinx instances running at once, so got curious :)
<hkaiser> jbjnr: yt?
<K-ballo> simbergm: there's little value in dropping 11 for just 14, other than perhaps not having to be actively identifying 14 patterns
<simbergm> K-ballo: fair enough, you and hkaiser probably have to deal with it the most, so if it's not a big deal for you then we can leave it
<K-ballo> we may not actually need it.. but it is not as big a deal as 11 was or 17 would be
<simbergm> sure
<simbergm> less code is less code though ;)
<K-ballo> yeah.. which code would go?
<simbergm> I'm not sure, maybe nothing
<K-ballo> some lambda bits with move only captures, maybe
<simbergm> yeah
<simbergm> zao: ah, make -j? it might not be happy with that
<zao> `ninja docs` is automatically -j based on cores and moon phase.
<simbergm> really? uh-oh
<simbergm> that could actually be the problem
<zao> -j N run N jobs in parallel [default=18, derived from CPUs available]
<zao> I ran a -j1 in a container, PDF generation still takes ages.
<simbergm> yeah, it does
<simbergm> but is that default 18 with the -j flag or even without
<simbergm> ?
<zao> Defaults on.
<zao> 18 is derived from my 16 cores + 2, I reckon.
<zao> If you want a serial build with Ninja you need to -j1
<simbergm> indeed... I'll change that
<simbergm> thanks for pointing that out
<zao> Something I found - Sphinx 1.7.3 has a mention of `#4803: latex: too slow in proportion to number of auto numbered footnotes`, we have like 3.5k footnotes in our PDF.
<zao> Installed the latest 1.8 series in my container to see if it runs faster or not.
<zao> (build_env:ubuntu ships with 1.6.7)
<zao> Ah, didn't notice the PR to use pip, guess that the published image isn't rebuilt yet.
K-ballo has quit [Ping timeout: 246 seconds]
<simbergm> it's just been rebuilt but I don't think anything has used it yet
<simbergm> I have 1.7.6 locally and it's still pretty slow
<zao> simbergm: I mean, I pulled it from dockerhub today when I started looking at things.
<simbergm> haven't timed it though
<simbergm> ah, ok
<zao> Regarding the compiler version bump btw, going for a higher GCC rules out manylinux1 for Python packaging fully, which is nice as we don't have to think about that anymore :)
<jbjnr> hkaiser: here
eschnett has joined #ste||ar
eschnett has quit [Client Quit]
K-ballo has joined #ste||ar
<jbjnr> hkaiser: still here
<jbjnr> but going to make a cup of tea since you're not
<simbergm> hkaiser: I'm guessing you don't have any objections against #3785?
<jbjnr> simbergm: interesting that the thread_pool_executors_test actually bombs out on start for me because "hpx.os_threads=all" isn't recognized
<jbjnr> oops
<jbjnr> not true
<jbjnr> the exception is caught and then things go on
<jbjnr> bit sloppy the way we handle that stuff. and very annoying to debug
<hkaiser> jbjnr: how many cores did we run octotiger on daint (per node)?
<jbjnr> 12
<hkaiser> what was the rationale for this?
<jbjnr> nobody said to use a different number
<jbjnr> (to me)
<hkaiser> nod
<hkaiser> this paper is one of the best-disorganized ones I've ever had the honor of being part of
<jbjnr> in the paper repo, there is a folder called scripts. I added my big run script launcher to that so you can see all my setting for validation etc
<hkaiser> sure
<hkaiser> not blaming you
<jbjnr> hkaiser: yes. That is ahy I am so cross.
<jbjnr> I wasn't involved remember until 3 weeks ago
<hkaiser> sure
<hkaiser> jbjnr: but the full system run was done on 5400 node, correct?
<jbjnr> yes but we only got one data point on our graphs for that
<hkaiser> sure, np
<hkaiser> that doesn't matter
<hkaiser> thanks
<jbjnr> one libfabric run and maybe one mpi one, I think I killed the MPI one when I knew we were almost out of time,
<hkaiser> jbjnr: the graph looks decent enough
<jbjnr> got told off today for using too many nodes
<hkaiser> lol
<hkaiser> any advertisement is good advertisement
<jbjnr> what amazes me is that the only data in this sodding paper is the stuff I got in the last few runs. WTF have you lot been doing for the last 6 months!
<hkaiser> jbjnr: slacking off! ;-)
<jbjnr> I thought you al had a plan
<jbjnr> I will never work with you lot again.
<jbjnr> until next time
<hkaiser> I know
<jbjnr> I think I said it last time too
<jbjnr> I need more backbone
<hkaiser> never happened before - and yet again
<hkaiser> :D
nikunj has joined #ste||ar
<jbjnr> I wrote about thread contention fighting for the network in the results and said it's a future work thing
<hkaiser> jbjnr: yah, saw that
<jbjnr> feel free to add more stuff. I got bored
<hkaiser> fair description of where we are
<hkaiser> nah, more than enough for this paper
<jbjnr> (not worth wasting too much more time on it since we know it's going in the bin anyway)
<hkaiser> jbjnr: not sure abut that
<hkaiser> jbjnr: it's interesting that the LF stuff is actually slower than MPI for smaller node counts
<jbjnr> yes. I must look into that and fix it. Might be a 12 core fighting problem
<hkaiser> well, the OS had 32 cores to play with
<jbjnr> I maxxed out everything to make sure the runs went through
<jbjnr> lots of background work being done at avery moment
<jbjnr> to make sure queues didn't overfil
<hkaiser> right
<jbjnr> might have put a string on the scheduling
<jbjnr> ^strain
<hkaiser> more cores might have helped, but who knows...
nikunj97 has joined #ste||ar
aserio has joined #ste||ar
<jbjnr> hkaiser: are you involved with https://sourceryinstitute.github.io/PAW/
<jbjnr> for SC19 I mean not 18
<hkaiser> I helped with the paper from Max Bremer
<hkaiser> ahh, no
<hkaiser> we might submit a paper there
<jbjnr> ok. I'm on the committee this year
<hkaiser> nice
nikunj has quit [Ping timeout: 244 seconds]
<jbjnr> don't know how they found me, must've been someone like you giving them my name
<hkaiser> apparently I was on the PC last year ;-)
<hkaiser> forgot about that
<jbjnr> lol
<hkaiser> jbjnr: we have that AGAS paper in the pipeline, might be a good place to submit to
nikunj97 has quit [Remote host closed the connection]
<jbjnr> yup. I want to do a parcelport paper too, but I wanted to aim for a better conference. I wonder if there are other things I can write about for that PAW one
nikunj97 has joined #ste||ar
<hkaiser> jbjnr: submit to the MPI conference ;-)
<jbjnr> I'd like to try IPDPS 2020
<hkaiser> nod
<jbjnr> hkaiser: see pm please
K-ballo has quit [Ping timeout: 250 seconds]
hkaiser has quit [Quit: bye]
<jbjnr> simbergm: the fact that the thread test runs with my scheduler, but locks up with others is very interesting
<jbjnr> cos I do all my debuggin with my scheduler :)
<jbjnr> and hence I never saw problems
<jbjnr> aha. fif is bad, lifo is good
<jbjnr> ^fifo
<simbergm> jbjnr: ah yes, forgot about that
<simbergm> lifo is default I guess?
K-ballo has joined #ste||ar
<simbergm> on your scheduler
<jbjnr> because the moody camel only supports fifo - lifo won't work any more for back-end thread stealing.
K-ballo has quit [Read error: Connection reset by peer]
<jbjnr> I have a clue on what to look for now
<simbergm> lifo won't work for the staged threads
<simbergm> pending threads still works with either
<jbjnr> yup
<jbjnr> something must have got mixed up in those bits
<jbjnr> bin/thread_pool_executors_test --hpx:queuing=local-priority-fifo
<jbjnr> bin/thread_pool_executors_test --hpx:queuing=local-priority-lifo
<jbjnr> gives me something to use to investigate
<jbjnr> very annoying that ctrl-C doesn't kill hpx jobs and you have to use pkill
<jbjnr> ^hanging hpx jobs
<simbergm> just srun --pty bash ;)
K-ballo has joined #ste||ar
<jbjnr> ?
<jbjnr> I don't use srun on my laptop!
<simbergm> hum, why doesn't ctrl-C kill your jobs?
<zao> We're sneakily hogging signals?
<simbergm> fix it
<simbergm> zao: probably
<simbergm> usually it works
<jbjnr> it breaks into the process, we get the stack trace dumped out, then it hangs forever
<jbjnr> maybe not forever, bu too lon for my patience
<jbjnr> maybe just this test is annoying
<jbjnr> since it hangs anyway
<simbergm> jbjnr: when I was testing it was enough to do just one executor and the test_sync part of it
<simbergm> stick a for loop around it and it'll hang pretty much right away
<diehlpk_work> jbjnr, Which version of tcmalloc are we using?
<jbjnr> diehlpk_work: tcmalloc comes from "gperftools 2.7"
<jbjnr> I'm using sa aystem installed on rather than one we built
<jbjnr> sorry about my por typing
<jbjnr> @#$$
<simbergm> do we have an equivalent to execution_policy.on(executor) for launch policies?
<diehlpk_work> jbjnr, Thanks, I finishing up the reproducibility part
<jbjnr> diehlpk_work: I added my launch cripts as well so they can be referenced if need be
<diehlpk_work> yes, I saw them. I just have to find a way to reference them without letting the reviewers know who we are. Double blind reviews
<jbjnr> I have no idea what you need, but hopefully it's enough to say that they exist!
<diehlpk_work> No, they want a link to them
<jbjnr> gosh.
<jbjnr> pastebin?
<diehlpk_work> But the link should not let them now who the authors are
<diehlpk_work> Yes, was thinking about pastbin too
eschnett has joined #ste||ar
hkaiser has joined #ste||ar
<jbjnr> simbergm: it's definitely/got to be a race condition, but I'm not suere where.
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 250 seconds]
aserio1 is now known as aserio
<simbergm> jbjnr yep, what else...
<simbergm> parsa don't know if you already had a look at the project ideas page for gsod, but if not have a look
<simbergm> I wrote something there but it'll need some more thought
<parsa> sorry. will look today
aserio has quit [Ping timeout: 250 seconds]
jaafar has joined #ste||ar
jaafar has quit [Client Quit]
aserio has joined #ste||ar
jaafar has joined #ste||ar
aserio1 has joined #ste||ar
<parsa> jbjnr: can you give me the changeset for the HPX you used for the runs in the paper?
jaafar_ has joined #ste||ar
aserio has quit [Ping timeout: 240 seconds]
aserio1 is now known as aserio
jaafar has quit [Ping timeout: 240 seconds]
aserio has quit [Quit: aserio]
jaafar_ is now known as jaafar
daissgr has joined #ste||ar
<jbjnr> parsa: it's the rdma_object branch with the sha we use in the paper
<jbjnr> I have pushed it to stellar/repo - the diff is very large compared to "merge-base master rdma_object"
<jbjnr> yes
<jbjnr> that is correct
<parsa> thanks
<simbergm> diehlpk_work, parsa: do we actually have something to talk about gsod today? you guys have been busy the last week with other things
<diehlpk_work> simbergm, let us move it to Fr. I need to apologize but I did not much due to the SC paper
<diehlpk_work> I just added the two wiki pages and answered some of the questions
<simbergm> no need to apologize
<simbergm> friday sounds good
<parsa> yeah Friday works
daissgr1 has joined #ste||ar
hkaiser has quit [Quit: bye]
<daissgr1> are you guys coming to the meeting?
<parsa> daissgr, daissgr1, jbjnr: are you guys on?
<daissgr1> Juhan and I are here
<daissgr1> where are you guys?
mbremer has joined #ste||ar
<mbremer> Is anyone having issues with slurm on the current master? hpx seems to be unable to find the correct number of localities. On Stampede2, with -N 2 -n 2, hpx::get_num_localities() is returning 1. I thought I'd ask before opening an issue.
daissgr1 has quit [Quit: WeeChat 1.9.1]
eschnett has quit [Quit: eschnett]
<zao> Do we have any test I can check with on my cluster?
khuck has joined #ste||ar
<heller> hello boys and girls
<heller> mbremer: is TACC still using its ibrun utilities?
<mbremer> heller: yup
<mbremer> I can give you the underlying call if your interested
<heller> please do
<K-ballo> heller: where you out?
<heller> a little
<K-ballo> *were
<heller> took a small break after my defense...
<mbremer> Congrats!
<heller> thanks
<zao> Heh, my old build of HPX (2ae4f48b6bae45d70e49f59c72582fc9369bf85c) doesn't work on multi-node at all.
<zao> Master process listens only on IPv4 (b-cn0233.hpc2n.umu.se:7910), other locality tries to connect to IPv6 (sin6_port=htons(7910), inet_pton(AF_INET6, "2001:6b0:e:4240::240:183")
<zao> Has anyone ever run HPX on a dual-stacked cluster? :)
<K-ballo> isn't that your job?
<zao> Also channeling some jbjnr as the processes doesn't care about interruption :)
<zao> K-ballo: Still waiting for those juicy LSU checks ;)
<K-ballo> ......they pay in karma
<zao> Can't test much anyway, parallel file system is still angry after all the kernel panics last week :)
<zao> Wish I had time to play around with HPX, got some soul-crushing amounts of other work to do.
Vir has quit [Ping timeout: 264 seconds]
Vir has joined #ste||ar
<mbremer> (grrr stampede2 queues)
<mbremer> heller: /opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/mpiexec.hydra -np 2 -machinefile /home1/02578/bremer/.slurm/job.3268183.hostlist.OX0BttkQ /work/02578/bremer/stampede2/hpx_mwe/build_fork/hello_world --hpx:threads=2
<mbremer> And the output is 1 (two times, i.e. I suspect each node is running a copy of the executable)
<heller> mbremer: alright, my guess is that you haven't compiled with the MPI parcelport
<mbremer> Will nuke dirs and retry
<K-ballo> heller: drop C++11 support, yey or meh?
<heller> K-ballo: you mean allow C++14 features unconditionally?
<heller> yay
<jbjnr> yay from me too please (though I know you don't care :( )
<heller> jbjnr: why?
<heller> jbjnr: btw, congrats on those nice LF results!
<mbremer> heller: Issue still exists; networking on, parcelport mpi is on. I'm going to go back a few commits, because this was working a few days ago for me.
<heller> mbremer: wait a second
<heller> mbremer: could you please run with --hpx:list-parcel-ports?
<zao> Oddly enough my -N1 -n2 doesn't start properly... *shakes cluster*
<mbremer> *******************************************************************************
<mbremer> locality: 0
<mbremer> 1
<mbremer> *******************************************************************************
<mbremer> locality: 0
<mbremer> 1
<heller> is that all?
<heller> also: please use some paste site
<mbremer> so that's my ./hello_world --hpx:threads=2 --hpx:list-parcel-ports (yeah sorry about that)
<heller> inside a ibrun command?
<mbremer> Yup; "c506-134[skx](1023)$ ibrun ./hello_world --hpx:threads=2 --hpx:list-parcel-ports" that's verbatim
<heller> and the above is all you got?
<mbremer> Yeah, modulo TACC: Starting up, etc. stuff
<heller> so no hello world output or anything?
<mbremer> Sorry, it's not a hello world. ( i'm just abusing the script; all it does is print the number of localities (in this case 1); I'll send you the script)
<heller> ok
<heller> makes sense
<heller> it looks like you somehow have no parcelports compiled in
<mbremer> Hmm weird. So I was building with this hpx https://github.com/STEllAR-GROUP/hpx/pull/3773 and today I wanted to update to the newest master, and that's when I started having this issue
<mbremer> Is there a particular library for which I should see if MPI is being linked in?
<heller> not really, if CMake doesn't find an MPI installation, it will complain
<heller> well
<heller> check libhpx.so
<heller> or libhpxd.so
<heller> can you show me your cmake command?
<mbremer> The MPI libraries are definitely in the linker line for libhpxd.so
<mbremer> This commit behaves as expected bb4dec2777e70cdddb2f52948c306150831ff7c1. i.e. get_num_localities().get() returns 2 for -N 1 -n 2
<heller> alright, will have a look
<mbremer> Sweet! Thanks, That's a pretty recent commit too. I hope it's not hard to track down. Let me know if there's anything I can do
<heller> could you try reverting that and test again?
<mbremer> On it, working on it
<mbremer> So running git revert with that hash, seems to do the trick as well.
<mbremer> I need to bounce, but I'll be back tomorrow.
khuck_ has joined #ste||ar
khuck has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
khuck_ has quit []
eschnett has joined #ste||ar