hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
jbjnr has quit [Ping timeout: 268 seconds]
eschnett has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: bye]
eschnett has quit [Quit: eschnett]
daissgr has joined #ste||ar
jbjnr has joined #ste||ar
nikunj97 has joined #ste||ar
nikunj97 has quit [Remote host closed the connection]
nikunj has joined #ste||ar
nikunj has quit [Remote host closed the connection]
daissgr has quit [Ping timeout: 240 seconds]
daissgr has joined #ste||ar
hkaiser has joined #ste||ar
nikunj97 has joined #ste||ar
daissgr has quit [Quit: WeeChat 1.9.1]
<hkaiser>
simbergm: yt?
<simbergm>
hkaiser: here
<hkaiser>
simbergm: what's wrong with the doc builder? any idea?
<hkaiser>
does it run oom?
<simbergm>
no, it's not that but I don't know what else it is
<simbergm>
I've just updated the sphinx packages in the docker image to see if it's just a buggy old version
<hkaiser>
everything else is no nicely green nowadays ;-)
<simbergm>
if that doesn't help I'll disable some or all of it
<hkaiser>
s/no/so/
<simbergm>
it is :D
<simbergm>
mostly...
<simbergm>
but we're getting there
<hkaiser>
things are improving at least
<simbergm>
definitely
<hkaiser>
and we're seeing beautiful speedup across the board
<simbergm>
I never managed to reproduce that docs build failure either locally or sshing into circleci
<hkaiser>
grrr
<simbergm>
could be some dodgy dependencies in the cmake configuration, but I'm out of ideas on how to fix it (besides updating the docker image)
<simbergm>
or how to debug it
<simbergm>
well, almost out of ideas
<simbergm>
I'll try to get a stack trace if it keeps failing
<simbergm>
the actual failure is that something is trying to create a directory that already exists, and it elegantly just bombs out on that
<simbergm>
hkaiser: seems like there are more timeouts in the parallel algorithms after the last few merges
<hkaiser>
are there now?
<hkaiser>
have a link?
<simbergm>
I don't think it's anything new but we need to have a look at it before the release
<simbergm>
I'm cleaning up the sanitizers branch, might contain some fixes for that
<zao>
Have we've selectively disabled PDF generation or something somehow?
<zao>
(I have no idea what we're doing)
<zao>
Ah, the second one was on a branch.
<simbergm>
correction, pdf docs build only on master and tags
<simbergm>
yeah
<simbergm>
I've selectively disabled it so that PRs wouldn't take that long
<simbergm>
don't care much if the pdf build breaks, as long as the html build works
K-ballo has joined #ste||ar
<simbergm>
I'm lucky the pdf works at all...
nikunj97 has quit [Ping timeout: 245 seconds]
<hkaiser>
simbergm: Mr. grostig has disappeared again, I think we can disable the pdf builds and do it for releases only
<simbergm>
:P
<simbergm>
yeah, I'm not going to try too hard
<hkaiser>
k
<simbergm>
I'm hoping updating the dependencies will do it, but that'll be it
<simbergm>
I'll disable it if it doesn't work
<hkaiser>
k
<zao>
Is it safe to run the three doc flavours at the same time, btw?
<simbergm>
zao: it's supposed to be, but clearly something isn't right so maybe not
<simbergm>
I don't think sphinx has anything against it in principle though
<zao>
Just noticed in my trial run that there's three Sphinx instances running at once, so got curious :)
<hkaiser>
jbjnr: yt?
<K-ballo>
simbergm: there's little value in dropping 11 for just 14, other than perhaps not having to be actively identifying 14 patterns
<simbergm>
K-ballo: fair enough, you and hkaiser probably have to deal with it the most, so if it's not a big deal for you then we can leave it
<K-ballo>
we may not actually need it.. but it is not as big a deal as 11 was or 17 would be
<simbergm>
sure
<simbergm>
less code is less code though ;)
<K-ballo>
yeah.. which code would go?
<simbergm>
I'm not sure, maybe nothing
<K-ballo>
some lambda bits with move only captures, maybe
<simbergm>
yeah
<simbergm>
zao: ah, make -j? it might not be happy with that
<zao>
`ninja docs` is automatically -j based on cores and moon phase.
<simbergm>
really? uh-oh
<simbergm>
that could actually be the problem
<zao>
-j N run N jobs in parallel [default=18, derived from CPUs available]
<zao>
I ran a -j1 in a container, PDF generation still takes ages.
<simbergm>
yeah, it does
<simbergm>
but is that default 18 with the -j flag or even without
<simbergm>
?
<zao>
Defaults on.
<zao>
18 is derived from my 16 cores + 2, I reckon.
<zao>
If you want a serial build with Ninja you need to -j1
<simbergm>
indeed... I'll change that
<simbergm>
thanks for pointing that out
<zao>
Something I found - Sphinx 1.7.3 has a mention of `#4803: latex: too slow in proportion to number of auto numbered footnotes`, we have like 3.5k footnotes in our PDF.
<zao>
Installed the latest 1.8 series in my container to see if it runs faster or not.
<zao>
(build_env:ubuntu ships with 1.6.7)
<zao>
Ah, didn't notice the PR to use pip, guess that the published image isn't rebuilt yet.
K-ballo has quit [Ping timeout: 246 seconds]
<simbergm>
it's just been rebuilt but I don't think anything has used it yet
<simbergm>
I have 1.7.6 locally and it's still pretty slow
<zao>
simbergm: I mean, I pulled it from dockerhub today when I started looking at things.
<simbergm>
haven't timed it though
<simbergm>
ah, ok
<zao>
Regarding the compiler version bump btw, going for a higher GCC rules out manylinux1 for Python packaging fully, which is nice as we don't have to think about that anymore :)
<jbjnr>
hkaiser: here
eschnett has joined #ste||ar
eschnett has quit [Client Quit]
K-ballo has joined #ste||ar
<jbjnr>
hkaiser: still here
<jbjnr>
but going to make a cup of tea since you're not
<simbergm>
hkaiser: I'm guessing you don't have any objections against #3785?
<jbjnr>
simbergm: interesting that the thread_pool_executors_test actually bombs out on start for me because "hpx.os_threads=all" isn't recognized
<jbjnr>
oops
<jbjnr>
not true
<jbjnr>
the exception is caught and then things go on
<jbjnr>
bit sloppy the way we handle that stuff. and very annoying to debug
<hkaiser>
jbjnr: how many cores did we run octotiger on daint (per node)?
<jbjnr>
12
<hkaiser>
what was the rationale for this?
<jbjnr>
nobody said to use a different number
<jbjnr>
(to me)
<hkaiser>
nod
<hkaiser>
this paper is one of the best-disorganized ones I've ever had the honor of being part of
<jbjnr>
in the paper repo, there is a folder called scripts. I added my big run script launcher to that so you can see all my setting for validation etc
<hkaiser>
sure
<hkaiser>
not blaming you
<jbjnr>
hkaiser: yes. That is ahy I am so cross.
<jbjnr>
I wasn't involved remember until 3 weeks ago
<hkaiser>
sure
<hkaiser>
jbjnr: but the full system run was done on 5400 node, correct?
<jbjnr>
yes but we only got one data point on our graphs for that
<hkaiser>
sure, np
<hkaiser>
that doesn't matter
<hkaiser>
thanks
<jbjnr>
one libfabric run and maybe one mpi one, I think I killed the MPI one when I knew we were almost out of time,
<hkaiser>
jbjnr: the graph looks decent enough
<jbjnr>
got told off today for using too many nodes
<hkaiser>
lol
<hkaiser>
any advertisement is good advertisement
<jbjnr>
what amazes me is that the only data in this sodding paper is the stuff I got in the last few runs. WTF have you lot been doing for the last 6 months!
<hkaiser>
jbjnr: slacking off! ;-)
<jbjnr>
I thought you al had a plan
<jbjnr>
I will never work with you lot again.
<jbjnr>
until next time
<hkaiser>
I know
<jbjnr>
I think I said it last time too
<jbjnr>
I need more backbone
<hkaiser>
never happened before - and yet again
<hkaiser>
:D
nikunj has joined #ste||ar
<jbjnr>
I wrote about thread contention fighting for the network in the results and said it's a future work thing
<hkaiser>
jbjnr: yah, saw that
<jbjnr>
feel free to add more stuff. I got bored
<hkaiser>
fair description of where we are
<hkaiser>
nah, more than enough for this paper
<jbjnr>
(not worth wasting too much more time on it since we know it's going in the bin anyway)
<hkaiser>
jbjnr: not sure abut that
<hkaiser>
jbjnr: it's interesting that the LF stuff is actually slower than MPI for smaller node counts
<jbjnr>
yes. I must look into that and fix it. Might be a 12 core fighting problem
<hkaiser>
well, the OS had 32 cores to play with
<jbjnr>
I maxxed out everything to make sure the runs went through
<jbjnr>
lots of background work being done at avery moment
<jbjnr>
to make sure queues didn't overfil
<hkaiser>
right
<jbjnr>
might have put a string on the scheduling
<jbjnr>
^strain
<hkaiser>
more cores might have helped, but who knows...
<jbjnr>
don't know how they found me, must've been someone like you giving them my name
<hkaiser>
apparently I was on the PC last year ;-)
<hkaiser>
forgot about that
<jbjnr>
lol
<hkaiser>
jbjnr: we have that AGAS paper in the pipeline, might be a good place to submit to
nikunj97 has quit [Remote host closed the connection]
<jbjnr>
yup. I want to do a parcelport paper too, but I wanted to aim for a better conference. I wonder if there are other things I can write about for that PAW one
nikunj97 has joined #ste||ar
<hkaiser>
jbjnr: submit to the MPI conference ;-)
<jbjnr>
I'd like to try IPDPS 2020
<hkaiser>
nod
<jbjnr>
hkaiser: see pm please
K-ballo has quit [Ping timeout: 250 seconds]
hkaiser has quit [Quit: bye]
<jbjnr>
simbergm: the fact that the thread test runs with my scheduler, but locks up with others is very interesting
<jbjnr>
cos I do all my debuggin with my scheduler :)
<jbjnr>
and hence I never saw problems
<jbjnr>
aha. fif is bad, lifo is good
<jbjnr>
^fifo
<simbergm>
jbjnr: ah yes, forgot about that
<simbergm>
lifo is default I guess?
K-ballo has joined #ste||ar
<simbergm>
on your scheduler
<jbjnr>
because the moody camel only supports fifo - lifo won't work any more for back-end thread stealing.
K-ballo has quit [Read error: Connection reset by peer]
<jbjnr>
I have a clue on what to look for now
<simbergm>
lifo won't work for the staged threads
<simbergm>
pending threads still works with either
<jbjnr>
yup
<jbjnr>
something must have got mixed up in those bits
<mbremer>
Is anyone having issues with slurm on the current master? hpx seems to be unable to find the correct number of localities. On Stampede2, with -N 2 -n 2, hpx::get_num_localities() is returning 1. I thought I'd ask before opening an issue.
daissgr1 has quit [Quit: WeeChat 1.9.1]
eschnett has quit [Quit: eschnett]
<zao>
Do we have any test I can check with on my cluster?
khuck has joined #ste||ar
<heller>
hello boys and girls
<heller>
mbremer: is TACC still using its ibrun utilities?
<mbremer>
heller: yup
<mbremer>
I can give you the underlying call if your interested
<heller>
please do
<K-ballo>
heller: where you out?
<heller>
a little
<K-ballo>
*were
<heller>
took a small break after my defense...
<mbremer>
Congrats!
<heller>
thanks
<zao>
Heh, my old build of HPX (2ae4f48b6bae45d70e49f59c72582fc9369bf85c) doesn't work on multi-node at all.
<zao>
Master process listens only on IPv4 (b-cn0233.hpc2n.umu.se:7910), other locality tries to connect to IPv6 (sin6_port=htons(7910), inet_pton(AF_INET6, "2001:6b0:e:4240::240:183")
<zao>
Has anyone ever run HPX on a dual-stacked cluster? :)
<K-ballo>
isn't that your job?
<zao>
Also channeling some jbjnr as the processes doesn't care about interruption :)
<zao>
K-ballo: Still waiting for those juicy LSU checks ;)
<K-ballo>
......they pay in karma
<zao>
Can't test much anyway, parallel file system is still angry after all the kernel panics last week :)
<zao>
Wish I had time to play around with HPX, got some soul-crushing amounts of other work to do.
<mbremer>
And the output is 1 (two times, i.e. I suspect each node is running a copy of the executable)
<heller>
mbremer: alright, my guess is that you haven't compiled with the MPI parcelport
<mbremer>
Will nuke dirs and retry
<K-ballo>
heller: drop C++11 support, yey or meh?
<heller>
K-ballo: you mean allow C++14 features unconditionally?
<heller>
yay
<jbjnr>
yay from me too please (though I know you don't care :( )
<heller>
jbjnr: why?
<heller>
jbjnr: btw, congrats on those nice LF results!
<mbremer>
heller: Issue still exists; networking on, parcelport mpi is on. I'm going to go back a few commits, because this was working a few days ago for me.
<heller>
mbremer: wait a second
<heller>
mbremer: could you please run with --hpx:list-parcel-ports?
<mbremer>
Yeah, modulo TACC: Starting up, etc. stuff
<heller>
so no hello world output or anything?
<mbremer>
Sorry, it's not a hello world. ( i'm just abusing the script; all it does is print the number of localities (in this case 1); I'll send you the script)
<heller>
it looks like you somehow have no parcelports compiled in
<mbremer>
Hmm weird. So I was building with this hpx https://github.com/STEllAR-GROUP/hpx/pull/3773 and today I wanted to update to the newest master, and that's when I started having this issue
<mbremer>
Is there a particular library for which I should see if MPI is being linked in?
<heller>
not really, if CMake doesn't find an MPI installation, it will complain