<wash[m]>
Someone tell Jules the Blaze fixes will be in the next major release of CUDa
<wash[m]>
(not the next release, but the one after that)
diehlpk has quit [Remote host closed the connection]
hkaiser has quit [Quit: bye]
<nikunj>
wash[m]: will let him know
lsl88 has quit [Ping timeout: 258 seconds]
lsl88 has joined #ste||ar
<tarzeau>
simbergm: dpkg -L libhpx-dev and dpkg -L libhpx1 show you the files each package installs
nikunj has quit [Remote host closed the connection]
Yorlik has quit [Read error: Connection reset by peer]
<simbergm>
heller: works quite nicely now
<simbergm>
tarzeau: so e.g. libboost-program-options1.67 installs /.../lib/libboost_program_options.so.1.67 but we expect libboost_program_options.so to be there when linking a program with hpx
<simbergm>
libboost-program-options-dev gives the .so
<tarzeau>
is that the only one missing?
<tarzeau>
added, if there's no others, i'll rebuild the src/bin packages and that's it
<simbergm>
tarzeau: rest of the boost dev packages and jemalloc are also missing
<simbergm>
almost everything from build depends is needed in depdens for libhpx-dev
<tarzeau>
i see, ok i'll add them as wel
<tarzeau>
libwhloc-dev and libjemalloc-dev added
<tarzeau>
anything i forgot? mpi-default-dev ? libpapi-dev? libboost-filesystem-dev ?
<simbergm>
they're probably all needed
<tarzeau>
ok adding them all
<tarzeau>
rebuilding
<simbergm>
for a minimal hello world libboost-X-dev and libjemalloc-dev are enough, but the rest will most likely be needed
<simbergm>
tarzeau: thanks!
<zao>
tarzeau: Thanks by the way for the tip about eatmydata, we're probably deploying it for our FAI runs of puppet after the vacation.
<tarzeau>
zao: i'll be happy about hearing speed improvement :)
<tarzeau>
it's just avoiding the waits on sync(2)
<tarzeau>
which dpkg uses heavily
<tarzeau>
so basically one should count sync in sources and somehow weight higher syncs inside loops to figure out gains, same could be done with malloc in code, but difficult if the calls are in libraries. not sure if strace/ltrace outputs cut help automatically figure counts and use/notuse mimalloc eatmydata on binaries
<tarzeau>
and i guess sync behaves different if not on local filesystems
<zao>
Considered integrating it in puppet for all runs of apt, but figured it'd be least intrusive to wrap the whole thing but only during node installation.
<zao>
Those either fail or succeed, and it'd be nice to have them faster.
<tarzeau>
we're only using it for node installations so far
<tarzeau>
(for over 2 years now, without problems)
<tarzeau>
how many failing nodes/year?
<tarzeau>
i was wondering if anyone else besides us use ruptime ?
<zao>
A few node incidents per week for the old cluster now.
<zao>
Fans, memory, etc.
<tarzeau>
yeah fans+memory after psu+disks
<zao>
But it's from 2011, so most of the lemons are long gone.
<tarzeau>
are you aware of memtester and memtestcl?
<tarzeau>
i use both to test systemmemory+gpumemory (very successful)
<zao>
Installing a cluster node takes around one hour, would be nice if it went faster.
<zao>
Half an hour on the new cluster with SSDs.
<tarzeau>
our workstation installation is 5000 packages, +32GB /opt, takes about 15-30 minutes
<zao>
We just rely on SEL logs and ECC warnings, with support agreements to replace sticks given enough failures in a time window.
<tarzeau>
(no ssd, real disks because people have large datasets here, some have ssds, but that's maybe 10%)
<zao>
Anything broken ends up in Icinga from healthchecks.
<simbergm>
tarzeau: we do, but what does it give us? :P
<tarzeau>
that's a good question. gbit internetlink with public ip? access to papers? (you probably already have that all already)
<mdiers_>
i get this week in the git master some configurations errors from HPX_AddPseudoDependencies.cmake during the cmake process with -DHPX_WITH_EXAMPLES=OFF and -DHPX_WITH_TESTS=ON. With -DHPX_WITH_EXAMPLES=OFF and -DHPX_WITH_TESTS=OFF works fine.
kordejong has joined #ste||ar
kordejong has quit [Ping timeout: 260 seconds]
K-ballo has joined #ste||ar
<simbergm>
mdiers_: can you open an issue please?
<mdiers_>
simbergm: #3953, you can set dependencies to #3879 and #3897 if necessary
hkaiser has joined #ste||ar
<simbergm>
mdiers_: thanks
nikunj has joined #ste||ar
<nikunj>
hkaiser: yt?
<hkaiser>
here
<nikunj>
hkaiser: the package was delivered yesterday
<simbergm>
hkaiser: it seems to happen only for older gccs or boosts again, not sure if that helps at all
<hkaiser>
ok
<hkaiser>
works for me :/
<K-ballo>
1.64 is old? :/
<hkaiser>
ancient ;-)
kordejong has joined #ste||ar
hkaiser has quit [Quit: bye]
kordejong has quit [Ping timeout: 260 seconds]
hkaiser has joined #ste||ar
bibek has quit [Quit: Konversation terminated!]
bibek has joined #ste||ar
<hkaiser>
simbergm: I have possibly fixed the all_to_all problem on the all_reduce branch
nikunj has quit [Remote host closed the connection]
<simbergm>
hkaiser: nice
<simbergm>
I could try reproducing the error and see if it's fixed if you'd like
<simbergm>
have you pushed a branch?
<heller>
simbergm: status?
<simbergm>
heller: got vampir to read the files with a newer otf2
<simbergm>
silly task names of course but it should work for a demo still
<hkaiser>
simbergm: also, I see linker errors for the assert module branch
<hkaiser>
ok to push directly?
<simbergm>
hkaiser: sec
<simbergm>
go ahead
<hkaiser>
thanks
<hkaiser>
need 5 more mins
<simbergm>
np
<simbergm>
hkaiser: should we try to get travis running? it didn't fix it by itself unfortunately...
<hkaiser>
simbergm: yah
<hkaiser>
also, I pushed my changes
<simbergm>
github tells be I have to be an admin on the account that owns the repository to change settings (which makes sense, I shouldn't be an admin for all of STEllAR-GROUP)