hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoD: https://developers.google.com/season-of-docs/
<lsl88> simbergm: sorry, saw the message late. Yeah! :)
nikunj has quit [Remote host closed the connection]
nikunj has joined #ste||ar
hkaiser has quit [Quit: bye]
K-ballo has quit [Quit: K-ballo]
nikunj has quit [Remote host closed the connection]
nikunj has joined #ste||ar
nikunj has quit [Remote host closed the connection]
<tarzeau> heller, simbergm: i think i activated jemalloc now, added -D for mpi and fortran, and the same url for testing again (also the dep from -dev to lib pkg is there now)
<heller> The missing -dev packages as well?
<tarzeau> heller: yes debian has a meta mpi package, i think it's mpi-default-dev using the /etc/alternatives system
<heller> Yes
<tarzeau> heller: there's just one for now libhpx-dev and libhpx1
<tarzeau> ahhh the missing headers boost ?
<heller> Yeah
<heller> And hwloc, IIRC
<tarzeau> thanks for the reminder :)
<tarzeau> but there's already libhwloc-dev and libblas-dev ?
<tarzeau> why would hpx want to have a copy of them?
<tarzeau> and there's also libopenblas-dev
<heller> What'd be nice is to have a procedure we can follow for generating new packages for newer releases
<tarzeau> take the old package of my packages debian/ directory put it in, have the right names (hpx_VERSION.orig.tar.gz) and run debuild
<heller> We don't need a copy, we need libboost-dev and libhwloc-dev as a dependency of libhpx-dev
<tarzeau> oh and i miss them, let me fix that
<Yorlik> Anyone having experience with spdlog?
<tarzeau> libhwloc-dev is already in
<tarzeau> which one of blas-dev do i need libblas-dev or libopenblas-dev ?
<tarzeau> how does it need a -D for cmake?
<zao> heller: Yeah, Ubuntu LTS are doomed as it's a compile-time flag that might affect ABI and there's no maintainer, while debian has moved to newer packages where they don't pass in that flag anymore.
<zao> So there's nothing to directly backport, as is common with Ubuntu.
<zao> #$@#$ toy distro
<tarzeau> 18.04 you mean?
<heller> tarzeau: no blas, boost
<tarzeau> ah crap, rebuilding
<tarzeau> boost b-d i already have, removing the blas again
<tarzeau> although we do use ubuntu at work (100-200 machines), i only do the packaging for debian (and ubuntu just copies my stuff, fine for me as long as they don't break it)
<tarzeau> i'm on 19.04 still, but should move to 19.10 for among debian sid for building/testing things
<heller> [19:34:01] <9c27b0simbergm> and the second is that if I install libhpx-dev and libhpx1 I don't actually get any of the boost or hwloc headers installed
<heller> tarzeau: ^^
<heller> Are you planning to upstream the package?
simbergm has quit [Remote host closed the connection]
rori has joined #ste||ar
<tarzeau> heller: what is upstream the package? i plan to officially maintain it in debian
<tarzeau> heller: i'll add the depends for hwloc-dev and boost-dev then...
<zao> tarzeau: Yeah, 18.04 LTS, the current one :)
<tarzeau> 18.04 is release, it'll not get it, but feel free to dget the source.dsc and rebuild it there for use
<zao> I'm just talking about the openmpi messup, no idea what you and the folks here are up to :)
<zao> Ran into it on our site when I accidentally used the OS openmpi.
<tarzeau> heller: if you want to retry and confirm the dep on libhpx-dev is here for hwloc+boost now
<tarzeau> (only the source package, the binary are still being built)
<tarzeau> zao: how many machines with ubuntu 18.04 do you run?
<tarzeau> https://readme.phys.ethz.ch/services/linuxws/ i've been running them here for 15 years now
<zao> 800-some compute nodes
<tarzeau> used to be debian, meanwhile it's ubuntu
<tarzeau> including GPU (nvidia/cuda)?
<zao> Most infra is on Ubuntu as well, apart from the ones we need RHEL for support reasons.
<zao> Yeah, GPU nodes run the same OS as everything else.
<tarzeau> we've got all kinds of software i built over the years: https://readme.phys.ethz.ch/linux/software_on_the_d-phys_linux_computers/
<tarzeau> cuda10.0/10.1 i guess? with pycharm and tensorflow mainly?
<zao> We use EasyBuild for our software deployment.
<zao> We have _everything_ installed.
<tarzeau> ouf
<tarzeau> we also have _everything_ installed :)
<tarzeau> but i'm looking into spack, not easybuild
<tarzeau> so people get used/prepared for cluster usage
<zao> We're the primary GPU resource in the country apart from some small pre/post systems, so we get all the lovely deep-learning people.
<tarzeau> the country being?
<zao> Spack's fine too, but doesn't fit our needs.
<zao> Sweden.
<tarzeau> we've got like 40 gpu nodes, one with 8 gpus
simbergm has joined #ste||ar
<tarzeau> hej hej, i've been in sweden, beautiful! power is expensive over there
<zao> One of our selling points when producing clusters is that it's slightly cheaper here up north.
<zao> *procuring
<tarzeau> wow 3 TB memory , real memory nodes?
<tarzeau> we've got two with 1.5 TB and 2 TB nvme ssd as swap
<zao> Yeah, 3T of actual memory.
<tarzeau> they use it with what kind of software?
<zao> We've historically had users that were very happy with our previous generation that had 512G, and desired more.
<tarzeau> our users need "form": https://www.nikhef.nl/~form/
<zao> Not sure really, I personally don't keep track of it.
<tarzeau> are you using compressed memory on any of the nodes?
<zao> There's definitely users that can use it all.
<tarzeau> (zram-config is the package in ubuntu, debian got their own for buster+ meanwhile)
<zao> I don't think we run anything like htat.
<tarzeau> same here, with form
<tarzeau> and filesystems? we've had xfs a decade of years, and moved to btrfs now (partly with compression)
<zao> Local scratch and OS is XFS, homedirs are AFS, project/cluster filesystem is Lustre on (assumedly) ext4.
<zao> Used to run GPFS before that.
<tarzeau> interesting...
<zao> All software lives on Lustre currently with a migration to CVMFS ongoing.
<tarzeau> if you want to test btrfs: http://aiei.ch/linux/btrfs-compression
<zao> Something like 19 million inodes currently.
<tarzeau> and some very basic filesystem check: http://aiei.ch/linux/fsbench
<zao> I don't think we consider btrfs production-ready :)
<tarzeau> the inventory and sw scripts are also fun
<tarzeau> we do since 1-2 years
<zao> Most jobs don't use local I/O much at all, it's pretty much all against Lustre.
<tarzeau> we also use eatmydata (for installing via ipxe/fully automatic workstation install) speed up is 100%
<tarzeau> (eatmydata apt-get install ALLTHEPACKAGESWEWANT, about 5000)
<zao> We use FAI for setting up enough node to run Puppet.
<tarzeau> i was wondering since debian builds software in PORTABLE/COMPATIBLE way, the base is SSE2 only, and -march forbidden, it's highly not optimized for speed,
<tarzeau> are you rebuilding part of the packages with optimizations?
<tarzeau> we use debian-installer preseeding
<zao> We use as little as possible from the distro installation.
<tarzeau> and ansible (before it was dphys-config) and aptitude-robot for automatic updates (and xymon.phys.ethz.ch for monitoring)
<zao> All software are built with our own toolchains, we have our own MPIs, BLAS impls, everything.
<tarzeau> i see
<zao> Scientific software, that is.
<tarzeau> yep, i plan on improving this problem in debian
<zao> Regular boring distro software comes from the distro, but we try to minimize the amount of deps it pulls in.
<zao> Kind of need our own builds of SLURM/MPI/PMIX/etc. to work well on our interconnects.
<tarzeau> we like debian packages. and use reprepro for own packages
<zao> reprepro...
<tarzeau> works like a champion :)
<zao> Nice software, but has some ideas about having multiple versions of something in the repo at once.
<zao> Makes it bothersome to selectively roll things out.
<zao> We mirror all vendor-supplied packages into our own reprepro repos, to have some control over what madness Dell/HP/etc. do.
<tarzeau> every node exports /scratch* (local disks without backup) and /export/data* (local disks that we do backup) to other machines, so that's very useful
<tarzeau> we run the swiss offical debian mirror and ubuntu
<zao> Not pointing at anyone in particular, but `find / -name 'blargh'` isn't a good way to find tools.
<zao> :D
<tarzeau> what is most cumbersome is ibm pc bios. are you playing with coreboot/flashrom?
<tarzeau> (the time lost when booting/rebooting/installing/debugging/adding removing/hw)
<zao> Stock firmware from whatever vendor there is.
<tarzeau> i wouldn't mind getting rid of non-free vendor bios, and replacing it with something faster
<zao> We'd value vendor support more, after all we pay for it :)
<tarzeau> who is your vendor? intel doesn't give a shit about broken bioses
<tarzeau> i had 14 ssd disks with broken firmware flashed yesterday (if not, they suddenly lose all data, haha)
<zao> We've had IBM, Supermicro in the past, now Lenovo.
<heller> Depends on how big a customer you are :p
<zao> They're a bit slow at times, we're one of the first academic customers in this area after splitting off from IBM.
<tarzeau> yeah supermicro we also have. but all kinds of vendors really. too small, eth zurich is not a commercial company but about 20k people, and about the same number of public ip addresses
<zao> We have KNL nodes that are very vanilla Intel, and they're horrible.
<tarzeau> you don't turn off hyperthreading for that intel-microcode stuff?
<zao> For regular compute nodes, HT has been off since day one.
<tarzeau> loosing so much resources/cpu power?
<zao> For the KNLs, we're currently poking at the vendor for mitigations for the current MDS mess, but do not have high hopes.
<tarzeau> KNL being?
<tarzeau> ah found it on your web
<zao> Knight's Landing, the 270-some core (SMT4) things.
<zao> We bought those, and then Intel discontinued the whole product line and all future development :D
<tarzeau> haha
<zao> I'm not sure of the reasoning, but I would guess that we value predictability w.r.t hyperthreading and compressed memory.
<zao> Allocation-wise, it tends to be memory that's constraining, but the variability in compression means that users can't really ask for less than expected anyway.
<zao> Not sure how it would interact with cgroups either.
<zao> Speaking of mirrors, our computer club runs the swedish mirror for ubuntu, debian, and everything else under the sun (ftp.acc.umu.se)
<tarzeau> how many users do you have?
<tarzeau> do you know/use nvtop?
<zao> No idea, thousands?
<zao> We're all remote, no local user workstations.
<zao> (our users are all remote)
<tarzeau> i see
<zao> CS department runs a setup more similar to yours I'd reckon.
<zao> End user access via SSH, graphical desktop over ThinLinc, and eventually some sort of JupyterHub maybe.
<zao> We also have a compute cloud.
<tarzeau> what will you do about python2?
<zao> Regarding nvtop, don't really use that much, just the odd nvidia-smi.
<zao> Any user-facing jobs run Python 2 and Python 3 from EasyBuild.
<tarzeau> we're mainly departmen physics, but also support geomatic engineering team, they use 3d scanning/processing, packaged: colmap, cloudcompare, working on meshroom/alicevision
<zao> Whatever the distro carries will not matter much, apart from powering EasyBuild and whatever things users run interactively if they don't load modules.
<zao> There's probably Python in our services, but that's a bridge to burn when that day comes :)
<tarzeau> i see. your machines work with submit/login nodes, and restricted to your users
<tarzeau> our machines are on the internet, no firewall (restricted just by users, with ssh)
<tarzeau> the goematic people do stuff like: https://motchallenge.net/ and https://eth3d.ethz.ch/
<jbjnr> tarzeau: are you affiliated with the ETHZ people?
<tarzeau> jbjnr: i'm employee of ETHZ yes
<tarzeau> jbjnr: we met at cscs cmake course :) hi john, alex myczko
<tarzeau> the a5 driver with private parking lot
<tarzeau> department physics here, on hoenggerberg. anyone welcome to meet us, we have a nice view! and parking for free (just ask me how ;)
<tarzeau> zao: having local users is so much helpful for all processes (solving all kinds of problems), as in, you can observe them, and give tips, and knowing your users generally is a big plus
<zao> Indeed.
<tarzeau> but we also have remote users, (travelers) and users from cern, psi.
<zao> Some of our users are on campus, just in other departments.
<zao> We also have the occasional courses where we interact in meatspace as well.
<jbjnr> tarzeau: aha. great. Mikael is in the HIT building, you should track him down and chat.
<tarzeau> simbergm: hahaha he's right next in the building where i support most people from J/K floor, just in G
<tarzeau> he could've told so
<simbergm> tarzeau: oops, I assumed you were eth but didn't realize you'd be that close
<simbergm> come by for a coffee at some point :D (or the other way around)
<tarzeau> simbergm: you know nicolas deutschmann? he was also at the course and is next to your office
<simbergm> hmm, no, doesn't sound familiar
<tarzeau> among one of the form developers, the one reimplementing form called reform in a strange programming language
<tarzeau> we're in HPT H 7, but i can visit too, just not right now, maybe after lunch?
<simbergm> tarzeau: yep, sounds good (most days are fine actually, you can come by next week as well)
<tarzeau> btw, model zoo, for gpu interested people, i just met schawinski (ex prof) who went on developping autolearning, so you can get rid of training cnn and right go ahead using it, (met two of the developers because i was in contact with them for software support)
<tarzeau> i was wondering if anyone is using distcc, or oclint ?
<heller> simbergm: ping
<heller> simbergm: does the login work for you know?
<simbergm> heller: nope, still not
<simbergm> for you?
<heller> simbergm: works for me (tm). I can give you an account on my jumphost
<simbergm> hmm, just hangs logging in
<simbergm> yeah, that'd be helpful, thanks
<heller> if you send me a key, I can set you up
<simbergm> heller: how
<simbergm> I'll be back later, lunch now
<heller> simbergm: add this: https://gist.github.com/sithhell/02a3335a95d6260c309b7c615f85434d to your .ssh/config and you are good to go to do a "ssh hazelhen"
<jbjnr> heller: simbergm - https://gist.github.com/biddisco/033ea839bb8b1aaec7bad1536660fc19 either of you already fixed this?
<heller> jbjnr: no :/
<heller> looks like a Debug vs Release build mismatch
<simbergm> jbjnr: nope
<jbjnr> release/debug mismatch. bad me
<jbjnr> must add a warning to the cmake if there isn't one already
<jbjnr> are we using the same tutorial examples as before, or should we use simbergm new stuff?
<heller> i thought we were using simbergm's on day 1 and the stneicl on day 2
<jbjnr> should we add simbergm 's stuff to the tuorial repo to kleep everything in one place and make build of everything in one go easier?
<heller> sounds good to me
<jbjnr> at one point we discussed a 'distributed hpx' section, but then forgot about it. Do we cover this in any significant way? Does the stencil stuff run distributed?
<heller> yes
<jbjnr> ah yes. I remember now
<jbjnr> just saxpy_parallel_cuda that fails now. Are we going to cover the hpx::compute stuff - or do we ditch it and instead throw in the cuda futures and cublas_ examples
<jbjnr> I'll ditch it or now and add the cuda/cublas examples
<jbjnr> ^for now
K-ballo has joined #ste||ar
<simbergm> heller: thanks, login works
<simbergm> jbjnr: I'll add my exercises to the tutorials repo
<heller> jbjnr: ditch it indeed in favor of the cuda future example
hkaiser has joined #ste||ar
<jbjnr> heller: just a thought. You probably do more debugging with gdb command line that I do, so please be ready to step in with help during the debuging session if I'm doing it.
<jbjnr> (I'm using QtCreator these days for mt desktop development)
<heller> Noted
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
<simbergm> hkaiser: yt?
<hkaiser> here
<simbergm> thanks for fixing my mess on your branch...
<hkaiser> np
<simbergm> but don't merge it yet, please
<hkaiser> ok
<simbergm> turns out the header tests are actually called tests.headers.headers.blah
<hkaiser> ahh, I messed that up in the yml file
<hkaiser> thought this was a typo
<simbergm> auriane's PR actually changes the names to be tests.headers.blah but doesn't change the circleci config
<simbergm> so I've asked her to cherry pick that change to her branch
<simbergm> and then we can remove almost all of that commit from your PR
<simbergm> it was kind of unrelated anyway
<hkaiser> if its separable, sure
<hkaiser> yes
<simbergm> yeah, np I though it was a typo as well, hence the approval
<hkaiser> just adds a new target to circle
<simbergm> yeah, that one commit comes out pretty cleanly
<hkaiser> ok, let me know once its fine to merge the all2all, I really need it for Phylanx
<simbergm> ok, I'll clean it up asap
<simbergm> ok to force push?
<hkaiser> thanks a lot!
<hkaiser> if you don't overwrite my latest changes, sure
<hkaiser> ;-)
<simbergm> I'll try not to
<simbergm> hkaiser: I ended up just changing the headers.headers part back, you had a bunch of other useful changes there that I left in
<hkaiser> ok
<hkaiser> thanks
<simbergm> do you mind waiting for circleci before merging?
<hkaiser> sure, let's wait
<K-ballo> heh
<hkaiser> K-ballo: btw, Windows install should be fixed now
<K-ballo> the location of the libs and the dlls? I'll give it a try
<hkaiser> yes
<jbjnr> simbergm: just commented on https://github.com/STEllAR-GROUP/hpx/pull/3942
<simbergm> jbjnr: it needs the NOEXPORT in APEX as well to fully fix it
<hkaiser> simbergm: possibly the same in hpxMP
<simbergm> hkaiser: ah yes...
<simbergm> actually, no
<simbergm> it doesn't use add_hpx_library
<jbjnr> correct
<K-ballo> install works
<hkaiser> K-ballo: nod, thanks
<hkaiser> simbergm: ahh, ok
hkaiser has quit [Quit: bye]
<jbjnr> simbergm: I have fixed it in my build
<simbergm> hmm, the noexport in apex?
<simbergm> jbjnr: ^
<jbjnr> yes. apex calls add_hpx_library and I added it there. I made another change that actualy might be wrong. better check
<parsa> daissgr_work: do you have time to talk about the eurohack2019?
<simbergm> what's the other change?
<jbjnr> aha. That's fine. same fix I just made. pity I didn't know you'd already done it. other fix wasn't needed, removed it
<simbergm> sorry :/
<jbjnr> diehlpk_work: yt?
<diehlpk_work> yes
<diehlpk_work> jbjnr, How can I help
<jbjnr> see pm
<diehlpk_work> simbergm, parsa Michael applied for GSoD
<parsa> diehlpk_work: good, good
<simbergm> diehlpk_work: yep, very happy about that
<simbergm> did you already get his proposal? we don't have a dashboard like for gsoc, right?
<simbergm> heller: have you had time to test anything on hazelhen?
<simbergm> we had an allocation only for today, but we'll probably need another one for next week
hkaiser has joined #ste||ar
<hkaiser> simbergm: using the generated force_linking files for config defeats the purpose :/
<simbergm> hkaiser: hmm, why? :(
<simbergm> what's different with config?
<hkaiser> simbergm: we need to force for the versioning symbols to be linked into the maim module
<hkaiser> if we separate the force_linking() {} from those symbols (TU-wise) they will not be linked
<simbergm> ah, ok, let's see if I can make a bigger
<simbergm> mess
<hkaiser> alternatively we need to reference the version symbols from inside the force_linking() {}
<hkaiser> that second option might be even cleaner as it explicitly forces for the symbols to be linked
<hkaiser> just having things in the same TU makes it work implicitly only
<simbergm> ugh
<heller> simbergm: not yet...
<heller> simbergm: later today only :(
<simbergm> heller: same, no time today
<heller> It'll work out
<simbergm> "ENDE=18:00_06/28/19"
<heller> Ugh
<heller> We'll have to ask for another one next week, I guess
<simbergm> hkaiser: I'm definitely lost
<simbergm> I can make the first happen I think
<simbergm> the force_linking.hpp header was missing for config
<hkaiser> was it?
<hkaiser> ohh, I forgot to add it :/
<hkaiser> sorry
<hkaiser> simbergm: do you want me to work on this?
<diehlpk_work> simberg, No, I am not sure how it will work for GSoD, but we will see tomorrow or at least Monday
rori has quit [Quit: WeeChat 1.9.1]
<simbergm> if you want the second option it's probably best you do it
<hkaiser> simbergm: ok
<hkaiser> pls push what you have, I'll take it on
<simbergm> hkaiser: I can add the header
<hkaiser> k
<hkaiser> I simply missed to add it while committing
<simbergm> ah, ok
<simbergm> hkaiser: hopefully better now
<hkaiser> thanks
<hkaiser> simbergm: its hopefully ok now, let's see
<simbergm> hkaiser: yep, thanks
<simbergm> and as I said, feel free to merge when circleci passes
<hkaiser> ok, thanks
<hkaiser> simbergm: so yes, the latest commit breaks in release :/ the compiler optimizes away the references - heh
<simbergm> hkaiser sigh :( it should be roughly the same as master, no? Except for some of the files being generated
<simbergm> Or is it broken on master as well?
<hkaiser> simbergm: no, don't worry I found a way to fix it
<hkaiser> it's committed
<hkaiser> more involved than I had hoped, but ohh well...
<simbergm> bleh, thanks for doing this
<hkaiser> simbergm: compile fails on circle now as -fPIC is missing
<hkaiser> how do I add that?
<hkaiser> or should we remove all that force_linking nonesense for non-windows platforms?
<K-ballo> we should probably be setting the POSITION_INDEPENDENT_CODE property on each (STATIC) module
<K-ballo> where all modules STATIC libraries, or only some?
<hkaiser> all of them
<hkaiser> I'll add it to the add_module function, then, thanks!
<simbergm> either will do
<hkaiser> I added it to add_module
<simbergm> just let me know if you add it to add_hpx_module so that we can remove it from the other PR
<simbergm> thanks, let's hope that's the last problem for this round
<hkaiser> yah
<diehlpk_work> simbergm, I just got the e-mail from Google. They will review the applications first and will send me the links of the accepted proposals from their side
<diehlpk_work> We need to decide which one we accepted by Jul 23
hkaiser has quit [Quit: bye]
hkaiser has joined #ste||ar
<K-ballo> today is slack's turn to be down https://status.slack.com/
hkaiser has quit [Ping timeout: 276 seconds]
hkaiser has joined #ste||ar
<diehlpk_work> hkaiser, Parsa and I implemented the major compiler version check when building hpx as binary packages
<hkaiser> diehlpk_work: cool!
<hkaiser> thanks
<hkaiser> thanks parsa as well!
<heller> simbergm: jbjnr: vampir is available
<heller> simbergm: jbjnr: we did provide a preinstalled HPX last time
<heller> simbergm: jbjnr: we can test without a reservation as well *phew*
nikunj97 has quit [Remote host closed the connection]
diehlpk_mobile has joined #ste||ar
diehlpk_mobile has quit [Read error: Connection reset by peer]
<Yorlik> Do you guys have siggestions for instrumentation? I know HPX offers performance counters, just wondering what to use to collect and present data - not only HPX specifics. I'm currently looking at Looking at https://prometheus.io/ and https://github.com/jupp0r/prometheus-cpp as possible instrumentation solution: Ideas? Suggestions?