#ste||ar on 2020-06-23 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

00:00 <K-ballo> (you may note from the above that default initialization is not necessarily here, and I could just reserve and push_back as I go)

00:01 <dd> That is essentially how my loop looks except I have an intermediate call to dataflow that combines the two futures

00:01 <dd> yeah I was thinking that push_back might solve our problem

00:01 <K-ballo> I don't think it will

00:01 <K-ballo> you'd just try to use garbage memory instead of an uninitialized future

00:02 <K-ballo> if your loop truly did behave as that, then you wouldn't be tripping over uninitialized futures

00:02 <dd> ok well I think you gave me what I need to keep working on this - much appreciated

00:02 <dd> BTW it interestingly works for small enough meshes

00:03 <K-ballo> if you turn it into a minimal test case reproducing the exact problem without all the complexity, I'm sure people here would have a look

00:03 <dd> ok thanks again - I will keep debugging and try to come up with a reproducer if I can't sort it out

00:31 kale[m] has quit [Ping timeout: 260 seconds]

00:34 kale[m] has joined #ste||ar

01:46 akheir has quit [Quit: Leaving]

02:04 hkaiser has quit [Quit: bye]

02:37 kale[m] has quit [Ping timeout: 260 seconds]

02:41 nan11 has quit [Remote host closed the connection]

03:05 dd has quit [Ping timeout: 245 seconds]

04:50 bita__ has quit [Ping timeout: 260 seconds]

05:15 nikunj97 has joined #ste||ar

05:19 Nikunj__ has joined #ste||ar

05:22 nikunj97 has quit [Ping timeout: 244 seconds]

06:51 Nikunj__ has quit [Read error: Connection reset by peer]

06:59 <zao> gonidelis[m]: the exact issue I said it was... smh

07:09 Nikunj__ has joined #ste||ar

07:47 nikunj97 has joined #ste||ar

07:51 Nikunj__ has quit [Ping timeout: 260 seconds]

08:13 kale[m] has joined #ste||ar

10:01 kale[m] has quit [Ping timeout: 256 seconds]

10:47 nikunj97 has quit [Read error: Connection reset by peer]

11:03 nikunj97 has joined #ste||ar

11:16 Nikunj__ has joined #ste||ar

11:19 nikunj97 has quit [Ping timeout: 260 seconds]

11:30 kale[m] has joined #ste||ar

11:35 kale[m] has quit [Ping timeout: 260 seconds]

11:48 nikunj97 has joined #ste||ar

11:50 kale[m] has joined #ste||ar

11:51 Nikunj__ has quit [Ping timeout: 260 seconds]

11:53 Nikunj__ has joined #ste||ar

11:57 nikunj97 has quit [Ping timeout: 260 seconds]

12:23 hkaiser has joined #ste||ar

13:05 kale[m] has quit [Read error: Connection reset by peer]

13:06 kale[m] has joined #ste||ar

13:41 kale[m] has quit [Ping timeout: 264 seconds]

13:42 kale[m] has joined #ste||ar

14:18 Nikunj__ has quit [Quit: Leaving]

14:18 weilewei has joined #ste||ar

14:20 dd has joined #ste||ar

14:20 nan11 has joined #ste||ar

14:23 weilewei has quit [Remote host closed the connection]

14:32 weilewei has joined #ste||ar

14:34 kale[m] has quit [Read error: Connection reset by peer]

14:35 kale[m] has joined #ste||ar

14:39 karame_ has joined #ste||ar

14:41 kale[m] has quit [Read error: Connection reset by peer]

14:41 kale[m] has joined #ste||ar

14:43 hkaiser has quit [Read error: Connection reset by peer]

14:45 hkaiser has joined #ste||ar

14:58 akheir has joined #ste||ar

15:07 rtohid has joined #ste||ar

15:07 <weilewei> HPX is mentioned in Kokkos training tutorial

15:20 <diehlpk_work_> June 29, 2020 - July 3, 2020 -> First GSoC evaluations next week

15:27 <jbjnr> hkaiser: yt?

15:30 <hkaiser> jbjnr: here

15:32 <jbjnr> agas - service_mode hosted and service_mode bootstrap - I've got a problem with bootstrapping the libfabric stuff and I'm curious how agas decides which mode is set. Can you elucidate please?

15:35 <jbjnr> what I want to do is start N worker nodes and say --hpx:agas=root_node:port - but when I use --hpx:agas on the command line, it thinks the worker is the root node and not that I want to connect to agas on the root node.

15:36 <hkaiser> mode 'bootstrap' means it's locality zero, otherwise 'hosted'

15:37 <hkaiser> --hpx:agas shouldn't have any relation to the service mode

15:37 <hkaiser> it just specifies where the bootstrapped agas instance lives

15:37 <jbjnr> my problem is that when I use "--hpx:agas=localhost:7910 --hpx:localities=2 --hpx:worker" is sets the service mode to bootstrap even though I'm a worker

15:38 <hkaiser> add a --hpx:node=0/1 accordingly

15:38 <hkaiser> then --hpx:worker is not needed

15:38 <jbjnr> you can't use hpx:node when you use hpx:agas

15:38 <hkaiser> see also https://stackoverflow.com/questions/35367816/hpx-minimal-two-node-example-set-up

15:39 <jbjnr> hpx::init: std::exception caught: Command line option --hpx:node is not compatible with --hpx:agas

15:39 <jbjnr> hkaiser: I know how it used to work.

15:39 <hkaiser> doesn't it work as described anymore?

15:39 <jbjnr> my libfabric stuff does not work any more and I'm trying to find out what has been broken

15:39 <jbjnr> and I have no idea why hpx:node is 'incompatible' with hpx:agas

15:40 <hkaiser> do those instructions work with tcp?

15:40 <jbjnr> no idea. not interested in tcp

15:40 <hkaiser> generally I wouldn't exclude the possibility that we have broken tings during startup

15:41 <hkaiser> jbjnr: please don't be annoyed

15:41 <jbjnr> why is hpx:node imcompatible with hpx:agas?

15:41 <hkaiser> finding out whether things still work with tcp would at least tell us whether it's a general problem or just something with the libfabric pp

15:42 <hkaiser> jbjnr: I don't know from the top of my head - need to look at the code and think about it

15:42 <hkaiser> most probably because using hpx:agas and hpx:node might create ambiguities, but I'm not sure

15:43 <jbjnr> ok. just wanted to know about the service mode stuff. it is as I expected and there are new bugs

15:43 <hkaiser> or it's simply an invalid restriction

15:43 <jbjnr> ^this

15:43 <jbjnr> imho

15:43 <jbjnr> but I suspect there was a reason for it once upon a time

15:48 <hkaiser> indeed

15:48 <ms[m]> weilewei: nice! are you attending? is it public?

15:48 bita__ has joined #ste||ar

15:49 <weilewei> ms[m] yes I am attending. I think you need to register to get zoom password and send email to celmont@sandia.gov: https://github.com/kokkos/kokkos-tutorials/issues/36

15:50 <weilewei> it's like half public/private. I think their target audience is summer students at Sandia lab

15:53 <ms[m]> weilewei: ok, thanks

15:53 <ms[m]> then just pass on all the gossip to here ;)

16:21 weilewei has quit [Remote host closed the connection]

16:32 dd has quit [Remote host closed the connection]

16:43 rtohid has quit [Ping timeout: 245 seconds]

16:50 karame_ has quit [Remote host closed the connection]

17:05 rtohid has joined #ste||ar

17:29 rtohid has quit [Remote host closed the connection]

17:30 rtohid has joined #ste||ar

17:39 <akheir> what happened to github? it is ugly!

17:39 <akheir> is it just me?

17:43 <zao> akheir: Hehe, round and nice.

17:43 <zao> akheir: I saw you ran into the GCC fixincludes bug on your cluster too, isn't it fun?

17:43 <zao> We had it in EasyBuild a while ago, I linked our thread last night in here.

17:43 karame_ has joined #ste||ar

17:44 <zao> In short, GCC is quite tied to the glibc version on the host OS, so for minor OS upgrades you may need to rebuild it.

17:44 <akheir> zao: yeah. I didn't know the name, but it was confusing

17:45 <akheir> oh, nasty. it is ok now since I only have two version but later on when the number grows could be difficult to handle

17:45 <akheir> zao: how about the libraries compiled this that gcc? should I recompile my boost as well?

17:46 <zao> In this particualr case, I don't believe it's required.

17:47 <akheir> good to know. I did it to be safe though

17:51 <zao> How do your build the software stack on the cluster, mostly manual or EB/spack/something?

17:52 <zao> We used to build everything manually for four different compiler vendors with just README.sysop files with vague instructions on how to build something :D

17:57 <akheir> zao: I gave up EB on old cluster. It's dependency management was headache. I have my on set of bash scripts. does it for me from download to creating lmod module

18:00 <zao> We find it quite nice for the vast bulk of software that our researchers need to run, but for a department cluster more aimed at development, maybe not quite as good.

18:03 <akheir> yeah. tell it to install openmpi it goes and installs two versions of gcc first and then compiles openmpi. lol

18:03 karame_ has quit [Remote host closed the connection]

18:07 <zao> The relative isolation from the underlying OS is nice when you want some semblance of things working the same across systems.

18:07 <zao> Most of the horrors I run into when trying it on weirdo distros is to get the system-ish things going, the rest tend to be smooth sailing.

18:10 nikunj97 has joined #ste||ar

18:13 nan11 has quit [Remote host closed the connection]

18:18 nan11 has joined #ste||ar

18:18 Nikunj__ has joined #ste||ar

18:21 nikunj97 has quit [Ping timeout: 260 seconds]

18:21 Nikunj__ has quit [Read error: Connection reset by peer]

18:22 Nikunj__ has joined #ste||ar

18:25 Nikunj__ has quit [Read error: Connection reset by peer]

19:25 hkaiser has quit [Read error: Connection reset by peer]

19:26 hkaiser has joined #ste||ar

19:43 kale[m] has quit [Ping timeout: 246 seconds]

19:43 kale[m] has joined #ste||ar

19:47 <nikunj> hkaiser: yt?

19:50 <bita__> thanks to Al, we were trying to build Phylanx on Rostam. Using yesterday's hpx master, Phylanx cannot be built: https://gist.github.com/taless474/05fb593094a2eed051c94540927cf6d3

19:50 <bita__> Phylanx image is good, as I just built it on docker

19:51 <bita__> Any suggestions?

19:52 <nikunj> hkaiser: I just noticed that time increases exponentially when using hpx::lcos::channel for distributed send/receive with increasing number of hpx threads. So if I keep 1 hpx thread per node and use lcos communication, it is blazing fast. But if I use 64 hpx threads per node, it is terribly slow for the same amount of send/receive compared to 1 hpx thread.

19:54 <nikunj> hkaiser: is it an expected behavior? If yes, any workaround if only 1 thread invokes set and get functions?

19:56 <hkaiser> what's your idle rates?

20:00 <hkaiser> bita__: missing header ?

20:00 <hkaiser> <memory> or <string>?

20:00 <bita__> well I am using master

20:01 <bita__> I see that hpx master yesterday was failing for a few hours. How can I get the lastest stable version?

20:01 <hkaiser> stable tag?

20:01 <bita__> yes

20:02 <hkaiser> there is a tag 'stable' that is the latest commit that passed testing

20:03 <bita__> Steve's Phylanx branch was successfully built about 3 hours ago. I just don't know what I miss

20:03 <bita__> thank I will try that

20:04 <hkaiser> bita__: do you build with c++14?

20:04 <bita__> I have this -DPHYLANX_WITH_CXX17=ON, so I would say no

20:06 <hkaiser> well, c++17 should work as well

21:06 mariella[m] has joined #ste||ar

22:45 kale[m] has quit [Ping timeout: 260 seconds]

22:56 kale[m] has joined #ste||ar

22:58 rtohid has left #ste||ar [#ste||ar]

23:26 weilewei has joined #ste||ar

23:27 <weilewei> Does github have new user interface from today?

23:48 <hkaiser> weilewei: I think you need to explicitly enabled it

23:49 <weilewei> hkaiser I didn't do anything at all, and then the github repo page has a new look

23:49 <hkaiser> ok

23:50 <hkaiser> I had it enabled for some time, they might have made it broadly available now

23:50 <weilewei> I see, I felt not familiar with the new look