#ste||ar on 2018-10-15 — irc logs at irclog.cct.lsu.edu

2018-08-26 23:03 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

02:38 hkaiser has quit [Quit: bye]

03:11 khuck has quit [Ping timeout: 250 seconds]

03:28 <simbergm> bad news: my change to apply made some dataflow tests fail almost all the time, good news: we might actually be able to find out what's wrong with dataflow...

03:29 <simbergm> there's no way that change in itself should be the source of those problems, right?

03:42 nanashi55 has quit [Ping timeout: 252 seconds]

03:43 nanashi55 has joined #ste||ar

04:14 <heller> simbergm: how do they fail?

04:24 eschnett_ has quit [Quit: eschnett_]

04:28 <simbergm> heller: http://rostam.cct.lsu.edu/builders/hpx_clang_5_boost_1_65_centos_x86_64_debug/builds/134/steps/run_unit_tests/logs/tests.unit.lcos.local_dataflow%20%280.67%20sec%29

04:28 <heller> that's bad

04:28 <simbergm> it looks like the same that we've seen occasionally, except that it happens almost every time now

04:37 <simbergm> heller: dataflow(f, std::vector<future<blah>>) is meant to wait for the futures in the vector, right? because if not the test is just broken

04:50 <heller> yes, it is supposed to wait

04:50 <heller> the continuation should only be executed once all inputs are ready

04:51 <simbergm> mmh

06:39 <jbjnr_> simbergm: great .I saw those dataflow fails on my guided executor branch and was worried it was that. But if it's on all the branches, I can relax.

06:39 <jbjnr_> maybe I can take a look at one of them, I have some practice with dataflow etc when I worked on the new executors ...

06:43 mcopik has joined #ste||ar

06:48 mcopik has quit [Ping timeout: 268 seconds]

08:20 mcopik has joined #ste||ar

08:55 mcopik has quit [Ping timeout: 272 seconds]

09:29 ste||ar-github has joined #ste||ar

09:29 <ste||ar-github> [hpx] sithhell created sithhell-patch-1 (+1 new commit): https://github.com/STEllAR-GROUP/hpx/commit/9d0e25221eef

09:29 <ste||ar-github> hpx/sithhell-patch-1 9d0e252 Thomas Heller: Changing Base docker image

09:29 ste||ar-github has left #ste||ar [#ste||ar]

09:30 ste||ar-github has joined #ste||ar

09:30 <ste||ar-github> [hpx] sithhell pushed 1 new commit to sithhell-patch-1: https://github.com/STEllAR-GROUP/hpx/commit/4cf1c10c00f3e0363ee6249e8cbaa5b2f237ea68

09:30 <ste||ar-github> hpx/sithhell-patch-1 4cf1c10 Thomas Heller: Changing base Docker image for the HPX image

09:30 ste||ar-github has left #ste||ar [#ste||ar]

09:31 ste||ar-github has joined #ste||ar

09:31 <ste||ar-github> [hpx] sithhell opened pull request #3491: Changing Base docker image (master...sithhell-patch-1) https://github.com/STEllAR-GROUP/hpx/pull/3491

09:31 ste||ar-github has left #ste||ar [#ste||ar]

09:52 heller has quit [Read error: Connection reset by peer]

10:08 heller__ has joined #ste||ar

10:21 nikunj has joined #ste||ar

11:49 mcopik has joined #ste||ar

12:28 hkaiser has joined #ste||ar

13:04 <heller__> hkaiser: so ... what do you propose to solve the issue?

13:04 <hkaiser> no idea

13:04 <hkaiser> I would like to avoid having to set up a separate HPX build for the Phylanx tests

13:04 <heller__> yes

13:05 <heller__> limited resources are a problem

13:05 <heller__> still, the two issues are orthogonal

13:05 <hkaiser> so compiling HPX in c++14 would be a viable workaround

13:06 <hkaiser> (on circleci)

13:06 <heller__> it's only viable until there's a new problem

13:07 <hkaiser> the actual problem is the way Klaus detects c++17 features

13:07 <heller__> right

13:08 <heller__> the change to clang7 should solve the problem (also comes with the gcc8 libstdc++)

13:08 <heller__> unfortunately, there seems to be a problem with the new hpx_main setup...

13:08 <hkaiser> do they have all algorithms by now?

13:09 <heller__> that's my understanding

13:11 <heller__> I only tested std::destroy, TBH

13:11 mcopik has quit [Remote host closed the connection]

13:11 <hkaiser> k

13:12 <hkaiser> there should be feature macros defined goinf with each of the algorithms

13:12 <hkaiser> so Klaus could check those

13:13 nikunj has quit [Ping timeout: 252 seconds]

13:14 <heller__> if we fix the failures for clang and lld appearing in llvm7, we could at least hide the problem for now

13:15 aserio has joined #ste||ar

13:19 <heller__> hkaiser: what about freezing the blaze version?

13:33 <hkaiser> heller__: for how long?

13:33 <heller__> hkaiser: I'd suggest forever

13:33 <heller__> hkaiser: it probably won't be the last time that external dependencies break something

13:34 <jbjnr_> sorry I caused so much trouble.

13:34 <hkaiser> heller__: that's not realistic

13:34 <hkaiser> jbjnr_: not just your fault ;-)

13:34 <hkaiser> it's heller__'s fault too

13:35 <jbjnr_> I could've fixed the mpi stuff the easy way, but I chose to try and do it the 'right' way. Shan't make that mistake again!

13:35 <hkaiser> heller__: but thatnks for setting up clang7 on circleci

13:35 <heller__> hkaiser: well, then you have to live with breakage...

13:35 <hkaiser> heller__: I'll talk to klaus as well

13:36 <hkaiser> for now freezing the blaze version is an option, but that can only be a stop-gap measure

13:39 <heller__> hkaiser: i disagree

13:40 <heller__> hkaiser: there only needs to be a automated mechanism that bumps the versions

13:40 <heller__> That way, you ensure smooth operation

13:41 <heller__> If you always assume that pulling the latest commit on master should work, that's bound to fail

13:41 <hkaiser> heller__: that means that you force people to manually install blaze even if they have a (possibly newer) version already installed

13:41 <heller__> We can't ensure that for hpx or blaze

13:41 <heller__> No, that's what I mean

13:42 <hkaiser> heller__: I'm not talking about latest master, I'm talking about released versions

13:45 <heller__> Phylanx currently uses the latest master versions of hpx, blaze, pybind11 and highfive

13:45 <heller__> For testing on circleci

13:46 <heller__> And that's what I'm suggesting: use fixed versions there.

13:46 <hkaiser> heller__: yes, this is not a necessary setup, it's for us to be sure that dependencies don't break our builds ;-)

13:47 <hkaiser> heller__: ok, sure

13:47 <heller__> Mission accomplished then :p

13:48 hkaiser has quit [Quit: bye]

13:48 <heller__> The advantage: we test a curated set of dependencies. Nicely documented

13:49 <heller__> New versions can be updated via PRs. They get checked via circle, and you can report potential problems upstream, without disturbing the normal workflow

13:49 <heller__> This update could even be automated

13:51 nikunj has joined #ste||ar

13:51 aserio has quit [Quit: aserio]

13:51 aserio has joined #ste||ar

13:54 eschnett has joined #ste||ar

13:56 parsa[w] has quit [Read error: Connection reset by peer]

14:02 parsa[w] has joined #ste||ar

14:14 jaafar has joined #ste||ar

14:34 RostamLog has joined #ste||ar

14:59 jaafar has quit [Ping timeout: 246 seconds]

15:41 hkaiser has joined #ste||ar

15:48 <hkaiser> simbergm: yt?

15:49 <hkaiser> heller__: https://bitbucket.org/blaze-lib/blaze/issues/211/c-17-detection-is-too-broad

15:51 <simbergm> hkaiser: here

15:54 <hkaiser> simbergm: where is the generated documentation hosted now?

16:03 khuck has joined #ste||ar

16:08 khuck has quit [Remote host closed the connection]

16:08 khuck has joined #ste||ar

16:24 david_pfander has quit [Ping timeout: 246 seconds]

16:38 <simbergm> hkaiser: https://stellar-group.github.io/hpx/

16:38 <simbergm> that redirects to the generated docs from master now, but I'll change it to latest release once we have 1.2.0 out

16:39 <simbergm> I'm not sure what would be a good place to document this...

16:39 <simbergm> heller suggested I write a blog post about the move to sphinx, I could write about this there as well

16:40 <nikunj> can we not buy a free domain to host the documentation?

16:40 <simbergm> jbjnr_: yeah, if you know something about dataflow please do have a look, I'm looking as well but starting from scratch

16:42 <jbjnr_> I had a very quick look this morning, but the futures are created with make_ready_future, so the fact that is_ready is false is very strange. Implies the construction is flawed?

16:43 <simbergm> no, it's push_back(dataflow(f, make_ready_future))

16:44 <simbergm> so the future from dataflow is not ready immediately (and shouldn't be)

16:44 <jbjnr_> I misread it then. sorry, was only a quick look

16:44 <jbjnr_> will look again later meybe

16:48 <simbergm> np

16:48 <hkaiser> simbergm: could you update the main README to point there, pls?

16:49 <simbergm> hkaiser: yeah, I'll do that

16:57 akheir has joined #ste||ar

17:04 <hkaiser> simbergm: thanks, the README still points to the old docs

17:04 aserio has quit [Ping timeout: 252 seconds]

17:06 <simbergm> hkaiser: yes, it does

17:06 <simbergm> I'm going to update it on the release branch so that it's changed once we have the release out

17:07 <jbjnr_> can we merge any of my prs before release. squeak squeak

17:09 <hkaiser> simbergm: the README points to the docs in progress as well, that could be changed right away

17:09 <simbergm> jbjnr_: on clang there's this: error: no member named 'free' in namespace 'std'; did you mean simply 'free'?

17:10 <simbergm> if they work :)

17:10 <simbergm> clang 3.8 that is

17:10 <jbjnr_> missing cstdlib then. I fix now

17:11 <simbergm> to all, keep opening prs to master still, I will just merge/cherry-pick to the release branch once that becomes relevant

17:12 ste||ar-github has joined #ste||ar

17:12 <ste||ar-github> [hpx] biddisco force-pushed demangle_helper from d7710d6 to 8239943: https://github.com/STEllAR-GROUP/hpx/commits/demangle_helper

17:12 <ste||ar-github> hpx/demangle_helper 8239943 John Biddiscombe: Use std::unique_ptr in demangler

17:12 ste||ar-github has left #ste||ar [#ste||ar]

17:31 <khuck> is there an ETA on the release?

17:32 <hkaiser> khuck: before sc

17:33 <khuck> thanks

17:33 <jbjnr_> or when we fix ALL the bugs!

17:35 <hkaiser> whatever comes first ;-)

17:35 <K-ballo> sc then

17:36 <K-ballo> there's only a month left

17:38 khuck has quit [Read error: Connection reset by peer]

17:38 khuck_ has joined #ste||ar

17:42 <jbjnr_> jesus - I just tried turning on CUDA in my build and the list of cmake errors is huge

17:43 <jbjnr_> do we test this?

17:43 <jbjnr_> configure bombs out completely

17:43 <hkaiser> jbjnr_: apparently we don't test it

17:49 <jbjnr_> do we want cuda to be linked PUBLIC or PRIVATE? I vote PRIVATE initially.

17:50 <heller__> Yes

17:50 <heller__> No

17:50 <jbjnr_> No

17:50 <jbjnr_> Yes

17:50 <jbjnr_> :)

17:50 <heller__> Has to be public, because of headers and inline functions

17:51 <jbjnr_> it affext transitivity

17:51 <jbjnr_> affects

17:51 <jbjnr_> we might link to cuda, but the usre might not want to

17:52 <jbjnr_> do we force cuda onto the user or let him add cuda himself if he wants it

17:52 <heller__> He chose to enable it in the first place

17:53 <khuck_> heller__: quick question - do we need any scratch space for the NERSC / ERCAP proposal? The format changed slightly from last year, and I don't think we requested it.

17:53 <khuck_> heller__: we requested 1TB for project space, but unknown for scratch space.

17:53 <jbjnr_> this is hard to argue with, ^^

17:54 <heller__> khuck_: hmm. I usually build on scratch

17:54 <jbjnr_> ut if the system has cuda enabled hpx on it and a user builds a hello world app. Should they be forced to pull in cuda?

17:55 <heller__> khuck_: having at least a TB would be appropriate

17:55 <khuck_> agreed

17:55 <heller__> jbjnr_: it should only be pulled in if it's a cuda app, no?

17:56 <heller__> jbjnr_: hmm you have a point. We're only talking about the rt libs and friends, right?

17:56 <jbjnr_> excatly, but making cuda target_link_libraries(PUBLIC...) forces all the includes, links etc on the user too

17:56 <heller__> Yeah, those should be private

17:56 <jbjnr_> ok, done

17:56 <heller__> So we need to fix the headers, maybe

17:56 <heller__> Not too sure there

17:57 <jbjnr_> cmake 3.12 fixes all this for us with set(CUDA_LINK_LIBRARIES_KEYWORD "PRIVATE")

17:57 <jbjnr_> don't know if this is in earlier cmake versions

17:57 <heller__> If they are implemented properly already...

17:57 <heller__> I think previous versions are pretty broken

17:58 <heller__> Let's just document that we need 3.12 for cuda support.

18:00 <jbjnr_> looks like it was added around cmake 3.9. good

18:04 ste||ar-github has joined #ste||ar

18:04 <ste||ar-github> [hpx] biddisco created cuda_cmake_fix (+1 new commit): https://github.com/STEllAR-GROUP/hpx/commit/81374c5b0441

18:04 <ste||ar-github> hpx/cuda_cmake_fix 81374c5 John Biddiscombe: Add CUDA_LINK_LIBRARIES_KEYWORD to allow PRIVATE keyword in linkage to cuda

18:04 ste||ar-github has left #ste||ar [#ste||ar]

18:05 ste||ar-github has joined #ste||ar

18:05 <ste||ar-github> [hpx] biddisco opened pull request #3492: Add CUDA_LINK_LIBRARIES_KEYWORD to allow PRIVATE keyword in linkage t… (master...cuda_cmake_fix) https://github.com/STEllAR-GROUP/hpx/pull/3492

18:05 ste||ar-github has left #ste||ar [#ste||ar]

18:39 mcopik has joined #ste||ar

19:13 aserio has joined #ste||ar

19:58 khuck_ has quit [Remote host closed the connection]

19:58 khuck has joined #ste||ar

19:58 hkaiser has quit [Quit: bye]

20:32 eschnett has quit [Quit: eschnett]

20:43 <khuck> aserio: do you have a brief desription of the Rostam cluster?

20:43 <aserio> khuck: do you mean like the types of nodes it contains?

20:44 <khuck> aserio: yes, but it can be short. i.e. x nodes of y processors, z memory.

20:45 <khuck> and if you have a web page that describes it, even better

20:45 <aserio> khuck: so this is the long form of the information: https://github.com/STEllAR-GROUP/hpx/wiki/Running-HPX-on-Rostam

20:46 <aserio> khuck: most students report on the node that they are running on

20:47 <khuck> thanks

20:48 <aserio> khuck: Perhaps we could just talk about Marvin and say ... and other nodes

20:48 <khuck> Here's what I am adding:

20:48 <khuck> Rostam cluster at LSU (47 node heterogeneous test cluster): https://github.com/STEllAR-GROUP/hpx/wiki/Running-HPX-on-Rostam

20:48 <khuck> Talapas cluster at UO (128 nodes, 28 cores per node): https://hpcf.uoregon.edu/content/talapas

20:48 <khuck> and that's enough

20:49 <aserio> Ok :)

20:49 <khuck> Alice wanted it

20:51 <khuck> I don't want it to look like we have other options. :)

20:51 <khuck> aserio: does HPX have compute allocations on other machines in Europe?

20:52 <aserio> The team at LSU does not

20:52 <khuck> fair enough

20:53 <aserio> lol I am not certain what heller__ has access to

20:53 <aserio> and jbjnr_ has access to his institutional resources

20:54 <khuck> right, but HPX doesn't have a project account to bill against.

20:54 <khuck> so that doesn't count.

21:15 <zao> 232 - tests.unit.lcos.local_dataflow (Failed)

21:15 <zao> 234 - tests.unit.lcos.local_dataflow_executor (Failed)

21:15 <zao> I'm guessing by the earlier discussion that these problems are known?

21:15 hkaiser has joined #ste||ar

21:17 <zao> (building d4bd36da10d)

21:47 akheir has quit [Quit: Leaving]

21:56 aserio has quit [Quit: aserio]

21:57 <simbergm> zao: yep, known

22:01 <zao> Was a bit surprised that only those two tests failed building master :)

22:24 <khuck> hkaiser: I am thoroughly confused now. Phylanx crashes with blaze 3.4. Or maybe it just crashes.

22:24 <khuck> hkaiser: is blaze master fixed?

22:24 <khuck> or are we still using 3.4?

22:30 <hkaiser> khuck: blaze master is fixed

22:31 <hkaiser> khuck: on circleci I'd like to get everything else under control first before going back to blaze master

22:32 <khuck> yeah, I am seeing other failures. right now on my main test system everything is failing because there's a zombie HPX process out there, hogging the port. :/

22:35 <khuck> hkaiser: " what(): <unknown>: HPX(network_error) " means that there is another HPX app running, right?

22:40 mcopik has quit [Ping timeout: 252 seconds]

22:41 mcopik has joined #ste||ar

22:41 <K-ballo> is that all the information? if so we regressed

22:41 <khuck> no, there's more

22:41 <K-ballo> another HPX app running would mention something about binding ports or similar

22:41 <khuck> good point

23:03 ste||ar-github has joined #ste||ar

23:03 <ste||ar-github> [hpx] msimberg opened pull request #3493: Remove deprecated options for 1.2.0 part 2 (master...remove-deprecated-options-2) https://github.com/STEllAR-GROUP/hpx/pull/3493

23:03 ste||ar-github has left #ste||ar [#ste||ar]

23:17 <khuck> K-ballo: found it. there was a hung 'fibonacci' test from another user - 6 days old.

23:19 <K-ballo> heh

23:34 <hkaiser> khuck: was this the issue you thought being caused by blaze?

23:35 <khuck> no

23:36 <khuck> the blaze issue prevented compilation

23:36 <khuck> phylanx is currently building with blaze 3.4, but there is one failing test

23:36 <khuck> simple_als

23:36 <khuck> 0!=1

23:36 <khuck> (in the output)

23:38 <khuck> http://ktau.nic.uoregon.edu:8020/#/builders/8/builds/228/steps/12/logs/stdio