#ste||ar on 2020-08-20 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

00:32 <diehlpk_work> hkaiser, I fixed the JOSS paper

00:32 <hkaiser> diehlpk_work: thanks a lot! what was it?

00:33 <diehlpk_work> You saved it as iso and not utf8

00:33 <hkaiser> urgs

00:33 <hkaiser> didn't know it has to be utf8

00:33 <diehlpk_work> Me neither, but it is the defualt on Linux

00:34 <hkaiser> ok

00:34 <diehlpk_work> http://res.cloudinary.com/hju22ue2k/image/upload/v1597883534/tochfwfsucscw2vqzkmz.pdf

00:34 <hkaiser> are my additions ok?

00:34 <diehlpk_work> Fine with me

00:35 <hkaiser> ok

00:44 <hkaiser> diehlpk_work: are there any rules for JOSS for the sequence of authors?

00:44 <hkaiser> should we sort them based on their contributions?

00:44 <hkaiser> currently it's all over the place

01:10 <diehlpk_work> hkaiser, it was supposed to be alphabetical

01:11 <diehlpk_work> Our SC panel is on Wednesday, 18 November 2020

01:11 <diehlpk_work> 3pm - 4:30pm

02:10 <hkaiser> diehlpk_work: it is not alphabetical, however

02:10 hkaiser has quit [Quit: bye]

02:56 nanmiao11 has quit [Remote host closed the connection]

04:12 akheir has quit [Quit: Leaving]

05:03 bita has joined #ste||ar

05:11 bita has quit [Ping timeout: 244 seconds]

07:28 kordejong has joined #ste||ar

08:25 kale[m] has joined #ste||ar

08:30 kordejong has quit [Changing host]

08:30 kordejong has joined #ste||ar

09:28 kale[m] has quit [Read error: Connection reset by peer]

09:32 kale[m] has joined #ste||ar

10:31 <kordejong> My current experiment distributes HPX processes over the 8 NUMA nodes in a single cluster node. `main` is executed on one of these localities. During some of the runs, one of the processes keeps hanging. `hpx::finalize` has always exited, but `hpx::init` hasn't, it seems. `top` shows that most NUMA nodes are idling, but at least one NUMA node is still doing something (all 6 cores 100%). Any idea what happens after calling

10:31 <kordejong> `hpx::finalize` in `hpx::init` that could be causing this behaviour? I could use some inspiration as to where to look for the cause of this. Platform: linux, gcc-10, HPX-1.5.0-rc2, Slurm, openmpi.

10:38 <kordejong> BTW, the information I need from the experiment is ready and fine. It seems some finalization code is hanging.

10:40 <kordejong> Since experiments are launched from shell scripts, hanging experiments prevent subsequent experiments to be launched.

11:00 <ms[m]> "Kor de Jong" (https://matrix.to/#/@kordejong:matrix.org): are you able to determine if it's locality 0 that hangs or some other locality? Can you reproduce the hangs with --hpx:debug-hpx-log? How reproducible are the hangs?

11:17 <kordejong> The hangs are pretty reproducible, with a Release build. Just ran a Release build with `--hpx:debug-hpx-log` and the formerly hanging experiment finished without hanging. Log is 13G. `hpx::init` exited OK.

11:23 <kordejong> It seems subtle. The calculations done in my code seem always fine and finished (json file with results, only written after waiting for all tasks to finish). But some array sizes + partition sizes result in a process that hangs upon exiting.

11:25 <kordejong> So even though some processes hang, they do so after finishing my tasks and writing the results. Even after finishing `hpx::finalize`.

12:06 * kordejong sent a long message: < https://matrix.org/_matrix/media/r0/download/matrix.org/lOoVOzCwtKxehTczQIBfknoG/message.txt >

12:07 * kordejong sent a long message: < https://matrix.org/_matrix/media/r0/download/matrix.org/JkjIjaeXzdKJVERzeHXlIxvx/message.txt >

12:33 <kordejong> Same thing with multiple cluster nodes (tested 4 cluster nodes == 32 numa nodes == 32 localities). Upon exitting, one locality (== numa node) is active (all cores 100%), while the other 31 are idling. The whole thing hangs.

12:34 <ms[m]> Kor de Jong: do you do any explicit work on the main thread pool or timer pool? unfortunately I think those stack traces are a red herring and just show us that hpx is waiting to shut down

12:36 <ms[m]> "some array sizes + partition sizes result in a process that hangs" does this mean it only happens with these sizes and it happens every time with those sizes, and never with other sizes?

12:36 K-ballo has quit [Ping timeout: 240 seconds]

12:38 K-ballo has joined #ste||ar

12:39 <kordejong> No, I don't do work explicitly on certain pools. Yes, some array/partition sizes consistently hang, while others don't. Many combinations don't hang, but some do. During scalability tests, there often is a combination that hangs. I haven't seen a hang while scaling over cores in a single numa node (== 1 locality). The issue seems restricted to using multiple localities.

12:42 <kordejong> Also, I have not been able to reproduce a hang using a Debug build. When using the same array/partition sizes a Debug build, or a Release build using the `--hpx:debug-hpx-log` argument, does not hang.

12:48 hkaiser has joined #ste||ar

12:55 <kordejong> Using the tcp parcelport instead of mpi doesn't help (`--hpx:ini="hpx.parcel.mpi.enable=0" --hpx:ini="hpx.parcel.tcp.enable=1"`)

13:09 <gonidelis[m]> anyone knows how to interpret this (c)make error message ? https://gist.github.com/gonidelis/a87373c4704612428e8cec574d629f61

13:09 <gonidelis[m]> it just pops when `make tests.unit.modules.algorithms.partitioned_vector_transform_binary3` while everything else runs fine

13:10 <hkaiser> that's not a cmake error

13:10 <ms[m]> Kor de Jong: not sure how to best debug that... you could try printing the thread count performance counter

13:11 <ms[m]> that way we'd know if you have some thread lingering longering than it should

13:11 <hkaiser> gonidelis[m]: you most likely have a mismatch between the definition and the declaration of transform_()

13:12 <hkaiser> kordejong: forgot to call hpx::finalize()?

13:14 <zao> gonidelis[m]: It's a linker error, so either the object file/library containing that symbol isn't provided when linking this thing, or the symbol isn't present in the libraries.

13:14 <zao> You may have a declaration without a corresponding definition, or a promise of a specialization that isn't there, or something. Squint at the error and try to see what it's missing.

13:15 <ms[m]> hkaiser: most likely not... would've been nice ;)

13:16 <ms[m]> there's a bit more context in the logs if you want to catch up on that

13:16 <hkaiser> ok

13:16 <hkaiser> could have been easy ;-)

13:16 <hkaiser> ms[m]: Kokos meeting today?

13:18 <ms[m]> hkaiser: yes

13:18 <ms[m]> gdaiss as well!? :)

13:18 <kordejong> <hkaiser "Kor de Jong: forgot to call hpx:"> No it is there, and a print afterwards gets printed, just before returning the result of calling `hpx::finalize` from `hpx_main`. That's why I think the hanging code is in `hpx::init`, after its call to `hpx_main`.

13:18 <hkaiser> in 15 minutes?

13:18 <ms[m]> in 10-15 minutes

13:19 <hkaiser> kordejong: finalize only tells the runtime to exit once all activity has ceased

13:19 <hkaiser> that means that there is still some work in the queues if init does not return, most likely some suspended thread that didn't get resumed

13:19 <hkaiser> ms[m]: ok

13:20 <hkaiser> possibly a future that was never made ready or an exception that got swallowed by some code not handling them

13:21 <kordejong> @hkaiser Ah, yes. I will think about that some more. Thanks.

13:21 <hkaiser> hpx::wait_all() is often a culprit as it waits for the futures, but doesn't rethrow exceptions

13:23 <hkaiser> kordejong: run it with --hpx:attach-debugger=exception in interactive mode, that will break execution if an exception is thrown

13:32 kale[m] has quit [Ping timeout: 264 seconds]

13:33 kale[m] has joined #ste||ar

13:56 akheir has joined #ste||ar

14:09 <gonidelis[m]> hkaiser: There is a serious possibility that I have successfully changed all the result types in `transform`. "threeoth" under /algorithms, /container_algorithms and /segmented_algorithms

14:09 <hkaiser> lol

14:09 <gonidelis[m]> I 'll wait for the github tests to verify

14:10 <gonidelis[m]> hkaiser: meanwhile I will try to complete all the remaining overloads with tag_invoke

14:11 <hkaiser> thanks a lot

14:14 <rori> freenode_gonidelis[m]: serious probability or serious possibility? that changes the odds :P

14:15 <gonidelis[m]> rori_[m]: let's stay with "possibility", altough I don't know what's the difference between the two...

14:18 nanmiao11 has joined #ste||ar

14:33 <gonidelis[m]> question

14:33 <gonidelis[m]> http://eel.is/c++draft/alg.transform#itemdecl:1 The algos with exec-policy return `ForwardIterator2` while the algos without exec-policy return sth like `constexpr OutputIterator`

14:34 <gonidelis[m]> does that mean that my `tag_invoke`s return sth like `friend FwdIter2` when no exec-policy and sth like `friend typename parallel::util::detail::algorithm_result<ExPolicy,

14:34 <gonidelis[m]> FwdIter2>::type` when we do have an exec-policy?

14:35 <gonidelis[m]> is that analogy correct? hkaiser

14:35 <K-ballo> friend is not part of the type

14:35 <gonidelis[m]> yeah you 've said that again... sorry

14:36 <K-ballo> no that was constexpr

14:36 <K-ballo> same thing though

14:36 <gonidelis[m]> K-ballo: I hate your hard-drive capacity

14:37 weilewei has joined #ste||ar

14:37 <gonidelis[m]> What about my question though?

14:43 <diehlpk_work> ms[m], Can this go to HPX 1.5.0 https://github.com/STEllAR-GROUP/hpx/pull/4875?

14:43 <diehlpk_work> So I could use HPX 1.5.0 for the JOSS paper as reference

14:44 <ms[m]> "merged commit 3d88ac8 into master 9 days ago", so yes ;)

14:44 <diehlpk_work> Ok, good

14:44 <diehlpk_work> If I add docs/joss_paper/ will that break the documentation?

14:45 <diehlpk_work> ms[m], We need to add the JOSS paper to HPX 1.5.0 as well

14:47 weilewei has quit [Ping timeout: 245 seconds]

14:47 <ms[m]> ah yes, so where did you want it?

14:47 <ms[m]> a badge? or a bibtex entry?

14:53 weilewei has joined #ste||ar

15:00 <diehlpk_work> I like to have the paper on the readme with a link to it;s Bibtex entry

15:00 nanmiao1176 has joined #ste||ar

15:00 <diehlpk_work> ms[m], I was thinking to add sth like to cite HPX use the JOSS paper and for a specific version use the Zenodo DOI

15:01 <diehlpk_work> ms[m], We already have the badge

15:01 nanmiao11 has quit [Ping timeout: 245 seconds]

15:03 <ms[m]> ok, we can add one to both the readme and the actual docs

15:03 <ms[m]> do you have a bibtex entry for the joss paper?

15:04 <ms[m]> diehlpk_work: ^

15:05 <diehlpk_work> ms[m], The second reviewer approved the paper

15:05 <diehlpk_work> We still miss few check boxes of the first reviewer

15:05 <diehlpk_work> I hope to have it next week

15:05 <diehlpk_work> If a paper is accepted it takes only few days until publication

15:06 <ms[m]> diehlpk_work: ok, that should work

15:06 <diehlpk_work> hkaiser, and I addressed the comments from the Charm++ guy and he is now happy with the paper

15:06 <ms[m]> final release is right now planned for september 2nd, so as long as we have it before that it can go in the release

15:06 <diehlpk_work> The first reviewer checked the code but not the paper

15:14 <gonidelis[m]> hkaiser: yt?

15:24 <hkaiser> here

15:26 <hkaiser> weilewei: yt?

15:26 <weilewei> hkaiser yes

15:27 <hkaiser> see pm, pls

15:32 <gonidelis[m]> hkaiser: Should I create a 'get_third_element" or sth https://github.com/gonidelis/hpx/blob/9d5073a03c634eb7c7c6dd947ec110bf7a8cb7a2/libs/algorithms/include/hpx/parallel/util/result_types.hpp#L246-L259 according to this: http://eel.is/c++draft/alg.transform#itemdecl:1 ?

15:32 <hkaiser> you might need it, yes

15:37 <gonidelis[m]> ok just did it... let's see

15:39 <gonidelis[m]> it works

15:39 <gonidelis[m]> well, that was obvious

15:39 <gonidelis[m]> the this is wether it is correct semantically ;p

15:40 <gonidelis[m]> whether^^

15:52 <gonidelis[m]> hkaiser: there does not seem to be a similiar overload of this https://github.com/gonidelis/hpx/blob/9d5073a03c634eb7c7c6dd947ec110bf7a8cb7a2/libs/algorithms/include/hpx/parallel/algorithms/transform.hpp#L819-L821 here http://eel.is/c++draft/alg.transform#itemdecl:1

15:53 <gonidelis[m]> Do we just throw that away? no tag_invoke api for that?

15:54 <gonidelis[m]> (i am talking about the non-ranges overloads)

15:57 <hkaiser> gonidelis[m]: leave the old algorithms untouched, please and add the CPO based overloads based on the standard

15:58 <gonidelis[m]> hkaiser: yeah, sorry I sent you the wrong permalink

15:58 <gonidelis[m]> https://github.com/gonidelis/hpx/blob/9d5073a03c634eb7c7c6dd947ec110bf7a8cb7a2/libs/algorithms/include/hpx/parallel/algorithms/transform.hpp#L702

15:58 <gonidelis[m]> I meant that ^^ `transform_` is never used

15:58 <hkaiser> ok, what's the question here?

15:58 <hkaiser> sure it is used

15:59 <gonidelis[m]> in that one `ranges::transform(I1 first1, S1 last1, I2 first2, S2 last2, O result,.......) `, rigjt

15:59 bita has joined #ste||ar

15:59 <gonidelis[m]> right? *

16:00 <hkaiser> it's used for the segemnted transform

16:01 <gonidelis[m]> oh ok yeah you are right... But what I want to ask is here https://github.com/gonidelis/hpx/blob/9d5073a03c634eb7c7c6dd947ec110bf7a8cb7a2/libs/algorithms/tests/unit/algorithms/transform_binary2_tests.hpp#L67-L69. If I turn that into `hpx::transform` which tag_invoke overload should be choosen according to the standard?

16:02 <gonidelis[m]> that's my bottom line problem

16:02 <hkaiser> that's the idea, yes

16:02 <hkaiser> what's your problem, then?

16:03 <gonidelis[m]> as you can see that is a "non-ranges" overload that takes first1, last1, first2., last2 but there is no such overload here http://eel.is/c++draft/alg.transform#itemdecl:1

16:03 <gonidelis[m]> that;s my problem

16:04 <hkaiser> ahh

16:05 <hkaiser> turn it into a hpx::ranges::transform, then

16:06 <gonidelis[m]> hkaiser: ok great thanks... should I move it undre the corresponding test directory too then?

16:09 <hkaiser> please feel free to do that, it's not that important, however

16:09 <gonidelis[m]> ok thanks :)

16:58 <gonidelis[m]> hkaiser: does a `parallel::util::detail::algorithm_result<ExPolicy, in_out_result<FwdIter1, FwdIter2>>::type foo` has `foo.in` and `foo.out` members?

16:59 <hkaiser> only if it's being instantaited with a non-task execution policy, otherwise it will be future<in_out_result<>>

16:59 <gonidelis[m]> hmm ok

16:59 shahrzad has joined #ste||ar

17:01 <hkaiser> gonidelis[m]: that's the whole point of having the algorithm_result<> type, to be able to distinguish between synchronous and asynchronous execution results

17:02 <gonidelis[m]> hkaiser: this thing https://github.com/gonidelis/hpx/blob/9d5073a03c634eb7c7c6dd947ec110bf7a8cb7a2/libs/algorithms/tests/unit/container_algorithms/transform_range_binary.cpp#L35-L44 expects an overload that is sth like transform(rng, first, last) ?

17:02 <hkaiser> yes, isn't that one of the ranges overloads?

17:03 <hkaiser> gonidelis[m]: the last of the specified overloads here: http://eel.is/c++draft/alg.transform#itemdecl:1

17:03 <hkaiser> (plus execution policy, that is)

17:06 <gonidelis[m]> but does tag_invoke(rng1, rng2) recognize that ?

17:06 <gonidelis[m]> plus this one last that you sent me has a different `O` dest

17:07 <hkaiser> why do you think it has a different 'O'?

17:08 <gonidelis[m]> `transform(R1&& r1, R2&& r2, O result)` that's its decl

17:08 <hkaiser> yes, that's exactly what this test is invoking, no?

17:09 <hkaiser> well, plus the operator

17:09 <hkaiser> transform(R1&& r1, R2&& r2, O result, BinaryOp)

17:10 <gonidelis[m]> does `std::begin(c2)` indicate a range?

17:10 <gonidelis[m]> if we match the args, that should be rng2

17:10 <hkaiser> no, it's the same a c2.begin()

17:10 <gonidelis[m]> and `std::begin(d1)` should be `O result`

17:10 <gonidelis[m]> hkaiser: exactly

17:10 <hkaiser> ann now I see what you mean

17:10 <zao> (you can std::begin other things too, like C-arrays)

17:11 <hkaiser> ok, this should be hpx::ranges(policy, c1, c2, std::begin(d1), add) instead

17:11 <gonidelis[m]> ahhh thanks

17:11 <gonidelis[m]> no it clicks

17:11 <gonidelis[m]> now *

17:11 <gonidelis[m]> :D

17:24 weilewei40 has joined #ste||ar

17:24 weilewei40 has quit [Remote host closed the connection]

17:24 weilewei31 has joined #ste||ar

17:26 <gonidelis[m]> hkaiser: there is a strong possiblity that I am done with the CPOs :D ... like all!!!

17:27 <gonidelis[m]> I shall wait for github tests to complain if they want to and then I need to fix the docs and render as depracated the depracated ones and move stuff in the proper namespace in general but thesea are trivial. The basic functionallity should be complete ;)

17:27 nanmiao1176 has quit [Ping timeout: 245 seconds]

17:27 weilewei has quit [Ping timeout: 245 seconds]

17:27 <gonidelis[m]> (plus add some binary tests overloads without exec-policy ;p)

17:28 <hkaiser> \o/

17:28 <K-ballo> deprecate, e-e-a

17:29 <gonidelis[m]> K-ballo: that changes everything... kidding. thanks for the tip ;)

17:33 kale[m] has quit [Ping timeout: 260 seconds]

17:34 kale[m] has joined #ste||ar

17:36 <gonidelis[m]> hkaiser: i have pushed the changes... you can take a look whenever you want if you want to discuss in tomorrow meeting

17:37 <hkaiser> k

17:39 shahrzad_ has joined #ste||ar

17:40 nanmiao11 has joined #ste||ar

17:41 shahrzad has quit [Ping timeout: 264 seconds]

17:55 nanmiao11 has quit [Remote host closed the connection]

17:57 nanmiao11 has joined #ste||ar

18:12 nanmiao11 has quit [Remote host closed the connection]

19:48 weilewei31 has quit [Remote host closed the connection]

19:52 weilewei has joined #ste||ar

20:09 <K-ballo> updated https://gist.github.com/K-ballo/7b551d71487bdf34d0b25aa034e399a5

20:11 <hkaiser> K-ballo: thanks! looks much better than the last one ;-)

20:12 <K-ballo> the last one was at 331, you can see it under "revisions"

20:13 <K-ballo> the first one in there was 702 :/

20:13 <hkaiser> I added it to #3440

20:15 <hkaiser> strange, I thought we had gotten rid of program_options

20:18 <hkaiser> it's the compatibility fallback that's including the boost/program_options stuff, makes sense

20:24 nanmiao11 has joined #ste||ar

21:05 weilewei has quit [Remote host closed the connection]

21:08 weilewei has joined #ste||ar

21:12 <weilewei> hkaiser For our paper, I am thinking to create a tag for hpx-enabled DCA in ste||ar repo: https://github.com/STEllAR-GROUP/DCA/. Shall I merge hpx_pr branch to master branch of ste||ar/dca, and then create a tag for it?

21:12 <weilewei> Or what will be an alternative solution?

21:14 <hkaiser> weilewei: sounds good

21:14 <weilewei> ok, nice

21:23 weilewei has quit [Remote host closed the connection]

21:23 weilewei has joined #ste||ar

21:32 kale[m] has quit [Ping timeout: 260 seconds]

21:32 kale[m] has joined #ste||ar

21:39 kale[m] has quit [Ping timeout: 264 seconds]

21:39 kale[m] has joined #ste||ar

21:54 akheir has quit [Remote host closed the connection]

21:54 akheir has joined #ste||ar

23:33 kale[m] has quit [Ping timeout: 272 seconds]

23:33 kale[m] has joined #ste||ar

23:50 <hkaiser> gnikunj[m]: would you mind having a look at Nan's problem on QBC?

23:50 <hkaiser> things seem to have worked there for you without issues...

23:56 <nanmiao11> =D I think right now is his early morning time