#ste||ar on 2021-01-27 — irc logs at irclog.cct.lsu.edu

2020-09-17 16:16 K-ballo changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:20 <zao> hkaiser: I'm heading off to bed real soon now, but the test (numa_allocator_test) I used seems to loop forever with hwloc 2.3 and the current combination of the two PRs. Two Ctrl-C states of the process: https://gist.github.com/zao/3b9a60aea46e75af970bf0035500baf9

00:21 <zao> This hwloc 2 is the unpatched version before the update to the ports that the original reporter made, so it should probably reproduce the problem original problem if I go and remove the fix PR.

00:22 <zao> (I've shot myself in the foot a bit by using the version from ports, can only ever have one version installed at once :D )

00:30 <hkaiser> yeah, I added some locks which might have been wrong :/

02:41 <hkaiser> removed now

03:01 hkaiser has quit [Quit: bye]

03:04 shahrzad has quit [Remote host closed the connection]

04:20 teonnik has quit [Ping timeout: 246 seconds]

04:20 klaus[m] has quit [Ping timeout: 246 seconds]

04:20 parsa[m] has quit [Ping timeout: 240 seconds]

04:20 k-ballo[m] has quit [Ping timeout: 240 seconds]

04:20 ms[m] has quit [Ping timeout: 240 seconds]

04:20 mdiers[m] has quit [Ping timeout: 240 seconds]

04:20 gnikunj[m] has quit [Ping timeout: 240 seconds]

04:20 jpinto[m] has quit [Ping timeout: 248 seconds]

04:20 tiagofg[m] has quit [Ping timeout: 265 seconds]

04:20 rori has quit [Ping timeout: 260 seconds]

04:21 pedro_barbosa[m] has quit [Ping timeout: 268 seconds]

04:21 gonidelis[m] has quit [Ping timeout: 268 seconds]

04:42 klaus[m] has joined #ste||ar

04:45 rori has joined #ste||ar

04:46 bita has quit [Ping timeout: 264 seconds]

04:48 ms[m] has joined #ste||ar

05:00 ms[m] has quit [Ping timeout: 246 seconds]

05:00 klaus[m] has quit [Ping timeout: 265 seconds]

05:01 rori has quit [Ping timeout: 260 seconds]

05:30 powderluv has quit [Quit: powderluv]

05:36 teonnik has joined #ste||ar

05:37 parsa[m] has joined #ste||ar

05:38 k-ballo[m] has joined #ste||ar

05:40 mdiers[m] has joined #ste||ar

05:41 gnikunj[m] has joined #ste||ar

05:42 pedro_barbosa[m] has joined #ste||ar

05:42 tiagofg[m] has joined #ste||ar

05:42 jpinto[m] has joined #ste||ar

05:43 gonidelis[m] has joined #ste||ar

06:04 ms[m] has joined #ste||ar

06:06 klaus[m] has joined #ste||ar

06:12 rori has joined #ste||ar

08:43 <ms[m]> mdiers: if you still have the debugger open with those threads (or a core dump) I'd also be interested in knowing which assertion on threads 61, 62, 64 is hit (I'm pretty sure it can only be the second), but particularly what the value of stacksize is

11:07 <mdiers[m]> ms: Sorry for my late reply. The line numbers are at the end of the line between the filename and the address in the callstacks.

11:08 <mdiers[m]> * ms: Sorry for my late reply. The line numbers are at the end of the line between the filename and the address in the callstacks. `scheduler_base.hpp:273`

11:28 <ms[m]> mdiers: ah, indeed, thanks, so that's the expected assertion

11:28 <ms[m]> do you have access to the value of stacksize?

11:45 <mdiers[m]> ms: I'll let you know as soon as I catch it in the debugger again.

13:05 <diehlpk_work> ms[m], I will push the rc to Fedora today?

13:06 <ms[m]> mdiers: thanks

13:06 <ms[m]> diehlpk_work: thanks as well

13:07 <ms[m]> note that I've extended the stellar group signing key, but I'm not sure it's propagated to all/most servers yet

13:25 hkaiser has joined #ste||ar

13:29 <hkaiser> ms[m], rori: thanks for all the work on the release!

13:42 <ms[m]> hkaiser: thank you, all the hard work has been done before the release :)

13:43 <ms[m]> note that I didn't include the freebsd environment nor the hwloc prs in the rc because I wasn't sure what the status of those were

13:43 <hkaiser> ms[m]: nod

13:43 <hkaiser> ms[m]: zao confirmed yesterday that the freebsd PR is fine, not sure what his verdict on the hwloc was

13:43 <ms[m]> did you conclude yesterday that they were ready to go in or do they need further testing (they probably do, but are they as tested as we will get them right now)?

13:44 <ms[m]> ok, so we can likely go ahead with that one then

13:44 <ms[m]> let's wait a bit to see if we get a confirmation about the hwloc one (either from zao or one of the reporters)

13:48 <hkaiser> nod

13:55 <gnikunj[m]> hkaiser: K-ballo why is it that some C++ features come in as warnings while others as errors while using let's say C++17/20 features on a compiler that defaults to C++14? For instance, using fold expressions on gcc 10.2 throws a warning (the executable runs as expected) saying that I should enable -std=c++17 to use fold-expressions

13:55 <hkaiser> gnikunj[m]: ask the compiler developers

13:55 <K-ballo> conforming extension, use pedantic if you want an error instead of a warning

13:56 <hkaiser> most likely because the feature was available before it got standardized, so this keeps existing code compiling

13:56 <gnikunj[m]> aah, that makes sense.

13:57 <ms[m]> gnikunj: also, clang is stricter than gcc about sticking to features in the version you specify

13:58 <ms[m]> *sticking only to features...

13:58 <gnikunj[m]> got it. I was really curious what went in the discussions that they decided to throw warnings at some and errors at others

13:59 <gnikunj[m]> things related to constexpr always seem to throw errors while things like inline variables and fold expressions throw warnings and the executable behaves as expected

14:04 <K-ballo> there are no conforming constexpr extensions

14:05 <K-ballo> for instance, gcc's constexpr math extensions are non-conforming

14:06 <gnikunj[m]> ohh. That reminds me of another question. Why aren't more algorithms constepxr?

14:07 <gnikunj[m]> C++20 does bring more constexpr algorithms but what is stopping from making most of the algorithms constexpr?

14:07 <K-ballo> mostly lack of proposals, but also some memcpy related concerns

14:07 <K-ballo> all non-allocating algorithms should by constexpr by now

14:07 <gnikunj[m]> but new and delete are now allowed with C++20 when used in the same constexpr function

14:08 <gnikunj[m]> why is memcpy an issue then?

14:08 <K-ballo> you can't memcpy in a constant expression

14:09 <K-ballo> but that got was resolved with is_constant_evaluated()

14:09 <K-ballo> for constexpr allocating algorithms it will take some constexpr implementations first, so that reference implementations can be tested, in order for proposals to come forward

14:10 <gnikunj[m]> makes sense. So we can expect some constexpr allocating algorithms by C++23?

14:12 <K-ballo> assuming they are actually implementable (they should be, memory is only used temporarily), then by some C++next yes

14:12 <gnikunj[m]> nod. Nice!

14:28 <zao> ms[m]: just got up, gonna see how much work I have I must do before I can get something built again.

14:28 <hkaiser> zao: no rush

15:01 <hkaiser> ms[m]: yt?

15:03 <ms[m]> hkaiser: here

15:04 <hkaiser> ms[m]: wrt #5117

15:04 <hkaiser> have you seen the stack backtraces mdiers[m] posted yesterday?

15:04 <ms[m]> yep

15:04 <ms[m]> we very briefly talked about it earlier today

15:05 <hkaiser> ahh

15:05 <hkaiser> it's a follow-up error, I'm just not sure what's causing what

15:05 <ms[m]> we know which assertion it is (the second, and it could really only be that one)

15:05 <hkaiser> right

15:05 <ms[m]> yeah, I suspect so too

15:05 <hkaiser> is it that the assert is the cause or the effect?

15:06 <ms[m]> hard to say

15:06 <ms[m]> I'm really struggling to see what could be wrong if it's the assert that's the cause

15:06 <hkaiser> the only explanation for the assert I have is that the thread_data went out of scope

15:07 <ms[m]> right, that would make sense

15:08 <ms[m]> but still, I've no idea what could cause that :/

15:08 <hkaiser> me neither

15:10 <ms[m]> in the stacktraces #62 is one level further down than what we saw earlier, i.e. in resume rather than notify_one

15:10 <ms[m]> mdiers: yt? in the set of callstacks you posted yesterday, which one is the one with the segfault?

15:11 <hkaiser> ms[m]: that could be caused by different inlining strategies the compiler applied in different places

15:12 <ms[m]> right, my point is that if it's in fact resume rather than notify_one it might be the agent_ref pointer that's messed up, not something in future_data

15:12 <ms[m]> not that that helps us much...

15:13 <hkaiser> ahh, that would coincide with the assert, possibly

15:19 <ms[m]> indeed, possibly :/

15:19 <ms[m]> you all know already, but in any case: https://github.com/STEllAR-GROUP/hpx/releases/tag/1.6.0-rc1

15:22 <k-ballo[m]> in that case, the x3 changes target 1.7.0

15:22 <ms[m]> k-ballo[m]: that can still go in (note: it's rc1)

15:23 <ms[m]> afaict it's only that one build failure with boost 1.66 remaining, the others seem ok

15:23 <k-ballo[m]> yeah but i don't want to, better merge them right after the release to give it plenty of time to be used

15:24 <ms[m]> as you wish

15:27 <zao> Started some builds of HPX with FBSD and hwloc changes, against hwloc 1.x, 2.3.0, 2.3.0 w/ patch, 2.4.0

15:31 <gnikunj[m]> ms: why is hpxMP support removed from 1.6?

15:31 <ms[m]> gnikunj: unmaintained

15:32 <ms[m]> (note that it might still be possible to build it separately, it's just not part of core hpx now)

15:32 <gnikunj[m]> got it

15:32 <gnikunj[m]> last I knew, Tianyi was maintaining it

15:37 <zao> (also 2.4.0 patch)

15:37 <zao> This seems unpromising.

15:48 bita has joined #ste||ar

15:49 <zao> hkaiser: Assuming I've managed to apply the patches correctly, everything not hwloc1 is broken in various ways: https://gist.github.com/zao/2b3e7f0edb6574aa8b3d25f256806deb

15:50 <zao> (gist has hwloc-info trees and logs for running numa_allocator_test on all five variants)

15:52 <zao> "patched" being https://github.com/open-mpi/hwloc/commit/45aaeb2010f32543cb3dd66fdfa784fad2597497.patch

15:58 <zao> hkaiser: Backtrace for the segfault in 240-patched: https://gist.github.com/zao/cc3e0d3fcf91fd1dd2a5f71949f8cd50

16:08 <hkaiser> zao: thanks! I thinkn we can safely disable this test for freebsd as it depends on functionalities that are disabled now

16:09 <zao> If you have recommendations of a test/example that I can run to verify that the topology stuff is working, it'd be great.

16:10 <hkaiser> zao: hello_world --hpx:print-bind might be a start

16:11 <zao> `hello_world_2` segfaults in the same `hpx::threads::topology::extract_node_mask` as the backtrace above.

16:13 <mdiers[m]> hkaiser: ms I have managed to extract the problem in an example. I'll just clear it up for a moment.

16:13 <hkaiser> mdiers[m]: excellent!

16:13 <hkaiser> zao: grrr

16:16 <hkaiser> zao: doesn't make sense if we assume that hwloc is not faulty

16:16 <hkaiser> the only thing that could cause a segfault is that obj points into nowhere

16:16 <hkaiser> and we've got that back from hwloc

16:17 <hkaiser> here: https://github.com/STEllAR-GROUP/hpx/blob/a03b1ebb35ecf8792a533857f41235d588fa1dc0/libs/core/topology/src/topology.cpp#L659

16:18 <zao> For ref: hwloc1 hello_world_1 either runs correctly or hangs, 230 and 240 bails with `hpx::init: hpx::exception caught: failed to initialize machine affinity mask: HPX(kernel_error)`, 230-patched and 240-patched has the crash.

16:19 <hkaiser> ok, so they screwed it up

16:19 <hkaiser> with that patch

16:37 <mdiers[m]> hkaiser: ms https://github.com/STEllAR-GROUP/hpx/pull/5117#issuecomment-768410664

16:39 <zao> Debugging is so much harder when the debugger crashes on printing values.

16:52 <zao> Some fields of this struct seem sketchy, arity is 0x3AB8800 which looks quite a lot like a pointer, elements of the children-array look weird.

16:53 <zao> https://gist.github.com/zao/bd65725cba83430120e9501105787c9c

17:03 <hkaiser> yah, the obj is messed up, thus the segfault

17:13 <zao> hrm, it may be a build issue, a clean rebuild after removing the system hwloc package doesn't crash, just does the good old hang.

17:14 <zao> hkaiser: Good news - https://gist.github.com/zao/c826f73531fccb5029061e06df9b94de

17:15 <hkaiser> what did you change?

17:15 <zao> Removed system hwloc1 package.

17:15 <hkaiser> lol

17:15 <hkaiser> good move

17:16 <zao> Our build may be finding those headers if something sneaks in an /usr/include on the build command line, haven't inspected those.

17:17 <zao> Anyway, an unpatched hwloc-2.4.0 still hangs our code around:

17:17 <zao> frame #3: 0x00000008031a058e libhpx_cored.so`hpx::threads::topology::extract_node_count(this=0x0000000802a6b038, parent=0x0000000803a63900, type=HWLOC_OBJ_PU, count=0) const at topology.cpp:689:42

17:17 <zao> frame #4: 0x00000008031a0a27 libhpx_cored.so`hpx::threads::topology::get_number_of_core_pus(this=0x0000000802a6b038, core=4) const at topology.cpp:841:20

17:21 <hkaiser> zao: ok, thanks

17:21 <hkaiser> that's their bug

17:22 <zao> We seem to be never managing to increment num_thread in decode_balanced_distribution.

17:22 <zao> But yeah, as long as we run right with the ports patch, it sounds fine.

17:30 <ms[m]> mdiers: thank you

17:30 <ms[m]> btw, you do use the default scheduler, right?

17:31 <ms[m]> or do you use any particular build/runtime options that are not the defaults?

17:31 <hkaiser> zao: if the cores are not available through hwloc, num-pus should be the same as num-cores

17:31 <hkaiser> or in this case the result should be 'one'

17:31 <hkaiser> (i.e. number of PUs per core)

18:17 <zao> I'm going to find a post-it note and put on this disk, telling me it's for HPX FreeBSD work and keep it for a while.

20:03 hkaiser has quit [Quit: bye]

21:46 hkaiser has joined #ste||ar

21:50 gonidelis[m] has quit [Ping timeout: 246 seconds]

21:50 parsa[m] has quit [Ping timeout: 246 seconds]

21:50 jpinto[m] has quit [Ping timeout: 240 seconds]

21:50 tiagofg[m] has quit [Ping timeout: 258 seconds]

21:50 mdiers[m] has quit [Ping timeout: 244 seconds]

21:50 teonnik has quit [Ping timeout: 268 seconds]

21:50 ms[m] has quit [Ping timeout: 260 seconds]

21:51 klaus[m] has quit [Ping timeout: 240 seconds]

21:51 pedro_barbosa[m] has quit [Ping timeout: 265 seconds]

21:51 rori has quit [Ping timeout: 246 seconds]

21:51 gnikunj[m] has quit [Ping timeout: 258 seconds]

21:51 k-ballo[m] has quit [Ping timeout: 265 seconds]

21:58 diehlpk has joined #ste||ar

21:58 diehlpk has quit [Changing host]

21:59 <diehlpk> hkaiser, https://koji.fedoraproject.org/koji/taskinfo?taskID=60679306

22:00 <diehlpk> HPX 1.6.0-rc1 on Fedora 34

22:00 <hkaiser> nice

22:01 <diehlpk_work> Let us see if HPX can handle gcc 11

22:02 <hkaiser> it's more the other way around ;-)

22:06 diehlpk has quit [Ping timeout: 244 seconds]

22:07 parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]

22:09 parsa has joined #ste||ar

22:20 jaafar has quit [Ping timeout: 272 seconds]

22:25 <diehlpk_work> hkaiser, it seems to compile on x86

22:26 jaafar has joined #ste||ar

22:28 <zao> This is the machine I set up to debug with btw :D https://i.imgur.com/lSW7kea.jpg

22:28 <hkaiser> cool!

22:49 parsa has quit [Read error: Connection reset by peer]

22:50 parsa| has joined #ste||ar

22:50 parsa| is now known as parsa

23:12 gonidelis[m] has joined #ste||ar

23:13 parsa[m] has joined #ste||ar

23:23 gonidelis[m] has quit [Quit: Bridge terminating on SIGTERM]

23:23 parsa[m] has quit [Quit: Bridge terminating on SIGTERM]

23:24 pedro_barbosa[m] has joined #ste||ar

23:25 pedro_barbosa[m] has quit [Remote host closed the connection]

23:33 klaus[m] has joined #ste||ar

23:33 gonidelis[m] has joined #ste||ar

23:41 rori has joined #ste||ar

23:41 teonnik has joined #ste||ar

23:41 jpinto[m] has joined #ste||ar

23:41 heller1 has joined #ste||ar

23:41 pedro_barbosa[m] has joined #ste||ar

23:41 gnikunj[m] has joined #ste||ar

23:41 tiagofg[m] has joined #ste||ar

23:41 ms[m] has joined #ste||ar

23:41 mdiers[m] has joined #ste||ar

23:45 k-ballo[m] has joined #ste||ar

23:45 parsa[m] has joined #ste||ar