K-ballo changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
<zao> hkaiser: I'm heading off to bed real soon now, but the test (numa_allocator_test) I used seems to loop forever with hwloc 2.3 and the current combination of the two PRs. Two Ctrl-C states of the process: https://gist.github.com/zao/3b9a60aea46e75af970bf0035500baf9
<zao> This hwloc 2 is the unpatched version before the update to the ports that the original reporter made, so it should probably reproduce the problem original problem if I go and remove the fix PR.
<zao> (I've shot myself in the foot a bit by using the version from ports, can only ever have one version installed at once :D )
<hkaiser> yeah, I added some locks which might have been wrong :/
<hkaiser> removed now
hkaiser has quit [Quit: bye]
shahrzad has quit [Remote host closed the connection]
teonnik has quit [Ping timeout: 246 seconds]
klaus[m] has quit [Ping timeout: 246 seconds]
parsa[m] has quit [Ping timeout: 240 seconds]
k-ballo[m] has quit [Ping timeout: 240 seconds]
ms[m] has quit [Ping timeout: 240 seconds]
mdiers[m] has quit [Ping timeout: 240 seconds]
gnikunj[m] has quit [Ping timeout: 240 seconds]
jpinto[m] has quit [Ping timeout: 248 seconds]
tiagofg[m] has quit [Ping timeout: 265 seconds]
rori has quit [Ping timeout: 260 seconds]
pedro_barbosa[m] has quit [Ping timeout: 268 seconds]
gonidelis[m] has quit [Ping timeout: 268 seconds]
klaus[m] has joined #ste||ar
rori has joined #ste||ar
bita has quit [Ping timeout: 264 seconds]
ms[m] has joined #ste||ar
ms[m] has quit [Ping timeout: 246 seconds]
klaus[m] has quit [Ping timeout: 265 seconds]
rori has quit [Ping timeout: 260 seconds]
powderluv has quit [Quit: powderluv]
teonnik has joined #ste||ar
parsa[m] has joined #ste||ar
k-ballo[m] has joined #ste||ar
mdiers[m] has joined #ste||ar
gnikunj[m] has joined #ste||ar
pedro_barbosa[m] has joined #ste||ar
tiagofg[m] has joined #ste||ar
jpinto[m] has joined #ste||ar
gonidelis[m] has joined #ste||ar
ms[m] has joined #ste||ar
klaus[m] has joined #ste||ar
rori has joined #ste||ar
<ms[m]> mdiers: if you still have the debugger open with those threads (or a core dump) I'd also be interested in knowing which assertion on threads 61, 62, 64 is hit (I'm pretty sure it can only be the second), but particularly what the value of stacksize is
<mdiers[m]> ms: Sorry for my late reply. The line numbers are at the end of the line between the filename and the address in the callstacks.
<mdiers[m]> * ms: Sorry for my late reply. The line numbers are at the end of the line between the filename and the address in the callstacks. `scheduler_base.hpp:273`
<ms[m]> mdiers: ah, indeed, thanks, so that's the expected assertion
<ms[m]> do you have access to the value of stacksize?
<mdiers[m]> ms: I'll let you know as soon as I catch it in the debugger again.
<diehlpk_work> ms[m], I will push the rc to Fedora today?
<ms[m]> mdiers: thanks
<ms[m]> diehlpk_work: thanks as well
<ms[m]> note that I've extended the stellar group signing key, but I'm not sure it's propagated to all/most servers yet
hkaiser has joined #ste||ar
<hkaiser> ms[m], rori: thanks for all the work on the release!
<ms[m]> hkaiser: thank you, all the hard work has been done before the release :)
<ms[m]> note that I didn't include the freebsd environment nor the hwloc prs in the rc because I wasn't sure what the status of those were
<hkaiser> ms[m]: nod
<hkaiser> ms[m]: zao confirmed yesterday that the freebsd PR is fine, not sure what his verdict on the hwloc was
<ms[m]> did you conclude yesterday that they were ready to go in or do they need further testing (they probably do, but are they as tested as we will get them right now)?
<ms[m]> ok, so we can likely go ahead with that one then
<ms[m]> let's wait a bit to see if we get a confirmation about the hwloc one (either from zao or one of the reporters)
<hkaiser> nod
<gnikunj[m]> hkaiser: K-ballo why is it that some C++ features come in as warnings while others as errors while using let's say C++17/20 features on a compiler that defaults to C++14? For instance, using fold expressions on gcc 10.2 throws a warning (the executable runs as expected) saying that I should enable -std=c++17 to use fold-expressions
<hkaiser> gnikunj[m]: ask the compiler developers
<K-ballo> conforming extension, use pedantic if you want an error instead of a warning
<hkaiser> most likely because the feature was available before it got standardized, so this keeps existing code compiling
<gnikunj[m]> aah, that makes sense.
<ms[m]> gnikunj: also, clang is stricter than gcc about sticking to features in the version you specify
<ms[m]> *sticking only to features...
<gnikunj[m]> got it. I was really curious what went in the discussions that they decided to throw warnings at some and errors at others
<gnikunj[m]> things related to constexpr always seem to throw errors while things like inline variables and fold expressions throw warnings and the executable behaves as expected
<K-ballo> there are no conforming constexpr extensions
<K-ballo> for instance, gcc's constexpr math extensions are non-conforming
<gnikunj[m]> ohh. That reminds me of another question. Why aren't more algorithms constepxr?
<gnikunj[m]> C++20 does bring more constexpr algorithms but what is stopping from making most of the algorithms constexpr?
<K-ballo> mostly lack of proposals, but also some memcpy related concerns
<K-ballo> all non-allocating algorithms should by constexpr by now
<gnikunj[m]> but new and delete are now allowed with C++20 when used in the same constexpr function
<gnikunj[m]> why is memcpy an issue then?
<K-ballo> you can't memcpy in a constant expression
<K-ballo> but that got was resolved with is_constant_evaluated()
<K-ballo> for constexpr allocating algorithms it will take some constexpr implementations first, so that reference implementations can be tested, in order for proposals to come forward
<gnikunj[m]> makes sense. So we can expect some constexpr allocating algorithms by C++23?
<K-ballo> assuming they are actually implementable (they should be, memory is only used temporarily), then by some C++next yes
<gnikunj[m]> nod. Nice!
<zao> ms[m]: just got up, gonna see how much work I have I must do before I can get something built again.
<hkaiser> zao: no rush
<hkaiser> ms[m]: yt?
<ms[m]> hkaiser: here
<hkaiser> ms[m]: wrt #5117
<hkaiser> have you seen the stack backtraces mdiers[m] posted yesterday?
<ms[m]> yep
<ms[m]> we very briefly talked about it earlier today
<hkaiser> ahh
<hkaiser> it's a follow-up error, I'm just not sure what's causing what
<ms[m]> we know which assertion it is (the second, and it could really only be that one)
<hkaiser> right
<ms[m]> yeah, I suspect so too
<hkaiser> is it that the assert is the cause or the effect?
<ms[m]> hard to say
<ms[m]> I'm really struggling to see what could be wrong if it's the assert that's the cause
<hkaiser> the only explanation for the assert I have is that the thread_data went out of scope
<ms[m]> right, that would make sense
<ms[m]> but still, I've no idea what could cause that :/
<hkaiser> me neither
<ms[m]> in the stacktraces #62 is one level further down than what we saw earlier, i.e. in resume rather than notify_one
<ms[m]> mdiers: yt? in the set of callstacks you posted yesterday, which one is the one with the segfault?
<hkaiser> ms[m]: that could be caused by different inlining strategies the compiler applied in different places
<ms[m]> right, my point is that if it's in fact resume rather than notify_one it might be the agent_ref pointer that's messed up, not something in future_data
<ms[m]> not that that helps us much...
<hkaiser> ahh, that would coincide with the assert, possibly
<ms[m]> indeed, possibly :/
<ms[m]> you all know already, but in any case: https://github.com/STEllAR-GROUP/hpx/releases/tag/1.6.0-rc1
<k-ballo[m]> in that case, the x3 changes target 1.7.0
<ms[m]> k-ballo[m]: that can still go in (note: it's rc1)
<ms[m]> afaict it's only that one build failure with boost 1.66 remaining, the others seem ok
<k-ballo[m]> yeah but i don't want to, better merge them right after the release to give it plenty of time to be used
<ms[m]> as you wish
<zao> Started some builds of HPX with FBSD and hwloc changes, against hwloc 1.x, 2.3.0, 2.3.0 w/ patch, 2.4.0
<gnikunj[m]> ms: why is hpxMP support removed from 1.6?
<ms[m]> gnikunj: unmaintained
<ms[m]> (note that it might still be possible to build it separately, it's just not part of core hpx now)
<gnikunj[m]> got it
<gnikunj[m]> last I knew, Tianyi was maintaining it
<zao> (also 2.4.0 patch)
<zao> This seems unpromising.
bita has joined #ste||ar
<zao> hkaiser: Assuming I've managed to apply the patches correctly, everything not hwloc1 is broken in various ways: https://gist.github.com/zao/2b3e7f0edb6574aa8b3d25f256806deb
<zao> (gist has hwloc-info trees and logs for running numa_allocator_test on all five variants)
<zao> hkaiser: Backtrace for the segfault in 240-patched: https://gist.github.com/zao/cc3e0d3fcf91fd1dd2a5f71949f8cd50
<hkaiser> zao: thanks! I thinkn we can safely disable this test for freebsd as it depends on functionalities that are disabled now
<zao> If you have recommendations of a test/example that I can run to verify that the topology stuff is working, it'd be great.
<hkaiser> zao: hello_world --hpx:print-bind might be a start
<zao> `hello_world_2` segfaults in the same `hpx::threads::topology::extract_node_mask` as the backtrace above.
<mdiers[m]> hkaiser: ms I have managed to extract the problem in an example. I'll just clear it up for a moment.
<hkaiser> mdiers[m]: excellent!
<hkaiser> zao: grrr
<hkaiser> zao: doesn't make sense if we assume that hwloc is not faulty
<hkaiser> the only thing that could cause a segfault is that obj points into nowhere
<hkaiser> and we've got that back from hwloc
<zao> For ref: hwloc1 hello_world_1 either runs correctly or hangs, 230 and 240 bails with `hpx::init: hpx::exception caught: failed to initialize machine affinity mask: HPX(kernel_error)`, 230-patched and 240-patched has the crash.
<hkaiser> ok, so they screwed it up
<hkaiser> with that patch
<zao> Debugging is so much harder when the debugger crashes on printing values.
<zao> Some fields of this struct seem sketchy, arity is 0x3AB8800 which looks quite a lot like a pointer, elements of the children-array look weird.
<hkaiser> yah, the obj is messed up, thus the segfault
<zao> hrm, it may be a build issue, a clean rebuild after removing the system hwloc package doesn't crash, just does the good old hang.
<hkaiser> what did you change?
<zao> Removed system hwloc1 package.
<hkaiser> lol
<hkaiser> good move
<zao> Our build may be finding those headers if something sneaks in an /usr/include on the build command line, haven't inspected those.
<zao> Anyway, an unpatched hwloc-2.4.0 still hangs our code around:
<zao> frame #3: 0x00000008031a058e libhpx_cored.so`hpx::threads::topology::extract_node_count(this=0x0000000802a6b038, parent=0x0000000803a63900, type=HWLOC_OBJ_PU, count=0) const at topology.cpp:689:42
<zao> frame #4: 0x00000008031a0a27 libhpx_cored.so`hpx::threads::topology::get_number_of_core_pus(this=0x0000000802a6b038, core=4) const at topology.cpp:841:20
<hkaiser> zao: ok, thanks
<hkaiser> that's their bug
<zao> We seem to be never managing to increment num_thread in decode_balanced_distribution.
<zao> But yeah, as long as we run right with the ports patch, it sounds fine.
<ms[m]> mdiers: thank you
<ms[m]> btw, you do use the default scheduler, right?
<ms[m]> or do you use any particular build/runtime options that are not the defaults?
<hkaiser> zao: if the cores are not available through hwloc, num-pus should be the same as num-cores
<hkaiser> or in this case the result should be 'one'
<hkaiser> (i.e. number of PUs per core)
<zao> I'm going to find a post-it note and put on this disk, telling me it's for HPX FreeBSD work and keep it for a while.
hkaiser has quit [Quit: bye]
hkaiser has joined #ste||ar
gonidelis[m] has quit [Ping timeout: 246 seconds]
parsa[m] has quit [Ping timeout: 246 seconds]
jpinto[m] has quit [Ping timeout: 240 seconds]
tiagofg[m] has quit [Ping timeout: 258 seconds]
mdiers[m] has quit [Ping timeout: 244 seconds]
teonnik has quit [Ping timeout: 268 seconds]
ms[m] has quit [Ping timeout: 260 seconds]
klaus[m] has quit [Ping timeout: 240 seconds]
pedro_barbosa[m] has quit [Ping timeout: 265 seconds]
rori has quit [Ping timeout: 246 seconds]
gnikunj[m] has quit [Ping timeout: 258 seconds]
k-ballo[m] has quit [Ping timeout: 265 seconds]
diehlpk has joined #ste||ar
diehlpk has joined #ste||ar
diehlpk has quit [Changing host]
<diehlpk> HPX 1.6.0-rc1 on Fedora 34
<hkaiser> nice
<diehlpk_work> Let us see if HPX can handle gcc 11
<hkaiser> it's more the other way around ;-)
diehlpk has quit [Ping timeout: 244 seconds]
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
parsa has joined #ste||ar
jaafar has quit [Ping timeout: 272 seconds]
<diehlpk_work> hkaiser, it seems to compile on x86
jaafar has joined #ste||ar
<zao> This is the machine I set up to debug with btw :D https://i.imgur.com/lSW7kea.jpg
<hkaiser> cool!
parsa has quit [Read error: Connection reset by peer]
parsa| has joined #ste||ar
parsa| is now known as parsa
gonidelis[m] has joined #ste||ar
parsa[m] has joined #ste||ar
gonidelis[m] has quit [Quit: Bridge terminating on SIGTERM]
parsa[m] has quit [Quit: Bridge terminating on SIGTERM]
pedro_barbosa[m] has joined #ste||ar
pedro_barbosa[m] has quit [Remote host closed the connection]
klaus[m] has joined #ste||ar
gonidelis[m] has joined #ste||ar
rori has joined #ste||ar
teonnik has joined #ste||ar
jpinto[m] has joined #ste||ar
heller1 has joined #ste||ar
pedro_barbosa[m] has joined #ste||ar
gnikunj[m] has joined #ste||ar
tiagofg[m] has joined #ste||ar
ms[m] has joined #ste||ar
mdiers[m] has joined #ste||ar
k-ballo[m] has joined #ste||ar
parsa[m] has joined #ste||ar