<hkaiser>
diehlpk_work: are there any rules for JOSS for the sequence of authors?
<hkaiser>
should we sort them based on their contributions?
<hkaiser>
currently it's all over the place
<diehlpk_work>
hkaiser, it was supposed to be alphabetical
<diehlpk_work>
Our SC panel is on Wednesday, 18 November 2020
<diehlpk_work>
3pm - 4:30pm
<hkaiser>
diehlpk_work: it is not alphabetical, however
hkaiser has quit [Quit: bye]
nanmiao11 has quit [Remote host closed the connection]
akheir has quit [Quit: Leaving]
bita has joined #ste||ar
bita has quit [Ping timeout: 244 seconds]
kordejong has joined #ste||ar
kale[m] has joined #ste||ar
kordejong has quit [Changing host]
kordejong has joined #ste||ar
kale[m] has quit [Read error: Connection reset by peer]
kale[m] has joined #ste||ar
<kordejong>
My current experiment distributes HPX processes over the 8 NUMA nodes in a single cluster node. `main` is executed on one of these localities. During some of the runs, one of the processes keeps hanging. `hpx::finalize` has always exited, but `hpx::init` hasn't, it seems. `top` shows that most NUMA nodes are idling, but at least one NUMA node is still doing something (all 6 cores 100%). Any idea what happens after calling
<kordejong>
`hpx::finalize` in `hpx::init` that could be causing this behaviour? I could use some inspiration as to where to look for the cause of this. Platform: linux, gcc-10, HPX-1.5.0-rc2, Slurm, openmpi.
<kordejong>
BTW, the information I need from the experiment is ready and fine. It seems some finalization code is hanging.
<kordejong>
Since experiments are launched from shell scripts, hanging experiments prevent subsequent experiments to be launched.
<ms[m]>
"Kor de Jong" (https://matrix.to/#/@kordejong:matrix.org): are you able to determine if it's locality 0 that hangs or some other locality? Can you reproduce the hangs with --hpx:debug-hpx-log? How reproducible are the hangs?
<kordejong>
The hangs are pretty reproducible, with a Release build. Just ran a Release build with `--hpx:debug-hpx-log` and the formerly hanging experiment finished without hanging. Log is 13G. `hpx::init` exited OK.
<kordejong>
It seems subtle. The calculations done in my code seem always fine and finished (json file with results, only written after waiting for all tasks to finish). But some array sizes + partition sizes result in a process that hangs upon exiting.
<kordejong>
So even though some processes hang, they do so after finishing my tasks and writing the results. Even after finishing `hpx::finalize`.
<kordejong>
Same thing with multiple cluster nodes (tested 4 cluster nodes == 32 numa nodes == 32 localities). Upon exitting, one locality (== numa node) is active (all cores 100%), while the other 31 are idling. The whole thing hangs.
<ms[m]>
Kor de Jong: do you do any explicit work on the main thread pool or timer pool? unfortunately I think those stack traces are a red herring and just show us that hpx is waiting to shut down
<ms[m]>
"some array sizes + partition sizes result in a process that hangs" does this mean it only happens with these sizes and it happens every time with those sizes, and never with other sizes?
K-ballo has quit [Ping timeout: 240 seconds]
K-ballo has joined #ste||ar
<kordejong>
No, I don't do work explicitly on certain pools. Yes, some array/partition sizes consistently hang, while others don't. Many combinations don't hang, but some do. During scalability tests, there often is a combination that hangs. I haven't seen a hang while scaling over cores in a single numa node (== 1 locality). The issue seems restricted to using multiple localities.
<kordejong>
Also, I have not been able to reproduce a hang using a Debug build. When using the same array/partition sizes a Debug build, or a Release build using the `--hpx:debug-hpx-log` argument, does not hang.
hkaiser has joined #ste||ar
<kordejong>
Using the tcp parcelport instead of mpi doesn't help (`--hpx:ini="hpx.parcel.mpi.enable=0" --hpx:ini="hpx.parcel.tcp.enable=1"`)
<gonidelis[m]>
it just pops when `make tests.unit.modules.algorithms.partitioned_vector_transform_binary3` while everything else runs fine
<hkaiser>
that's not a cmake error
<ms[m]>
Kor de Jong: not sure how to best debug that... you could try printing the thread count performance counter
<ms[m]>
that way we'd know if you have some thread lingering longering than it should
<hkaiser>
gonidelis[m]: you most likely have a mismatch between the definition and the declaration of transform_()
<hkaiser>
kordejong: forgot to call hpx::finalize()?
<zao>
gonidelis[m]: It's a linker error, so either the object file/library containing that symbol isn't provided when linking this thing, or the symbol isn't present in the libraries.
<zao>
You may have a declaration without a corresponding definition, or a promise of a specialization that isn't there, or something. Squint at the error and try to see what it's missing.
<ms[m]>
hkaiser: most likely not... would've been nice ;)
<ms[m]>
there's a bit more context in the logs if you want to catch up on that
<hkaiser>
ok
<hkaiser>
could have been easy ;-)
<hkaiser>
ms[m]: Kokos meeting today?
<ms[m]>
hkaiser: yes
<ms[m]>
gdaiss as well!? :)
<kordejong>
<hkaiser "Kor de Jong: forgot to call hpx:"> No it is there, and a print afterwards gets printed, just before returning the result of calling `hpx::finalize` from `hpx_main`. That's why I think the hanging code is in `hpx::init`, after its call to `hpx_main`.
<hkaiser>
in 15 minutes?
<ms[m]>
in 10-15 minutes
<hkaiser>
kordejong: finalize only tells the runtime to exit once all activity has ceased
<hkaiser>
that means that there is still some work in the queues if init does not return, most likely some suspended thread that didn't get resumed
<hkaiser>
ms[m]: ok
<hkaiser>
possibly a future that was never made ready or an exception that got swallowed by some code not handling them
<kordejong>
@hkaiser Ah, yes. I will think about that some more. Thanks.
<hkaiser>
hpx::wait_all() is often a culprit as it waits for the futures, but doesn't rethrow exceptions
<hkaiser>
kordejong: run it with --hpx:attach-debugger=exception in interactive mode, that will break execution if an exception is thrown
kale[m] has quit [Ping timeout: 264 seconds]
kale[m] has joined #ste||ar
akheir has joined #ste||ar
<gonidelis[m]>
hkaiser: There is a serious possibility that I have successfully changed all the result types in `transform`. "threeoth" under /algorithms, /container_algorithms and /segmented_algorithms
<hkaiser>
lol
<gonidelis[m]>
I 'll wait for the github tests to verify
<gonidelis[m]>
hkaiser: meanwhile I will try to complete all the remaining overloads with tag_invoke
<hkaiser>
thanks a lot
<rori>
freenode_gonidelis[m]: serious probability or serious possibility? that changes the odds :P
<gonidelis[m]>
rori_[m]: let's stay with "possibility", altough I don't know what's the difference between the two...
nanmiao11 has joined #ste||ar
<gonidelis[m]>
question
<gonidelis[m]>
http://eel.is/c++draft/alg.transform#itemdecl:1 The algos with exec-policy return `ForwardIterator2` while the algos without exec-policy return sth like `constexpr OutputIterator`
<gonidelis[m]>
does that mean that my `tag_invoke`s return sth like `friend FwdIter2` when no exec-policy and sth like `friend typename parallel::util::detail::algorithm_result<ExPolicy,
<gonidelis[m]>
FwdIter2>::type` when we do have an exec-policy?
<gonidelis[m]>
is that analogy correct? hkaiser
<K-ballo>
friend is not part of the type
<gonidelis[m]>
yeah you 've said that again... sorry
<K-ballo>
no that was constexpr
<K-ballo>
same thing though
<gonidelis[m]>
K-ballo: I hate your hard-drive capacity
<hkaiser>
turn it into a hpx::ranges::transform, then
<gonidelis[m]>
hkaiser: ok great thanks... should I move it undre the corresponding test directory too then?
<hkaiser>
please feel free to do that, it's not that important, however
<gonidelis[m]>
ok thanks :)
<gonidelis[m]>
hkaiser: does a `parallel::util::detail::algorithm_result<ExPolicy, in_out_result<FwdIter1, FwdIter2>>::type foo` has `foo.in` and `foo.out` members?
<hkaiser>
only if it's being instantaited with a non-task execution policy, otherwise it will be future<in_out_result<>>
<gonidelis[m]>
hmm ok
shahrzad has joined #ste||ar
<hkaiser>
gonidelis[m]: that's the whole point of having the algorithm_result<> type, to be able to distinguish between synchronous and asynchronous execution results
<gonidelis[m]>
but does tag_invoke(rng1, rng2) recognize that ?
<gonidelis[m]>
plus this one last that you sent me has a different `O` dest
<hkaiser>
why do you think it has a different 'O'?
<gonidelis[m]>
`transform(R1&& r1, R2&& r2, O result)` that's its decl
<hkaiser>
yes, that's exactly what this test is invoking, no?
<hkaiser>
well, plus the operator
<hkaiser>
transform(R1&& r1, R2&& r2, O result, BinaryOp)
<gonidelis[m]>
does `std::begin(c2)` indicate a range?
<gonidelis[m]>
if we match the args, that should be rng2
<hkaiser>
no, it's the same a c2.begin()
<gonidelis[m]>
and `std::begin(d1)` should be `O result`
<gonidelis[m]>
hkaiser: exactly
<hkaiser>
ann now I see what you mean
<zao>
(you can std::begin other things too, like C-arrays)
<hkaiser>
ok, this should be hpx::ranges(policy, c1, c2, std::begin(d1), add) instead
<gonidelis[m]>
ahhh thanks
<gonidelis[m]>
no it clicks
<gonidelis[m]>
now *
<gonidelis[m]>
:D
weilewei40 has joined #ste||ar
weilewei40 has quit [Remote host closed the connection]
weilewei31 has joined #ste||ar
<gonidelis[m]>
hkaiser: there is a strong possiblity that I am done with the CPOs :D ... like all!!!
<gonidelis[m]>
I shall wait for github tests to complain if they want to and then I need to fix the docs and render as depracated the depracated ones and move stuff in the proper namespace in general but thesea are trivial. The basic functionallity should be complete ;)
nanmiao1176 has quit [Ping timeout: 245 seconds]
weilewei has quit [Ping timeout: 245 seconds]
<gonidelis[m]>
(plus add some binary tests overloads without exec-policy ;p)
<hkaiser>
\o/
<K-ballo>
deprecate, e-e-a
<gonidelis[m]>
K-ballo: that changes everything... kidding. thanks for the tip ;)
kale[m] has quit [Ping timeout: 260 seconds]
kale[m] has joined #ste||ar
<gonidelis[m]>
hkaiser: i have pushed the changes... you can take a look whenever you want if you want to discuss in tomorrow meeting
<hkaiser>
k
shahrzad_ has joined #ste||ar
nanmiao11 has joined #ste||ar
shahrzad has quit [Ping timeout: 264 seconds]
nanmiao11 has quit [Remote host closed the connection]
nanmiao11 has joined #ste||ar
nanmiao11 has quit [Remote host closed the connection]
weilewei31 has quit [Remote host closed the connection]
<hkaiser>
K-ballo: thanks! looks much better than the last one ;-)
<K-ballo>
the last one was at 331, you can see it under "revisions"
<K-ballo>
the first one in there was 702 :/
<hkaiser>
I added it to #3440
<hkaiser>
strange, I thought we had gotten rid of program_options
<hkaiser>
it's the compatibility fallback that's including the boost/program_options stuff, makes sense
nanmiao11 has joined #ste||ar
weilewei has quit [Remote host closed the connection]
weilewei has joined #ste||ar
<weilewei>
hkaiser For our paper, I am thinking to create a tag for hpx-enabled DCA in ste||ar repo: https://github.com/STEllAR-GROUP/DCA/. Shall I merge hpx_pr branch to master branch of ste||ar/dca, and then create a tag for it?
<weilewei>
Or what will be an alternative solution?
<hkaiser>
weilewei: sounds good
<weilewei>
ok, nice
weilewei has quit [Remote host closed the connection]
weilewei has joined #ste||ar
kale[m] has quit [Ping timeout: 260 seconds]
kale[m] has joined #ste||ar
kale[m] has quit [Ping timeout: 264 seconds]
kale[m] has joined #ste||ar
akheir has quit [Remote host closed the connection]
akheir has joined #ste||ar
kale[m] has quit [Ping timeout: 272 seconds]
kale[m] has joined #ste||ar
<hkaiser>
gnikunj[m]: would you mind having a look at Nan's problem on QBC?
<hkaiser>
things seem to have worked there for you without issues...
<nanmiao11>
=D I think right now is his early morning time