hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC2018: https://wp.me/p4pxJf-k1
<nikunj>
zao, my implementation seems to be working fine other than 1 failing test that I think is unrelated to my implementation
<nikunj>
My friend is using a Macbook Air 2015 with High Sierra
<nikunj>
I can ask for his laptop's spec sheet if you want me to
jgolinowski has joined #ste||ar
<jbjnr>
nikunj: can you ask him to rerun the test that failed a few times "ctest -R tests.regressions.parallel.static_chunker_2282 -VV" and see if it fails every time with the same error, or if it is random and only fails occasionally.
<nikunj>
jbjnr, ok I'll ask him to rerun
<jbjnr>
or even just rerun all the tests and see if the same one fails, or if othres fail occasionally
<nikunj>
alright, I'll ask him to run the test suite multiple times
<jbjnr>
I'm worried that certain tests randomly fail and that there is a bug somewhere deep in the runtime that we have not located (for years)
<zao>
Note that this is nikunj's PR, don't know how master fares.
<jbjnr>
interesting stack trace. the static chunker needs a config var to know how big the chunk default is. it has to query this from the config anf that pulls in all the RP stuff. very messy.
<heller>
ugg
<jbjnr>
I will take a look when I have a moment - I think an #issue needs to be filed with that stack trace in
<heller>
--> lazy init :/
<zao>
ls
<jbjnr>
ms[m]1: and I have been chatting about lazy_init on slack
<ste||ar-github>
[hpx] msimberg opened pull request #3401: Fix cuda_future_helper.h when compiling with C++11 (master...fix-cuda-future-helper) https://github.com/STEllAR-GROUP/hpx/pull/3401
ste||ar-github has left #ste||ar [#ste||ar]
<zao>
nikunj: aye
<heller>
__linux
<jbjnr>
error: no member named 'include_libhpx_wrap' looks like it might be nikunj's fault
<nikunj>
jbjnr, yes
<nikunj>
seems like it is accessing it for non linux systems
<heller>
FIX IT!
<nikunj>
heller, on it
<jbjnr>
heller: ms[m]1 and I were talking about lazy_init and he says it's bad because stacks don't get reused properly. I want to fix it, but he thinks you had a plan for it already. Do you?
<heller>
jbjnr: I wish I completed it long time ago
<heller>
jbjnr: I don't have a concrete plan
<heller>
jbjnr: but I 100% agree with ms[m]1's verdict
<jbjnr>
we need thread local stacks that are assigned to tasks and reused in a numa/thread loaclly sensitive way
<heller>
++
<jbjnr>
ok
<jbjnr>
then I will look into it
<jbjnr>
ms[m]1: yt?
<jbjnr>
opinions?
<ms[m]1>
^heller's verdict
<ms[m]1>
uhm, nothing more to add
<ms[m]1>
I've told you everything I know(/remember)
<jbjnr>
ok
<ms[m]1>
heller, my problem was about where to handle the stack reuse
<jbjnr>
main question from me then is - if you said the coroutines are allocating the stacks - can we easily hook into that and change it - or is it in boost code somewhere?
<ms[m]1>
the scheduler does it now, but the coroutines handle stack allocation, did you have any thoughts on where it would be nicest to have it?
<heller>
no
<jbjnr>
asking me or heller?
<heller>
I haven't thought it through yet
<heller>
any place is as bad as the other, i guess
<ms[m]1>
great...
<ms[m]1>
ok
<jbjnr>
we need a memory manager that is visible from the scheduler and from the coroutine - it would manage the stacks of stacks
<jbjnr>
using the numa aware allocators
<heller>
so, the place where we need to get the stack is when calling thread_data::operator()
<heller>
that is, when there's no stack associated with the specific coroutine, we need to get some
<jbjnr>
ok
<ms[m]1>
basically where you have if (stack_ptr == nullptr) allocate_one, you would have if (stack_ptr == nullptr) get_existing_or_allocate_one
<heller>
right
<heller>
the question is, where the 'get_exisiting_or_allocate_one' state resides
<heller>
and also, once the coroutine is done, you need to give it back
<heller>
thread locally
<heller>
or delete it, if it is from another NUMA domain
<heller>
you could safe the origin NUMA domain in the first bytes of the stack
<jbjnr>
not great - you have to hit the memory to query it and waste a possible cahe line
<ms[m]1>
but you'd know when you allocate where you are
<ms[m]1>
unless there are no thread bindings
<jbjnr>
better to just say get_existing_or_allocate_one(thread_info) and allow the memory manager ro do the right thing
ste||ar-github has joined #ste||ar
<ste||ar-github>
[hpx] NK-Nikunj opened pull request #3402: Allow debug option to be enabled only for Linux systems with dynamic main on (master...fix_debug) https://github.com/STEllAR-GROUP/hpx/pull/3402
ste||ar-github has left #ste||ar [#ste||ar]
<jbjnr>
(the scheduler knows where the task will run and it should give it a stack from it's cache of stacks).
<nikunj>
heller, I just added a PR that should fix the current master
<heller>
jbjnr: well, since stacks might migrate along with the task, you need to know where the task was allocated...
<heller>
it's about dispensing the stack
<heller>
we don't want to keep it around if it came from another NUMA domain
<jbjnr>
if the thread_data stores the id of the pu it ran on, then that info is available
<jbjnr>
the scheduler knows all this
<heller>
how is that different from storing it in the first bytes of the stack ;)?
<jbjnr>
when the task inishes, it gives the stack back
<heller>
sure
<jbjnr>
because storing it in the stack itself is just a hack
<jbjnr>
smells
<jbjnr>
and won't be portable to new architectures with different memory semantics
<jbjnr>
oh my stack is on the GP, I'll just fetch that memory and query it ...
<nikunj>
this should fix the broken master on mac OS
<zao>
Not sure when it actually builds that bit, but seems to have gotten a fair bit into the build thus far.
nikunj[m] has joined #ste||ar
<zao>
nikunj: Is there any particular reason you're using all those different defines to identify Linux?
<zao>
All you need is __linux__, the rest are so obsolete that there's no chance of them ever occurring.
<nikunj[m]>
zao: yes, hkaiser told me that these different linux #defines specifies different linux kernels
<nikunj[m]>
so he asked me to split it into these 3
<zao>
predef.sf.net claims that the other two are obsolete.
<nikunj[m]>
I previously had only __linux__
<zao>
I don't know what his source is, of course.
<nikunj[m]>
zao: the PR should have fixed things for master. The error was related to compiler not being able to find out of the hpx_start namespace due to those #defines
<zao>
Anyway, it's not a fight I care to take, but I'd be EXTREMELY surprised if anything relied on the non-POSIX compliant definitions on any system made this century.
<zao>
It's just noice.
<zao>
*noise
<nikunj[m]>
jbjnr: did anything get updated on CDash?
<jbjnr>
I did not see anything
<jbjnr>
maybe I screwed up with the commands
<jbjnr>
lunch ... bbiab
<nikunj[m]>
jbjnr: I'll ask him to share a gist in that case. He did tell me that some of the tests later timed out. I don't know which ones though
daissgr has joined #ste||ar
daissgr has quit [Client Quit]
daissgr has joined #ste||ar
<nikunj[m]>
zao: did it build correctly?
<zao>
[1343/1433] Building CXX object tests/unit/util/CMakeFiles/any_serialization_test_exe.dir/any_serialization.cpp.o
<zao>
macs are many things, but fast they're not :)
<nikunj[m]>
so true!
<zao>
tests.regressions.parallel.static_chunker_2282 hangs on 3402 as well.
<zao>
(that is, pretty much master)
<nikunj[m]>
statich_chunker is amongst the tests that usually fails on mac (from my observation)
<nikunj[m]>
I remember lowering failing tests down to static_chunker_2282
<nikunj[m]>
but that's the only test I could not pass during my mac OS implementation a month ago
<zao>
jbjnr: Could this kind of thing be related to the kind of generic/whatever context we have for coroutines on macOS vs. real OSes? I don't quite get why this doesn't manifest on Linux.
<hkaiser>
I think it's trying to initialize a global object before hpx is up and running
<hkaiser>
nikunj: do you copy this?
<nikunj>
yes I do, but that should not have happened
<hkaiser>
heh
mcopik has quit [Ping timeout: 240 seconds]
<hkaiser>
is there any bug that 'should happen'?
<nikunj>
my implementation only calls init
<hkaiser>
sure
<hkaiser>
not blaming you
<nikunj>
and it works as a function wrapper, so everything should have been initialized by then. I will look further if it's related to my implementation
<hkaiser>
ahh, the test is wrong
<hkaiser>
it should #include hpx_main.hpp, not hpx_init.hpp
<hkaiser>
hold on
<hkaiser>
it is supposed to fail
<hkaiser>
ok, I think I know how to fix this
<hkaiser>
thanks zao, very helpful
eschnett has joined #ste||ar
<zao>
Happy to cause work :)
<hkaiser>
I don't see why it fails on mac only, though
<zao>
nikunj: Built a commit from mid-February, same kind of hang :)
<nikunj>
<sigh>
<zao>
As that's before you entered our world, kind of means you're innocent :)
<zao>
Yeah, same stack trace.
<nikunj>
<sigh>
<hkaiser>
I'm not surprised
aserio has joined #ste||ar
bibek has quit [Quit: Konversation terminated!]
bibek has joined #ste||ar
hkaiser has quit [Quit: bye]
jaafar has joined #ste||ar
Vir has joined #ste||ar
Vir has quit [Client Quit]
Vir has joined #ste||ar
hkaiser has joined #ste||ar
david_pfander has quit [Ping timeout: 240 seconds]
hkaiser has quit [Ping timeout: 256 seconds]
aserio1 has joined #ste||ar
aserio1 has quit [Remote host closed the connection]
aserio has quit [Ping timeout: 265 seconds]
aserio has joined #ste||ar
Vir has quit [Ping timeout: 265 seconds]
Vir has joined #ste||ar
aserio has quit [Read error: Connection reset by peer]
aserio has joined #ste||ar
aserio has quit [Ping timeout: 248 seconds]
Vir has quit [Ping timeout: 240 seconds]
Vir has joined #ste||ar
jgolinowski has quit [Ping timeout: 240 seconds]
Vir has quit [Ping timeout: 244 seconds]
Vir has joined #ste||ar
nikunj[m] has joined #ste||ar
jgolinowski has joined #ste||ar
nikunj[m] has quit [Quit: Bye]
daissgr has quit [Quit: WeeChat 1.9.1]
Vir has quit [Ping timeout: 265 seconds]
aserio has joined #ste||ar
<diehlpk_work>
heller, Please have a look into the PR for updating the repo docker_build_env to circle-ci 2.0
<diehlpk_work>
Once this is updated, all of our circle-ci projects are updated to 2.0
nikunj has quit [Quit: Bye]
jgolinowski has quit [Read error: Connection timed out]
jgolinowski has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 265 seconds]
aserio1 is now known as aserio
akheir has joined #ste||ar
quaz0r has quit [Ping timeout: 268 seconds]
aserio has quit [Ping timeout: 256 seconds]
quaz0r has joined #ste||ar
<jgolinowski>
ms[m]1, yt?
<ms[m]1>
jgolinowski: yep, here
<jgolinowski>
ms[m]1, did you have a look at the application?
<ms[m]1>
just did
<ms[m]1>
it looks good
<ms[m]1>
haven't had a look at the code yet
<jgolinowski>
So any last changes I should introduce? Does it breaks for some combintaion of inputs or sth like this? Does it build OK?
<ms[m]1>
works for me (TM)
<ms[m]1>
I noticed trying to save a video into a non-existing directory ends up printing "Failed to create AVI writer"
<jgolinowski>
Ah this
<jgolinowski>
did you set the path first?
<ms[m]1>
but that's not the end of the world
<ms[m]1>
yeah, once I set it works
<ms[m]1>
so you can leave it as it is for gsoc, you can of course continue afterwards ;)
<jgolinowski>
ms[m]1, ok
<jgolinowski>
I am finishing first version of the readme for the opencv_hpx_backend repo
<jgolinowski>
will commit shortly
<ms[m]1>
regarding shared pointers: 1. you can use std::shared_ptr (I know there was a boost::shared_ptr there from before), 2. use unique_ptr by default, and shared only if you actually need to share it, 3. I would prefer not to typedef shared_ptr<something> and just use it directly but that's a matter of taste
<ms[m]1>
ok, nice
<jgolinowski>
so with the shared pointers
<jgolinowski>
one issue is that QT seems to be using bare pointers
<jgolinowski>
but I understand that since the bare pointer is not counted in the reference counter then it is not bad?
<jgolinowski>
I mean that when the martycam object gets destroyed and at this point there is only one reference then the object under pointer will be destroyed anyway
<jgolinowski>
which might not be "nice" for the QT code but any way this is only at the very end of the app
<ms[m]1>
hmm, so what you say is true
<ms[m]1>
I'm not sure I understand what the problem is
<jgolinowski>
ms[m]1, well there is no problem as such
<ms[m]1>
why is it not "nice" for QT? I can't imagine they would recommend just letting their objects leak...
<jgolinowski>
just a thing I was thinking about while porting to smart pointers
<jgolinowski>
and the "not nice" is the potential situation in which the pointer points to nothing (nullptr) but I am pretty sure it is accounted for
<ms[m]1>
right, that's a general problem when dealing with raw pointers