hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC2018: https://wp.me/p4pxJf-k1
daissgr has quit [Quit: WeeChat 1.9.1]
ste||ar-github has joined #ste||ar
<ste||ar-github> [hpx] hkaiser pushed 4 new commits to master: https://github.com/STEllAR-GROUP/hpx/compare/fdc279a748e1...6e19dce20b1f
<ste||ar-github> hpx/master 95e2fbf Nikunj Gupta: Updates doc with recent hpx_wrap implementation
<ste||ar-github> hpx/master d39f043 Nikunj Gupta: Adds info about libhpx_wrap implementations
<ste||ar-github> hpx/master 47b7945 Nikunj Gupta: Adds hpx_usage section to include list
ste||ar-github has left #ste||ar [#ste||ar]
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: bye]
nanashi55 has quit [Ping timeout: 256 seconds]
nanashi55 has joined #ste||ar
<nikunj> I asked one of my friends to try building HPX on his Macbook Air (2015) with my Mac OSX implementation. Here are the results: https://gist.github.com/ac-alpha/e1efd4e078ce8577148596949da7bfc9
<nikunj> zao, my implementation seems to be working fine other than 1 failing test that I think is unrelated to my implementation
<nikunj> My friend is using a Macbook Air 2015 with High Sierra
<nikunj> I can ask for his laptop's spec sheet if you want me to
jgolinowski has joined #ste||ar
<jbjnr> nikunj: can you ask him to rerun the test that failed a few times "ctest -R tests.regressions.parallel.static_chunker_2282 -VV" and see if it fails every time with the same error, or if it is random and only fails occasionally.
<nikunj> jbjnr, ok I'll ask him to rerun
<jbjnr> or even just rerun all the tests and see if the same one fails, or if othres fail occasionally
<nikunj> alright, I'll ask him to run the test suite multiple times
<jbjnr> I'm worried that certain tests randomly fail and that there is a bug somewhere deep in the runtime that we have not located (for years)
<jbjnr> if he wants to submit results to the dashboard here http://cdash.cscs.ch/index.php?project=HPX then the commands are ...
<jbjnr> ctest -D ExperimentalStart; ctest -D ExperimentalTest; ctest -D ExperimentalSubmit
<jbjnr> then the results will apear as an entry in CDash
<jbjnr> and I can look at the output from any failing ones
<nikunj> so I'll have to run those 3 commands only in the terminal and results will show up on CDash?
<jbjnr> correct
<jbjnr> that sequence assumes you have already compiled and only does the start,test,submit steps
<nikunj> ok, I'll send him a bash script for it. Should I ask him to run the test suite 5 times?
<jbjnr> 5 would be fine. thanks
<nikunj> it is compiled and build properly on his machine
<jbjnr> they will(should) apear in the Experimental section of the dashboard - scroll right down to see them
<nikunj> so those commands should work fine
<nikunj> ok
<nikunj> I'll ask him to run the tests right away
<jbjnr> great.thanks
jaafar has quit [Ping timeout: 248 seconds]
<zao> Seems like our HPX_FALLTHROUGH; annotations doesn't do what they should here.
<zao> Ah, it expands to _nothing_ if you build as C++14.
<zao> We didn't bother using the vendor-specific ways to indicate this?
<jbjnr> Those warning annoy me too. so many of them
<heller> FIX IT
<jbjnr> zao: heller says you have to fix it :)
<zao> :D
jbjnr has quit [Ping timeout: 240 seconds]
david_pfander has joined #ste||ar
<zao> ohboy
<nikunj> woah! it's the first time I'm seeing an ICE
<zao> Could be machine-specific too, just put in more memory to make it not swap while building.
<nikunj> can swap memory really result in ICE?
<zao> nikunj: Running out of resources can, as allocations may fail in unexpected places.
<zao> nikunj: I added more physical memory to the machine, going from 2 sticks to 4 sticks.
<zao> The more sticks you have, the lower you might have to clock them.
<heller> nikunj: one of the primary reasons when compiling HPX code ;)
<nikunj> heller: true!
jbjnr has joined #ste||ar
<jbjnr> my windows machine keeps rebooting and I lose my IRC window. Hope I'm not missing any messages (doubtful).
<nikunj> jbjnr, quick update. My friend has fired up my bash script. You should be able to see uploaded results soon.
<jbjnr> thanks
<zao> jbjnr: On my Linux machine, the test completes in 50ms. First run on macOS wedges.
<jbjnr> wedges?
<jbjnr> =hangs?
<zao> Yeah.
<jbjnr> hmmm
<nikunj> zao, which version of mac os are you using?
<zao> I'm impatient, but at least 30s of no-go.
<zao> nikunj: Whatever is current non-beta.
<jbjnr> I'll have to get the old macbook out tonight to investigate
<zao> High Sierra?
<nikunj> also, iirc the last time I asked you to try building it built successfully right?
<nikunj> from what I can recollect 4-6 tests failed that time, with some passing when ran again
<zao> It seems to be reliably stuck.
<zao> 3/3 times.
<nikunj> that's quite odd. Coz I really didn't change anything in it
<nikunj> I just integrated the 2 implementation to lower code redundancy
<jbjnr> ooh. lock in get partitioner
<jbjnr> race on startup by the looks of things
<jbjnr> thanks. That looks useful.
<zao> Note that this is nikunj's PR, don't know how master fares.
<jbjnr> interesting stack trace. the static chunker needs a config var to know how big the chunk default is. it has to query this from the config anf that pulls in all the RP stuff. very messy.
<heller> ugg
<jbjnr> I will take a look when I have a moment - I think an #issue needs to be filed with that stack trace in
<heller> --> lazy init :/
<zao> ls
<jbjnr> ms[m]1: and I have been chatting about lazy_init on slack
<zao> Blargh, current master doesn't compile.
<nikunj> zao, you mean in mac OS?
ste||ar-github has joined #ste||ar
<ste||ar-github> [hpx] msimberg opened pull request #3401: Fix cuda_future_helper.h when compiling with C++11 (master...fix-cuda-future-helper) https://github.com/STEllAR-GROUP/hpx/pull/3401
ste||ar-github has left #ste||ar [#ste||ar]
<zao> nikunj: aye
<heller> __linux
<jbjnr> error: no member named 'include_libhpx_wrap' looks like it might be nikunj's fault
<nikunj> jbjnr, yes
<nikunj> seems like it is accessing it for non linux systems
<heller> FIX IT!
<nikunj> heller, on it
<jbjnr> heller: ms[m]1 and I were talking about lazy_init and he says it's bad because stacks don't get reused properly. I want to fix it, but he thinks you had a plan for it already. Do you?
<heller> jbjnr: I wish I completed it long time ago
<heller> jbjnr: I don't have a concrete plan
<heller> jbjnr: but I 100% agree with ms[m]1's verdict
<jbjnr> we need thread local stacks that are assigned to tasks and reused in a numa/thread loaclly sensitive way
<heller> ++
<jbjnr> ok
<jbjnr> then I will look into it
<jbjnr> ms[m]1: yt?
<jbjnr> opinions?
<ms[m]1> ^heller's verdict
<ms[m]1> uhm, nothing more to add
<ms[m]1> I've told you everything I know(/remember)
<jbjnr> ok
<ms[m]1> heller, my problem was about where to handle the stack reuse
<jbjnr> main question from me then is - if you said the coroutines are allocating the stacks - can we easily hook into that and change it - or is it in boost code somewhere?
<ms[m]1> the scheduler does it now, but the coroutines handle stack allocation, did you have any thoughts on where it would be nicest to have it?
<heller> no
<jbjnr> asking me or heller?
<heller> I haven't thought it through yet
<heller> any place is as bad as the other, i guess
<ms[m]1> great...
<ms[m]1> ok
<jbjnr> we need a memory manager that is visible from the scheduler and from the coroutine - it would manage the stacks of stacks
<jbjnr> using the numa aware allocators
<heller> so, the place where we need to get the stack is when calling thread_data::operator()
<heller> that is, when there's no stack associated with the specific coroutine, we need to get some
<jbjnr> ok
<ms[m]1> basically where you have if (stack_ptr == nullptr) allocate_one, you would have if (stack_ptr == nullptr) get_existing_or_allocate_one
<heller> right
<heller> the question is, where the 'get_exisiting_or_allocate_one' state resides
<heller> and also, once the coroutine is done, you need to give it back
<heller> thread locally
<heller> or delete it, if it is from another NUMA domain
<heller> you could safe the origin NUMA domain in the first bytes of the stack
<jbjnr> not great - you have to hit the memory to query it and waste a possible cahe line
<ms[m]1> but you'd know when you allocate where you are
<ms[m]1> unless there are no thread bindings
<jbjnr> better to just say get_existing_or_allocate_one(thread_info) and allow the memory manager ro do the right thing
ste||ar-github has joined #ste||ar
<ste||ar-github> [hpx] NK-Nikunj opened pull request #3402: Allow debug option to be enabled only for Linux systems with dynamic main on (master...fix_debug) https://github.com/STEllAR-GROUP/hpx/pull/3402
ste||ar-github has left #ste||ar [#ste||ar]
<jbjnr> (the scheduler knows where the task will run and it should give it a stack from it's cache of stacks).
<nikunj> heller, I just added a PR that should fix the current master
<heller> jbjnr: well, since stacks might migrate along with the task, you need to know where the task was allocated...
<heller> it's about dispensing the stack
<heller> we don't want to keep it around if it came from another NUMA domain
<jbjnr> if the thread_data stores the id of the pu it ran on, then that info is available
<jbjnr> the scheduler knows all this
<heller> how is that different from storing it in the first bytes of the stack ;)?
<jbjnr> when the task inishes, it gives the stack back
<heller> sure
<jbjnr> because storing it in the stack itself is just a hack
<jbjnr> smells
<jbjnr> and won't be portable to new architectures with different memory semantics
<jbjnr> oh my stack is on the GP, I'll just fetch that memory and query it ...
<jbjnr> GPU^
<heller> ok. point taken
<nikunj> this should fix the broken master on mac OS
<zao> Not sure when it actually builds that bit, but seems to have gotten a fair bit into the build thus far.
nikunj[m] has joined #ste||ar
<zao> nikunj: Is there any particular reason you're using all those different defines to identify Linux?
<zao> All you need is __linux__, the rest are so obsolete that there's no chance of them ever occurring.
<nikunj[m]> zao: yes, hkaiser told me that these different linux #defines specifies different linux kernels
<nikunj[m]> so he asked me to split it into these 3
<zao> predef.sf.net claims that the other two are obsolete.
<nikunj[m]> I previously had only __linux__
<zao> I don't know what his source is, of course.
<nikunj[m]> zao: the PR should have fixed things for master. The error was related to compiler not being able to find out of the hpx_start namespace due to those #defines
<zao> Anyway, it's not a fight I care to take, but I'd be EXTREMELY surprised if anything relied on the non-POSIX compliant definitions on any system made this century.
<zao> It's just noice.
<zao> *noise
<nikunj[m]> jbjnr: did anything get updated on CDash?
<jbjnr> I did not see anything
<jbjnr> maybe I screwed up with the commands
<jbjnr> lunch ... bbiab
<nikunj[m]> jbjnr: I'll ask him to share a gist in that case. He did tell me that some of the tests later timed out. I don't know which ones though
daissgr has joined #ste||ar
daissgr has quit [Client Quit]
daissgr has joined #ste||ar
<nikunj[m]> zao: did it build correctly?
<zao> [1343/1433] Building CXX object tests/unit/util/CMakeFiles/any_serialization_test_exe.dir/any_serialization.cpp.o
<zao> macs are many things, but fast they're not :)
<nikunj[m]> so true!
<zao> tests.regressions.parallel.static_chunker_2282 hangs on 3402 as well.
<zao> (that is, pretty much master)
<nikunj[m]> statich_chunker is amongst the tests that usually fails on mac (from my observation)
<nikunj[m]> I remember lowering failing tests down to static_chunker_2282
<nikunj[m]> but that's the only test I could not pass during my mac OS implementation a month ago
<zao> jbjnr: Could this kind of thing be related to the kind of generic/whatever context we have for coroutines on macOS vs. real OSes? I don't quite get why this doesn't manifest on Linux.
mcopik has joined #ste||ar
nikunj[m] has quit [Quit: brb]
nanashi55 has quit [Ping timeout: 244 seconds]
nanashi55 has joined #ste||ar
K-ballo has joined #ste||ar
<jbjnr> zao: yes. could be related.
hkaiser has joined #ste||ar
ste||ar-github has joined #ste||ar
<ste||ar-github> [hpx] hkaiser closed pull request #3367: Adds Mac OS implementation to hpx_main.hpp (master...Mac_OS_impl) https://github.com/STEllAR-GROUP/hpx/pull/3367
ste||ar-github has left #ste||ar [#ste||ar]
<nikunj> jbjnr, see pm please
<heller> supercomputers being super nice to me today
<heller> with a excellent job throughput
<hkaiser> that's why they are called super computers
<zao> jbjnr: A-ha!
<zao> I enabled generic coroutines on Linux and it wedges the chunker test.
<heller> hkaiser: I like my little 16 board the most :D
<jbjnr> zao: wedges = locks up?
<zao> Oh, wait.
<heller> zao: interesting
<zao> Might've run it on the wrong machine <_<
<zao> Accidentally ran it on the mac :D
<jbjnr> <sigh>
<hkaiser> yah, the chunker test fails on macs
<zao> It works on Linux, even with -DHPX_WITH_GENERIC_CONTEXT_COROUTINES=ON
<zao> (so many terminals open)
<hkaiser> so it's not the coroutines implementation
<zao> hkaiser: Did you see my backtrace for that test on macOS? https://gist.github.com/zao/0400380fae91561560fd2f4a74e6df99
<hkaiser> frame #26 points to a line that does not make sense
<hkaiser> do you run on top of master?
<zao> This was on one of nikunj's branches, but master + 3407 also exhibits the deadlock.
<hkaiser> what line on master does this correspond to?
<hkaiser> it tries to grab a lock twice recursively
<zao> Yes.
<hkaiser> I think it's trying to initialize a global object before hpx is up and running
<hkaiser> nikunj: do you copy this?
<nikunj> yes I do, but that should not have happened
<hkaiser> heh
mcopik has quit [Ping timeout: 240 seconds]
<hkaiser> is there any bug that 'should happen'?
<nikunj> my implementation only calls init
<hkaiser> sure
<hkaiser> not blaming you
<nikunj> and it works as a function wrapper, so everything should have been initialized by then. I will look further if it's related to my implementation
<hkaiser> ahh, the test is wrong
<hkaiser> it should #include hpx_main.hpp, not hpx_init.hpp
<hkaiser> hold on
<hkaiser> it is supposed to fail
<hkaiser> ok, I think I know how to fix this
<hkaiser> thanks zao, very helpful
eschnett has joined #ste||ar
<zao> Happy to cause work :)
<hkaiser> I don't see why it fails on mac only, though
<zao> nikunj: Built a commit from mid-February, same kind of hang :)
<nikunj> <sigh>
<zao> As that's before you entered our world, kind of means you're innocent :)
<zao> Yeah, same stack trace.
<nikunj> <sigh>
<hkaiser> I'm not surprised
aserio has joined #ste||ar
bibek has quit [Quit: Konversation terminated!]
bibek has joined #ste||ar
hkaiser has quit [Quit: bye]
jaafar has joined #ste||ar
Vir has joined #ste||ar
Vir has quit [Client Quit]
Vir has joined #ste||ar
hkaiser has joined #ste||ar
david_pfander has quit [Ping timeout: 240 seconds]
hkaiser has quit [Ping timeout: 256 seconds]
aserio1 has joined #ste||ar
aserio1 has quit [Remote host closed the connection]
aserio has quit [Ping timeout: 265 seconds]
aserio has joined #ste||ar
Vir has quit [Ping timeout: 265 seconds]
Vir has joined #ste||ar
aserio has quit [Read error: Connection reset by peer]
aserio has joined #ste||ar
aserio has quit [Ping timeout: 248 seconds]
Vir has quit [Ping timeout: 240 seconds]
Vir has joined #ste||ar
jgolinowski has quit [Ping timeout: 240 seconds]
Vir has quit [Ping timeout: 244 seconds]
Vir has joined #ste||ar
nikunj[m] has joined #ste||ar
jgolinowski has joined #ste||ar
nikunj[m] has quit [Quit: Bye]
daissgr has quit [Quit: WeeChat 1.9.1]
Vir has quit [Ping timeout: 265 seconds]
aserio has joined #ste||ar
<diehlpk_work> heller, Please have a look into the PR for updating the repo docker_build_env to circle-ci 2.0
<diehlpk_work> Once this is updated, all of our circle-ci projects are updated to 2.0
nikunj has quit [Quit: Bye]
jgolinowski has quit [Read error: Connection timed out]
jgolinowski has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 265 seconds]
aserio1 is now known as aserio
akheir has joined #ste||ar
quaz0r has quit [Ping timeout: 268 seconds]
aserio has quit [Ping timeout: 256 seconds]
quaz0r has joined #ste||ar
<jgolinowski> ms[m]1, yt?
<ms[m]1> jgolinowski: yep, here
<jgolinowski> ms[m]1, did you have a look at the application?
<ms[m]1> just did
<ms[m]1> it looks good
<ms[m]1> haven't had a look at the code yet
<jgolinowski> So any last changes I should introduce? Does it breaks for some combintaion of inputs or sth like this? Does it build OK?
<ms[m]1> works for me (TM)
<ms[m]1> I noticed trying to save a video into a non-existing directory ends up printing "Failed to create AVI writer"
<jgolinowski> Ah this
<jgolinowski> did you set the path first?
<ms[m]1> but that's not the end of the world
<ms[m]1> yeah, once I set it works
<ms[m]1> so you can leave it as it is for gsoc, you can of course continue afterwards ;)
<jgolinowski> ms[m]1, ok
<jgolinowski> I am finishing first version of the readme for the opencv_hpx_backend repo
<jgolinowski> will commit shortly
<ms[m]1> regarding shared pointers: 1. you can use std::shared_ptr (I know there was a boost::shared_ptr there from before), 2. use unique_ptr by default, and shared only if you actually need to share it, 3. I would prefer not to typedef shared_ptr<something> and just use it directly but that's a matter of taste
<ms[m]1> ok, nice
<jgolinowski> so with the shared pointers
<jgolinowski> one issue is that QT seems to be using bare pointers
<jgolinowski> but I understand that since the bare pointer is not counted in the reference counter then it is not bad?
<jgolinowski> I mean that when the martycam object gets destroyed and at this point there is only one reference then the object under pointer will be destroyed anyway
<jgolinowski> which might not be "nice" for the QT code but any way this is only at the very end of the app
<ms[m]1> hmm, so what you say is true
<ms[m]1> I'm not sure I understand what the problem is
<jgolinowski> ms[m]1, well there is no problem as such
<ms[m]1> why is it not "nice" for QT? I can't imagine they would recommend just letting their objects leak...
<jgolinowski> just a thing I was thinking about while porting to smart pointers
<jgolinowski> and the "not nice" is the potential situation in which the pointer points to nothing (nullptr) but I am pretty sure it is accounted for
<ms[m]1> right, that's a general problem when dealing with raw pointers
<ms[m]1> in that case you should actually be passing the shared_ptr directly by value rather than getting the raw pointer
eschnett has quit [Quit: eschnett]
<jgolinowski> ms[m]1, I was talking more about the lines 49-51
<jgolinowski> the places where QT enforces raw pointers
<ms[m]1> I see
<jgolinowski> ms[m]1, btw I pushed the README.md
<ms[m]1> thanks, will have a look in the morning
aserio has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 256 seconds]
aserio1 is now known as aserio
ste||ar-github has joined #ste||ar
<ste||ar-github> [hpx] hapoo opened pull request #3404: fixing multiple difinition of main() in linux (master...master) https://github.com/STEllAR-GROUP/hpx/pull/3404
ste||ar-github has left #ste||ar [#ste||ar]
aserio1 has joined #ste||ar
jaafar has quit [Ping timeout: 260 seconds]
<heller> zao: could you take a look at the hpx-users ml? There's a guy with a question about freebsd
aserio has quit [Ping timeout: 256 seconds]
aserio1 is now known as aserio
eschnett has joined #ste||ar
jaafar has joined #ste||ar
aserio has quit [Quit: aserio]
aserio has joined #ste||ar
aserio has quit [Client Quit]
eschnett has quit [Quit: eschnett]
akheir has quit [Quit: Leaving]
eschnett has joined #ste||ar