hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC2018: https://wp.me/p4pxJf-k1
hkaiser has joined #ste||ar
quaz0r has quit [Ping timeout: 260 seconds]
K-ballo has quit [Quit: K-ballo]
quaz0r has joined #ste||ar
hkaiser has quit [Quit: bye]
diehlpk has quit [Ping timeout: 244 seconds]
aserio has joined #ste||ar
aserio has quit [Quit: aserio]
nanashi64 has joined #ste||ar
nanashi55 has quit [Ping timeout: 265 seconds]
nanashi64 is now known as nanashi55
nikunj97 has joined #ste||ar
twwright has quit [Read error: Connection reset by peer]
twwright_ has joined #ste||ar
anushi has quit [Ping timeout: 260 seconds]
anushi has joined #ste||ar
<M-ms> [ PERFSTAT ] (samples=100 mean=6.85 median=4.40 min=3.83 stddev=3.83 (55.9%))
<M-ms> [ PERFSTAT ] (samples=18 mean=2.93 median=2.92 min=2.74 stddev=0.09 (3.0%))
jaafar has quit [Ping timeout: 240 seconds]
nikunj[m] has joined #ste||ar
mcopik has joined #ste||ar
jbjnr has joined #ste||ar
mcopik has quit [Read error: Connection reset by peer]
<nikunj97> jbjnr, yt?
<jbjnr> here
<nikunj97> jbjnr, could you try building this branch. https://github.com/STEllAR-GROUP/hpx/pull/3375
<nikunj97> with -DHPX_WITH_DYNAMIC_HPX_MAIN=ON
<nikunj97> I wanted to check if the patch made things work out on ppc
<jbjnr> ok. checking out now. will leave build running whilst I go for coffee
<jbjnr> report back in a bit
<nikunj97> jbjnr, sure
mcopik has joined #ste||ar
nikunj97 has quit [Quit: Leaving]
<jbjnr> nikunj[m]: I used cmake -DHPX_WITH_DYNAMIC_HPX_MAIN=ON . and now hello world runs correctly. Nice one.
nikunj[m] has quit [Ping timeout: 276 seconds]
anushi has quit [Remote host closed the connection]
anushi has joined #ste||ar
david_pfander has joined #ste||ar
nikunj[m] has joined #ste||ar
<nikunj[m]> @jbjnr, that's good to hear that hpx now runs well on powerpc as well
<nikunj[m]> Did you run tests as well?
<jbjnr> yes
<jbjnr> 94% tests passed, 35 tests failed out of 573
<jbjnr> The same ones failing as I had with the DYNAM?/ic MAIN off
<jbjnr> ^oops
<nikunj[m]> @jbjnr then I don't think it has something to do with my code
<jbjnr> nope. it's something fishy on powerpc that doesn't happen on other linux falvours
<jbjnr> ^flavours
<jbjnr> I will investigate what's going on.
<nikunj[m]> @jbjnr if you find out anything related to my implementation please let me know
<jbjnr> of course.
<jbjnr> I suspect a race condition that was previously unknown
<nikunj[m]> @jbjnr could you please try building phylanx if possible
<nikunj[m]> @jbjnr that's odd
<jbjnr> not phylanx. I have too much work to do to get involved with another project
<jbjnr> got deadlines to meet here
<jbjnr> sorry.
<nikunj[m]> @jbjnr no worries
mcopik has quit [Read error: Connection reset by peer]
<jbjnr> race condition because we will be using 160 threads and most other tests are using 8 or 16. Quite possible there's a problem in the scheduling somewhere or some place in parallel:algorithms that is not triggered frequently
<nikunj[m]> @jbjnr sounds right
<zao> Good moroning, all!
<nikunj[m]> zao good morning!
<jbjnr> good morning
anushi has quit [Ping timeout: 276 seconds]
nikunj[m] has quit [Ping timeout: 252 seconds]
jbjnr_ has joined #ste||ar
jbjnr has quit [Ping timeout: 240 seconds]
mcopik has joined #ste||ar
<heller___> jbjnr_: which tests fail?
mcopik has quit [Ping timeout: 240 seconds]
jbjnr__ has joined #ste||ar
jbjnr_ has quit [Ping timeout: 240 seconds]
mcopik has joined #ste||ar
mcopik has quit [Ping timeout: 260 seconds]
mcopik has joined #ste||ar
mcopik has quit [Ping timeout: 244 seconds]
jakub_golinowski has joined #ste||ar
<jakub_golinowski> M-ms, so as you mentioned one of the important things to spot is that the opencv tests seem to be somehow randomized
<jbjnr__> Testing on PowerPC Debug build : 99% tests passed, 2 tests failed out of 573
jakub_golinowski has quit [Ping timeout: 240 seconds]
<heller___> jbjnr__: not bad :)
<jbjnr__> one timeout, one fail
<jbjnr__> questio is - why does release mode trigger problems ...
<heller___> no idea
<heller___> what's the output of the tests?
<heller___> why do they fail?
nikunj[m] has joined #ste||ar
nikunj[m] has quit [Ping timeout: 252 seconds]
mcopik has joined #ste||ar
<jbjnr__> heller___: is there a pattern in this set of fails : https://gist.github.com/biddisco/7980a7160183093e4e9f582fc150f234
mcopik has quit [Ping timeout: 240 seconds]
hkaiser has joined #ste||ar
<heller___> jbjnr__: can't see any
<heller___> could be anything
Chewbakka has joined #ste||ar
Chewbakk_ has joined #ste||ar
<Chewbakk_> Hi, we have a few questions regarding the distribution policy of partitioned vectors: Does the default distribution policy distributes the vector blockwise? If yes, is it possible to control which locality own which block chunk? I am also asking this because I read that it could also be partitioned in a round robin manner
<jbjnr__> I believe you can supply your own distribution policy
<jbjnr__> but I've never used the partitioned vectors so I can't recall the details
Chewbakka has quit [Ping timeout: 240 seconds]
<hkaiser> Chewbakk_: I think the default is to distribute the blocks round robin
anushi has joined #ste||ar
<Chewbakk_> Is there no given block chunking policy? I also read this here: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/components/containers/container_distribution_policy.hpp#L29
<Chewbakk_> This seems like a common scenario to me
<hkaiser> Chewbakk_: what would you like to achieve?
anushi has quit [Remote host closed the connection]
anushi has joined #ste||ar
<Chewbakk_> For given k localities and n the size of the partitioned vector: Locality 0 owns the first k data entries, locality 1 the next k data entries, ...
<hkaiser> Chewbakk_: that is the default, yes
anushi has quit [Remote host closed the connection]
<hkaiser> the blocks (not elements) are distributed round robin, by default to as many localities as there are connected to the application
<hkaiser> one block per locality
<Chewbakk_> ah perfect, thank you
anushi has joined #ste||ar
anushi has quit [Remote host closed the connection]
anushi has joined #ste||ar
K-ballo has joined #ste||ar
<jbjnr__> hkaiser: same question to you : is there a pattern in this set of fails : https://gist.github.com/biddisco/7980a7160183093e4e9f582fc150f234
<jbjnr__> if I run those tests using hpx:threads=1 then they are passing.
<jbjnr__> (didn't try all of them, but those I tried passed)
<hkaiser> don't see any - does this happen always of just for a particular rundom number?
<jbjnr__> always AFAICT
<hkaiser> is that on the ppc system?
<jbjnr__> oooh
<jbjnr__> yes PPC
<hkaiser> k
<jbjnr__> I tried find_test again and again and it just passed once
<hkaiser> could be a race that does not show itself on x86
<jbjnr__> yup. it's a race.
<jbjnr__> same seed and it failed
<hkaiser> ppc has a more rleaxed memory model than the x86
<jbjnr__> that's annoying. How will I track it down!
<hkaiser> annoying, indeed
<hkaiser> well, as always - try to minimize to the smallest possible reliably failing test case
<hkaiser> looks like to be something fundamental, not related to a specific algorithm
<hkaiser> even util.function fails
<hkaiser> jbjnr__: if you asked me to venture a guess, I'd say look at condition_variable
<hkaiser> but that's a guess only - could be anything
<jbjnr__> gdb shows me exceptions thrown in STL vector and algorithms
<jbjnr__> vector assign triggers exception in reduce_by_key, find in find_test
<jbjnr__> some corruption must be going on
<hkaiser> so something went out of scope too early
<jbjnr__> could be
<jbjnr__> I will play more. Odd that they pass in debug mode, but fail in release
<hkaiser> not odd at all
<jbjnr__> what do we do differently
<hkaiser> different timings, less contention, etc.
<hkaiser> less pressure on the synchronization
Chewbakka has joined #ste||ar
jbjnr__ has quit [Read error: Connection reset by peer]
<hkaiser> do you have something like the intel tools available on those machines?
<hkaiser> or the clang sanitizers
Chewbakk_ has quit [Ping timeout: 240 seconds]
jbjnr has joined #ste||ar
mbremer has quit [Quit: Page closed]
Chewbakk_ has joined #ste||ar
<hkaiser> jbjnr: clang sanitizers could help
<hkaiser> also, I'd start with making all atomics sequentially consistent, i.e. remove all memory_order parameters
<hkaiser> good chance we got those wrong...
<jbjnr> from a .load(thig, std:memory_order_relaxed) - you mean just remove the order completely
<jbjnr> and do that for all
<hkaiser> yes
<jbjnr> .load, .store
<hkaiser> exchange
<jbjnr> to see if it get's better
<hkaiser> right
Chewbakka has quit [Ping timeout: 244 seconds]
<jbjnr> what about acquire/release - leave them?
<hkaiser> no, remove those
<hkaiser> make everything seq consistent
<jbjnr> worth a try
<hkaiser> slows things down a bit, but more predictable
hkaiser has quit [Quit: bye]
anushi has quit [Remote host closed the connection]
hkaiser has joined #ste||ar
<jbjnr> hkaiser: I think your plan is working
<jbjnr> almost
<jbjnr> hmmm. perhaps not
<jbjnr> waiting for tests .....
<K-ballo> oh, right, powerpc
Chewbakka has joined #ste||ar
Chewbakk_ has quit [Ping timeout: 265 seconds]
<jbjnr> ok hkaiser making everything sequentially consistent did not fix the errors
<jbjnr> just FYI
<jbjnr> so we have a race, but not an abvious atomic type fix
Chewbakka has quit [Remote host closed the connection]
Chewbakka has joined #ste||ar
<hkaiser> jbjnr: so same picture even with seq consistent atomics?
<Chewbakka> Hey, we are trying to compile our HPX program with debug symbols using cmake (we set CMAKE_BUILD_TYPE to Debug). However, we get the following linking error https://gist.github.com/JeannedArk/1a3b21971ef997734533999cb2d30c4b What are we missing?
<hkaiser> Chewbakka: is everything built using Debug?
<hkaiser> i.e. your library too?
<hkaiser> how do you build your libarry/executable?
<hkaiser> cmake?
<Chewbakka> which libriaries e.g. boost? We only set the cmake flag in our application
<Chewbakka> yes with cmake
<hkaiser> you should build HPX using Debug as well
<Chewbakka> mh ok, that will take a while
<Chewbakka> thank you
jaafar has joined #ste||ar
galabc has joined #ste||ar
Chewbakka has quit [Quit: Leaving...]
nikunj has joined #ste||ar
mbremer has joined #ste||ar
<nikunj> hkaiser, yt?
<hkaiser> here
<mbremer> Hey guys, is hpx master broken at the moment? I pulled and rebuilt a docker container recently and keep getting errors like "hpx::init: can't initialize runtime system more than once! Exiting..."
<nikunj> hkaiser, wrapping main works fine on ppc as well
<nikunj> mbremer, did you try calling hpx::init from main after including hpx_main?
<hkaiser> mbremer: all should be well since yesterday, when did you pull last?
<mbremer> I rebuilt today. Let me look at the commit to be sure.
<mbremer> I guess I am calling hpx::init after including hpx_main
<nikunj> mbremer, if you remove hpx_main from there things should work fine
<hkaiser> no
<hkaiser> mbremer: try using -DHPX_WITH_DYNAMIC_HPX_MAIN=OFF
<nikunj> actually the new implementation works when you include hpx_main in your file. Therefore, it is already initialized from main
<nikunj> hpx system ^^
<mbremer> kk, @hkaiser will try (I'll be back after lunch; container takes a while)
<mbremer> Thanks nikunj, @hkaiser
anushi has joined #ste||ar
<hkaiser> nikunj: I hope this lessen teaches us to be more careful with updating master - a lot of people are affected
<nikunj> hkaiser, yes. I won't hurry anymore. Will check thoroughly everything before adding a pr
<nikunj> hkaiser, this pr will enable things for ppc as well. I tested with jbjnr and things work pretty fine on his pc: https://github.com/STEllAR-GROUP/hpx/pull/3375
<hkaiser> nod, he said so him self
<github> [hpx] hkaiser created integrate_hpxmp (+1 new commit): https://git.io/fNLAw
<github> hpx/integrate_hpxmp c6df77a Hartmut Kaiser: Adding build system support to integrate hpxmp into hpx at the user's machine
<github> [hpx] hkaiser opened pull request #3377: Adding build system support to integrate hpxmp into hpx at the user's machine (master...integrate_hpxmp) https://git.io/fNLAM
<nikunj> hkaiser: since the error "hpx::init: can't initialize runtime system more than once! Exiting..." would not explain much if hpx_main is included and hpx_init is called once. So I think I should add another check if hpx_main is included and then hpx_init is called and print out an error corresponding to it (something like hpx system is already initialized from main. Remove hpx_main.hpp to use hpx_init functionality)
<nikunj> do you agree?
<hkaiser> nikunj: if you think you can diagnose that, sure - would be absolutely appreciated
quaz0r has quit [Ping timeout: 264 seconds]
<nikunj> hkaiser, added code specific to it. Testing it currently
anushi has quit [Read error: Connection reset by peer]
anushi has joined #ste||ar
quaz0r has joined #ste||ar
<nikunj> hkaiser: to achieve the above error symbol I will have to add another weak symbol (same as that of hpx_wrap.cpp) to libhpx_init.a
<hkaiser> ok
<hkaiser> let's keep this change independent of the current PR
<nikunj> wait no
<nikunj> I think I might be missing something
<nikunj> hkaiser: ok I will keep things independent of the current stable pr
<nikunj> zao, yt?
<zao> meep
<nikunj> zao, could you test the pr as well. Just to be sure everything is working on your end as well. here: https://github.com/stellAR-GROUP/hpx/pull/3375
<K-ballo> case insensitive github
<zao> Oh right... this beeped.
<nikunj> zao, did you try running it?
<zao> nikunj: Build failure.
<nikunj> oh
<nikunj> what does it says
<zao> Not sure.
<zao> Hrm...
<zao> Ah no, not build failure.
<zao> `tests` target didn't run tests.
<zao> Keep mixing up the targets.
<mbremer> @hkaiser, nikunj: Cmake flag did the trick. Thanks
<nikunj> zao, could you share a gist
<zao> nikunj: It didn't fail. I just got so many warnings and test-suite didn't start.
<zao> So I assumed there was a failure somewhere.
<nikunj> oh.
mbremer has quit [Quit: Page closed]
<zao> `100% tests passed, 0 tests failed out of 573`
<zao> boooring.
<nikunj> zao, good to hear that!
<nikunj> so things are working for both x86 and ppc
<K-ballo> that's not normal
<zao> :D
anushi has quit [Ping timeout: 265 seconds]
eschnett has joined #ste||ar
anushi has joined #ste||ar
galabc has quit [Quit: Leaving]
mcopik has joined #ste||ar
<nikunj> hkaiser: do you think I can make use of the weak symbol (the one in hpx_wrap) inside of libhpx.so?
<nikunj> hkaiser, I will think of a way to add error to make debugging it easier in the morning.
nikunj has quit [Quit: goodnight]
<jbjnr> hkaiser: correct. seq_consistency didn't help. Got any other reasonably simple to try ideas that might shed light?
<hkaiser> jbjnr: clang sanitizers
hkaiser has quit [Quit: bye]
eschnett has quit [Quit: eschnett]
mcopik has quit [Ping timeout: 260 seconds]
hkaiser has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
mcopik has joined #ste||ar
mcopik has quit [Ping timeout: 244 seconds]
galabc has joined #ste||ar
quaz0r has quit [Ping timeout: 240 seconds]
<hkaiser> parsa[w]: you can add languages in cmake after the project() statement: enable_language()
K-ballo has joined #ste||ar
jakub_golinowski has joined #ste||ar
quaz0r has joined #ste||ar
V|r has quit [Ping timeout: 265 seconds]
jakub_golinowski has quit [Ping timeout: 256 seconds]
galabc has quit [Quit: Leaving]
diehlpk has joined #ste||ar
diehlpk has quit [Ping timeout: 268 seconds]
K-ballo has quit [Ping timeout: 240 seconds]
K-ballo has joined #ste||ar