#ste||ar on 2018-07-12 — irc logs at irclog.cct.lsu.edu

2018-04-23 16:40 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC2018: https://wp.me/p4pxJf-k1

01:05 hkaiser has joined #ste||ar

01:11 quaz0r has quit [Ping timeout: 260 seconds]

01:20 K-ballo has quit [Quit: K-ballo]

01:33 quaz0r has joined #ste||ar

02:24 hkaiser has quit [Quit: bye]

03:04 diehlpk has quit [Ping timeout: 244 seconds]

03:17 aserio has joined #ste||ar

03:25 aserio has quit [Quit: aserio]

03:25 nanashi64 has joined #ste||ar

03:25 nanashi55 has quit [Ping timeout: 265 seconds]

03:25 nanashi64 is now known as nanashi55

04:04 nikunj97 has joined #ste||ar

05:41 twwright has quit [Read error: Connection reset by peer]

05:41 twwright_ has joined #ste||ar

05:41 anushi has quit [Ping timeout: 260 seconds]

05:45 anushi has joined #ste||ar

05:56 <M-ms> [ PERFSTAT ] (samples=100 mean=6.85 median=4.40 min=3.83 stddev=3.83 (55.9%))

05:56 <M-ms> [ PERFSTAT ] (samples=18 mean=2.93 median=2.92 min=2.74 stddev=0.09 (3.0%))

06:27 jaafar has quit [Ping timeout: 240 seconds]

06:29 nikunj[m] has joined #ste||ar

06:30 mcopik has joined #ste||ar

06:38 jbjnr has joined #ste||ar

06:40 mcopik has quit [Read error: Connection reset by peer]

06:48 <nikunj97> jbjnr, yt?

06:48 <jbjnr> here

06:48 <nikunj97> jbjnr, could you try building this branch. https://github.com/STEllAR-GROUP/hpx/pull/3375

06:49 <nikunj97> with -DHPX_WITH_DYNAMIC_HPX_MAIN=ON

06:49 <nikunj97> I wanted to check if the patch made things work out on ppc

06:49 <jbjnr> ok. checking out now. will leave build running whilst I go for coffee

06:49 <jbjnr> report back in a bit

06:49 <nikunj97> jbjnr, sure

07:03 mcopik has joined #ste||ar

07:04 nikunj97 has quit [Quit: Leaving]

07:11 <jbjnr> nikunj[m]: I used cmake -DHPX_WITH_DYNAMIC_HPX_MAIN=ON . and now hello world runs correctly. Nice one.

07:17 nikunj[m] has quit [Ping timeout: 276 seconds]

07:40 anushi has quit [Remote host closed the connection]

07:40 anushi has joined #ste||ar

07:47 david_pfander has joined #ste||ar

07:47 nikunj[m] has joined #ste||ar

07:48 <nikunj[m]> @jbjnr, that's good to hear that hpx now runs well on powerpc as well

07:48 <nikunj[m]> Did you run tests as well?

07:48 <jbjnr> yes

07:49 <jbjnr> 94% tests passed, 35 tests failed out of 573

07:49 <jbjnr> The same ones failing as I had with the DYNAM?/ic MAIN off

07:49 <jbjnr> ^oops

07:49 <nikunj[m]> @jbjnr then I don't think it has something to do with my code

07:50 <jbjnr> nope. it's something fishy on powerpc that doesn't happen on other linux falvours

07:50 <jbjnr> ^flavours

07:50 <jbjnr> I will investigate what's going on.

07:50 <nikunj[m]> @jbjnr if you find out anything related to my implementation please let me know

07:51 <jbjnr> of course.

07:51 <jbjnr> I suspect a race condition that was previously unknown

07:51 <nikunj[m]> @jbjnr could you please try building phylanx if possible

07:51 <nikunj[m]> @jbjnr that's odd

07:51 <jbjnr> not phylanx. I have too much work to do to get involved with another project

07:51 <jbjnr> got deadlines to meet here

07:51 <jbjnr> sorry.

07:52 <nikunj[m]> @jbjnr no worries

07:52 mcopik has quit [Read error: Connection reset by peer]

07:53 <jbjnr> race condition because we will be using 160 threads and most other tests are using 8 or 16. Quite possible there's a problem in the scheduling somewhere or some place in parallel:algorithms that is not triggered frequently

07:53 <nikunj[m]> @jbjnr sounds right

07:55 <zao> Good moroning, all!

07:56 <nikunj[m]> zao good morning!

07:58 <jbjnr> good morning

07:59 anushi has quit [Ping timeout: 276 seconds]

08:08 nikunj[m] has quit [Ping timeout: 252 seconds]

08:08 jbjnr_ has joined #ste||ar

08:09 jbjnr has quit [Ping timeout: 240 seconds]

08:09 mcopik has joined #ste||ar

08:13 <heller___> jbjnr_: which tests fail?

08:13 <jbjnr_> this is the list from yesterday https://gist.github.com/biddisco/10f1e782c86a0483bc752753c3d1c056

08:17 mcopik has quit [Ping timeout: 240 seconds]

08:41 jbjnr__ has joined #ste||ar

08:44 jbjnr_ has quit [Ping timeout: 240 seconds]

08:52 mcopik has joined #ste||ar

09:00 mcopik has quit [Ping timeout: 260 seconds]

09:06 mcopik has joined #ste||ar

09:16 mcopik has quit [Ping timeout: 244 seconds]

09:45 jakub_golinowski has joined #ste||ar

09:46 <jakub_golinowski> M-ms, so as you mentioned one of the important things to spot is that the opencv tests seem to be somehow randomized

09:49 <jbjnr__> Testing on PowerPC Debug build : 99% tests passed, 2 tests failed out of 573

09:50 jakub_golinowski has quit [Ping timeout: 240 seconds]

09:57 <heller___> jbjnr__: not bad :)

09:57 <jbjnr__> one timeout, one fail

09:57 <jbjnr__> questio is - why does release mode trigger problems ...

10:00 <heller___> no idea

10:00 <heller___> what's the output of the tests?

10:00 <heller___> why do they fail?

10:24 <jbjnr__> heller___: https://gist.github.com/biddisco/c05978233004587758b2abca1c954477

10:41 nikunj[m] has joined #ste||ar

10:50 nikunj[m] has quit [Ping timeout: 252 seconds]

10:54 mcopik has joined #ste||ar

11:24 <jbjnr__> heller___: is there a pattern in this set of fails : https://gist.github.com/biddisco/7980a7160183093e4e9f582fc150f234

12:04 mcopik has quit [Ping timeout: 240 seconds]

12:09 hkaiser has joined #ste||ar

12:15 <heller___> jbjnr__: can't see any

12:15 <heller___> could be anything

12:22 Chewbakka has joined #ste||ar

12:25 Chewbakk_ has joined #ste||ar

12:26 <Chewbakk_> Hi, we have a few questions regarding the distribution policy of partitioned vectors: Does the default distribution policy distributes the vector blockwise? If yes, is it possible to control which locality own which block chunk? I am also asking this because I read that it could also be partitioned in a round robin manner

12:27 <jbjnr__> I believe you can supply your own distribution policy

12:27 <jbjnr__> but I've never used the partitioned vectors so I can't recall the details

12:28 Chewbakka has quit [Ping timeout: 240 seconds]

12:28 <hkaiser> Chewbakk_: I think the default is to distribute the blocks round robin

12:30 anushi has joined #ste||ar

12:30 <Chewbakk_> Is there no given block chunking policy? I also read this here: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/components/containers/container_distribution_policy.hpp#L29

12:30 <Chewbakk_> This seems like a common scenario to me

12:31 <hkaiser> Chewbakk_: what would you like to achieve?

12:31 anushi has quit [Remote host closed the connection]

12:33 anushi has joined #ste||ar

12:33 <Chewbakk_> For given k localities and n the size of the partitioned vector: Locality 0 owns the first k data entries, locality 1 the next k data entries, ...

12:33 <hkaiser> Chewbakk_: that is the default, yes

12:33 anushi has quit [Remote host closed the connection]

12:34 <hkaiser> the blocks (not elements) are distributed round robin, by default to as many localities as there are connected to the application

12:35 <hkaiser> one block per locality

12:35 <Chewbakk_> ah perfect, thank you

12:44 anushi has joined #ste||ar

12:47 anushi has quit [Remote host closed the connection]

12:48 anushi has joined #ste||ar

12:53 K-ballo has joined #ste||ar

13:00 <jbjnr__> hkaiser: same question to you : is there a pattern in this set of fails : https://gist.github.com/biddisco/7980a7160183093e4e9f582fc150f234

13:01 <jbjnr__> if I run those tests using hpx:threads=1 then they are passing.

13:01 <jbjnr__> (didn't try all of them, but those I tried passed)

13:01 <hkaiser> don't see any - does this happen always of just for a particular rundom number?

13:01 <jbjnr__> always AFAICT

13:01 <hkaiser> is that on the ppc system?

13:02 <jbjnr__> oooh

13:02 <jbjnr__> yes PPC

13:02 <hkaiser> k

13:02 <jbjnr__> I tried find_test again and again and it just passed once

13:02 <hkaiser> could be a race that does not show itself on x86

13:02 <jbjnr__> yup. it's a race.

13:02 <jbjnr__> same seed and it failed

13:02 <hkaiser> ppc has a more rleaxed memory model than the x86

13:02 <jbjnr__> that's annoying. How will I track it down!

13:03 <hkaiser> annoying, indeed

13:03 <hkaiser> well, as always - try to minimize to the smallest possible reliably failing test case

13:05 <hkaiser> looks like to be something fundamental, not related to a specific algorithm

13:05 <hkaiser> even util.function fails

13:06 <hkaiser> jbjnr__: if you asked me to venture a guess, I'd say look at condition_variable

13:06 <hkaiser> but that's a guess only - could be anything

13:07 <jbjnr__> gdb shows me exceptions thrown in STL vector and algorithms

13:07 <jbjnr__> vector assign triggers exception in reduce_by_key, find in find_test

13:07 <jbjnr__> some corruption must be going on

13:07 <hkaiser> so something went out of scope too early

13:07 <jbjnr__> could be

13:08 <jbjnr__> I will play more. Odd that they pass in debug mode, but fail in release

13:08 <hkaiser> not odd at all

13:08 <jbjnr__> what do we do differently

13:08 <hkaiser> different timings, less contention, etc.

13:08 <hkaiser> less pressure on the synchronization

13:09 Chewbakka has joined #ste||ar

13:09 jbjnr__ has quit [Read error: Connection reset by peer]

13:09 <hkaiser> do you have something like the intel tools available on those machines?

13:10 <hkaiser> or the clang sanitizers

13:12 Chewbakk_ has quit [Ping timeout: 240 seconds]

13:18 jbjnr has joined #ste||ar

13:25 mbremer has quit [Quit: Page closed]

13:30 Chewbakk_ has joined #ste||ar

13:31 <hkaiser> jbjnr: clang sanitizers could help

13:32 <hkaiser> also, I'd start with making all atomics sequentially consistent, i.e. remove all memory_order parameters

13:33 <hkaiser> good chance we got those wrong...

13:33 <jbjnr> from a .load(thig, std:memory_order_relaxed) - you mean just remove the order completely

13:33 <jbjnr> and do that for all

13:34 <hkaiser> yes

13:34 <jbjnr> .load, .store

13:34 <hkaiser> exchange

13:34 <jbjnr> to see if it get's better

13:34 <hkaiser> right

13:34 Chewbakka has quit [Ping timeout: 244 seconds]

13:34 <jbjnr> what about acquire/release - leave them?

13:35 <hkaiser> no, remove those

13:35 <hkaiser> make everything seq consistent

13:35 <jbjnr> worth a try

13:35 <hkaiser> slows things down a bit, but more predictable

13:35 hkaiser has quit [Quit: bye]

13:55 anushi has quit [Remote host closed the connection]

14:12 hkaiser has joined #ste||ar

14:18 <jbjnr> hkaiser: I think your plan is working

14:18 <jbjnr> almost

14:20 <jbjnr> hmmm. perhaps not

14:20 <jbjnr> waiting for tests .....

14:22 <K-ballo> oh, right, powerpc

14:25 Chewbakka has joined #ste||ar

14:29 Chewbakk_ has quit [Ping timeout: 265 seconds]

14:41 <jbjnr> ok hkaiser making everything sequentially consistent did not fix the errors

14:42 <jbjnr> hkaiser: https://gist.github.com/biddisco/7980a7160183093e4e9f582fc150f234#file-powerpc-hpx-L64

14:42 <jbjnr> just FYI

14:43 <jbjnr> so we have a race, but not an abvious atomic type fix

14:44 Chewbakka has quit [Remote host closed the connection]

14:49 Chewbakka has joined #ste||ar

14:52 <hkaiser> jbjnr: so same picture even with seq consistent atomics?

14:57 <Chewbakka> Hey, we are trying to compile our HPX program with debug symbols using cmake (we set CMAKE_BUILD_TYPE to Debug). However, we get the following linking error https://gist.github.com/JeannedArk/1a3b21971ef997734533999cb2d30c4b What are we missing?

15:03 <hkaiser> Chewbakka: is everything built using Debug?

15:03 <hkaiser> i.e. your library too?

15:04 <hkaiser> how do you build your libarry/executable?

15:04 <hkaiser> cmake?

15:04 <Chewbakka> which libriaries e.g. boost? We only set the cmake flag in our application

15:04 <Chewbakka> yes with cmake

15:04 <hkaiser> you should build HPX using Debug as well

15:07 <Chewbakka> mh ok, that will take a while

15:07 <Chewbakka> thank you

15:11 jaafar has joined #ste||ar

15:24 galabc has joined #ste||ar

15:28 Chewbakka has quit [Quit: Leaving...]

15:59 nikunj has joined #ste||ar

16:01 mbremer has joined #ste||ar

16:03 <nikunj> hkaiser, yt?

16:03 <hkaiser> here

16:03 <mbremer> Hey guys, is hpx master broken at the moment? I pulled and rebuilt a docker container recently and keep getting errors like "hpx::init: can't initialize runtime system more than once! Exiting..."

16:03 <nikunj> hkaiser, wrapping main works fine on ppc as well

16:04 <nikunj> mbremer, did you try calling hpx::init from main after including hpx_main?

16:04 <hkaiser> mbremer: all should be well since yesterday, when did you pull last?

16:05 <mbremer> I rebuilt today. Let me look at the commit to be sure.

16:06 <mbremer> hkaiser: Here is my main https://github.com/UT-CHG/dgswemv2/blob/master/examples/ehdg_swe_manufactured_solution/manufactured_hpx_main_swe.cpp

16:06 <mbremer> I guess I am calling hpx::init after including hpx_main

16:07 <nikunj> mbremer, if you remove hpx_main from there things should work fine

16:07 <hkaiser> no

16:07 <hkaiser> mbremer: try using -DHPX_WITH_DYNAMIC_HPX_MAIN=OFF

16:07 <nikunj> actually the new implementation works when you include hpx_main in your file. Therefore, it is already initialized from main

16:08 <nikunj> hpx system ^^

16:09 <mbremer> kk, @hkaiser will try (I'll be back after lunch; container takes a while)

16:09 <mbremer> Thanks nikunj, @hkaiser

16:13 anushi has joined #ste||ar

16:14 <hkaiser> nikunj: I hope this lessen teaches us to be more careful with updating master - a lot of people are affected

16:14 <nikunj> hkaiser, yes. I won't hurry anymore. Will check thoroughly everything before adding a pr

16:18 <nikunj> hkaiser, this pr will enable things for ppc as well. I tested with jbjnr and things work pretty fine on his pc: https://github.com/STEllAR-GROUP/hpx/pull/3375

16:20 <hkaiser> nod, he said so him self

16:32 <github> [hpx] hkaiser created integrate_hpxmp (+1 new commit): https://git.io/fNLAw

16:32 <github> hpx/integrate_hpxmp c6df77a Hartmut Kaiser: Adding build system support to integrate hpxmp into hpx at the user's machine

16:33 <github> [hpx] hkaiser opened pull request #3377: Adding build system support to integrate hpxmp into hpx at the user's machine (master...integrate_hpxmp) https://git.io/fNLAM

16:50 <nikunj> hkaiser: since the error "hpx::init: can't initialize runtime system more than once! Exiting..." would not explain much if hpx_main is included and hpx_init is called once. So I think I should add another check if hpx_main is included and then hpx_init is called and print out an error corresponding to it (something like hpx system is already initialized from main. Remove hpx_main.hpp to use hpx_init functionality)

16:50 <nikunj> do you agree?

17:09 <hkaiser> nikunj: if you think you can diagnose that, sure - would be absolutely appreciated

17:09 quaz0r has quit [Ping timeout: 264 seconds]

17:10 <nikunj> hkaiser, added code specific to it. Testing it currently

17:20 anushi has quit [Read error: Connection reset by peer]

17:31 anushi has joined #ste||ar

17:45 quaz0r has joined #ste||ar

17:50 <nikunj> hkaiser: to achieve the above error symbol I will have to add another weak symbol (same as that of hpx_wrap.cpp) to libhpx_init.a

17:50 <hkaiser> ok

17:50 <hkaiser> let's keep this change independent of the current PR

17:50 <nikunj> wait no

17:51 <nikunj> I think I might be missing something

17:51 <nikunj> hkaiser: ok I will keep things independent of the current stable pr

17:52 <nikunj> zao, yt?

17:52 <zao> meep

17:53 <nikunj> zao, could you test the pr as well. Just to be sure everything is working on your end as well. here: https://github.com/stellAR-GROUP/hpx/pull/3375

18:01 <K-ballo> case insensitive github

18:10 <zao> Oh right... this beeped.

18:42 <nikunj> zao, did you try running it?

18:42 <zao> nikunj: Build failure.

18:42 <nikunj> oh

18:42 <nikunj> what does it says

18:43 <zao> Not sure.

18:43 <zao> Hrm...

18:43 <zao> Ah no, not build failure.

18:43 <zao> `tests` target didn't run tests.

18:43 <zao> Keep mixing up the targets.

18:43 <mbremer> @hkaiser, nikunj: Cmake flag did the trick. Thanks

18:44 <nikunj> zao, could you share a gist

18:45 <zao> nikunj: It didn't fail. I just got so many warnings and test-suite didn't start.

18:45 <zao> So I assumed there was a failure somewhere.

18:45 <nikunj> oh.

18:46 mbremer has quit [Quit: Page closed]

18:54 <zao> `100% tests passed, 0 tests failed out of 573`

18:54 <zao> boooring.

18:54 <nikunj> zao, good to hear that!

18:54 <nikunj> so things are working for both x86 and ppc

18:56 <K-ballo> that's not normal

18:57 <zao> :D

18:58 anushi has quit [Ping timeout: 265 seconds]

19:05 eschnett has joined #ste||ar

19:06 anushi has joined #ste||ar

19:16 galabc has quit [Quit: Leaving]

19:20 mcopik has joined #ste||ar

19:30 <nikunj> hkaiser: do you think I can make use of the weak symbol (the one in hpx_wrap) inside of libhpx.so?

20:10 <nikunj> hkaiser, I will think of a way to add error to make debugging it easier in the morning.

20:10 nikunj has quit [Quit: goodnight]

20:15 <jbjnr> hkaiser: correct. seq_consistency didn't help. Got any other reasonably simple to try ideas that might shed light?

20:19 <hkaiser> jbjnr: clang sanitizers

20:26 hkaiser has quit [Quit: bye]

20:45 eschnett has quit [Quit: eschnett]

20:50 mcopik has quit [Ping timeout: 260 seconds]

20:56 hkaiser has joined #ste||ar

21:11 K-ballo has quit [Quit: K-ballo]

21:18 mcopik has joined #ste||ar

21:23 mcopik has quit [Ping timeout: 244 seconds]

21:55 galabc has joined #ste||ar

22:01 quaz0r has quit [Ping timeout: 240 seconds]

22:05 <hkaiser> parsa[w]: you can add languages in cmake after the project() statement: enable_language()

22:10 K-ballo has joined #ste||ar

22:25 jakub_golinowski has joined #ste||ar

22:27 quaz0r has joined #ste||ar

22:43 V|r has quit [Ping timeout: 265 seconds]

22:49 jakub_golinowski has quit [Ping timeout: 256 seconds]

22:52 galabc has quit [Quit: Leaving]

23:01 diehlpk has joined #ste||ar

23:28 diehlpk has quit [Ping timeout: 268 seconds]

23:46 K-ballo has quit [Ping timeout: 240 seconds]

23:53 K-ballo has joined #ste||ar