#ste||ar on 2017-10-16 — irc logs at irclog.cct.lsu.edu

2017-05-17 13:54 aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:01 EverYoung has quit [Ping timeout: 240 seconds]

00:51 <github> [hpx] hkaiser deleted fixing_2940 at 738dec5: https://git.io/vdMKD

01:05 jaafar has quit [Ping timeout: 248 seconds]

01:18 gedaj has quit [Remote host closed the connection]

01:19 gedaj has joined #ste||ar

01:23 jaafar has joined #ste||ar

01:31 jaafar has quit [Remote host closed the connection]

01:31 jaafar has joined #ste||ar

01:44 hkaiser has quit [Quit: bye]

02:34 K-ballo has quit [Quit: K-ballo]

02:58 EverYoung has joined #ste||ar

03:03 EverYoung has quit [Ping timeout: 248 seconds]

05:05 EverYoung has joined #ste||ar

05:14 zbyerly_ has quit [Ping timeout: 252 seconds]

05:19 EverYoung has quit [Ping timeout: 252 seconds]

06:08 zbyerly_ has joined #ste||ar

07:01 gedaj has quit [Remote host closed the connection]

07:02 gedaj has joined #ste||ar

07:58 david_pfander has joined #ste||ar

08:08 <heller> jbjnr: where do you experience hangs?

08:12 <jbjnr> heller: just submitted issue #2949

08:12 <heller> jbjnr: thanks

08:12 <jbjnr> bin/simple_resource_partitioner --use-scheduler --use-pools --pool-threads=1 --hpx:threads=3 --hpx:bind=balanced

08:12 <jbjnr> for example

08:22 <heller> ok

08:24 <github> [hpx] biddisco force-pushed namespace_error from 01148c4 to 2e65f2a: https://git.io/vdMA2

08:24 <github> hpx/namespace_error 2e65f2a John Biddiscombe: Fix a namespace compilation error when some schedulers are disabled

08:24 <github> [hpx] biddisco opened pull request #2950: Fix a namespace compilation error when some schedulers are disabled (master...namespace_error) https://git.io/vdMAw

08:47 <heller> jbjnr: we have bigger problems right, as it seems

08:47 <jbjnr> ?

08:48 <jbjnr> don't understand your comment

08:50 <heller> right now*

08:50 <heller> the recent function changes seem to have broken distributed runs

08:52 <jbjnr> yes. things are not behaviong as expected. hello world in distributed gives extra output etc.

08:56 <heller> yes

08:56 <heller> trying

08:58 <jbjnr> for my lockup, it seems that the scheduling loop gets stuck because the background thread does not complete properly

08:59 <heller> ok

08:59 <heller> I thought I fixed it

09:00 <heller> i am looking into it right now

09:17 EverYoung has joined #ste||ar

09:18 hkaiser has joined #ste||ar

09:21 EverYoung has quit [Ping timeout: 240 seconds]

09:33 david_pfander has quit [Ping timeout: 240 seconds]

09:37 heller has quit [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]

09:38 heller has joined #ste||ar

09:44 <heller> hkaiser: good morning

09:44 <heller> hkaiser: the vtable change broke function serialization

09:46 <hkaiser> heller: uhh

09:46 <hkaiser> why's that?

09:46 <heller> http://rostam.cct.lsu.edu/builders/hpx_clang_3_8_boost_1_59_centos_x86_64_debug/builds/70/steps/run_unit_tests/logs/tests.unit.agas.distributed.mpi.get_colocation_id%20%282.61%20sec%29

09:46 <heller> see here for example

09:46 <heller> the automatic registration doesn't kick in anymore

09:46 <heller> for some reason...

09:46 <hkaiser> ok, will have a look

09:47 <hkaiser> it might be that the global constructor doing the registration is now called too late

09:47 <heller> my guess is that the concrete vtable instance doesn't get instantiated in time

09:47 <heller> yes

09:47 <hkaiser> ok, will look - thanks for the heads-up

09:49 <heller> hkaiser: did you run into a concret problem to create #2945?

09:49 <hkaiser> yes

09:50 <heller> where is the testcase for it ;)?

09:50 <hkaiser> in my code

09:51 <heller> do you have a minimal reproducable testcase for that?

09:51 <hkaiser> not really minimal, but yes

09:51 <hkaiser> happens whenever you initialize hpx in a global constructor

09:52 <hkaiser> well, can potentially happen - depends on many other thngs as well

09:52 <heller> static initialization order fiasco then, I guess

09:52 <hkaiser> right

09:52 <zao> Like Boost.Path, but more fun!

09:53 <zao> Mem[||||||||||||||||||||||||||||||||||31.1G/31.4G] Tasks: 73, 0 thr; 10 running

09:53 <zao> Swp[||||||||||||||||||||||||| 8.61G/16.0G] Load average: 18.78 18.34 17.81

09:54 * zao shakes a fist at tests

09:55 <hkaiser> heller: not quite clear why the registration fails now - the code looks ok

09:56 <heller> hkaiser: can you reproduce the failure on MSVC?

09:57 <hkaiser> will try

09:58 <hkaiser> heller: this happens during de-serialization, right?

09:59 <hkaiser> as a side efefct of the change the construction of the vtable is now delayed until somebody actively asks for it

10:00 <hkaiser> we need to find the spot where we need to ask for it for the serialization to work

10:01 <hkaiser> I think we need to insert a call to serializable_function_vtable<VTable>::get_vtable() here: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/util/detail/vtable/serializable_function_vtable.hpp#L50

10:02 <hkaiser> heller: do you have it reproduced?

10:10 <heller> hkaiser: i have the error reproduced, yes, it happens during deserialization, yes. the type is not registered on the receiving end

10:10 <heller> hkaiser: I am not sure if we can live without the global ctors

10:11 <hkaiser> should be possible

10:11 <heller> we have them all over the place

10:11 <heller> what was the exact error you ran into?

10:11 <hkaiser> bad enough

10:12 <hkaiser> a not initialized vtbl here: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/util/function.hpp#L63

10:15 <heller> how is this even possible :(

10:15 <hkaiser> heller: forcing the vtable initialization in the default constrcutor of basic_unction should work as well

10:15 <hkaiser> well, hpx uses funtions during initialization - this is triggered by the static initialization order being wrong

10:16 <heller> in order to instantiate a vtable, I need the concrete function type

10:16 <hkaiser> you have that in the default constructor of basic_function

10:17 <hkaiser> https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/util/detail/basic_function.hpp#L264

10:18 <heller> I don't have the target type

10:18 <heller> which is the essential part of it

10:18 <hkaiser> R(Ts...)

10:18 <heller> no, that's the signature

10:19 <heller> I need the type of the object that is being called eventually

10:19 <hkaiser> k

10:19 <hkaiser> well, then we're screwed

10:19 <hkaiser> let's roll back the change, I'll find another way around it

10:19 <heller> https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/util/detail/basic_function.hpp#L231

10:20 <hkaiser> nod

10:20 <heller> hkaiser: one way to fix this would be to explicitly register the functions, I think

10:21 <hkaiser> yah

10:21 <hkaiser> could help

10:22 <hkaiser> heller: #2951

10:27 <heller> jbjnr: any news about this jenkins thingy?

10:28 <hkaiser> he's travelling

10:28 <heller> he was online earlier

10:36 david_pfander has joined #ste||ar

11:07 <jbjnr> heller: no news about jenkins yet. will have to wait till I get back at the end of next week.

11:08 <heller> ok

11:08 <jbjnr> Any way, I though you were doing some fancy super circle ci stuff ...

11:08 <jbjnr> but I guess that's only 4 cores.

11:18 <heller> yeah

11:18 <heller> the circle stuff isn't going anywhere ...

11:18 <heller> they are not willing to ramp up our resources

11:19 <heller> oh my gosh ... I just found a *very* stupid mistake

11:19 <heller> https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/runtime/threads/policies/thread_queue.hpp#L1066

11:19 <heller> how did that *ever* work?

11:21 <jbjnr> what's wrong with it?

11:21 <heller> it's calling a function which is supposed to be guarded by a lock

11:22 <heller> looks like it was a merge error or something though

11:22 <heller> at least, I am to blame :/

11:34 <github> [hpx] sithhell created fix_thread_map (+1 new commit): https://git.io/vdDYW

11:34 <github> hpx/fix_thread_map d83c949 Thomas Heller: Removing wrong call to cleanup_terminated_locked...

11:34 <heller> jbjnr: https://github.com/STEllAR-GROUP/hpx/pull/2952 <-- this should fix the deadlocks

11:34 <github> [hpx] sithhell opened pull request #2952: Removing wrong call to cleanup_terminated_locked (master...fix_thread_map) https://git.io/vdDYz

11:41 <github> [hpx] hkaiser created fixing_2947 (+1 new commit): https://git.io/vdDOm

11:41 <github> hpx/fixing_2947 a5d12ef Hartmut Kaiser: Making sure any hpx.os_threads=N supplied through a --hpx::config file is taken into account...

11:42 <github> [hpx] hkaiser opened pull request #2953: Making sure any hpx.os_threads=N supplied through a --hpx::config file is taken into account (master...fixing_2947) https://git.io/vdDOG

11:43 <hkaiser> heller: great, thanks!

11:43 <heller> very stupid :/

12:02 K-ballo has joined #ste||ar

12:06 <zao> Blargh... seeing a ton of timeouts on distributed.tcp today.

12:12 <heller> yeah

12:12 <heller> we just reverted a bad commit

12:15 <heller> zao: what should we do about std::rand in unit tests now?

12:16 <zao> Kill with fire, replace reasonably mechanically with a MT + uniform int distribution where needed, and use a constant where you don't really need randomness?

12:17 <hkaiser> has std::rand been deprecated now?

12:17 <zao> My personal opinion is that unless you intend to run a test a lot of times to find problems, randomization only leads to flapping.

12:17 <hkaiser> zao: that's what we do

12:17 <zao> Once-per-commit is way too seldom if the goal is to test different datasets.

12:18 <hkaiser> run them very often ;)

12:18 <zao> Well, you don't run them more than once in CI/buildbot?

12:18 <hkaiser> zao: we do that for years now, quite successfully, btw

12:18 <heller> hkaiser: the problem is that apparently, we run into UB with some std::rand uses

12:18 <hkaiser> heller: what?

12:18 <hkaiser> how's that?

12:18 <zao> hkaiser: The problem last week, was that std::rand returned a number close to INT_MAX.

12:19 <hkaiser> ok

12:19 <zao> Which we overflowed and ended up calling uniform_int_distribution(base - x, base + x)

12:19 <zao> Which is UB.

12:19 <hkaiser> we don't use uniform_distribution, do we?

12:19 <zao> Found it due to blind luck where the seed for std::rand was such that it triggered the lingering bug.

12:19 <hkaiser> see, so there is your use case

12:20 <zao> hkaiser: https://github.com/STEllAR-GROUP/hpx/blob/master/tests/unit/parallel/algorithms/partition_tests.hpp#L369

12:20 <zao> hkaiser: I don't mind such soak tests, but I disagree with them being part of the regular test suite if they just run once per build.

12:20 <hkaiser> where is that uniform_int_distribution?

12:21 <zao> https://github.com/STEllAR-GROUP/hpx/blob/master/tests/unit/parallel/algorithms/partition_tests.hpp#L83

12:21 <zao> Via test_partition_heavy

12:21 <hkaiser> ok, but then this a bug, not the use of std::rand

12:22 <zao> The problem is using the result of std::rand without any range clamping.

12:22 <hkaiser> absolutely

12:22 <zao> std::rand has an implementation defined range, which hides the problem on say MSVC.

12:23 <hkaiser> right

12:23 <zao> It also may(?) have an arbitrary implementation, which makes the results hard to reproduce on other machines.

12:23 <hkaiser> that is definitely a good point

12:23 <zao> If it used say a MT, we'd be deterministic across boxen.

12:23 <hkaiser> are other generators portable across machines?

12:23 <zao> Yes.

12:23 <hkaiser> that woul dbe benefitial indeed

12:24 <zao> Another benefit is that we are not affected by any other use of std::rand in the process.

12:24 <zao> Which would be rude by libraries, but quite possible.

12:24 <hkaiser> ok, I'll try to find somebody to do the work

12:24 <zao> For example, this test generates a rand_base per test, and if something like say hwloc invokes rand(), we get divergence.

12:25 <zao> Interesting fact, it took over four hundred runs of the test suite to trigger this problem on my machine.

12:25 <zao> And it feels like it's quite lucky even then :)

12:29 <hkaiser> zao: do we have a ticket for the partitioner test problem?

12:32 <zao> No, I have not filed this.

12:32 <zao> Only discussed it with heller and K-ballo on IRC as it came up.

12:32 <zao> (I was in meetings)

12:33 <hkaiser> may I ask to create a ticket?

12:55 <jbjnr> heller: I cherry picked your commit, but my test still locks up

12:58 <zao> hkaiser: In a workshop all day today, but I'll try to whip something up toward the evening.

13:01 <heller> jbjnr: yeah, just saw the lock up as well

13:01 <heller> jbjnr: still working on it

13:01 <heller> jbjnr: the PR fixes another, more serious problem

13:03 <jbjnr> ok

13:14 eschnett has quit [Quit: eschnett]

13:24 <hkaiser> zao: thanks

13:26 <K-ballo> hkaiser: the vtables change was bad?

13:26 <hkaiser> K-ballo: yah, it broke serialization

13:27 <K-ballo> interesting

13:29 <hkaiser> actually it broke de-serialization as the delayed vtable construction was not triggered

13:30 <K-ballo> ah, right, and the construction does the registration

13:30 <hkaiser> yes

13:31 <zao> Gah, ctest overwrote my results :(

13:33 eschnett has joined #ste||ar

13:42 parsa[[[w]]] has quit [Read error: Connection reset by peer]

13:48 parsa[[w]] has joined #ste||ar

13:48 eschnett has quit [Quit: eschnett]

14:19 K-ballo has quit [Ping timeout: 248 seconds]

14:21 K-ballo has joined #ste||ar

14:25 K-ballo has quit [Read error: Connection reset by peer]

14:26 aserio has joined #ste||ar

14:26 K-ballo has joined #ste||ar

14:52 <zao> hkaiser: Filed https://github.com/STEllAR-GROUP/hpx/issues/2954

14:52 <zao> Might've misspelled the test name, didn't have any output handy.

14:53 <msimberg> I was trying to account for all the threads that hpx starts, and with jbjnr we got to n worker threads, 2 (default) io pool threads, 2 timer pool threads, 2 parcel pool threads and the wait_helper to wait for finalize. Is this correct? Would hpx spawn threads for any other purposes?

14:54 <hkaiser> zao: ok, thanks!

14:55 <hkaiser> msimberg: no

14:55 <hkaiser> msimberg: wait, there is also the main thread - but that's not started by hpx

14:56 aserio has quit [Read error: Connection reset by peer]

14:56 <msimberg> no, meaning no other threads, or not correct? and yeah, I ignored the main thread

14:57 hkaiser has quit [Read error: Connection reset by peer]

14:57 hkaiser has joined #ste||ar

14:58 <hkaiser> msimberg: 'no' as in hpx does not spawn any other threads

14:58 aserio has joined #ste||ar

14:58 <msimberg> ok, thanks!

15:09 eschnett has joined #ste||ar

15:15 <hkaiser> aserio: yt?

15:18 <aserio> hkaiser: yes

15:18 <hkaiser> see pm, pls

15:37 EverYoung has joined #ste||ar

15:42 jaafar has quit [Ping timeout: 248 seconds]

15:47 gedaj has quit [Remote host closed the connection]

15:48 gedaj has joined #ste||ar

15:54 EverYoung has quit [Remote host closed the connection]

15:54 EverYoung has joined #ste||ar

15:59 diehlpk_work has joined #ste||ar

16:14 rod_t has joined #ste||ar

16:22 bibek_desktop has quit [Quit: Leaving]

16:23 Bibek has joined #ste||ar

16:27 david_pfander has quit [Ping timeout: 248 seconds]

16:43 EverYoung has quit [Ping timeout: 246 seconds]

17:12 EverYoung has joined #ste||ar

17:26 aserio1 has joined #ste||ar

17:30 aserio has quit [Ping timeout: 252 seconds]

17:30 aserio1 is now known as aserio

17:41 <heller> hkaiser: hmm, everything but the throttle thing is what I am able to fix right now :/

17:43 <heller> removing and adding PUs dynamically might need some more revised rework of the scheduling loop

17:43 <heller> the scheduling loop is currently designed to really run from start to finish ...

17:43 <hkaiser> ok

17:43 <hkaiser> I need this functionality so I will work on it soon

17:44 <heller> what do you need it for?

17:44 <hkaiser> switching back & forth between MPI/OpenMP step and HPX step

17:45 <heller> ok

17:45 <heller> for that it would be overkill to kill of the entire thread anyways

17:45 <heller> a condition variable which controls whether this thread is active or not sounds more suitable

17:45 <hkaiser> nobody said we should kill the threads

17:47 <heller> yes, that is how it was sketched to be implemented

17:47 <github> [hpx] sithhell pushed 1 new commit to fix_thread_map: https://git.io/vdDx6

17:47 <github> hpx/fix_thread_map 36544eb Thomas Heller: Partially reverting background thread handling during shutdown...

17:48 <heller> hkaiser: I'll do it properly then tomorrow

17:48 hkaiser has quit [Read error: Connection reset by peer]

17:48 hkaiser has joined #ste||ar

17:49 <hkaiser> heller: yah, let's see how it goes

17:49 <heller> hkaiser: let's merge what I have ASAP ... this should at least put master in a working condition again for the rest of the world

17:49 <hkaiser> fine by me

17:51 <heller> #2955

17:51 <heller> I am getting more and more angry emails from my EU partners ;)

17:52 <hkaiser> why?

17:53 <heller> "nothing ever works! Do you even do CI!"?

17:53 aserio has quit [Read error: Connection reset by peer]

17:53 <heller> My reply: "Yes, we do CI, that's how I know that it is currently broken"

17:53 <zao> Coincindent Irritation.

17:53 <hkaiser> lol

17:53 <heller> made them even angrier

17:53 <hkaiser> idiots

17:53 <hkaiser> know everything better, as usual

17:54 <heller> yeah

17:54 <hkaiser> how many lines of code do they have? 10? 20?

17:54 <K-ballo> this channel is being recorded for quality purposes

17:54 <heller> K-ballo: thanks ;)

17:55 <heller> anyways

17:56 <heller> hkaiser: quite a few actually

17:56 <heller> the code that crosses the boundaries is always the hardest work

17:57 <hkaiser> absolutely

17:57 <heller> and we are in an unfortunate situation that the runtime is the place where the dots get connected

17:57 aserio has joined #ste||ar

17:58 <hkaiser> heller: do you have to work off top of master?

17:58 <hkaiser> why not selecting a 'stable' commit?

17:58 <heller> the recent pool executor changes forced us

17:59 <hkaiser> otoh, they can't expect for major changes in between releases to be 100% stable

17:59 <hkaiser> that's nonsense

17:59 <heller> the entire project is WIP, it's research

17:59 <heller> code breaks

17:59 <heller> always

17:59 <heller> or is dead

18:01 EverYoung has quit [Ping timeout: 246 seconds]

18:02 EverYoung has joined #ste||ar

18:11 hkaiser has quit [Quit: bye]

18:19 aserio has quit [Ping timeout: 240 seconds]

18:27 hkaiser has joined #ste||ar

18:34 aserio has joined #ste||ar

18:50 aserio has quit [Ping timeout: 240 seconds]

18:52 <jbjnr> hkaiser: just noticed this conversation and wanted to point out that adding an hpx::suspend and hpx::resume is what msimberg is heading for, so the to of you should have a skype call soon. I can join too if I'm not presenting or anything.

18:53 hkaiser has quit [Read error: Connection reset by peer]

18:53 hkaiser has joined #ste||ar

18:53 <hkaiser> jbjnr: absolutely

18:57 <hkaiser> jbjnr: what class will expose those?

18:57 <hkaiser> msimberg: ^^

18:57 hkaiser has quit [Client Quit]

19:03 aserio has joined #ste||ar

19:28 zbyerly_ has quit [Remote host closed the connection]

19:29 zbyerly_ has joined #ste||ar

20:06 rod_t has left #ste||ar [#ste||ar]

20:10 hkaiser has joined #ste||ar

20:16 EverYoun_ has joined #ste||ar

20:20 EverYoung has quit [Ping timeout: 252 seconds]

20:24 <jbjnr> heller: just fyi - I saw that you pushed another commit to the thread handling, but it does not fix the hangs for me. sorry.

20:26 <heller> jbjnr: shit

20:26 <heller> jbjnr: thanks for letting me know. Still the same reproduction?

20:35 rod_t has joined #ste||ar

21:02 <github> [hpx] chinz07 opened pull request #2957: Fixing errors generated by mixing different attribute syntaxes (master...fixing_2956) https://git.io/vdyWV

21:02 EverYoun_ has quit [Remote host closed the connection]

21:03 EverYoung has joined #ste||ar

21:17 eschnett has quit [Quit: eschnett]

21:58 <github> [hpx] hkaiser closed pull request #2943: Changing channel actions to be direct (master...channel_direct) https://git.io/vd6MX

21:59 <hkaiser> msimberg: yt?

22:13 aserio has quit [Quit: aserio]

22:26 sam29 has joined #ste||ar

22:27 sam29 has left #ste||ar [#ste||ar]

23:14 EverYoung has quit [Ping timeout: 252 seconds]