aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
EverYoung has quit [Ping timeout: 240 seconds]
<github> [hpx] hkaiser deleted fixing_2940 at 738dec5: https://git.io/vdMKD
jaafar has quit [Ping timeout: 248 seconds]
gedaj has quit [Remote host closed the connection]
gedaj has joined #ste||ar
jaafar has joined #ste||ar
jaafar has quit [Remote host closed the connection]
jaafar has joined #ste||ar
hkaiser has quit [Quit: bye]
K-ballo has quit [Quit: K-ballo]
EverYoung has joined #ste||ar
EverYoung has quit [Ping timeout: 248 seconds]
EverYoung has joined #ste||ar
zbyerly_ has quit [Ping timeout: 252 seconds]
EverYoung has quit [Ping timeout: 252 seconds]
zbyerly_ has joined #ste||ar
gedaj has quit [Remote host closed the connection]
gedaj has joined #ste||ar
david_pfander has joined #ste||ar
<heller> jbjnr: where do you experience hangs?
<jbjnr> heller: just submitted issue #2949
<heller> jbjnr: thanks
<jbjnr> bin/simple_resource_partitioner --use-scheduler --use-pools --pool-threads=1 --hpx:threads=3 --hpx:bind=balanced
<jbjnr> for example
<heller> ok
<github> [hpx] biddisco force-pushed namespace_error from 01148c4 to 2e65f2a: https://git.io/vdMA2
<github> hpx/namespace_error 2e65f2a John Biddiscombe: Fix a namespace compilation error when some schedulers are disabled
<github> [hpx] biddisco opened pull request #2950: Fix a namespace compilation error when some schedulers are disabled (master...namespace_error) https://git.io/vdMAw
<heller> jbjnr: we have bigger problems right, as it seems
<jbjnr> ?
<jbjnr> don't understand your comment
<heller> right now*
<heller> the recent function changes seem to have broken distributed runs
<jbjnr> yes. things are not behaviong as expected. hello world in distributed gives extra output etc.
<heller> yes
<heller> trying
<jbjnr> for my lockup, it seems that the scheduling loop gets stuck because the background thread does not complete properly
<heller> ok
<heller> I thought I fixed it
<heller> i am looking into it right now
EverYoung has joined #ste||ar
hkaiser has joined #ste||ar
EverYoung has quit [Ping timeout: 240 seconds]
david_pfander has quit [Ping timeout: 240 seconds]
heller has quit [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
heller has joined #ste||ar
<heller> hkaiser: good morning
<heller> hkaiser: the vtable change broke function serialization
<hkaiser> heller: uhh
<hkaiser> why's that?
<heller> see here for example
<heller> the automatic registration doesn't kick in anymore
<heller> for some reason...
<hkaiser> ok, will have a look
<hkaiser> it might be that the global constructor doing the registration is now called too late
<heller> my guess is that the concrete vtable instance doesn't get instantiated in time
<heller> yes
<hkaiser> ok, will look - thanks for the heads-up
<heller> hkaiser: did you run into a concret problem to create #2945?
<hkaiser> yes
<heller> where is the testcase for it ;)?
<hkaiser> in my code
<heller> do you have a minimal reproducable testcase for that?
<hkaiser> not really minimal, but yes
<hkaiser> happens whenever you initialize hpx in a global constructor
<hkaiser> well, can potentially happen - depends on many other thngs as well
<heller> static initialization order fiasco then, I guess
<hkaiser> right
<zao> Like Boost.Path, but more fun!
<zao> Mem[||||||||||||||||||||||||||||||||||31.1G/31.4G] Tasks: 73, 0 thr; 10 running
<zao> Swp[||||||||||||||||||||||||| 8.61G/16.0G] Load average: 18.78 18.34 17.81
* zao shakes a fist at tests
<hkaiser> heller: not quite clear why the registration fails now - the code looks ok
<heller> hkaiser: can you reproduce the failure on MSVC?
<hkaiser> will try
<hkaiser> heller: this happens during de-serialization, right?
<hkaiser> as a side efefct of the change the construction of the vtable is now delayed until somebody actively asks for it
<hkaiser> we need to find the spot where we need to ask for it for the serialization to work
<hkaiser> I think we need to insert a call to serializable_function_vtable<VTable>::get_vtable() here: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/util/detail/vtable/serializable_function_vtable.hpp#L50
<hkaiser> heller: do you have it reproduced?
<heller> hkaiser: i have the error reproduced, yes, it happens during deserialization, yes. the type is not registered on the receiving end
<heller> hkaiser: I am not sure if we can live without the global ctors
<hkaiser> should be possible
<heller> we have them all over the place
<heller> what was the exact error you ran into?
<hkaiser> bad enough
<heller> how is this even possible :(
<hkaiser> heller: forcing the vtable initialization in the default constrcutor of basic_unction should work as well
<hkaiser> well, hpx uses funtions during initialization - this is triggered by the static initialization order being wrong
<heller> in order to instantiate a vtable, I need the concrete function type
<hkaiser> you have that in the default constructor of basic_function
<heller> I don't have the target type
<heller> which is the essential part of it
<hkaiser> R(Ts...)
<heller> no, that's the signature
<heller> I need the type of the object that is being called eventually
<hkaiser> k
<hkaiser> well, then we're screwed
<hkaiser> let's roll back the change, I'll find another way around it
<hkaiser> nod
<heller> hkaiser: one way to fix this would be to explicitly register the functions, I think
<hkaiser> yah
<hkaiser> could help
<hkaiser> heller: #2951
<heller> jbjnr: any news about this jenkins thingy?
<hkaiser> he's travelling
<heller> he was online earlier
david_pfander has joined #ste||ar
<jbjnr> heller: no news about jenkins yet. will have to wait till I get back at the end of next week.
<heller> ok
<jbjnr> Any way, I though you were doing some fancy super circle ci stuff ...
<jbjnr> but I guess that's only 4 cores.
<heller> yeah
<heller> the circle stuff isn't going anywhere ...
<heller> they are not willing to ramp up our resources
<heller> oh my gosh ... I just found a *very* stupid mistake
<heller> how did that *ever* work?
<jbjnr> what's wrong with it?
<heller> it's calling a function which is supposed to be guarded by a lock
<heller> looks like it was a merge error or something though
<heller> at least, I am to blame :/
<github> [hpx] sithhell created fix_thread_map (+1 new commit): https://git.io/vdDYW
<github> hpx/fix_thread_map d83c949 Thomas Heller: Removing wrong call to cleanup_terminated_locked...
<heller> jbjnr: https://github.com/STEllAR-GROUP/hpx/pull/2952 <-- this should fix the deadlocks
<github> [hpx] sithhell opened pull request #2952: Removing wrong call to cleanup_terminated_locked (master...fix_thread_map) https://git.io/vdDYz
<github> [hpx] hkaiser created fixing_2947 (+1 new commit): https://git.io/vdDOm
<github> hpx/fixing_2947 a5d12ef Hartmut Kaiser: Making sure any hpx.os_threads=N supplied through a --hpx::config file is taken into account...
<github> [hpx] hkaiser opened pull request #2953: Making sure any hpx.os_threads=N supplied through a --hpx::config file is taken into account (master...fixing_2947) https://git.io/vdDOG
<hkaiser> heller: great, thanks!
<heller> very stupid :/
K-ballo has joined #ste||ar
<zao> Blargh... seeing a ton of timeouts on distributed.tcp today.
<heller> yeah
<heller> we just reverted a bad commit
<heller> zao: what should we do about std::rand in unit tests now?
<zao> Kill with fire, replace reasonably mechanically with a MT + uniform int distribution where needed, and use a constant where you don't really need randomness?
<hkaiser> has std::rand been deprecated now?
<zao> My personal opinion is that unless you intend to run a test a lot of times to find problems, randomization only leads to flapping.
<hkaiser> zao: that's what we do
<zao> Once-per-commit is way too seldom if the goal is to test different datasets.
<hkaiser> run them very often ;)
<zao> Well, you don't run them more than once in CI/buildbot?
<hkaiser> zao: we do that for years now, quite successfully, btw
<heller> hkaiser: the problem is that apparently, we run into UB with some std::rand uses
<hkaiser> heller: what?
<hkaiser> how's that?
<zao> hkaiser: The problem last week, was that std::rand returned a number close to INT_MAX.
<hkaiser> ok
<zao> Which we overflowed and ended up calling uniform_int_distribution(base - x, base + x)
<zao> Which is UB.
<hkaiser> we don't use uniform_distribution, do we?
<zao> Found it due to blind luck where the seed for std::rand was such that it triggered the lingering bug.
<hkaiser> see, so there is your use case
<zao> hkaiser: I don't mind such soak tests, but I disagree with them being part of the regular test suite if they just run once per build.
<hkaiser> where is that uniform_int_distribution?
<zao> Via test_partition_heavy
<hkaiser> ok, but then this a bug, not the use of std::rand
<zao> The problem is using the result of std::rand without any range clamping.
<hkaiser> absolutely
<zao> std::rand has an implementation defined range, which hides the problem on say MSVC.
<hkaiser> right
<zao> It also may(?) have an arbitrary implementation, which makes the results hard to reproduce on other machines.
<hkaiser> that is definitely a good point
<zao> If it used say a MT, we'd be deterministic across boxen.
<hkaiser> are other generators portable across machines?
<zao> Yes.
<hkaiser> that woul dbe benefitial indeed
<zao> Another benefit is that we are not affected by any other use of std::rand in the process.
<zao> Which would be rude by libraries, but quite possible.
<hkaiser> ok, I'll try to find somebody to do the work
<zao> For example, this test generates a rand_base per test, and if something like say hwloc invokes rand(), we get divergence.
<zao> Interesting fact, it took over four hundred runs of the test suite to trigger this problem on my machine.
<zao> And it feels like it's quite lucky even then :)
<hkaiser> zao: do we have a ticket for the partitioner test problem?
<zao> No, I have not filed this.
<zao> Only discussed it with heller and K-ballo on IRC as it came up.
<zao> (I was in meetings)
<hkaiser> may I ask to create a ticket?
<jbjnr> heller: I cherry picked your commit, but my test still locks up
<zao> hkaiser: In a workshop all day today, but I'll try to whip something up toward the evening.
<heller> jbjnr: yeah, just saw the lock up as well
<heller> jbjnr: still working on it
<heller> jbjnr: the PR fixes another, more serious problem
<jbjnr> ok
eschnett has quit [Quit: eschnett]
<hkaiser> zao: thanks
<K-ballo> hkaiser: the vtables change was bad?
<hkaiser> K-ballo: yah, it broke serialization
<K-ballo> interesting
<hkaiser> actually it broke de-serialization as the delayed vtable construction was not triggered
<K-ballo> ah, right, and the construction does the registration
<hkaiser> yes
<zao> Gah, ctest overwrote my results :(
eschnett has joined #ste||ar
parsa[[[w]]] has quit [Read error: Connection reset by peer]
parsa[[w]] has joined #ste||ar
eschnett has quit [Quit: eschnett]
K-ballo has quit [Ping timeout: 248 seconds]
K-ballo has joined #ste||ar
K-ballo has quit [Read error: Connection reset by peer]
aserio has joined #ste||ar
K-ballo has joined #ste||ar
<zao> Might've misspelled the test name, didn't have any output handy.
<msimberg> I was trying to account for all the threads that hpx starts, and with jbjnr we got to n worker threads, 2 (default) io pool threads, 2 timer pool threads, 2 parcel pool threads and the wait_helper to wait for finalize. Is this correct? Would hpx spawn threads for any other purposes?
<hkaiser> zao: ok, thanks!
<hkaiser> msimberg: no
<hkaiser> msimberg: wait, there is also the main thread - but that's not started by hpx
aserio has quit [Read error: Connection reset by peer]
<msimberg> no, meaning no other threads, or not correct? and yeah, I ignored the main thread
hkaiser has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
<hkaiser> msimberg: 'no' as in hpx does not spawn any other threads
aserio has joined #ste||ar
<msimberg> ok, thanks!
eschnett has joined #ste||ar
<hkaiser> aserio: yt?
<aserio> hkaiser: yes
<hkaiser> see pm, pls
EverYoung has joined #ste||ar
jaafar has quit [Ping timeout: 248 seconds]
gedaj has quit [Remote host closed the connection]
gedaj has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
diehlpk_work has joined #ste||ar
rod_t has joined #ste||ar
bibek_desktop has quit [Quit: Leaving]
Bibek has joined #ste||ar
david_pfander has quit [Ping timeout: 248 seconds]
EverYoung has quit [Ping timeout: 246 seconds]
EverYoung has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 252 seconds]
aserio1 is now known as aserio
<heller> hkaiser: hmm, everything but the throttle thing is what I am able to fix right now :/
<heller> removing and adding PUs dynamically might need some more revised rework of the scheduling loop
<heller> the scheduling loop is currently designed to really run from start to finish ...
<hkaiser> ok
<hkaiser> I need this functionality so I will work on it soon
<heller> what do you need it for?
<hkaiser> switching back & forth between MPI/OpenMP step and HPX step
<heller> ok
<heller> for that it would be overkill to kill of the entire thread anyways
<heller> a condition variable which controls whether this thread is active or not sounds more suitable
<hkaiser> nobody said we should kill the threads
<heller> yes, that is how it was sketched to be implemented
<github> [hpx] sithhell pushed 1 new commit to fix_thread_map: https://git.io/vdDx6
<github> hpx/fix_thread_map 36544eb Thomas Heller: Partially reverting background thread handling during shutdown...
<heller> hkaiser: I'll do it properly then tomorrow
hkaiser has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
<hkaiser> heller: yah, let's see how it goes
<heller> hkaiser: let's merge what I have ASAP ... this should at least put master in a working condition again for the rest of the world
<hkaiser> fine by me
<heller> #2955
<heller> I am getting more and more angry emails from my EU partners ;)
<hkaiser> why?
<heller> "nothing ever works! Do you even do CI!"?
aserio has quit [Read error: Connection reset by peer]
<heller> My reply: "Yes, we do CI, that's how I know that it is currently broken"
<zao> Coincindent Irritation.
<hkaiser> lol
<heller> made them even angrier
<hkaiser> idiots
<hkaiser> know everything better, as usual
<heller> yeah
<hkaiser> how many lines of code do they have? 10? 20?
<K-ballo> this channel is being recorded for quality purposes
<heller> K-ballo: thanks ;)
<heller> anyways
<heller> hkaiser: quite a few actually
<heller> the code that crosses the boundaries is always the hardest work
<hkaiser> absolutely
<heller> and we are in an unfortunate situation that the runtime is the place where the dots get connected
aserio has joined #ste||ar
<hkaiser> heller: do you have to work off top of master?
<hkaiser> why not selecting a 'stable' commit?
<heller> the recent pool executor changes forced us
<hkaiser> otoh, they can't expect for major changes in between releases to be 100% stable
<hkaiser> that's nonsense
<heller> the entire project is WIP, it's research
<heller> code breaks
<heller> always
<heller> or is dead
EverYoung has quit [Ping timeout: 246 seconds]
EverYoung has joined #ste||ar
hkaiser has quit [Quit: bye]
aserio has quit [Ping timeout: 240 seconds]
hkaiser has joined #ste||ar
aserio has joined #ste||ar
aserio has quit [Ping timeout: 240 seconds]
<jbjnr> hkaiser: just noticed this conversation and wanted to point out that adding an hpx::suspend and hpx::resume is what msimberg is heading for, so the to of you should have a skype call soon. I can join too if I'm not presenting or anything.
hkaiser has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
<hkaiser> jbjnr: absolutely
<hkaiser> jbjnr: what class will expose those?
<hkaiser> msimberg: ^^
hkaiser has quit [Client Quit]
aserio has joined #ste||ar
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
rod_t has left #ste||ar [#ste||ar]
hkaiser has joined #ste||ar
EverYoun_ has joined #ste||ar
EverYoung has quit [Ping timeout: 252 seconds]
<jbjnr> heller: just fyi - I saw that you pushed another commit to the thread handling, but it does not fix the hangs for me. sorry.
<heller> jbjnr: shit
<heller> jbjnr: thanks for letting me know. Still the same reproduction?
rod_t has joined #ste||ar
<github> [hpx] chinz07 opened pull request #2957: Fixing errors generated by mixing different attribute syntaxes (master...fixing_2956) https://git.io/vdyWV
EverYoun_ has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
eschnett has quit [Quit: eschnett]
<github> [hpx] hkaiser closed pull request #2943: Changing channel actions to be direct (master...channel_direct) https://git.io/vd6MX
<hkaiser> msimberg: yt?
aserio has quit [Quit: aserio]
sam29 has joined #ste||ar
sam29 has left #ste||ar [#ste||ar]
EverYoung has quit [Ping timeout: 252 seconds]