aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
<diehlpk> Ok, hkaiser I will do it later
<diehlpk> zao, I invited you
parsa has quit [Quit: Zzzzzzzzzzzz]
<diehlpk> hkaiser, zao https://pastebin.com/HG9t3qEv
parsa has joined #ste||ar
<hkaiser> try again but run with 'catch throw'
<hkaiser> that will stop at the spot where the initial execption is thrown
<hkaiser> the one in yor listing is the rethrown exception after it has been handled in the scheduler
eschnett has joined #ste||ar
<hkaiser> diehlpk: uhh
parsa has quit [Quit: Zzzzzzzzzzzz]
<diehlpk> hkaiser, I am not able to reproduce it on my local machine
<hkaiser> it's a strange one
<hkaiser> diehlpk: is it a debug/release mismatch?
<hkaiser> the hpx docker image has a debug build, yours seem to be a release build
<hkaiser> relwithdebinfo, that is
parsa has joined #ste||ar
<diehlpk> Ok, hpx is using this one here stellargroup/build_env:debian_clang
<hkaiser> yes, that's a debug build
<diehlpk> I am using stellargroup/hpx:dev
<hkaiser> that's a debug build as well
<hkaiser> the first is a barebone docker image, the second one has hpx built in
<hkaiser> (well, the first has the prerequisites installed)
<diehlpk> Ok, - docker run -v $PWD:/hpx -w /hpx/build -e "CIRCLECI=true" ${IMAGE_NAME} cmake -DCMAKE_BUILD_TYPE=Debug -DHPX_WITH_MALLOC=system -DPHPX_Doc=ON -DPHPX_Test=ON ..
<hkaiser> yes
<hkaiser> what CMAKE_BUILD_TYPE do you use?
<diehlpk> Debug
<hkaiser> hmm
<hkaiser> something is off, then
<diehlpk> This line above is from my circle-ci
<hkaiser> ah
<diehlpk> I will try to use the same kind of main as in the hpx example
<hkaiser> that's not the problem
<diehlpk> And we use the same line for HPXCL
<hkaiser> and it breaks there as well?
<diehlpk> I have to check it again, I removed the test cases there because they were not working and I can not remember why
<hkaiser> k
<hkaiser> this really looks like a release/debug mismatch to me
<diehlpk> Ok, I will check if there is a mismatch or not
<hkaiser> otoh, it all happens inside the hpx core library - hmmm
<hkaiser> no idea what's wrong
<diehlpk> But should hpx not complain if I combine my own code as debug with release hpx build or vice versa?
<hkaiser> yah, it should
<hkaiser> as I said, not sure what's going on
<hkaiser> should not link, actualy
<diehlpk> Ok, I will investigate more tomorrow morning
<hkaiser> k
<diehlpk> The gsoc student is doing quite well for writing the paper
hkaiser has quit [Quit: bye]
K-ballo has quit [Quit: K-ballo]
EverYoung has joined #ste||ar
<zao> I went and intentionally mismatched build-types in my own build of a simple executable, RWDI HPX with Debug project - https://gist.github.com/zao/98187c0fb7555344bb1735bac409a7ce
diehlpk has quit [Ping timeout: 264 seconds]
<zao> (results in a rather uninformative trace of a segfault)
EverYoung has quit [Remote host closed the connection]
nanashi55 has quit [Ping timeout: 248 seconds]
nanashi55 has joined #ste||ar
<zao> Debug HPX with RWDI project is more fun:
<zao> terminate called after throwing an instance of 'std::invalid_argument'
<zao> what(): hpx::resource::get_partitioner() can be called only after the resource partitioner has been allowed to parse the command line options.
<zao> Aborted (core dumped)
EverYoung has joined #ste||ar
parsa has quit [Quit: Zzzzzzzzzzzz]
parsa has joined #ste||ar
<zao> diehlpk_work: Are you aware of these warnings? https://gist.github.com/zao/485e19a0d2fa0662de7bbe8d7de6ccdc
<zao> This feels race:y... only triggers outside of GDB
<zao> terminate called after throwing an instance of 'hpx::detail::exception_with_info<hpx::exception>'
<zao> what(): description is nullptr: HPX(bad_parameter)
<zao> And not reliably, like half of the times I run it.
<zao> (debug HPX, debug PDHPX)
<zao> This happens on a "working" run, after a few minutes: https://gist.github.com/zao/2171b43bbb597a10e8b04a9f530c7615
<zao> Seems like it's spending its time slowly eating up my measly 32G of memory.
simbergm has joined #ste||ar
simbergm is now known as msimberg
david_pfander has joined #ste||ar
<heller> msimberg: still working on it ... got side tracked with this USL thingy ...
hkaiser has joined #ste||ar
<jbjnr> I just realized why gdb is so slow attaching to hpx process on daint when I use px:attach-debugger=exception
<jbjnr> it's because all threads are spinning in the schedule loop and using 100% cpu whilst gdb is trying to laod.
<jbjnr> I wonder if we can use fancy new suspend runtime feature to halt hpx when an exception is hit and awaken it once we've attached the debugger!
<github> [hpx] StellarBot pushed 1 new commit to gh-pages: https://git.io/vFQXQ
<github> hpx/gh-pages e35141a StellarBot: Updating docs
<heller> jbjnr: go for it!
<jbjnr> msimberg: hope you're paying attention :)
<msimberg> jbjnr, heller: listening :)
<msimberg> heller, I'm also still working on it
<heller> msimberg: ok
<msimberg> I have the throttle and timed tests passing simultaneously, but not sure if this is the best way...
<msimberg> basically I took jbjnr's idea of different exit modes, so now the throttle test and shutdown remove processing units in different ways
<msimberg> are going some other way?
<msimberg> are you...
<msimberg> jbjnr: your threads are correctly a spinning at 100%? i.e. they actually have work to do?
<jbjnr> no, no work to do, but they sit in a wait state with the cpu consuming 100%
<msimberg> thinking if this idle backoff fix that I was thinking of would help in your case
<msimberg> basically make the backoff exponential
<jbjnr> I didn't look closely, but probably a spinlock with no backoff or no actual susped of underlying pthread
<msimberg> do you have IDLE_BACKOFF on?
<jbjnr> doubt it
<msimberg> if yes, you can try a one-line change to try it
<msimberg> ok
<jbjnr> yes. I should try setting HPX_HAVE_THREAD_MANAGER_IDLE_BACKOFF / WITH...XXX
<jbjnr> that ought to fix it.
<jbjnr> ignore my previous comments then
<msimberg> heller: see above question :)
<heller> msimberg: ahh, yeah, different exit modes make sense I am perfectly ok with it
<heller> msimberg: I am trying a different approach
<msimberg> I'm thinking if it's even okay to allow remove_processing_unit outside of shutdown, and only allow suspend_processing_unit...
<msimberg> the difference being that a suspended pu will be woken up again to finish off its work
<msimberg> and then it can have strict termination criteria again
<msimberg> and secondly if when removing/suspending (doesn't matter which) the scheduling_loop should wait for suspended tasks? they could take arbitrarily long to finish, but then again one might have to wait for them to finish
<msimberg> if suspending the pu suspended tasks get taken care of once resumed, but if removing the pu one would have to make sure that the pu gets added back again to finish the suspended tasks
<msimberg> otherwise suspended tasks would have to be stolen or something
<msimberg> heller?
<msimberg> sorry :)
<heller> hm?
<msimberg> well, no specific question but thoughts?
<msimberg> on the above?
<heller> let's make what we have to work first
<msimberg> ok, fair enough
<msimberg> should I wait for what you have or can I go ahead with my approach?
<msimberg> it needs some cleaning up first though...
<msimberg> would be curious how you tried to fix it as well
parsa has quit [Quit: Zzzzzzzzzzzz]
parsa has joined #ste||ar
parsa has quit [Client Quit]
parsa has joined #ste||ar
parsa has quit [Client Quit]
<heller> msimberg: no, go ahead, let's pick whoever finishes first ;)
<msimberg> heller: ok :)
<jbjnr> I quite like the idea of not allowing pus to be added or removed unless the pool is suspended. that also solves the mutex issue on github
<jbjnr> I think I just found a major bug!
<jbjnr> no. I didn't
<zao> "Do you wish to present a poster?"
<zao> If only I had a poster :)
<jbjnr> poster of what?
<zao> We're inagurating our "new" cluster later this month, apparently one of the questions signing up for the event is "yo, want to present a poster (of stuff you've done with the resources)?"
<jbjnr> if in doubt, you can always go with something like http://emmascrivener.net/wp-content/uploads/2014/12/motivation-topdemotivators.jpg
<zao> Hehe...
<zao> One of the CS sysadmins had the "meetings - a practical alternative to work" poster up on his door.
<zao> Turns out that that's not appreciated by the kind of people who are always in meetings.
<jbjnr> lol
<jbjnr> is TSS broken?
<jbjnr> or rather TLS
<jbjnr> heller: hkaiser under what conditions could get_worker_thread_num return the wrong thread number? I am seeing some strange errors in my scheduler that appear to be caused by a thread that should be thread 35, returning a different number from that tss call.
<hkaiser> jbjnr: that also would explain why hello_world prints messages twice
<jbjnr> correct
<jbjnr> I am looking at it now
<jbjnr> but I'm flummoxed
<hkaiser> nod
<hkaiser> nothing would work if TLS was broken
<hkaiser> msimberg: I thik that during shutdown all sleeping or waiting schedulers should be awoken
<hkaiser> that might help getting schedulers shut down properly
<msimberg> hkaiser: do you mean in general or in response to my comment on the throttle issue?
<msimberg> in general I agree, and I would say it's a must if suspending threads
<hkaiser> responding to your latest comment
<msimberg> ok, so that hangup is not during shutdown
<msimberg> it's earlier
<hkaiser> well a thread can't remove th ePU it's running on
<msimberg> yeah, okay, then maybe that should be discussed (currently it's set up to allow that)
<msimberg> heller might have some opinion on this
<hkaiser> this would require suspending the current thread, but then nobody is there to do the actual removal of the PU
<msimberg> indeed, that's why it currently suspends itself until it's been stolen
<msimberg> and then it of course requires that stealing is enabled
<hkaiser> well, sure
<msimberg> but maybe a better direction would then be to use the shrink_pool interface
<hkaiser> I think for now we can prohibit removing a PU from the thread running on it
<msimberg> and just remove some pu, which we can check is not the one on which that's running at the moment
<msimberg> ok, but then that fixes a lot of things :)
<hkaiser> :D
<msimberg> would still wait for heller to comment as well as he was working on a fix as well
<msimberg> minus one as well
<heller> well
<heller> I think it is non trivial to figure out if the thread you currently run in is the one that's going to be suspended or not, in the general case
<heller> but i think it is trivial if we rely on task stealing to occur
<hkaiser> heller: sure you can
<msimberg> get_worker_thread_num == num_thread?
<heller> sure, what if, down the line (aka when the suspension actually happens), the task has been migrated to another thread?
<hkaiser> can't
<hkaiser> for that it first needs to be suspended
<heller> might happen if locks used in that call paths are contented
<hkaiser> not in the scheduler
<heller> in the resource partitioner though
<hkaiser> the scheduler uses kernel primitives for synchronization
<hkaiser> ok
<heller> at least from my understanding ...
<heller> we might get away with compare_exchange for setting the new used pu masks though
<hkaiser> heller: no need
<hkaiser> the scheduler needs a flag `enabled` which is updated from the RP as needed
<hkaiser> that allows to remove all of the masking business checking whether the scheduler is active
<heller> ok
<heller> I guess it is unevitable that we distinguish between suspending and removing
<hkaiser> that's fine
<heller> yes, that's the preferred option anyway
<hkaiser> even if removing will suspend in the end anyways
<zao> For things like diehlpk_work's test, is it possible to somehow break into a debugger post-fact or just plain coredump at a state where it's meaningful? If starting from GDB, I cannot get it to trigger the race.
<hkaiser> can you trigger the race outside of gdb?
<heller> zao: --hpx:attach-debugger=exception
<zao> Half of the runs break, approximately.
<github> [hpx] hkaiser closed pull request #3010: Remove useless decay_copy (master...remove-decay_copy) https://git.io/vF1rf
<hkaiser> interesting
<zao> In debug HPX with debug PDHPX build.
<zao> The runs that don't bail out consume all my memory and die, sloooowly.
<zao> Only RelWithDebInfo builds actually complete the test.
<zao> (only tested debug and rwdi)
<hkaiser> the back trace diehlpk posted yesterday appeared to be caused by a 'shift' of the arguments to a function by one position
<hkaiser> this makes me believe it's caused by a debug/release (or similar ABI breaking) mismatch
<heller> in other news: USL is just awesome
<hkaiser> it is
<hkaiser> just misses grainsize
<hkaiser> ;)
<heller> except not ;)
<hkaiser> each concrete pair of alpha/beta corresponds to a concrete grainsize
<zao> terminate called after throwing an instance of 'hpx::detail::exception_with_info<hpx::exception>'
<zao> what(): description is nullptr: HPX(bad_parameter)
<zao> Aborted (core dumped)
<zao> This one?
<hkaiser> yes
<zao> That's 100% correctly matched build-type on my machine.
<heller> hkaiser: sure, but they let you also predict optimal grainsizes for this fit. The other nice thing is that you can extract the overheads, and different reasons for overheads from the model
<hkaiser> sure
<hkaiser> that's not my point
<hkaiser> you can predict only two parameters at the same time
<hkaiser> zao: it's this call causing the exception further down: https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime_impl.cpp#L335
<hkaiser> the function called believes that "run_helper" is actually == nullptr
<hkaiser> well, register_thread(data, id); believes, that data.desc (initialized from "run_helper") is == nullptr
<heller> hkaiser: still playing with the model and this is made up data: https://i.imgur.com/8aJCMw9.png
<hkaiser> where is the grainsize?
<msimberg> so hkaiser, heller: do we agree that calling remove_processing_unit on oneself is not allowed?
<hkaiser> msimberg: if you can detect it reliably?
<zao> PID: 21500 on lin ready for attaching debugger. Once attached set i = 1 and continue
<hkaiser> zao: nod
<hkaiser> go to the thread sitting in a nano_sleep
<zao> I have no idea what to do in my debugger there. All threads are in ?? and there's no i variable to set anywhere.
<hkaiser> the I variable is one level up
<zao> I don't think this is doing what it should :)
<msimberg> hkaiser: ok, so I probably didn't follow 100% what you discussed earlier, but I don't see a way for the thread to get stolen by the pu being shut down
<msimberg> once the pu is stopping it stops stealing
<zao> Ah.
<zao> Could not open `target:/tree/build-debug/lib/libhpxd.so.1' as an executable file: Operation not permitted
<heller> hkaiser: made up data, no actual relation to launching tasks yet. latency is essentially the average time a task takes. That is, overheads + grain size = latency
<zao> Seems like this gosh-darned debugger or OS is gimping me.
<zao> Might be in a different cgroup here.
<hkaiser> heller: that's nonesense
<heller> hkaiser: that's what the model gives you
<hkaiser> no
<hkaiser> the model does not include grainsize, this word is not used once in the book
<heller> it uses latency
<hkaiser> right
<heller> and latency is the time a task needs to be processed
<hkaiser> no
<hkaiser> well
<heller> sure
<hkaiser> that's quite a stretch, though
<heller> I don't think so
<hkaiser> I think the USL has latency and contention
<heller> the parameters for the function you mean?
<heller> alpha and beta?
<hkaiser> those contribute to alpha and beta
<heller> sure
<hkaiser> latency != grainsize
<heller> of course not, I never said that
<heller> the overheads are dependant on the degree of parallelism in the system (contention and coherency)
<hkaiser> no idea what coherency is
<heller> grain size, that is, the time a function needs to be executed, might depend on that as well
<hkaiser> and overheads are orthogonal to contention
<hkaiser> the useful work executed by a function are not overheads
<heller> contention is one of the reasons for overhead
<hkaiser> no it is not
<heller> no, but useful work might be limited by contention
<hkaiser> overheads is the work you need to perform in principle to manage paralellism
<hkaiser> contention might be caused by that work, but it is not part of it
<heller> and in order to mange parallelism, you need to synchronize, synchronization eventually leads to contention
<hkaiser> overheads are still there if you have no parallelism, contention might not be there in that case
<heller> of course
<hkaiser> only if you have contention on the sync resource
<hkaiser> so we agree
<hkaiser> overheads and contention are orthogonal
<heller> sure, you always have a constant overhead. in order to manage parallelism, your overheads increase depending on the workload and degree of concurrency
<heller> we are disagreeing on the terminology :P
<hkaiser> it does not
<hkaiser> it increases on the amount of max concurrency you want to support, not on the real amount of concurrency
<hkaiser> we disagree on the terminology - true, so you might want to go back and properly define your terminology
<hkaiser> as it is right now you have everything mixed up
<heller> well
<heller> one problem with the USL is that it's not easy to distinguish between contention happening in real work and the contention that are parts of the scheduling overhead
<heller> i don't think everything is mixed up
<hkaiser> no need to do so
<hkaiser> not everything, sure
<heller> in order to determine the scalabilty of the task scheduling, i think we do have to make that distinction
<hkaiser> from the system's standpoint there is no way to distinguish contention caused by the scheduler and by real work
<heller> true
<heller> one result I want to get out of this is to get this though
<hkaiser> the good thing about USL is that they found a _generalized_ expression and a heuristic to define alpha and beta
<hkaiser> but alpha and beta have no direct mapping onto SLOW as there is no real way to distnguish the effect each of the SLOW factors have
<heller> that's true
<hkaiser> I still believe that adding a 3rd dimention to the USL would clarify things
<heller> especially starvation is not really covered
<hkaiser> the USL gives you the relation of speedup over number of cores
<heller> and also, the 'L' in SLOW is different to what the USL authors refer to as latency
<hkaiser> the bathtub gives you the relation of speedup over grainsize
<hkaiser> heller: sure, could be - they directly related alpha with L
<hkaiser> I just said there is no direct mapping to SLOW
<heller> yes, I don't think there has to be
<hkaiser> right
<hkaiser> making the USL 3 dimensional (speedup over number of cores and over grainsize) gives you the full picture
<heller> so varying the program input in two dimensions
<heller> the interesting thing here is that the underlying theory of the USL does indeed talk about grainsize
<heller> the USL by itself only fits the parameters for a given set of measurements
<heller> it also allows you to predict different throughputs etc. based on this fit
<heller> so what I want to figure out now, if we can, based on a given measurement, figure out the optimal (average) grain size the application should have
<heller> so in the USL model, we get a set of formulas to calculate the degree of parallelism (numbers of CPUs), throughput (for example tasks/s) and latency (average time a task needs to complete)
<heller> those formulas come from the underlying queuing theory and Little's law
<heller> using this terminology for latency, and by defining the grain size to be the amount of "real work". latency = scheduling overheads + grain size holds
<heller> I can calculate the latency with the fitted parameters, and know the grain size. Now I should be able to easily vary the grain size parameter, and calculate the optimum
<heller> at least that's the plan
<heller> hkaiser: does this make more sense to you know?
<heller> if this works out as intended, this could even be implemented as a performance counter giving you a prediction on how far away your application is from the optimal grain size
<hkaiser> heller: you can't predict an optimal pair of (number of cores, grainsize) based on measurements that do not take grainsize into account
<hkaiser> but if you take grainsize into account while doing the measurements, you automatically have 2 dependent parameters you're fitting over
<hkaiser> grainsize and number of cores
<hkaiser> so naturally, after the fiting you get an optimal pair of those - which is what you wanted in the first place
<heller> yeah sure
<hkaiser> if you do measurements where you change both, you naturally get a 3 dimensional dependency
<heller> ideally, I want to avoid to vary grain size
<hkaiser> then you have no way of doing the fitting for optimal grainsize
<heller> the problem, of course, is that the scheduling overhead kind of depends on the grain size
<hkaiser> heller: Goedel has shown that you can't prove the axioms of a system without exiting the system you're in
<heller> hkaiser: right, I don't want to fit for grainsize
<heller> it's hairy ;)
<hkaiser> you want to get an optimal grainsize
<heller> in the end, yes
<hkaiser> now you're contradicting yourself
<heller> well, I am hoping to get a prediction for an optimal grain size by using the USL and the underlying queuing theory
<hkaiser> ok
<heller> my initial motivation was something else anyways ;)
<hkaiser> heller: grainsize is a highly application specific value, I don't see how the underlying queing theory can give you that
K-ballo has joined #ste||ar
<hkaiser> it may give you an estimate of how much work (i.e. in terms of time) you might want to execute in order to amortize overheads etc
<heller> it doesn't. The inputs for my models are: Number of PUs, overall throughput and grainsize
<hkaiser> so you have the grainsize from measurements?
<hkaiser> also - whatever 'throughput' is
<diehlpk_work> zao, I am in the office. Let me know if you need help
<hkaiser> do you mean speedup?
<heller> throughput would be tasks/s
<hkaiser> so it's speedup
<hkaiser> ok
<hkaiser> you're sayin you measure throughput/speedup over number of pus and grainsize
<hkaiser> is that correct?
<heller> yes
<hkaiser> ok
<heller> while I don't want to vary grain size
<hkaiser> isn't that what I'm suggesting for the last hour or so?
<hkaiser> ok
<hkaiser> well, good luck with that
<heller> you sugggested to vary the grain size as aprt of the measurement as well
<hkaiser> yes
<hkaiser> good luck
<zao> diehlpk_work: Did you see my note about the possible (unrelated?) bug?
eschnett has quit [Quit: eschnett]
<zao> Ah, debugging worked better in the same process tree.
<zao> diehlpk_work: In summary, half of the time I launch the program, I get the HPX error. The other half of the time it just eats my memory.
<zao> (in Debug)
<diehlpk_work> Ok, but in release it is working for you or?
<zao> relwithdebinfo seems to pass some tests.
<diehlpk_work> Ok, in Release all of the tests pass for me
<zao> Well, 1D x+ / x- quasistatic ones. Now it's possibly stuck or just not talking.
<diehlpk_work> No, 2D quasistatic tests are long
<zao> 1min+ ?
<zao> I'll let it chew for a bit then.
<zao> Hrm, these are all hpx_main-based, eh?
<zao> Wasn't there talk about hpx_init the other day, or was that someone else?
<hkaiser> zao: shouldn't make a difference (tm)
hkaiser has quit [Quit: bye]
<zao> Ah, there 2dx- completed.
<zao> Finally caught the naughty one in GDB - https://gist.github.com/zao/a28a00555f0ee6ff6d858788dbb96f86
eschnett has joined #ste||ar
<diehlpk_work> zao, We agree that the error ins inside of hpx or?
<zao> Oh wait, this is another one, this is an network_error.
<zao> Probably due to running the tests at the same time.
<zao> Oh shoot... I'm not network-isolating.
<zao> I've got to look again.
<zao> Same backtrace that you got the other day, it seems.
<diehlpk_work> Yes
hkaiser has joined #ste||ar
<zao> hkaiser: My crash is the same as diehl's, which is nice. Still don't know why, tho :)
<diehlpk_work> heller, hkaiser Will you have time to proofread the HPXCL paper mid december?
<diehlpk_work> Deadline is 22th of december
<diehlpk_work> zao, me neither
<msimberg> heller: I'm back on the original plan of different shutdown modes as shutting down oneself wasn't really the (main) problem
<msimberg> but I'm getting "mmap() failed to allocate thread stack due to insufficient resources, increase /proc/sys/vm/max_map_count or add -Ihpx.stacks.use_guard_pages=0 to the command line"
<msimberg> after spawning a lot of tasks with async_execute_after, even without removing pus
<msimberg> is this expected?
<msimberg> also hkaiser ^
<msimberg> (although this can be related to my earlier changes, so will still test on master)
<msimberg> hkaiser: another question for you: do you try or intend to follow semantic versioning with hpx?
<msimberg> a lot of tasks = roughly 20000
<zao> msimberg: I got that one on diehlpk_work's code when it doesn't bail out early.
<msimberg> zao: oh, ok, also using some *_execute_after function?
<zao> No idea what the code uses, but it likes to eat address space.
<zao> Last time I squinted at it, it was at 24.5G of virt and 4.5G of RES.
<msimberg> ok
<zao> (I'm not actually supposed to do this, I just like to build HPX and see if it fails :) )
<zao> diehlpk_work: Did you say what compilers you used? I'm on debian's clang 3.8.1-24 and Boost 1.65.1.
<zao> (Boost and HPX built with C++14)
<jbjnr> execute_After is designed to work with timers so that tasks run after some interval - are you sure you want that?
<diehlpk_work> zao, gcc (GCC) 6.2.1 20160916 (Red Hat 6.2.1-2) on fedora 25
<zao> I'm on debian latest (9.2?) w/ libstdc++ from 6.3.0 in a singularity image on Ubuntu 17.10, so great fun.
<zao> So not exactly a single configuration failure either.
<diehlpk_work> zao, It is also working on Ubuntu with release
<zao> Well, yeah... talking about Debug here.
<jbjnr> hkaiser: just fyi. my mpi pool executor is somehow running on a pu that was allocated in the default pool. I'm tracking it down ...
daissgr has joined #ste||ar
jakemp has quit [Ping timeout: 248 seconds]
<jbjnr> msimberg: there has only been a single 1.0 release - so until now, the answer is no, and I doubt we can only make changes that are non api modifying, so it's going to be no
<jbjnr> unless we release 2.0
<K-ballo> we might just as well jump to 2.0 :P
<jbjnr> the async and executor api's are not changing and that's what should really matter :)
<jbjnr> s/changing/chaging much/g :)
<diehlpk_work> zao, Now I get terminate called after throwing an instance of 'hpx::detail::exception_with_info<hpx::exception>'
<diehlpk_work> what(): failed to insert console_print_action into typename to id registry.: HPX(invalid_status)
<diehlpk_work> Aborted (core dumped)
parsa has joined #ste||ar
EverYoung has quit [Ping timeout: 258 seconds]
twwright has quit [Read error: Connection reset by peer]
parsa has quit [Quit: Zzzzzzzzzzzz]
<msimberg> jbjnr: yeah, that's why I'm asking
<msimberg> you mean the public api of hpx has had only additions since 1.0?
<diehlpk_work> Ok, get a new more strange error
<msimberg> as for the execute_after I was just testing that removing pus works correctly also with suspended threads
<jbjnr> stuff like the RP is new, but I'm not sure if anything was "removed" to make way for it. lots of threadmanager changes and runtime tweaks were made - they're not really public facing AI, but they are API. if we were strict, we'd ju8mp to 2.0, but I don't think we should myself.
<K-ballo> hpx 1.0 won't link to hpx next
<jbjnr> K-ballo: so you'd prefer us to stick to the versioning scheme then?
<msimberg> jbjnr: I don't mind either way but I think it should be a conscious decision at least, bumping version numbers arbitrarily is not helpful, then we could almost just use 1, 2, 3 etc
<K-ballo> jbjnr: I think it depends on what we want to convey by it
<jbjnr> agreed.
<diehlpk_work> hkaiser, I get the same error for HPXCL when using debug build there
<hkaiser> diehlpk_work: could it be that you're using different config settings for your stuff than was used for compileing HPX?
<hkaiser> some explicit HPX_HAVE_... set anywhere in your apps?
<diehlpk_work> No
<diehlpk_work> HPXCL was working with debug some time ago
<diehlpk_work> And both crash with the exact same error
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
parsa[[w]] has quit [Ping timeout: 246 seconds]
parsa[w] has joined #ste||ar
parsa[w] has quit [Ping timeout: 246 seconds]
mbremer has joined #ste||ar
<mbremer> @hkaiser: yt?
parsa[w] has joined #ste||ar
parsa[w] has quit [Read error: Connection reset by peer]
parsa has joined #ste||ar
jakemp has joined #ste||ar
parsa has quit [Client Quit]
parsa[w] has joined #ste||ar
aserio has joined #ste||ar
EverYoung has quit [Ping timeout: 240 seconds]
<diehlpk_work> zao, hkaiser With latest master HPXCL is working again with debug
david_pfander has quit [Ping timeout: 248 seconds]
EverYoung has joined #ste||ar
<zao> latest hpx?
<zao> My last test was with 33813a0, guess I should try a7f7eec
<zao> (which assumedly shouldn't change anything)
EverYoun_ has joined #ste||ar
<diehlpk_work> zao, I did a git pull and rebuilt HPX with debug
<diehlpk_work> For PDHPX I get issues with the stack size
<diehlpk_work> ./PeridynamicHPX -i input.yaml -k type -t threads
<diehlpk_work> Stack overflow in coroutine at address 0x00000000000002c0.
<diehlpk_work> Configure the hpx runtime to allocate a larger coroutine stack size.
<diehlpk_work> Use the hpx.stacks.small_size, hpx.stacks.medium_size,
<diehlpk_work> hpx.stacks.large_size, or hpx.stacks.huge_size configuration
<diehlpk_work> flags to configure coroutine stack sizes.
EverYoung has quit [Ping timeout: 240 seconds]
parsa has joined #ste||ar
parsa has quit [Client Quit]
<diehlpk_work> How can I change the stack size for my application?
<diehlpk_work> I do not want to pass it as command line arg
jaafar has joined #ste||ar
jaafar_ has quit [Ping timeout: 258 seconds]
<jbjnr> diehlpk_work: set it in the cfg in int main before calling hpx::init
<diehlpk_work> Ok, thanks
<jbjnr> hkaiser: I've got a thread with description "task_object::apply" that is on a thread in the default pool, but with the parent->pool set to the mpi pool. any ideas? now that I've found it, I can start debugging, but if you have thoughts ....
<hkaiser> jbjnr: no immediate idea, sorry
<jbjnr> I made a mistake - it's running in the mpi pool, but using a default pool thread.
<jbjnr> just once!
<jbjnr> but it's enough to screw everything up!
<jbjnr> thank you
<mbremer> hkaiser: Do you have time a for call to discuss the paper?
<hkaiser> any time
<hkaiser> mbremer: sure
<mbremer> When works for you?
<hkaiser> mbremer: do I need to read it before discussing things?
<mbremer> Oh no. I was going to walk you through it. And then discuss what needs to be changed.
<hkaiser> now is a good time
<mbremer> Great
<mbremer> Let me move. I'll call through gchat in a few minutes
<hkaiser> skype pls
<zao> Oh boy. Clang ICEs :D
<hkaiser> \o/
parsa has joined #ste||ar
<zao> Also, so much noise from deprecations :(
<hkaiser> zao: there is a cmake flag disabling those
<diehlpk_work> hkaiser, Which version of HPX does the docker image contains?
mbremer_ has joined #ste||ar
<hkaiser> HPX_WITH_DEPRECATION_WARNINGS=OFF
<hkaiser> diehlpk_work: the latest successful build off of master
<diehlpk_work> Ok, for now it works locally but not not on circle-ci
<hkaiser> strange
<diehlpk_work> At least HPXCL works again
<diehlpk_work> For my code, I get an stack overflow for now
<hkaiser> diehlpk_work: that's a segfault in disguise
<hkaiser> the error is misleading - there is a ticket also
<diehlpk_work> Ok, because I get this error Stack overflow in coroutine at address 0x00000000000002c0.
<zao> smells rather low.
<hkaiser> nod, we know about this - fix is in the works
<mbremer_> hkaiser: what's your skype handle?
<diehlpk_work> Ok, I will wait for the fix
<hkaiser> mbremer_: sent a contact request
mbremer_ has quit [Quit: Page closed]
<zao> diehlpk_work: My build still occasionally gives the bad_parameter for the quasi tests.
<zao> (in debug, ofc)
<jbjnr> christ - it's this effing background thread that heller put in there
<zao> heh?
<diehlpk_work> And for explicit tests?
<zao> Same kind of behaviour, bogus description or doing _something_.
<diehlpk_work> zao, For PeridynamicHPX I get this one here
<diehlpk_work> For explicit tests, I get https://pastebin.com/JkpF34HT
<diehlpk_work> And the same for quasistatic
<diehlpk_work> I using GIT commit is a7f7eecfcdee680a04408300448db06482838ac9
<diehlpk_work> Boost version: 1.60.0
<zao> type_ == data_type_address is part of that whole description union, but the other branch of the type.
<zao> I'm on a7f7eec, 1.65.1.
<diehlpk_work> Boost is Boost version: 1.60.0
<zao> A run of ./PeridynamicHPX just twiddles its thumbs, eating memory.
<diehlpk_work> Ok, strange
<zao> Seems like my case when it doesn't break immediately is to grow and die.
<diehlpk_work> It is very strange because for HPXCL it is working
<diehlpk_work> Ok, with gdb I get this error for explicit tests
<diehlpk_work> 0x00007ffff68ce269 in hpx::parcelset::parcelhandler::load_runtime_configuration[abi:cxx11]() ()
<diehlpk_work> at /calculs/git/hpx/src/runtime/parcelset/parcelhandler.cpp:1601
<diehlpk_work> 1601 f->get_plugin_info(ini_defs);
<zao> nb, your pastebins seem to expire after a day or two.
aserio has quit [Quit: aserio]
aserio has joined #ste||ar
<jbjnr> just fyi heller - not your background thread. sorry - false alarm
<jakemp> Is there any guarantee what hpx::get_worker_thread_num() will return if called in a function passed to run_as_hpx_thread() in main()?
<jakemp> also jbjnr, hkaiser said I could bug you about how I would need to change my code that uses executors
<diehlpk_work> zao, Yes, normally I set them for two weeks
<zao> I should probably kill this by now.
<diehlpk_work> Yes
aserio has quit [Ping timeout: 250 seconds]
* zao invents sleep instead
jbjnr_ has joined #ste||ar
jbjnr has quit [Ping timeout: 250 seconds]
jbjnr_ has quit [Client Quit]
jbjnr_ has joined #ste||ar
jbjnr_ is now known as jbjnr
EverYoun_ has quit [Remote host closed the connection]
Smasher has quit [Changing host]
Smasher has joined #ste||ar
Smasher has joined #ste||ar
<heller> hkaiser: no bathtub for my model :/
<heller> still no real data fed into it though...
<heller> hopefully my made up data is just bad
jaafar has quit [Ping timeout: 255 seconds]
jaafar has joined #ste||ar
aserio has joined #ste||ar
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
hkaiser has quit [Quit: bye]
twwright has joined #ste||ar
parsa has quit [Quit: Zzzzzzzzzzzz]
EverYoun_ has joined #ste||ar
EverYoun_ has quit [Remote host closed the connection]
EverYoung has quit [Ping timeout: 255 seconds]
<heller> nope, real data is just as bad
EverYoung has joined #ste||ar
<diehlpk_work> heller, Do we have a docker image with hpx release?
<heller> no
<diehlpk_work> Ok, so I just will turn of test cases
<heller> or fix the problem
<heller> what's the point of having tests when they aren't run?
<diehlpk_work> Ok, you are right, I will look into this stack overflow stuff
<heller> you should really try to reproduce the error locally
<heller> I think it is some strange environment hickup
<diehlpk_work> I was able to reproduce the error locally
<diehlpk_work> With the latest master the error is gone and I new one is there
<heller> ok
<diehlpk_work> it is related to coroutine stuff
eschnett has quit [Quit: eschnett]
<heller> why?
<diehlpk_work> I think it is related to this issue https://github.com/STEllAR-GROUP/hpx/issues/2987
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
hkaiser has joined #ste||ar
eschnett has joined #ste||ar
aserio has quit [Quit: aserio]
<heller> hkaiser: ha!
<heller> hkaiser: the measurements match the data pretty nicely :D
<heller> model, not data
<diehlpk_work> Cool
<diehlpk_work> Ok, I am confused I tunred off HPX_WITH_STACKOVERFLOW_DETECTI and HPX_WITH_STACKTRACES and got the same error
hkaiser has quit [Read error: Connection reset by peer]
<diehlpk_work> All fail here
hkaiser has joined #ste||ar
<heller> gdb shows an assertion failure, not a segfault
<hkaiser> heller: could you explain those graphs, please?
<heller> hkaiser: it's what we talked about earlier
<heller> hkaiser: I made some htts2 measurements (varying number of PUs and grain size)
<hkaiser> what is it I'm looking at? the axis have no title etc.
<heller> yeah, still very drafty
<heller> I put the results for each grainsize in a seperate USL model
<hkaiser> diehlpk_work: the STACKTRACES setting is unrelated
<heller> the result of the USL fitting is shown in the first graph
<diehlpk_work> Ok, I tried it with HPX_WITH_STACKOVERFLOW_DETECTION=Off only and get the same error
<hkaiser> let me have a look at the cmake scripts
<heller> the second graph is then using the model for each grainsize, to calculate what's called latency in the literature
<hkaiser> diehlpk_work: the setting is HPX_WITH_THREAD_STACKOVERFLOW_DETECTION=OFF
EverYoun_ has joined #ste||ar
<heller> then I use the fixed grain size, for each model, to calculate the scheduling overhead, based on that, I can model the to be expected throughput for different grain sizes, that's on the third graph
<diehlpk_work> Ok, this Option was already off
<hkaiser> diehlpk_work: it's actually messed up, use both for now
<heller> where throughput is tasks/second
<heller> and the latency is in seconds as well
<diehlpk_work> Ok, even when both are off, I get the courinte error
eschnett has quit [Quit: eschnett]
<heller> the third graph compares the model with the actual measurement
<heller> the only problem I have with this, is that I don't have that nice bathtub curve
<hkaiser> that requires some touching up, though
EverYoung has quit [Ping timeout: 255 seconds]
<hkaiser> heller: so you created a 3d model after all ;)
<diehlpk_work> Ok, I will have a look tomorrow and try to fix it
<hkaiser> diehlpk_work: I'll try to fix this later today
<heller> hkaiser: well, it's using the USL as a foundation, it's more or less just juggling with equations
<diehlpk_work> Ok, cool, thanks
<hkaiser> heller: you said: the second graph is then using the model for each grainsize - that's exactly what I meant
<diehlpk_work> heller, Do you intend to go to ISC next year?
<diehlpk_work> And if yes could you present the HPXCL paper, if accepted
<heller> hkaiser: yeah, that's to play with the different models for the different grain sizes
<heller> I have to to see where this is going somehow ;)
<diehlpk_work> hkaiser, heller Do you have time to proofread the HPXCL paper before 22th of December
<heller> diehlpk_work: I have no plans for ISC so far
<heller> diehlpk_work: I guess, where do I find it?
<diehlpk_work> The student will commit the first two chapters this Thursday
<diehlpk_work> He will have the first version one week before deadline
<heller> ok
<diehlpk_work> I started to correct his sections already.
<diehlpk_work> He writes every week one section and I will correct them all every Friday
<heller> hkaiser: from which benchmarks did you get those bathtub curves?
mbremer has quit [Quit: Page closed]
<diehlpk_work> GTG
<hkaiser> heller: pats paper
<heller> which one?
<heller> got it
<heller> hkaiser: what's also interesting when just looking at the USL models is that they seem to converge
<heller> hkaiser: I'll see if I am able to match those measurements tomorrow
<heller> still not sure if this isn't total nonesense, in the end it worked out too nicely so far
<heller> i find it odd in that paper that Fig. 6 in that paper doesn't match with the others
<heller> it should show a bathtub as well, shouldn't it?
<heller> ah no, it's just wait time
jakemp has quit [Ping timeout: 240 seconds]
eschnett has joined #ste||ar
<hkaiser> where does it converge to?
<hkaiser> heller: ^^
<hkaiser> heller: I think you're proving axioms using the axioms
<K-ballo> why would anyone prove an axiom? isn't that..
<K-ballo> axiomatic?
* K-ballo wasn't even paying attention to the context, he apologies in advance just in case
<github> [hpx] hkaiser created fixing_stackoverflow_options (+1 new commit): https://git.io/vF5BS
<github> hpx/fixing_stackoverflow_options 15a4868 Hartmut Kaiser: Unify stack-overflow detection options, remove reference to libsigsegv
<github> [hpx] hkaiser created action_performance (+1 new commit): https://git.io/vF5BQ
<github> hpx/action_performance 0ce15e2 Hartmut Kaiser: Speed up local action execution...
<hkaiser> K-ballo: ;)
<hkaiser> I meant this idiomatically
<github> [hpx] hkaiser opened pull request #3016: Unify stack-overflow detection options, remove reference to libsigsegv (master...fixing_stackoverflow_options) https://git.io/vF5BF
<github> [hpx] hkaiser opened pull request #3017: Speed up local action execution (master...action_performance) https://git.io/vF5BN
<hkaiser> heller: #3017 speeds up local action invocation _significantly_
EverYoun_ has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
eschnett has quit [Quit: eschnett]