aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
EverYoung has quit [Ping timeout: 252 seconds]
zombieleet has quit [Ping timeout: 248 seconds]
galabc has joined #ste||ar
<github>
[hpx] hkaiser force-pushed refactor_base_action from 8d692d5 to c0673fc: https://git.io/vAvAI
<github>
hpx/refactor_base_action c0673fc Hartmut Kaiser: Refactoring component_base and base_action/transfer_base_action to reduce number of instantiated functions and exported symbols...
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoun_ has joined #ste||ar
EverYoun_ has quit [Remote host closed the connection]
<jbjnr>
heller_: good morning - question - are you done with small task launching/destruction cleanups or are there more PRs to come?
<heller_>
jbjnr: there are more, I'm currently waiting on the lazy init PR to get approved
<jbjnr>
shall I take a look at that now?
<heller_>
And fixing bugs in the meantime
<heller_>
Sure
<heller_>
The wait_or_add_new removal is next
<heller_>
And then the thread map removal
<jbjnr>
aha. that's a big one
<jbjnr>
which #PR is lazy_init
<jbjnr>
found it
<heller_>
3146
<heller_>
I guess the bug fixing PRs have higher priority though
<jbjnr>
do we still have a race condition somewhere deep inside hpx that causes random tests to fail from time to time
<heller_>
I just fixed one last night which was there since forever
<jbjnr>
great. PR?
<heller_>
Those will always pop up
<jbjnr>
s/always/should never/g
<heller_>
3153
<heller_>
And the stack size test is fixed with my other pr
<jbjnr>
you get a gold star
<heller_>
#3150
<heller_>
migrate test is next...
<heller_>
I want a green release
<jbjnr>
hmmm. I see that most tests are passing with normal pycicle build, but simbergm 's dodgy sanitizer build is flaggging everything as bad.
<heller_>
Let's make master green, branch, create the RC, and call it a day
<heller_>
Yes
<heller_>
Right, I wanted to look into the leak sanitizer failures first
<jbjnr>
has anyone looked at the sanitizer problems - do we know where they are coming from?
<jbjnr>
^^ :)
<heller_>
Thread sanitizer is something we should turn off
<heller_>
It's too buggy in itself
<jbjnr>
fair enough
<jbjnr>
I shall merge 3150
<jbjnr>
simbergm: when you read this - please disable the thread sanitizer build - the leak sanitizer is finding enough for now and we have enough red. We can reenable it sometime if mecessary
<simbergm>
jbjnr: yeah, no problem
<simbergm>
stopped
<jbjnr>
cool.
<jbjnr>
How many pycicle instances have you got running?
<jbjnr>
^had
<heller_>
another thing: would be cool if we turned on debugging symbols for the sanitizer builds
<simbergm>
1
EverYoung has joined #ste||ar
<simbergm>
heller_: yep, can do that once we turn it on again
<jbjnr>
adding a debug builds would be useful in general too
<heller_>
simbergm: I mean for the leak sanitizer
<jbjnr>
I will do some options for pycicle today to simplify that
<simbergm>
heller_: yeah, that's what I understood
<heller_>
and I found that the leak/address sanitizer reports a lot of false positives when you haven't build boost with the same sanitizer flags
<heller_>
which might be the issue at hand ... since I ran the same tests with a debug build and don't encounter those reports
<heller_>
but will try with a relwithdebinfo build today
<jbjnr>
simbergm: please add the compile flags / build settings that you are using for your sanitizer builds to the pycicle issue so I can add them to the 'options' branch I will start on
<simbergm>
jbjnr: ok
EverYoung has quit [Ping timeout: 265 seconds]
<jbjnr>
what are our rules for merging several PRs at once. Do we wait for a bvuild cycle or do we now trust pycicle enough since the PRs have been tested lots
vamatya has quit [Ping timeout: 256 seconds]
<jbjnr>
and should I leave all merges to simbergm as release master, or can I do some?
<simbergm>
jbjnr: you're more than welcome to merge
<jbjnr>
multiple PRs at a time?
<simbergm>
I usually wait, but it's quite slow
<heller_>
I tend to wait as well
<simbergm>
I count on hkaiser merging some in the night :)
<heller_>
but it is upon your judgement, if you think they will cause no further damage ;)
<jbjnr>
yup. I'm thinking that when PRs are not overlapping and since they are tested independently, it ought to be ok to merge several now
<heller_>
right
<jbjnr>
ok.
<simbergm>
in normal circumstances I think waiting for them to finish is okay
<heller_>
also, something like doc fixes can be easily interleaved
<simbergm>
some PRs have been obviously okay so those can overlap
<github>
[hpx] biddisco closed pull request #3153: Fixing a race with timed suspension (master...fix_timed_suspension) https://git.io/vAJjp
<simbergm>
it's getting quite close
<jbjnr>
what is?
<simbergm>
the possibility of a release candidate
<jbjnr>
k
<heller_>
yeah
<simbergm>
meaning rostam is quite clean
<heller_>
let's try to get master as clean as possible on rostam
<simbergm>
yep
<heller_>
then branch the RC, let it settle and test it for a week and then release it
<jbjnr>
rostam? I forgot it was still there!
<simbergm>
:P
<simbergm>
it's still very useful because it does so many builds
<heller_>
yes
<heller_>
eventually, we need to migrate most of them to pycicle ;)
david_pfander has joined #ste||ar
<jbjnr>
heller_: feel free to comment on https://github.com/biddisco/pycicle/issues/13 with the kind of options for different builds you'd like to see in order to replace rostam/buildbot
<heller_>
simbergm: so I can't reproduce the leak sanitizer failures locally, that is with a sanitizer enabled boost build
<heller_>
so the pycicle errors seem to be one of those false positives I was talking about
<simbergm>
heller_: thanks for checking, did not realize that it needs that as well
<simbergm>
this is the split_gid error, right?
<heller_>
yes
<heller_>
hmmm
<heller_>
I might be wrong after all ...
<heller_>
the FAQ don't see a problem with only having parts of the program being instrumented
<simbergm>
did you run it on multiple tests?
<heller_>
yes
<simbergm>
ok
<heller_>
I might just upgrade my clang
<simbergm>
I am running with gcc though
<simbergm>
this might make a difference
<heller_>
ahh!
<heller_>
no clang?
<simbergm>
well, not yet at least
<heller_>
ok, I wasn't aware of that
<heller_>
let me try with gcc as well
<simbergm>
but I can't tell if it's gcc giving false positives or clang missing something
<simbergm>
second seems more likely
<github>
[hpx] biddisco created ctest_warnings (+1 new commit): https://git.io/vAUru
<heller_>
simbergm: jbjnr: since hwloc2 is now released, should we make everything working with it before the release?
<jbjnr>
heller_: no
<jbjnr>
hwloc 2 completely changes the memory heirarchy and everything breaks completely. We cannot use hwloc 2 without a rewrite of the topology class.
<jbjnr>
unfortunately :(
<jbjnr>
we will need to add in a check i cmake to give an error if hwloc 2 is used at the moment
<jbjnr>
all the numa domain stuff needs to be redone.
simbergm has joined #ste||ar
<heller_>
ok
<heller_>
too bad
<heller_>
alright, I think I fixed component migration
<heller_>
nope.
<heller_>
:/
<heller_>
I think someone broke rostam
<simbergm>
heller_: #3153? :/
<simbergm>
pycicle seems to have timed out on that
<simbergm>
jbjnr: two feature requests for pycicle
<simbergm>
1. set the PR status to pending when starting a build
<simbergm>
2. merge the config, build, test statuses into one
<jbjnr>
already got an issue for that
<jbjnr>
#1 I mean
<simbergm>
very nice
<simbergm>
2 is not a must but I think they don't all need a separate status
<jbjnr>
2 - could do. Not a very big deal though, github already does and and/or operation on the results for us
<simbergm>
I can try to hack it together in pycicle as well
<simbergm>
yep
<simbergm>
but 1 would be very nice
<github>
[hpx] msimberg created revert-3153-fix_timed_suspension (+1 new commit): https://git.io/vAUHD
<github>
hpx/revert-3153-fix_timed_suspension 30135c6 Mikael Simberg: Revert "Fixing a race with timed suspension"
<heller_>
why did you revert it?
<jbjnr>
why do you think 3153 is the problem?
<simbergm>
not reverted yet but have a look on rostam
<heller_>
rostam is completely fired
<heller_>
as it seems
<simbergm>
ah, you think it's unrelated?
<heller_>
yes
<jbjnr>
when the builds don't complete - it's a disk/memory error or something
<heller_>
yeah
<heller_>
there are three jobs on rostam than run for over 12 hours at the moment
<simbergm>
sorry, missed that
<heller_>
and the binaries running as long as the job
<simbergm>
can you see that on the buildbot page? or is this insider knowledge?
<heller_>
so I am guessing that they just hammer the FS right now...
<heller_>
this is me logging onto rostam as root and looking at the jobs
<simbergm>
ok, thanks for stopping me
<heller_>
once the guys at LSU wake up, we should try to merge your doc PR
<github>
[hpx] msimberg deleted revert-3153-fix_timed_suspension at 30135c6: https://git.io/vAUQm
<heller_>
just take a look at this, doesn't get through the queue
<heller_>
simbergm: the failures are still suspicous though
<heller_>
I'll investigate
<simbergm>
I hope it's rostam
<simbergm>
but the builds didn't finish on pycicle either
<heller_>
just noticed
<heller_>
I'll give it a whirl
<heller_>
at least migrate_component is running super smooth now :P
hkaiser has joined #ste||ar
<simbergm>
that is good news!
<heller_>
and just the second after I said it, it crashed again :/
<jbjnr>
simbergm: the fact that rostam keeps dying on us is one of the reasons I wanted to do builds on daint. We have a whole team of people who keep daint running and maintain the filesyste etc. Rostam has one guy - and he's supposed to be doing a PhD not maintaining the system.
<heller_>
K-ballo: yes
<heller_>
jbjnr: simbergm: I guess it was the patch after all
<heller_>
let me investigate though
<heller_>
before reverting and then commiting again
<simbergm>
jbjnr: yeah, that's a good thing, but daint is not exactly super stable either
<jbjnr>
so daint couldn't launch any jobs for some reason.
<heller_>
simbergm: works for me :/
<simbergm>
huh, I can check again
<heller_>
got it to hang now ... needs 4 cores
<simbergm>
ok
<simbergm>
"good"
<github>
[hpx] AntonBikineev created fix_3134 (+1 new commit): https://git.io/vAUAf
<github>
hpx/fix_3134 6b25fb4 AntonBikineev: Fixing serialization of classes with incompatible serialize signature...
<github>
[hpx] AntonBikineev opened pull request #3156: Fixing serialization of classes with incompatible serialize signature (master...fix_3134) https://git.io/vAUAT
<jbjnr>
action_invoke_no_more_than_test hung by the looks of it
<jbjnr>
so ....
<jbjnr>
action_invoke_no_more_than hung on daint. Ctest tried to kill it, but it didn't die. After that, all the next tests timed out and failed because slurm couldn't get the job step. Once I manualy killed the action_no_more_than test, then slurm continued to work and the remaining tests began passing.
<jbjnr>
simbergm: ^ heller_ ^
<heller_>
yeah
<heller_>
looking into it right now
<simbergm>
hum, this happens on rostam as well
<simbergm>
seems like ctest is not able to properly kill tests sometimes
<jbjnr>
yes, because they are inside a dodgy python wrapper
<jbjnr>
note to self :get rid of hpxrun.py
<jbjnr>
timed_this_thread_executors is also bad by the looks of things
<heller_>
yeah, not sure what's going on
<jbjnr>
just remember rule #1
<simbergm>
it's heller's fault?
<jbjnr>
you're a fast learner!
<simbergm>
sorry heller, jbjnr has brainwashed me, I couldn't help it
<jbjnr>
though it's probly your dodgy thread suspension stuff that broke everything :)
<simbergm>
everything!
<simbergm>
probably
<jbjnr>
simbergm: if I had a pool called "stuff" and I wanted to put just that pool to sleep, could I launch a task with a pool_executor("stuff") on the stuff pool and then do put the pool to sleep by saying async(stuff_executor, task).then(put_pool_to_sleep("stuff");
<jbjnr>
I want to create two pools, use them for something, and then put them to sleep until I need them again, and by attaching a continuation to the tasks on those pools, I could make sure my task runs, then make them sleep
<simbergm>
jbjnr: you should *not* launch put_pool_to_sleep on the pool you want to suspend
<simbergm>
i.e. it can't put itself to sleep
<jbjnr>
ok, find, I do this async(stuff_executor, task).then(default_eecutor, put_pool_to_sleep("stuff");
<jbjnr>
so the put pool to sleep task runs on another pool
<simbergm>
yeah, something like that should work
<jbjnr>
but I can do it with your changes yes?
<jbjnr>
cool
<jbjnr>
I will implement it then
<simbergm>
assuming I didn't break everything, yes ;)
<simbergm>
it's there at least
<jbjnr>
to suspend one pool I call suspend("pool name") ?
<simbergm>
get_thread_pool("stuff").suspend()
<jbjnr>
lovely
<jbjnr>
thanks
<simbergm>
it's a member function of thread_pool_base
<jbjnr>
to wake it up again can I just get_thread_pool("stuff").resume()
<simbergm>
yep
<jbjnr>
lovely
<jbjnr>
tests.unit.threads.set_thread_state died too. must be a hanging on termination problem.
hkaiser has quit [Quit: bye]
mcopik has joined #ste||ar
<jbjnr>
simbergm: it's may bad - I am looking at the dashboard and I see that the PR tests for 3153 all failed, but I merged it anyway. I must have looked at the wrong PR number when I decided it was safe to merge. Sorry.
<jbjnr>
I guess it's a good idea to revert it.
<heller_>
it's all broken!
<heller_>
I hate race conditions
mcopik_ has joined #ste||ar
mcopik_ has quit [Client Quit]
<simbergm>
jbjnr: I've done the same... that's why the pending status on the PR is useful
<jbjnr>
ok. good point.
<simbergm>
heller_: I'll go ahead and revert for now so we can merge other stuff in the meantime
<simbergm>
so rostam seems to be slowly making progress, I guess it's okay to merge or do you think I should wait?
<heller_>
shouldn't matter
<heller_>
a fix will probably take until tomorrow, yeah
<github>
[hpx] msimberg created revert-3153-fix_timed_suspension (+1 new commit): https://git.io/vATsm
<github>
hpx/revert-3153-fix_timed_suspension 7e37600 Mikael Simberg: Revert "Fixing a race with timed suspension"
<github>
[hpx] msimberg opened pull request #3157: Revert "Fixing a race with timed suspension" (master...revert-3153-fix_timed_suspension) https://git.io/vATsE
<diehlpk_work>
I have both thresholds now and for smaller vectors and matrices we are slighlty better as omp
<hkaiser>
nice
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
vamatya has joined #ste||ar
EverYoun_ has joined #ste||ar
EverYoun_ has quit [Remote host closed the connection]
EverYoun_ has joined #ste||ar
EverYoung has quit [Ping timeout: 265 seconds]
kisaacs has quit [Ping timeout: 240 seconds]
<heller_>
diehlpk_work: release vs debug build?
vamatya has quit [Read error: Connection reset by peer]
vamatya has joined #ste||ar
EverYoung has joined #ste||ar
EverYoun_ has quit [Ping timeout: 276 seconds]
aserio has quit [Ping timeout: 252 seconds]
mbremer has joined #ste||ar
<heller_>
round 2.
<github>
[hpx] sithhell created fix_timed_suspension (+1 new commit): https://git.io/vAT5M
<github>
hpx/fix_timed_suspension 81b2856 Thomas Heller: Fixing a race with timed suspension...
<github>
[hpx] sithhell opened pull request #3158: Fixing a race with timed suspension (second attempt) (master...fix_timed_suspension) https://git.io/vAT55
aserio has joined #ste||ar
<diehlpk_work>
heller_, Was in a meeting and have a look soon
kisaacs has joined #ste||ar
aserio1 has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
aserio has quit [Ping timeout: 276 seconds]
aserio1 is now known as aserio
parsa[w] has quit [Read error: Connection reset by peer]
kisaacs has quit [Ping timeout: 260 seconds]
parsa[w] has joined #ste||ar
sam29 has joined #ste||ar
EverYoung has joined #ste||ar
kisaacs has joined #ste||ar
autrilla has quit [Disconnected by services]
autrilla1 has joined #ste||ar
autrilla1 has quit [Client Quit]
<diehlpk_work>
hkaiser, heller_ The default behavior of hpx with parsing commandline args is not nice
<hkaiser>
diehlpk_work: ok?
<hkaiser>
what behavior is bothering you?
<diehlpk_work>
Passing an unknown argument does do not result in an error
<hkaiser>
that depends
<diehlpk_work>
With the nee version -t is not supported anymore and I switched to --hpx:threads=
<hkaiser>
that's only true if you use hpx_main.hpp - that is to allow for applications handling their own arguments without hpx complaining
<diehlpk_work>
Forgot to append the s to threads and HPX was running with all coresd
<diehlpk_work>
*cores
<hkaiser>
-t is not supported only for hpx_main.hpp as well
<diehlpk_work>
Therefore, I got different results for BLAZE
<hkaiser>
diehlpk_work: how do you suggest we handle this?
<diehlpk_work>
Is it possible to print an warning that the provided commandline option is not known or used by hpx
<diehlpk_work>
Some tools say unrecognized option
<hkaiser>
that would annoy people that use hpx_main.hpp and want to handle their own command line arguments
<diehlpk_work>
Yes, you are right
<hkaiser>
we could handle that using a application wide pp constance
<hkaiser>
constant*
EverYoun_ has joined #ste||ar
<diehlpk_work>
Yes, or just let hpx start with one core as default and not all
<diehlpk_work>
I would consider default is one core and with --hpx:threads= one specifiy the amount of cores
EverYoung has quit [Ping timeout: 276 seconds]
<diehlpk_work>
I just used htop and realized that hpx uses too many cores
<diehlpk_work>
At least I know why my performance dropped
<diehlpk_work>
And can start to work on tune blaze again
EverYoun_ has quit [Remote host closed the connection]
sam29 has quit [Ping timeout: 260 seconds]
EverYoung has joined #ste||ar
<hkaiser>
diehlpk_work: fight this out with jbjnr, he changed it to use all cores
<diehlpk_work>
Ok, I will do this
hkaiser has quit [Quit: bye]
cogle has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
kisaacs has quit [Ping timeout: 276 seconds]
hkaiser has joined #ste||ar
kisaacs has joined #ste||ar
twwright_ has joined #ste||ar
twwright has quit [Read error: Connection reset by peer]
twwright_ is now known as twwright
twwright has quit [Client Quit]
twwright has joined #ste||ar
daissgr has quit [Quit: WeeChat 1.4]
daissgr has joined #ste||ar
<github>
[hpx] aserio opened pull request #3159: Support Checkpointing Components (master...checkpoint_component) https://git.io/vAkWZ
kektrain_ has joined #ste||ar
kektrain_ has left #ste||ar ["iяс.sцреяиетs.ояg сни sцреявоwl"]
aserio has quit [Quit: aserio]
daissgr has quit [Ping timeout: 255 seconds]
<jbjnr>
hkaiser: if I want to create a task and it is already ready to run - all futures it depends on are ready - what is the most efficient way of creating it and inserting it directly onto the queues - is lcos::local::futures_factory inefficient?
<hkaiser>
it needs an allocation, otherwise it should be fine
<hkaiser>
creating a future always allocates
<hkaiser>
otherwise there shouldn't be much overhead
<jbjnr>
is there a flag I should pass to say - this task can be run - it is ready, not waiting for nything
<jbjnr>
I guess async says that already
daissgr has joined #ste||ar
ct-clmsn has joined #ste||ar
<ct-clmsn>
hkaiser, where in the source tree is the strassen test you put together?
<ct-clmsn>
i've done a local update and fgrep and can't seem to find the right keyword to search