aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
EverYoun_ has quit [Remote host closed the connection]
EverYoun_ has joined #ste||ar
EverYoun_ has quit [Remote host closed the connection]
vamatya has quit [Ping timeout: 246 seconds]
daissgr has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: bye]
EverYoung has joined #ste||ar
EverYoung has quit [Ping timeout: 276 seconds]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
<zao>
This is nifty... seems like a flag we've built our OpenMPI with for a long while has some rather massive perf shenanigans.
Guest2733 is now known as Vir
<heller_>
lol
<heller_>
jbjnr_: yt?
<jbjnr_>
here
<jbjnr_>
no
<heller_>
jbjnr_: I want to add a pycicle tester, how do I verify that my setup works? As in how would I start a test run of a build?
<jbjnr_>
on local laptop/machine or over ssh
<jbjnr_>
once you've setup a config file like "my-laptop.cmake" then "python ./pycicle -m my-laptop"
<jbjnr_>
but I usually try -p 3118 and --debug
<jbjnr_>
to force just one PR to be checked and also don't trigger any builds (over ssh to test job luanching etc)
<jbjnr_>
--help gives a quick summary of options
<heller_>
thanks
<jbjnr_>
use -p option when running locally, because if N branches need rebuilding, you only have one machine, so triggering just one branch is a good idea
<jbjnr_>
-f to force a rebuild even if it doesn't need it
<jbjnr_>
python pycicle -f -p 3118
<jbjnr_>
--debug (-d) prints out comand without ssh sending them
<github>
hpx/master 261fe3a Thomas Heller: Merge pull request #3125 from STEllAR-GROUP/after_3120...
<jbjnr_>
I might have broken local build recently, only been using it over ssh
<jbjnr_>
if it doesn't work, tell me and I'll fix it
<heller_>
i want to test a ssh build anyway
<jbjnr_>
since it "works for me" but nobody else has tried it, there are probably several things broken that I jus assume people know even though they couldn't possibly know
<heller_>
sure
<heller_>
I think I got a hang on it now
<heller_>
thanks
<jbjnr_>
my daint setup is just "python ./pycicle.py -m daint"
<heller_>
where would I set the build type?
<jbjnr_>
and I leave it running in a terminal and can see stuff scroll whenever you do a merge (like just now)
<jbjnr_>
then it settles into a quiet state and just prints out a check - time since last 86s
<jbjnr_>
watch out that in the python there is a hardocded randon clang/gcc option you might want to disable
<jbjnr_>
I need to turn those into proper config settings that can be changed on a per project/setup basis
<heller_>
so ... just using, for example papi_cost, I can't reproduce the segfault
<heller_>
so it is either the generic context coroutines (who knows why those are activated on rostam) or something else
<heller_>
I'll try to reproduce this locally here
<hkaiser>
heller_: how do you know those are enabled?
<heller_>
hkaiser: from the stacktrace
<heller_>
hkaiser: red herring ... they are only enabled for the gcc 6.3 builds
<hkaiser>
I now remember we enabled those deliberately to have it tested on at least one or two platforms
<heller_>
*nod*
<K-ballo>
good thinking
<jbjnr_>
We will have to start running HPX on summit (ARM)
<hkaiser>
indeed
<jbjnr_>
so expect plenty of testing of differnt architectures soon
<hkaiser>
perfect
<jbjnr_>
ARM + 2 GPU's per node
<heller_>
arm?
<heller_>
I thought it was power9?
<hkaiser>
isn't summit power9?
<heller_>
I have a regular power tester
<jbjnr_>
so sorry. power 9
<jbjnr_>
I'm losing my mind
<jbjnr_>
risc-ish anyway :)
<jbjnr_>
the hpx+cuda situation is a total mess, and this is going to be a problem for us moving forward with our next big project on summit.
<jbjnr_>
ooh - cdash has a FAU machine incoming
<heller_>
hkaiser: regarding our discussion about two threads trying to resume one suspended one ... and I am having hard time to figure out what the semantics of this should be ... I am even thinking that this is UB in general, and the two "resumer" threads need to synchronize in a different way, for example what our condition variable is doing. Am I totally off there?
<heller_>
jbjnr_: I agree ... there would have been an easy way out of this mess...
<jbjnr_>
well. I blame you for that!
<hkaiser>
heller_: stop worrying about this ;-)
<hkaiser>
just leave it alone
<hkaiser>
jbjnr_: having a mess is only a 'problem' if nobody does anything about it
parsa has joined #ste||ar
<jbjnr_>
it'll be my next problem, so I'll have to do something about it
<hkaiser>
good!
<hkaiser>
should we clean up the existing half-way solved 'problems' first?
<jbjnr_>
cat hpx::compute > /dev/nul
<jbjnr_>
which "we" are we talking about?
<hkaiser>
if you have a better solution, all the better
<hkaiser>
I always use the the royal 'we' ;) I'm a Kaiser
<heller_>
lol
<jbjnr_>
<sigh>
Smasher has quit [Remote host closed the connection]
Smasher has joined #ste||ar
mcopik has quit [Ping timeout: 255 seconds]
<heller_>
hkaiser: the suspend/resume or the other thing?
<hkaiser>
what other thing?
<heller_>
the easy way out of the hpx+cuda mess ;)
<hkaiser>
ahh
<hkaiser>
the suspend/resume
<heller_>
I won't, that really bugs me ;)
<hkaiser>
heh, why did I know ...
<hkaiser>
it's very high risk with questionaly outcomes
<hkaiser>
questionable*
<hkaiser>
the changes you work on currently are noticable in real applications for very high contention already - this change would not be measuarable at all - so why bother
<hkaiser>
?
<heller_>
I am getting those changes in first
<heller_>
and I don't think they won't be measurable
<hkaiser>
the current changes will have some effect for sure
<heller_>
absolutely
<jbjnr_>
still trying to fix my bugs, so I can test your stuff heller_
<heller_>
jbjnr_: gradually getting it into master
<jbjnr_>
:)
<heller_>
jbjnr_: once I am completely done, we'll take care of your stuff
<heller_>
just takes an insane amount time, since I really want to ensure everything still works
<heller_>
hkaiser: FWIW, there might be a low hanging fruit that doesn't require a high risk change
<heller_>
but I won't tell any details until I have more than just an educated guess, because I know what you'll going to say
<simbergm>
currently hpx::start can return before the runtime is in state_running, is this a feature or bug? I'd like to suspend the runtime as soon possible after it's running
<simbergm>
and second, starting the runtime without a main function does not seem to be possible right now, or am I missing some overload?
<hkaiser>
heller_: ok
<hkaiser>
simbergm: I was not aware of start returning before the runtime actually 'runs'
<hkaiser>
but start is ment to signal to the runtime to start - so I guess it could happen, yah
<simbergm>
yeah, I don't think it's necessarily wrong, just wondering if I should do the checking separately in that case
<hkaiser>
simbergm: start(argc, argv) does not need a main-function?
<simbergm>
hkaiser: but that will then call hpx_main?
<hkaiser>
only on locality 0 (if not otherwise configured)
<hkaiser>
and only if the locality is not connecting late
<simbergm>
okay, I'm only dealing with one locality
<simbergm>
I'm just trying to streamline starting the runtime and getting it suspended
<hkaiser>
what would be the point of starting the runtime without running any functions?
<simbergm>
it's not a big deal to add an empty hpx_main though
<hkaiser>
nod
<simbergm>
it should just be initialized, I can then later resume; hpx::async(); suspend
<simbergm>
ideally faster than start/stop
<hkaiser>
makes sense
<hkaiser>
that was not part of the initial design ;)
<hkaiser>
I think passing nullptr as hpx_main will do the trick
<hkaiser>
or a default constructed function<...>{}
<simbergm>
mmh, I'll try that
<simbergm>
an empty function_nonser did not work when I tried it though
<hkaiser>
hold on
<K-ballo>
an empty one might throw, unless it is special cased to be ignored
<heller_>
I can't reproduce it anywhere else and it is not a stack overflow
<akheir>
heller_: I'll take a look, I patched the server for meltdown thing last week, I've heard It may cause trouble with papi but I didn't expect this
<heller_>
you don't trust your users :)?
<heller_>
when I run some papi utilities, it seems to work fine
<heller_>
it's just within HPX
<heller_>
maybe you need to recompile PAPI against the new kernel?
<akheir>
heller_: this papi comes with upstream I will compile a new one and put it in buildbot
<akheir>
heller_: I don't know half of the users ;-)
<heller_>
thanks
<heller_>
let's hope this'll fix it
<heller_>
akheir: another note, I undrained the ariel nodes this morning
<akheir>
heller_: thanks, Patrick told me about it but I wasn't on my desk anymore to fix it
rtohid has joined #ste||ar
daissgr has joined #ste||ar
daissgr has quit [Ping timeout: 276 seconds]
daissgr has joined #ste||ar
<jbjnr_>
heller_: FAU build didn't ever complete by the looks of it :(
<heller_>
jbjnr_: no i cancelled it
<jbjnr_>
ok
<heller_>
because idiot me included the header tests
<jbjnr_>
Still can't remember what they are/do
<heller_>
they check that all headers are self consistent
<heller_>
self contained, sorry
EverYoung has quit [Ping timeout: 276 seconds]
<github>
[hpx] hkaiser closed pull request #3118: Adding performance_counter::reinit to allow for dynamically changing counter sets (master...reinit_counters) https://git.io/vN2Is
aserio has quit [Ping timeout: 276 seconds]
eschnett has joined #ste||ar
aserio has joined #ste||ar
<aserio>
wash[m]: Will you be joining us today?
parsa has joined #ste||ar
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
akheir_ has joined #ste||ar
vamatya has joined #ste||ar
akheir_ has quit [Remote host closed the connection]
parsa has quit [Quit: Zzzzzzzzzzzz]
aserio has quit [Quit: aserio]
aserio has joined #ste||ar
aserio has quit [Ping timeout: 276 seconds]
jaafar has joined #ste||ar
EverYoun_ has joined #ste||ar
EverYoung has quit [Ping timeout: 276 seconds]
daissgr has quit [Read error: Connection reset by peer]
david_pfander has quit [Ping timeout: 240 seconds]
jaafar_ has joined #ste||ar
jaafar has quit [Ping timeout: 252 seconds]
jaafar_ is now known as jaafar
jaafar has quit [Remote host closed the connection]
jaafar_ has joined #ste||ar
jaafar_ is now known as jaafar
aserio has joined #ste||ar
jaafar has quit [Remote host closed the connection]
jaafar has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 252 seconds]
aserio1 is now known as aserio
daissgr has joined #ste||ar
aserio has quit [Ping timeout: 252 seconds]
aserio has joined #ste||ar
jaafar has quit [Remote host closed the connection]
jaafar has joined #ste||ar
jaafar has quit [Ping timeout: 252 seconds]
jaafar has joined #ste||ar
<aserio>
twwright: ty?
<twwright>
aserio, yes
<aserio>
twwright: see pm please
aserio has quit [Quit: aserio]
aserio has joined #ste||ar
aserio1 has joined #ste||ar
jaafar_ has joined #ste||ar
jaafar has quit [Ping timeout: 276 seconds]
aserio1 has quit [Remote host closed the connection]
daissgr has quit [Quit: WeeChat 1.4]
aserio has quit [Ping timeout: 252 seconds]
daissgr has joined #ste||ar
eschnett has quit [Quit: eschnett]
aserio has joined #ste||ar
hkaiser has joined #ste||ar
<jbjnr_>
heller_: good news, bad news - good news - I fixed my problem and can run tests again - bad news, no noticable speedup using your fix overheads branch (yet)
kisaacs has joined #ste||ar
<hkaiser>
jbjnr_: you most likely won't be able to see a measurable improvement in overall runtime, but you should be able to reduce your thread-granularity which then might improve runtime
<aserio>
hkaiser: would you forward me Jiangua's email
<hkaiser>
done
<jbjnr_>
that's exactly what I'm testing. no noticable speedup for the smaller block sizes
<hkaiser>
jbjnr_: ok, good to know
aserio has quit [Ping timeout: 252 seconds]
Smasher has quit [Remote host closed the connection]
Smasher has joined #ste||ar
akheir has quit [Remote host closed the connection]