aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
EverYoung has joined #ste||ar
EverYoung has quit [Ping timeout: 246 seconds]
EverYoung has joined #ste||ar
diehlpk has joined #ste||ar
EverYoung has quit [Ping timeout: 252 seconds]
vamatya has quit [Ping timeout: 240 seconds]
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: bye]
diehlpk has quit [Ping timeout: 240 seconds]
vamatya has joined #ste||ar
vamatya_ has joined #ste||ar
vamatya has quit [Ping timeout: 246 seconds]
EverYoung has joined #ste||ar
EverYoung has quit [Ping timeout: 246 seconds]
Matombo has joined #ste||ar
vamatya_ has quit [Ping timeout: 248 seconds]
Matombo has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoung has quit [Ping timeout: 246 seconds]
<heller>
jbjnr: got a running stream benchmark on daint again
<heller>
including GPU support
david_pfander has joined #ste||ar
bikineev has joined #ste||ar
bikineev has quit [Read error: Connection reset by peer]
bikineev has joined #ste||ar
<jbjnr>
heller: what's your skype id?
<jbjnr>
we are setting everything up here
<jbjnr>
we have a laptop ready for your face
bikineev_ has joined #ste||ar
bikineev_ has quit [Remote host closed the connection]
bikineev has quit [Ping timeout: 240 seconds]
david_pfander has quit [Ping timeout: 240 seconds]
<heller>
jbjnr: heller52
<jbjnr>
ok, I'll send an invite from the cscs machine/account
<heller>
thanks
david_pfander has joined #ste||ar
<jbjnr>
Will is giving a quick intro etc
<jbjnr>
On our team is Alan, from Nvidia
<jbjnr>
he will help us port our kernels, if we can explain them well enough to him
<heller>
great
<heller>
I am only having problems with cuda + clang in debug mode
<jbjnr>
ok, we were deciding that relwithdebinfo would be our mode of choice for the week
<jbjnr>
I'll do an hpx install that we will use at this end, we hope all of us can use the same basic set of binaries if poss
<heller>
ok
<heller>
I have some adjustements
<heller>
and we'll probably need to adjust the install over the week over time
<jbjnr>
well, we'll probably have our own octo builds as we tweak stuff
<jbjnr>
yes^
<heller>
MPI support is missing for example
<jbjnr>
adjustments for sure
<jbjnr>
We will concentrate on single node to begin with
<heller>
sure
<heller>
jbjnr: speak up that we intend to use clang
<jbjnr>
nb. there is a 128 node reservation on daint, but only eurohack accounts can access it :(
<jbjnr>
I announced that at the mentors meeting this morning
<github>
hpx/cuda_clang 56a0a30 Thomas Heller: Fixing ICE with nvcc
bikineev_ has quit [Ping timeout: 240 seconds]
<jbjnr>
hkaiser: heller I have not been able to find a reasonable explanation for the double peaks in our task times https://pasteboard.co/GI0Dk5x.png - is there any conceivable way that when runnin hpx on many threads - it could accidentally run the task twice - due to a race in the deep internals?
<heller>
very unlikely
<jbjnr>
indeed.
<hkaiser>
unlikely indeed
<jbjnr>
just posted on slack that stream ok now, thanks for boost patch
<heller>
semi ok
<heller>
performance sucks
<hkaiser>
jbjnr: could be a matter of critical tasks bein gexecuted too late, holdin gback everything else
<heller>
which worries me
<heller>
i'd really check the hardware counters to see what kind of cache misses or other memory transfer we are dealing with here
<jbjnr>
no.
<heller>
likwid would be a perfect tool to check this
<jbjnr>
the cache cannot explain it and the task execution cannot causse it - the time is started inside the lambda, and stopped at the end of the lambda
<jbjnr>
memory bw calcs do not alow for the scale of the slowdown - cache not the cause
<hkaiser>
no suspension?
<jbjnr>
if we run a small time, it takes 8ms and we see a peak at 8 and another at 16
<heller>
please just check it
<jbjnr>
if we run a big tile that takes 30, we see a peak at 30 and another at 60
<hkaiser>
loading tlbs?
<jbjnr>
only explanation is two threads bound to one code
<jbjnr>
core^
<jbjnr>
but diagnostics disprove this
<jbjnr>
as I can dump out the core with each task
<jbjnr>
and they all are differnt and correct
<hkaiser>
TLBs?
<heller>
translation lookaside buffers?
<hkaiser>
yes
<heller>
instruction cache misses?
<jbjnr>
none of these would cause a 2x delay - they would add some overhead, but not scale with tile size
<heller>
well
<jbjnr>
bbiab
<hkaiser>
jbjnr: TLBs would scale
<heller>
we'll only know for sure once we actually look at the counters
<heller>
hkaiser: btw, nvcc on daint works now. as well as cuda clang
<hkaiser>
cool, what did you change?
<heller>
hooray for spending hours and hours in front of compiler error messages ;)
<heller>
really nothing
<hkaiser>
heller: you're my hero
<heller>
now, we need to bring back the performance :P
<hkaiser>
uhh, so why did it start to work?
<heller>
well, the strange segfaults last week were on my local test system
<heller>
now I am running on daint
<hkaiser>
and the compilation problems in unwrap?
<heller>
they are only showing up with the binpacking distribution policy
<hkaiser>
ahh, because that ties in the actions, right?
<heller>
the assertion (in the EDG frontend) is coming out out of a file named "scope_tks.c"
<hkaiser>
lol
<hkaiser>
very helpful
<heller>
the policies are statically initialized, right?
<hkaiser>
might be
<heller>
with a global that is, at namespace scope
<hkaiser>
don't remember
<heller>
yeah, they are
<hkaiser>
ok
<heller>
so this is my guess: it is ok when unwrap is called from within function scope
<hkaiser>
heller: could you comment about your findings on the related tickets, pls?
<heller>
but the assert fails once it is instantiated from a static scope