hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
bita has joined #ste||ar
bita_ has quit [Ping timeout: 252 seconds]
<zao>
heller_: I ran the test repeatedly for six hours, no crash. I then put some intermittent stress-test load on 14-16 cores (of the Ryzen) and the `tests.unit.components.distributed.mpi.migrate_component` test failed in a matter of minutes.
<zao>
Unloaded test runtime is 6-7s, loaded test runs were around 9-10s, this one took 13.5s.
<K-ballo>
I've seen that possible deadlock thing repeatedly, something coming from our spinlock IIRC, it spun more than a few billion times
hkaiser has quit [Quit: bye]
jaafar has quit [Ping timeout: 250 seconds]
<zao>
Not quite the same look as the last crash, but that one might've been provoked by running CP2K builds.
<zao>
K-ballo: Sounds like a lovely trap.
<zao>
I guess that on clusters you don't have much contention for compute on the nodes, unless you've got other threads in your process or some background maintenance happens.
<K-ballo>
the error suggests the compiler is interpreting `get_binding_helper_cast` as a variable rather than a template, but something else is missing
<jbjnr_>
the binder is templated on another T
<jbjnr_>
and the other class is templated on another T too
<jbjnr_>
using binder_type_2D = matrix_numa_binder<T>;
<jbjnr_>
ah shit. the template keyword does work. but I misread another error
<jbjnr_>
I thought it gave a new error, but is actually ok
<jbjnr_>
thereis a const error I need to fix too. Thanks K-ballo that sorts me out
<zao>
Ah, will try after fika
hkaiser has joined #ste||ar
eschnett has joined #ste||ar
hkaiser has quit [Quit: bye]
<jbjnr_>
Did anyone fix the guided_pool_executor yet? I need to use it so will have to fix it if nobody else did
bita has joined #ste||ar
<K-ballo>
I had tried but, not me.. you said you knew what to fix, I did not
<jbjnr_>
ok. didn't want to spend time on it if someone else has already fixed it
<K-ballo>
I'm eagerly awaiting to understand the underlying causes
<jbjnr_>
the cause is that the tuple isn't unwrapped when it arrives at my executor
<jbjnr_>
or the other way around. I forget now
<jbjnr_>
I fix soon.
<K-ballo>
I'm hoping for something "more underlying" than that
<K-ballo>
I don't understand the design
<jbjnr_>
aha
<jbjnr_>
that's above my pay grade
<K-ballo>
and I don't see how it could be entangled to dataflow's internals
<jbjnr_>
which design do you not understand?
<K-ballo>
whichever one is responsible for coupling the executor with dataflow implementation details, if I understood things correctly
<K-ballo>
it would suggest the guided executor cannot be implemented non-intrusively
<jbjnr_>
the guided executor uses dataflow to hold on to arguments until the futures are ready, then it fetches the contents of them to query the memory placement and then do a 'dynamic' schedul instead of a static one. I don't recall the details now - but a quirk in the executor design that the dataflow frame is passed in at some point with tuples and so I had to add a nasty overload to intercept the stuff
<jbjnr_>
once I fix the imediate problem, maybe you can see a better fix
<jbjnr_>
"more underlying"
<zao>
Heh, built without deadlock detection, test never completes :D
<K-ballo>
I'm hoping it can be implemented in a way that it is independent of dataflow's internal, or failing that I want to understand and document which internals will affect it so we know what we can and cannot change
<jbjnr_>
"skipping 32 instantiation contexts, use -ftemplate-backtrace-limit=0 to disable" - Gotta love C++ and HPX.
<jbjnr_>
<sigh>
<K-ballo>
zao: you don't say? :P
aserio has joined #ste||ar
akheir has quit [Quit: Konversation terminated!]
akheir has joined #ste||ar
jaafar has joined #ste||ar
eschnett has quit [Read error: Connection reset by peer]
eschnett has joined #ste||ar
<jbjnr_>
\me shakes his fist at k-ballo and swears hje ever wants to look at the guided pool executor code again
<jbjnr_>
^he never
<zao>
`1/1 Test #225: tests.unit.components.distributed.mpi.migrate_component ...***Timeout 1500.03 sec`
<jbjnr_>
zao save yourself some pain and set the timeout to 60s
<jbjnr_>
waiting 1500s is too much dedication
<zao>
Yeah, I always forget when hand-running.
<zao>
I managed to generate (with deadlock detection) something like 5800 runs overnight.
<zao>
A lot of fun distinct failures.
<jbjnr_>
the great news is that when you find the problem, it'll be heller's fault :)
<zao>
Going to be fun to see what I get out of the runs that don't have deadlock detection or if it's all side effects.
eschnett has quit [Read error: Connection reset by peer]
eschnett has joined #ste||ar
eschnett has quit [Read error: Connection reset by peer]
eschnett has joined #ste||ar
jaafar has quit [Ping timeout: 252 seconds]
hkaiser has joined #ste||ar
aserio has quit [Ping timeout: 252 seconds]
<K-ballo>
what was the SC demo doing object classification on the LSU booth?
aserio has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 252 seconds]
aserio1 is now known as aserio
<parsa_>
K-ballo: it was tensorflow running on an nvidia jettson
parsa_ is now known as parsa
bibek has joined #ste||ar
<K-ballo>
parsa: is there a link? more info?
<parsa>
K-ballo: i don't think there's a link. but i know they guy. i can put you in contact with him if you want
<parsa>
:s/they/the
<K-ballo>
no, that's ok, that'd require human interaction