hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
bita has joined #ste||ar
bita_ has quit [Ping timeout: 252 seconds]
<zao> heller_: I ran the test repeatedly for six hours, no crash. I then put some intermittent stress-test load on 14-16 cores (of the Ryzen) and the `tests.unit.components.distributed.mpi.migrate_component` test failed in a matter of minutes.
<zao> Hrm, deadlock detection?
<zao> Unloaded test runtime is 6-7s, loaded test runs were around 9-10s, this one took 13.5s.
<K-ballo> I've seen that possible deadlock thing repeatedly, something coming from our spinlock IIRC, it spun more than a few billion times
hkaiser has quit [Quit: bye]
jaafar has quit [Ping timeout: 250 seconds]
<zao> Not quite the same look as the last crash, but that one might've been provoked by running CP2K builds.
<zao> K-ballo: Sounds like a lovely trap.
<zao> I guess that on clusters you don't have much contention for compute on the nodes, unless you've got other threads in your process or some background maintenance happens.
<zao> Ooh, this failure seems more interesting and seems similar to the one before: https://gist.github.com/zao/53f006ddd21d6377da7599335cec38db
<zao> `{what}: assertion 'it != migrating_objects_.end() && get<0>(it->second)' failed: HPX(assertion_failure)`
<zao> This is all on that `reduce_iterators` commit, btw.
<zao> `{what}: assertion 'naming::detail::is_migratable(gid_)' failed: HPX(assertion_failure)`
<zao> `test 't1.get_data() == 42
<zao> ' failed in function 'test_migrate_busy_component2(hpx::naming::id_type, hpx::naming::id_type)::<lambda()>': '0' != '42'`
<zao> Seems reasonably easy to provoke it into being sad if you `stress -c 16` this machine.
<zao> Crashes every other minute or so.
<zao> Now sleep :D
<zao> Ooh, hpx/runtime/components/server/migration_support.hpp:79: void hpx::components::migration_support<BaseComponent, Mutex>::pin() [with BaseComponent = hpx::components::component_base<test_server>; Mutex = hpx::lcos::local::spinlock]: Assertion `pin_count_ != ~0x0u' failed.
<K-ballo> that one I do remember
eschnett has joined #ste||ar
<Yorlik> playing with atomics for concurrency control. Contention really takes its toll :D: Thread 15: 35414 / 1000000 success = 3.5414 %
<Yorlik> Threads: 16, Count: 416514, Runs Max: 16000000, runs real: 416514
<Yorlik> success means the thread cold acquire the local lock on a struct using the atomic
<Yorlik> After this simple stupid test my respect fpor people writing proper performant concurrent code and structures went up an order of magnitude.
hkaiser has joined #ste||ar
eschnett has quit [Quit: eschnett]
eschnett has joined #ste||ar
eschnett has quit [Quit: eschnett]
K-ballo has quit [Quit: K-ballo]
bita has quit [Ping timeout: 252 seconds]
hkaiser has quit [Quit: bye]
parsa_ has quit [Remote host closed the connection]
parsa has joined #ste||ar
parsa is now known as parsa_
jaafar has joined #ste||ar
david_pfander has joined #ste||ar
jaafar has quit [Ping timeout: 252 seconds]
K-ballo has joined #ste||ar
<jbjnr_> K-ballo: quick question?
<K-ballo> here
<jbjnr_> what's the canonical way of making this do the right thing
<jbjnr_> const Binder & /*unused*
<jbjnr_> grrr..
<jbjnr_> sorry. wrong paste
<jbjnr_> template <typename Binder>
<jbjnr_> std::shared_ptr<Binder> get_binding_helper_cast() {
<jbjnr_> return std::dynamic_pointer_cast<Binder>(binding_helper_);
<jbjnr_> }
<jbjnr_> I need to supply an empty param to make it pick the right type for Binder. But is there an easier way?
<jbjnr_> (where the param type depends on Binder type)
<K-ballo> not sure I understand.. as written you'd explicitly specify the template parameter Binder, get_binding_helper_cast<X>()
<jbjnr_> the compiler won't accept that.
<K-ballo> it would, for the snippet as shown
<jbjnr_> error: expected primary-expression before ‘>’ token
<jbjnr_> get_binding_helper_cast<binder_type_2D>()->leading_dim_;
<zao> K-ballo: Would it be possible to disable the deadlock detection somehow, or is it so far into the realm of high counts that we risk wrapping?
<jbjnr_> It's fine if I add a dummy param of type binder_type_2d, or a tag type, but I don't like it
<jbjnr_> zao: can't you just disable HPX_HAVE_MI?NIMAL_DEADLOCK_DETECTION
<K-ballo> there's some extra context we are not seeing.. is get_binding_helper_cast a member of a template class? is binder_type_2D a type?
<jbjnr_> yes and yes
<K-ballo> do you need `this->template foo<X>()`?
<K-ballo> no, or you'd have already needed the `this->` in there
<K-ballo> zao: HPX_WITH_SPINLOCK_DEADLOCK_DETECTION
<K-ballo> the error suggests the compiler is interpreting `get_binding_helper_cast` as a variable rather than a template, but something else is missing
<jbjnr_> the binder is templated on another T
<jbjnr_> and the other class is templated on another T too
<K-ballo> how does the real line look like?
<jbjnr_> template
<jbjnr_> grrr.
<jbjnr_> a.localMatrix().get_allocator().get_binding_helper_cast<binder_type_2D>()->leading_dim_;
<jbjnr_> I tried adding template in various places
<K-ballo> do you need
<K-ballo> a.localMatrix().get_allocator().template get_binding_helper_cast<binder_type_2D>()->leading_dim_;
<K-ballo> ?
<jbjnr_> using binder_type_2D = matrix_numa_binder<T>;
<jbjnr_> ah shit. the template keyword does work. but I misread another error
<jbjnr_> I thought it gave a new error, but is actually ok
<jbjnr_> thereis a const error I need to fix too. Thanks K-ballo that sorts me out
<zao> Ah, will try after fika
hkaiser has joined #ste||ar
eschnett has joined #ste||ar
hkaiser has quit [Quit: bye]
<jbjnr_> Did anyone fix the guided_pool_executor yet? I need to use it so will have to fix it if nobody else did
bita has joined #ste||ar
<K-ballo> I had tried but, not me.. you said you knew what to fix, I did not
<jbjnr_> ok. didn't want to spend time on it if someone else has already fixed it
<K-ballo> I'm eagerly awaiting to understand the underlying causes
<jbjnr_> the cause is that the tuple isn't unwrapped when it arrives at my executor
<jbjnr_> or the other way around. I forget now
<jbjnr_> I fix soon.
<K-ballo> I'm hoping for something "more underlying" than that
<K-ballo> I don't understand the design
<jbjnr_> aha
<jbjnr_> that's above my pay grade
<K-ballo> and I don't see how it could be entangled to dataflow's internals
<jbjnr_> which design do you not understand?
<K-ballo> whichever one is responsible for coupling the executor with dataflow implementation details, if I understood things correctly
<K-ballo> it would suggest the guided executor cannot be implemented non-intrusively
<jbjnr_> the guided executor uses dataflow to hold on to arguments until the futures are ready, then it fetches the contents of them to query the memory placement and then do a 'dynamic' schedul instead of a static one. I don't recall the details now - but a quirk in the executor design that the dataflow frame is passed in at some point with tuples and so I had to add a nasty overload to intercept the stuff
<jbjnr_> once I fix the imediate problem, maybe you can see a better fix
<jbjnr_> "more underlying"
<zao> Heh, built without deadlock detection, test never completes :D
<K-ballo> I'm hoping it can be implemented in a way that it is independent of dataflow's internal, or failing that I want to understand and document which internals will affect it so we know what we can and cannot change
<jbjnr_> "skipping 32 instantiation contexts, use -ftemplate-backtrace-limit=0 to disable" - Gotta love C++ and HPX.
<jbjnr_> <sigh>
<K-ballo> zao: you don't say? :P
aserio has joined #ste||ar
akheir has quit [Quit: Konversation terminated!]
akheir has joined #ste||ar
jaafar has joined #ste||ar
eschnett has quit [Read error: Connection reset by peer]
eschnett has joined #ste||ar
<jbjnr_> \me shakes his fist at k-ballo and swears hje ever wants to look at the guided pool executor code again
<jbjnr_> ^he never
<zao> `1/1 Test #225: tests.unit.components.distributed.mpi.migrate_component ...***Timeout 1500.03 sec`
<jbjnr_> zao save yourself some pain and set the timeout to 60s
<jbjnr_> waiting 1500s is too much dedication
<zao> Yeah, I always forget when hand-running.
<zao> I managed to generate (with deadlock detection) something like 5800 runs overnight.
<zao> A lot of fun distinct failures.
<jbjnr_> the great news is that when you find the problem, it'll be heller's fault :)
<zao> Going to be fun to see what I get out of the runs that don't have deadlock detection or if it's all side effects.
eschnett has quit [Read error: Connection reset by peer]
eschnett has joined #ste||ar
eschnett has quit [Read error: Connection reset by peer]
eschnett has joined #ste||ar
jaafar has quit [Ping timeout: 252 seconds]
hkaiser has joined #ste||ar
aserio has quit [Ping timeout: 252 seconds]
<K-ballo> what was the SC demo doing object classification on the LSU booth?
aserio has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 252 seconds]
aserio1 is now known as aserio
<parsa_> K-ballo: it was tensorflow running on an nvidia jettson
parsa_ is now known as parsa
bibek has joined #ste||ar
<K-ballo> parsa: is there a link? more info?
<parsa> K-ballo: i don't think there's a link. but i know they guy. i can put you in contact with him if you want
<parsa> :s/they/the
<K-ballo> no, that's ok, that'd require human interaction
<K-ballo> thanks
<parsa> K-ballo: found the the paper: https://ieeexplore.ieee.org/abstract/document/8416390
jaafar has joined #ste||ar
david_pfander has quit [Ping timeout: 245 seconds]
<zao> Still failing on all the fun things when running without deadlock detection \o/
aserio has quit [Ping timeout: 252 seconds]
eschnett has quit [Quit: eschnett]
aserio has joined #ste||ar
eschnett_ has joined #ste||ar
eschnett_ has quit [Read error: Connection reset by peer]
eschnett has joined #ste||ar
aserio has quit [Ping timeout: 252 seconds]
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
aserio has joined #ste||ar
eschnett has quit [Quit: eschnett]
eschnett has joined #ste||ar
eschnett has quit [Client Quit]
eschnett has joined #ste||ar
akheir has quit [Quit: Konversation terminated!]
akheir has joined #ste||ar
hkaiser has quit [Quit: bye]
eschnett has quit [Quit: eschnett]
hkaiser has joined #ste||ar
<aserio> hkaiser: quick question
<hkaiser> aserio: sure
<aserio> hkaiser: why did you pass a function object to dataflow instead of a function pointer
<aserio> Dataflow takes both right?
<hkaiser> yes
<hkaiser> the function would have to be a template
<hkaiser> those are a pain if you need their address
<hkaiser> the function object leaves the argument deduction to the compiler
<aserio> I see
<K-ballo> function pointers are function objects too
<aserio> K-ballo: but I assume that the compiler needs more info to produce the object from the function pointer?
Vir has joined #ste||ar
aserio has quit [Quit: aserio]
bita has quit [Read error: Connection reset by peer]
<hkaiser> K-ballo: aserio was referring to a PFO vs. template function
bibek has quit [Quit: Konversation terminated!]