#ste||ar on 2019-03-06 — irc logs at irclog.cct.lsu.edu

2018-08-26 23:03 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:02 K-ballo has quit [Quit: K-ballo]

00:03 K-ballo has joined #ste||ar

00:21 bita has joined #ste||ar

00:24 bita_ has quit [Ping timeout: 252 seconds]

00:46 <zao> heller_: I ran the test repeatedly for six hours, no crash. I then put some intermittent stress-test load on 14-16 cores (of the Ryzen) and the `tests.unit.components.distributed.mpi.migrate_component` test failed in a matter of minutes.

00:48 <zao> https://gist.github.com/zao/99f1ce5589474a502abb534eb68cbf67

00:48 <zao> Hrm, deadlock detection?

00:49 <zao> Unloaded test runtime is 6-7s, loaded test runs were around 9-10s, this one took 13.5s.

00:50 <K-ballo> I've seen that possible deadlock thing repeatedly, something coming from our spinlock IIRC, it spun more than a few billion times

00:51 hkaiser has quit [Quit: bye]

00:54 jaafar has quit [Ping timeout: 250 seconds]

00:55 <zao> Not quite the same look as the last crash, but that one might've been provoked by running CP2K builds.

00:55 <zao> K-ballo: Sounds like a lovely trap.

00:58 <zao> I guess that on clusters you don't have much contention for compute on the nodes, unless you've got other threads in your process or some background maintenance happens.

01:03 <zao> Ooh, this failure seems more interesting and seems similar to the one before: https://gist.github.com/zao/53f006ddd21d6377da7599335cec38db

01:04 <zao> `{what}: assertion 'it != migrating_objects_.end() && get<0>(it->second)' failed: HPX(assertion_failure)`

01:04 <zao> This is all on that `reduce_iterators` commit, btw.

01:05 <zao> `{what}: assertion 'naming::detail::is_migratable(gid_)' failed: HPX(assertion_failure)`

01:06 <zao> `test 't1.get_data() == 42

01:06 <zao> ' failed in function 'test_migrate_busy_component2(hpx::naming::id_type, hpx::naming::id_type)::<lambda()>': '0' != '42'`

01:06 <zao> Seems reasonably easy to provoke it into being sad if you `stress -c 16` this machine.

01:07 <zao> Crashes every other minute or so.

01:08 <zao> Now sleep :D

01:15 <zao> Ooh, hpx/runtime/components/server/migration_support.hpp:79: void hpx::components::migration_support<BaseComponent, Mutex>::pin() [with BaseComponent = hpx::components::component_base<test_server>; Mutex = hpx::lcos::local::spinlock]: Assertion `pin_count_ != ~0x0u' failed.

01:16 <K-ballo> that one I do remember

01:34 eschnett has joined #ste||ar

01:49 <Yorlik> playing with atomics for concurrency control. Contention really takes its toll :D: Thread 15: 35414 / 1000000 success = 3.5414 %

01:49 <Yorlik> Threads: 16, Count: 416514, Runs Max: 16000000, runs real: 416514

01:50 <Yorlik> success means the thread cold acquire the local lock on a struct using the atomic

01:52 <Yorlik> After this simple stupid test my respect fpor people writing proper performant concurrent code and structures went up an order of magnitude.

01:57 hkaiser has joined #ste||ar

02:30 eschnett has quit [Quit: eschnett]

02:39 eschnett has joined #ste||ar

03:21 eschnett has quit [Quit: eschnett]

03:27 K-ballo has quit [Quit: K-ballo]

03:49 bita has quit [Ping timeout: 252 seconds]

03:53 hkaiser has quit [Quit: bye]

05:15 parsa_ has quit [Remote host closed the connection]

05:20 parsa has joined #ste||ar

05:26 parsa is now known as parsa_

07:42 jaafar has joined #ste||ar

08:03 david_pfander has joined #ste||ar

08:38 jaafar has quit [Ping timeout: 252 seconds]

11:00 K-ballo has joined #ste||ar

13:08 <jbjnr_> K-ballo: quick question?

13:08 <K-ballo> here

13:09 <jbjnr_> what's the canonical way of making this do the right thing

13:09 <jbjnr_> const Binder & /*unused*

13:09 <jbjnr_> grrr..

13:09 <jbjnr_> sorry. wrong paste

13:10 <jbjnr_> template <typename Binder>

13:10 <jbjnr_> std::shared_ptr<Binder> get_binding_helper_cast() {

13:10 <jbjnr_> return std::dynamic_pointer_cast<Binder>(binding_helper_);

13:10 <jbjnr_> }

13:10 <jbjnr_> I need to supply an empty param to make it pick the right type for Binder. But is there an easier way?

13:11 <jbjnr_> (where the param type depends on Binder type)

13:12 <K-ballo> not sure I understand.. as written you'd explicitly specify the template parameter Binder, get_binding_helper_cast<X>()

13:12 <jbjnr_> the compiler won't accept that.

13:13 <K-ballo> it would, for the snippet as shown

13:14 <jbjnr_> error: expected primary-expression before ‘>’ token

13:14 <jbjnr_> get_binding_helper_cast<binder_type_2D>()->leading_dim_;

13:15 <zao> K-ballo: Would it be possible to disable the deadlock detection somehow, or is it so far into the realm of high counts that we risk wrapping?

13:15 <jbjnr_> It's fine if I add a dummy param of type binder_type_2d, or a tag type, but I don't like it

13:16 <jbjnr_> zao: can't you just disable HPX_HAVE_MI?NIMAL_DEADLOCK_DETECTION

13:16 <K-ballo> there's some extra context we are not seeing.. is get_binding_helper_cast a member of a template class? is binder_type_2D a type?

13:17 <jbjnr_> yes and yes

13:17 <K-ballo> https://wandbox.org/permlink/KS9rcRLHDV8dGEiM

13:17 <K-ballo> do you need `this->template foo<X>()`?

13:18 <K-ballo> no, or you'd have already needed the `this->` in there

13:18 <K-ballo> zao: HPX_WITH_SPINLOCK_DEADLOCK_DETECTION

13:22 <K-ballo> the error suggests the compiler is interpreting `get_binding_helper_cast` as a variable rather than a template, but something else is missing

13:22 <jbjnr_> the binder is templated on another T

13:23 <jbjnr_> and the other class is templated on another T too

13:23 <K-ballo> how does the real line look like?

13:24 <jbjnr_> template

13:24 <jbjnr_> grrr.

13:24 <jbjnr_> a.localMatrix().get_allocator().get_binding_helper_cast<binder_type_2D>()->leading_dim_;

13:24 <jbjnr_> I tried adding template in various places

13:25 <K-ballo> do you need

13:25 <K-ballo> a.localMatrix().get_allocator().template get_binding_helper_cast<binder_type_2D>()->leading_dim_;

13:25 <K-ballo> ?

13:25 <jbjnr_> using binder_type_2D = matrix_numa_binder<T>;

13:26 <jbjnr_> ah shit. the template keyword does work. but I misread another error

13:27 <jbjnr_> I thought it gave a new error, but is actually ok

13:27 <jbjnr_> thereis a const error I need to fix too. Thanks K-ballo that sorts me out

13:55 <zao> Ah, will try after fika

14:21 hkaiser has joined #ste||ar

14:33 eschnett has joined #ste||ar

14:37 hkaiser has quit [Quit: bye]

14:42 <jbjnr_> Did anyone fix the guided_pool_executor yet? I need to use it so will have to fix it if nobody else did

14:43 bita has joined #ste||ar

14:48 <K-ballo> I had tried but, not me.. you said you knew what to fix, I did not

14:49 <jbjnr_> ok. didn't want to spend time on it if someone else has already fixed it

14:49 <K-ballo> I'm eagerly awaiting to understand the underlying causes

14:49 <jbjnr_> the cause is that the tuple isn't unwrapped when it arrives at my executor

14:50 <jbjnr_> or the other way around. I forget now

14:50 <jbjnr_> I fix soon.

14:50 <K-ballo> I'm hoping for something "more underlying" than that

14:50 <K-ballo> I don't understand the design

14:50 <jbjnr_> aha

14:50 <jbjnr_> that's above my pay grade

14:50 <K-ballo> and I don't see how it could be entangled to dataflow's internals

14:51 <jbjnr_> which design do you not understand?

14:52 <K-ballo> whichever one is responsible for coupling the executor with dataflow implementation details, if I understood things correctly

14:53 <K-ballo> it would suggest the guided executor cannot be implemented non-intrusively

14:54 <jbjnr_> the guided executor uses dataflow to hold on to arguments until the futures are ready, then it fetches the contents of them to query the memory placement and then do a 'dynamic' schedul instead of a static one. I don't recall the details now - but a quirk in the executor design that the dataflow frame is passed in at some point with tuples and so I had to add a nasty overload to intercept the stuff

14:54 <jbjnr_> once I fix the imediate problem, maybe you can see a better fix

14:54 <jbjnr_> "more underlying"

14:55 <zao> Heh, built without deadlock detection, test never completes :D

14:56 <K-ballo> I'm hoping it can be implemented in a way that it is independent of dataflow's internal, or failing that I want to understand and document which internals will affect it so we know what we can and cannot change

14:57 <jbjnr_> "skipping 32 instantiation contexts, use -ftemplate-backtrace-limit=0 to disable" - Gotta love C++ and HPX.

14:57 <jbjnr_> <sigh>

14:58 <K-ballo> zao: you don't say? :P

14:59 aserio has joined #ste||ar

15:03 akheir has quit [Quit: Konversation terminated!]

15:07 akheir has joined #ste||ar

15:10 jaafar has joined #ste||ar

15:15 eschnett has quit [Read error: Connection reset by peer]

15:15 eschnett has joined #ste||ar

15:19 <jbjnr_> \me shakes his fist at k-ballo and swears hje ever wants to look at the guided pool executor code again

15:19 <jbjnr_> ^he never

15:21 <zao> `1/1 Test #225: tests.unit.components.distributed.mpi.migrate_component ...***Timeout 1500.03 sec`

15:21 <jbjnr_> zao save yourself some pain and set the timeout to 60s

15:22 <jbjnr_> waiting 1500s is too much dedication

15:22 <zao> Yeah, I always forget when hand-running.

15:25 <zao> I managed to generate (with deadlock detection) something like 5800 runs overnight.

15:26 <zao> A lot of fun distinct failures.

15:26 <jbjnr_> the great news is that when you find the problem, it'll be heller's fault :)

15:28 <zao> Going to be fun to see what I get out of the runs that don't have deadlock detection or if it's all side effects.

15:49 eschnett has quit [Read error: Connection reset by peer]

15:50 eschnett has joined #ste||ar

16:08 eschnett has quit [Read error: Connection reset by peer]

16:08 eschnett has joined #ste||ar

16:17 jaafar has quit [Ping timeout: 252 seconds]

16:26 hkaiser has joined #ste||ar

16:56 aserio has quit [Ping timeout: 252 seconds]

17:24 <K-ballo> what was the SC demo doing object classification on the LSU booth?

17:41 aserio has joined #ste||ar

17:42 aserio1 has joined #ste||ar

17:45 aserio has quit [Ping timeout: 252 seconds]

17:45 aserio1 is now known as aserio

17:46 <parsa_> K-ballo: it was tensorflow running on an nvidia jettson

17:46 parsa_ is now known as parsa

17:46 bibek has joined #ste||ar

17:47 <K-ballo> parsa: is there a link? more info?

17:47 <parsa> K-ballo: i don't think there's a link. but i know they guy. i can put you in contact with him if you want

17:48 <parsa> :s/they/the

17:48 <K-ballo> no, that's ok, that'd require human interaction

17:48 <K-ballo> thanks

17:52 <parsa> K-ballo: found the the paper: https://ieeexplore.ieee.org/abstract/document/8416390

17:55 <parsa> on news: https://opensource.com/article/18/5/state-of-image-recognition and http://www.openhealthnews.com/story/2018-05-26/look-open-source-image-recognition-technology direct link to paper pdf: https://www.distributed-systems.net/my-data/var/icdcs2018/666.pdf

18:11 jaafar has joined #ste||ar

18:23 david_pfander has quit [Ping timeout: 245 seconds]

18:24 <zao> Still failing on all the fun things when running without deadlock detection \o/

18:33 aserio has quit [Ping timeout: 252 seconds]

18:55 eschnett has quit [Quit: eschnett]

18:59 aserio has joined #ste||ar

19:01 eschnett_ has joined #ste||ar

19:07 eschnett_ has quit [Read error: Connection reset by peer]

19:07 eschnett has joined #ste||ar

19:32 aserio has quit [Ping timeout: 252 seconds]

19:41 K-ballo has quit [Quit: K-ballo]

19:42 K-ballo has joined #ste||ar

20:06 aserio has joined #ste||ar

20:15 eschnett has quit [Quit: eschnett]

20:21 eschnett has joined #ste||ar

20:25 eschnett has quit [Client Quit]

20:31 eschnett has joined #ste||ar

20:37 akheir has quit [Quit: Konversation terminated!]

20:41 akheir has joined #ste||ar

20:56 hkaiser has quit [Quit: bye]

21:49 eschnett has quit [Quit: eschnett]

21:57 hkaiser has joined #ste||ar

22:00 <aserio> hkaiser: quick question

22:01 <hkaiser> aserio: sure

22:01 <aserio> hkaiser: why did you pass a function object to dataflow instead of a function pointer

22:01 <aserio> Dataflow takes both right?

22:01 <hkaiser> yes

22:02 <hkaiser> the function would have to be a template

22:02 <hkaiser> those are a pain if you need their address

22:02 <hkaiser> the function object leaves the argument deduction to the compiler

22:04 <aserio> I see

22:21 <K-ballo> function pointers are function objects too

22:26 <aserio> K-ballo: but I assume that the compiler needs more info to produce the object from the function pointer?

22:27 Vir has joined #ste||ar

22:33 aserio has quit [Quit: aserio]

22:41 bita has quit [Read error: Connection reset by peer]

22:55 <hkaiser> K-ballo: aserio was referring to a PFO vs. template function

23:02 bibek has quit [Quit: Konversation terminated!]