#ste||ar on 2020-05-03 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

02:02 nikunj has quit [Remote host closed the connection]

02:02 nikunj has joined #ste||ar

02:29 hkaiser has quit [Quit: bye]

07:58 kale_ has joined #ste||ar

08:09 kale_ has quit [Quit: Konversation terminated!]

08:50 parsa has quit [Remote host closed the connection]

09:02 parsa has joined #ste||ar

09:56 mcopik has joined #ste||ar

09:57 mcopik has quit [Client Quit]

10:54 <Yorlik> Seems I just found an uncaught exception in HPX which caused my lockups: Scroll to end of https://gist.github.com/McKillroy/5f92e41f5c851d28408ca447e7dc8f09

10:56 nikunj has quit [Read error: Connection reset by peer]

10:56 nikunj has joined #ste||ar

10:59 nikunj has quit [Read error: Connection reset by peer]

11:00 <Yorlik> https://github.com/STEllAR-GROUP/hpx/blob/master/libs/algorithms/include/hpx/parallel/util/detail/handle_local_exceptions.hpp#L109

11:00 <Yorlik> Seems to be not caught and cause my parallel loop to hang and never exit.

11:01 nikunj has joined #ste||ar

11:01 <Yorlik> ms[m] ^^

11:02 gonidelis has joined #ste||ar

11:05 gonidelis has quit [Remote host closed the connection]

11:07 <Yorlik> Issue posted: https://github.com/STEllAR-GROUP/hpx/issues/4589

11:34 <heller1> Yorlik: so you eventually throw in one of your element functions?

11:35 <heller1> Did you try catching the exception yourself and see if that lockup persists?

12:53 hkaiser has joined #ste||ar

12:56 <ms[m]> Yorlik: what heller said + the question is how do you launch your parallel for loop? with par(task)? do you ever get() the value from the future?

13:30 <Yorlik> Allright - back. Had to take a nap for beauty and sanity :)

13:30 <Yorlik> I can add the callsite to the report ofc - I'll do that right away.

13:31 <Yorlik> And yes - there are occasions where I throw (standard exceptions like std::runtime_error) and catch the exception through the future. All futures are put into a list and checked for exceptions.

13:32 <Yorlik> If I didn't produce a bug, there should be not a single unchecked future. It has worked in the past.

13:37 <Yorlik> OFC I can't catch anything at the loop callsite, since nothing ever arrives there.

13:38 <Yorlik> heller1, ms[m], hkaiser ^^

13:39 <hkaiser> Yorlik: we'll need a small reproducing case

13:39 <hkaiser> Yorlik: our parallel algorithms all have tests for exception handling, so you must be doing something differently

13:40 <Yorlik> Whyt could I have done wrong? Forget to check a future and never call future.get() ? Or not catch when calling .get()?

13:43 <hkaiser> I'm not saying you did something wrong

13:43 <hkaiser> I said you do something in a way we do not test

13:43 <Yorlik> Probably. What I'm doing is std::move the futures between lists for delayed checking at the end of a frame.

13:44 <hkaiser> Yorlik: why did you close the ticket?

13:44 <Yorlik> Did I? Then it was an accident

13:45 <Yorlik> Sorry for that.

13:45 <hkaiser> ok, I'll reopen it

13:45 <Yorlik> Probably when trying to get that code sample in

13:45 <hkaiser> np

13:45 <hkaiser> now please tell me, where is that exception thrown? in update_entity<>?

13:46 <Yorlik> Must be

13:46 <Yorlik> But not in there probably

13:46 <Yorlik> Couzld also be inside the Lua Call stack

13:46 <hkaiser> so it's thrown in the loop body?

13:47 <hkaiser> what exception is it?

13:47 <Yorlik> I don't know - the endpoint on the HPX side has an error list and the debugger crashed each time I tried to open and inspect it

13:47 <Yorlik> I'll retry and see what's acually there

13:48 <Yorlik> But I don't have many sites where I throw

13:49 <hkaiser> Yorlik: I looked now - we don't test exception handling for for_loop :/

13:49 <Yorlik> Sorry for that ;)

13:49 <hkaiser> that could be causing your problem

13:49 <Yorlik> It wasn't my intention .... :o

13:50 <Yorlik> So - do you think there's an obvious fix?

13:51 <hkaiser> let's see

13:51 <Yorlik> Debugger crashing again - it doesn't want me to see the errors

13:52 <Yorlik> errors length was 1

14:14 <Yorlik> hkaiser: Updated - I commented with the e.what() output

14:16 <Yorlik> I gave up on asking the debugger and just rethrew and printed. Take that debugger !!! :)

14:17 <Yorlik> If I read this correctly it didn't like me holding a spinlock while creating a new lua state

14:19 <Yorlik> The first thing I do in "agns::luaengine::lua_engine::init" is to acquire a lock_guard with a spinlock as lockable.

14:20 <Yorlik> And it seems I yielded while holding that lock and that seems not to be allowed.

14:34 <hkaiser> Yorlik: now it makes sense why it fails in debug only

14:34 <Yorlik> You solved the riddle?

14:34 <hkaiser> we don't check for held locks in release

14:35 <Yorlik> IC. I guess it's a protection mechanism to avoid deadlocks?

14:36 <hkaiser> yes

14:36 <hkaiser> you call yield while holding a lock

14:36 <Yorlik> And init() is a pretty large function with a ton of possible output

14:36 <hkaiser> can you unlock the lock while yielding?

14:37 <hkaiser> you could use util::scoped_unlock<> to handle that

14:37 <Yorlik> The main reason why I protected it was, that when creating a lua_engine I needed readable ungarbled output from the lua side, because part of the initialization requires me to run Lua scripts

14:37 <hkaiser> no, it's called unlock_guard

14:38 <Yorlik> I could try to just remove it entirely

14:38 <hkaiser> hpx::cout should ungarble output

14:38 <Yorlik> Not sure if it was really required

14:38 <Yorlik> The output comes from Lua print

14:38 <hkaiser> you can have the lock, just unlock it while yielding

14:38 <hkaiser> (if possible)

14:38 <Yorlik> I think I need to find a way to get synchronized output from Lua

14:39 <hkaiser> if you you have to hold on while yielding, use ignore_lock to tell HPX not to bother checking

14:39 <Yorlik> The problem is, even a synched output would garble the output between the lua engines being created

14:39 <hkaiser> Yorlik: whatever, that's not the problem we're trying to solve here

14:39 <Yorlik> Could i use ignore lock locally for a single case?

14:39 <hkaiser> yes

14:40 <Yorlik> So it's hpx::ignore_lock(true/false) ??

14:40 <hkaiser> no

14:40 <hkaiser> it's an object that disables checking in it's ctor and re-enables it in its dtor

14:41 <Yorlik> Oh that's nice

14:41 <Yorlik> So I just create one and it aute reenables when out of scope?

14:41 <hkaiser> simply create a scope befre calling yield with this variable inside: util::ignore_all_while_checking ignore_lock;

14:42 <Yorlik> Nice! Thanks !

14:42 <hkaiser> { util::ignore_all_while_checking ignore_lock; yield_while(...); }

14:42 <Yorlik> Makes sense - just like a lock_guard

14:42 <hkaiser> I'll look into for_loop exception handling

14:42 <Yorlik> Allright

14:42 <Yorlik> Seems the last days are finally steering towards a result :)

14:48 <Yorlik> It's kinda nice hpx has so much header only code, it was easy for me to poke into it and generate all the output I needed without recompiling everything.

15:11 <hkaiser> Yorlik: btw, exception handling for for_loop seems to be ok, btw - would be interesting to see why it failed for you

15:11 <Yorlik> So you'd have expected for me to get an exception at the loop callsite?

15:12 <Yorlik> hkaiser ^^

15:15 <hkaiser> yes

15:15 <hkaiser> IIF you call .get on the returned future - that will rethrow the exception

15:16 <hkaiser> but I'll add the test as it is missing

15:19 <Yorlik> I have wrapped that in try catch ofc

15:20 <hkaiser> Yorlik: so let me ask again

15:20 <hkaiser> is the exception thrown in the loop body?

15:20 <Yorlik> Yes

15:21 <Yorlik> On occasion I have to create a new Lua_engine

15:21 <hkaiser> k

15:21 <Yorlik> So it gets initialized and there is that lock

15:21 <Yorlik> Sometimes a task spawns new lua engines, when I'm going out of lua and back into lua

15:22 <Yorlik> So it mostly happens either in update() or in the Lua scripts called by update

15:39 <hkaiser> Yorlik: do you launch hpx tasks using apply()?

15:39 <hkaiser> of always using async?

15:39 <hkaiser> *or*

15:39 <Yorlik> I don't think I have any apply left - I'd have to scan my code, but I doubt it

15:39 <hkaiser> please have a look

15:39 <Yorlik> After you explained that exception mechanics I decided to chack all responses

15:40 <Yorlik> OK

15:43 <Yorlik> There are three left, but not in the simulator - they live in the code launching commands on the server controller - only used when issuing administrative commands coming from the admin client.

15:43 <Yorlik> The simulation is completely apply free

15:43 <hkaiser> Yorlik: here is the test I promised: https://github.com/STEllAR-GROUP/hpx/pull/4590

15:44 <hkaiser> well, we need to investigate what's happening, then

15:44 <Yorlik> Will it throw at the loop callsite if something goes wrong?

15:44 <Yorlik> I'm currently rebuilding boost and hpx debug versions

15:45 <Yorlik> Shouzld I build from you PR for testing?

15:45 <hkaiser> Yorlik: no, the PR just adds a test, no changes

15:46 <Yorlik> OK

15:46 <Yorlik> I'll continue building latest stable then. I made sure its new - deleted all old stuff

15:46 <hkaiser> Yorlik: if you use par, the loop will throw, if you use par(task), then the return future will rethrow when .get() is called

15:46 <Yorlik> Makes sense

15:48 <Yorlik> I assume the loop will create tasks internally in any case, right?

15:49 <Yorlik> BTW: Is there a list of headers in hpx I should use when using program_options or fmt through HPX to reuse dependencies?

15:50 * Yorlik is browsing PC Builder sites with Threadripper machines while boost is building ....

16:01 <hkaiser> Yorlik: go through the hpx/libs folder for all modules

16:01 <Yorlik> ?

16:01 <hkaiser> program_options is in hpx/libs/program_options, format is in hpx/libs/format ;-)

16:02 <Yorlik> Ahh - I see!

16:02 <hkaiser> but your app gets those through cmake already

16:02 <Yorlik> I'm building with external project - I'd have to point to the cmake config files

16:03 <Yorlik> Just using the master HPX config will give me all the dependency targets?

16:03 <hkaiser> yes

16:03 <hkaiser> use the target HPX::hpx

16:04 <hkaiser> we don't officially expose the individual modles at this point

16:04 <hkaiser> might in the future, though

16:06 <Yorlik> Nice. So I can just start using the headers

17:08 nikunj has quit [Ping timeout: 260 seconds]

17:08 nikunj has joined #ste||ar

18:39 Abhishek09 has joined #ste||ar

18:40 <Abhishek09> hkaiser: When will GSoC project announced?

18:40 <Abhishek09> diehlpk_work

18:55 Abhishek09 has quit [Remote host closed the connection]

19:24 <heller1> what does the timeline say?

19:24 <heller1> tomorrow

19:39 bita_ has joined #ste||ar

20:00 diehlpk_work has quit [Remote host closed the connection]

20:39 Yorlik has quit [Read error: Connection reset by peer]

20:43 Yorlik has joined #ste||ar

22:02 bita_ has quit [Ping timeout: 240 seconds]