hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
nikunj has quit [Remote host closed the connection]
nikunj has joined #ste||ar
hkaiser has quit [Quit: bye]
kale_ has joined #ste||ar
kale_ has quit [Quit: Konversation terminated!]
parsa has quit [Remote host closed the connection]
parsa has joined #ste||ar
mcopik has joined #ste||ar
mcopik has quit [Client Quit]
<Yorlik> Seems I just found an uncaught exception in HPX which caused my lockups: Scroll to end of https://gist.github.com/McKillroy/5f92e41f5c851d28408ca447e7dc8f09
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
nikunj has quit [Read error: Connection reset by peer]
<Yorlik> Seems to be not caught and cause my parallel loop to hang and never exit.
nikunj has joined #ste||ar
<Yorlik> ms[m] ^^
gonidelis has joined #ste||ar
gonidelis has quit [Remote host closed the connection]
<heller1> Yorlik: so you eventually throw in one of your element functions?
<heller1> Did you try catching the exception yourself and see if that lockup persists?
hkaiser has joined #ste||ar
<ms[m]> Yorlik: what heller said + the question is how do you launch your parallel for loop? with par(task)? do you ever get() the value from the future?
<Yorlik> Allright - back. Had to take a nap for beauty and sanity :)
<Yorlik> I can add the callsite to the report ofc - I'll do that right away.
<Yorlik> And yes - there are occasions where I throw (standard exceptions like std::runtime_error) and catch the exception through the future. All futures are put into a list and checked for exceptions.
<Yorlik> If I didn't produce a bug, there should be not a single unchecked future. It has worked in the past.
<Yorlik> OFC I can't catch anything at the loop callsite, since nothing ever arrives there.
<Yorlik> heller1, ms[m], hkaiser ^^
<hkaiser> Yorlik: we'll need a small reproducing case
<hkaiser> Yorlik: our parallel algorithms all have tests for exception handling, so you must be doing something differently
<Yorlik> Whyt could I have done wrong? Forget to check a future and never call future.get() ? Or not catch when calling .get()?
<hkaiser> I'm not saying you did something wrong
<hkaiser> I said you do something in a way we do not test
<Yorlik> Probably. What I'm doing is std::move the futures between lists for delayed checking at the end of a frame.
<hkaiser> Yorlik: why did you close the ticket?
<Yorlik> Did I? Then it was an accident
<Yorlik> Sorry for that.
<hkaiser> ok, I'll reopen it
<Yorlik> Probably when trying to get that code sample in
<hkaiser> np
<hkaiser> now please tell me, where is that exception thrown? in update_entity<>?
<Yorlik> Must be
<Yorlik> But not in there probably
<Yorlik> Couzld also be inside the Lua Call stack
<hkaiser> so it's thrown in the loop body?
<hkaiser> what exception is it?
<Yorlik> I don't know - the endpoint on the HPX side has an error list and the debugger crashed each time I tried to open and inspect it
<Yorlik> I'll retry and see what's acually there
<Yorlik> But I don't have many sites where I throw
<hkaiser> Yorlik: I looked now - we don't test exception handling for for_loop :/
<Yorlik> Sorry for that ;)
<hkaiser> that could be causing your problem
<Yorlik> It wasn't my intention .... :o
<Yorlik> So - do you think there's an obvious fix?
<hkaiser> let's see
<Yorlik> Debugger crashing again - it doesn't want me to see the errors
<Yorlik> errors length was 1
<Yorlik> hkaiser: Updated - I commented with the e.what() output
<Yorlik> I gave up on asking the debugger and just rethrew and printed. Take that debugger !!! :)
<Yorlik> If I read this correctly it didn't like me holding a spinlock while creating a new lua state
<Yorlik> The first thing I do in "agns::luaengine::lua_engine::init" is to acquire a lock_guard with a spinlock as lockable.
<Yorlik> And it seems I yielded while holding that lock and that seems not to be allowed.
<hkaiser> Yorlik: now it makes sense why it fails in debug only
<Yorlik> You solved the riddle?
<hkaiser> we don't check for held locks in release
<Yorlik> IC. I guess it's a protection mechanism to avoid deadlocks?
<hkaiser> yes
<hkaiser> you call yield while holding a lock
<Yorlik> And init() is a pretty large function with a ton of possible output
<hkaiser> can you unlock the lock while yielding?
<hkaiser> you could use util::scoped_unlock<> to handle that
<Yorlik> The main reason why I protected it was, that when creating a lua_engine I needed readable ungarbled output from the lua side, because part of the initialization requires me to run Lua scripts
<hkaiser> no, it's called unlock_guard
<Yorlik> I could try to just remove it entirely
<hkaiser> hpx::cout should ungarble output
<Yorlik> Not sure if it was really required
<Yorlik> The output comes from Lua print
<hkaiser> you can have the lock, just unlock it while yielding
<hkaiser> (if possible)
<Yorlik> I think I need to find a way to get synchronized output from Lua
<hkaiser> if you you have to hold on while yielding, use ignore_lock to tell HPX not to bother checking
<Yorlik> The problem is, even a synched output would garble the output between the lua engines being created
<hkaiser> Yorlik: whatever, that's not the problem we're trying to solve here
<Yorlik> Could i use ignore lock locally for a single case?
<hkaiser> yes
<Yorlik> So it's hpx::ignore_lock(true/false) ??
<hkaiser> no
<hkaiser> it's an object that disables checking in it's ctor and re-enables it in its dtor
<Yorlik> Oh that's nice
<Yorlik> So I just create one and it aute reenables when out of scope?
<hkaiser> simply create a scope befre calling yield with this variable inside: util::ignore_all_while_checking ignore_lock;
<Yorlik> Nice! Thanks !
<hkaiser> { util::ignore_all_while_checking ignore_lock; yield_while(...); }
<Yorlik> Makes sense - just like a lock_guard
<hkaiser> I'll look into for_loop exception handling
<Yorlik> Allright
<Yorlik> Seems the last days are finally steering towards a result :)
<Yorlik> It's kinda nice hpx has so much header only code, it was easy for me to poke into it and generate all the output I needed without recompiling everything.
<hkaiser> Yorlik: btw, exception handling for for_loop seems to be ok, btw - would be interesting to see why it failed for you
<Yorlik> So you'd have expected for me to get an exception at the loop callsite?
<Yorlik> hkaiser ^^
<hkaiser> yes
<hkaiser> IIF you call .get on the returned future - that will rethrow the exception
<hkaiser> but I'll add the test as it is missing
<Yorlik> I have wrapped that in try catch ofc
<hkaiser> Yorlik: so let me ask again
<hkaiser> is the exception thrown in the loop body?
<Yorlik> Yes
<Yorlik> On occasion I have to create a new Lua_engine
<hkaiser> k
<Yorlik> So it gets initialized and there is that lock
<Yorlik> Sometimes a task spawns new lua engines, when I'm going out of lua and back into lua
<Yorlik> So it mostly happens either in update() or in the Lua scripts called by update
<hkaiser> Yorlik: do you launch hpx tasks using apply()?
<hkaiser> of always using async?
<hkaiser> *or*
<Yorlik> I don't think I have any apply left - I'd have to scan my code, but I doubt it
<hkaiser> please have a look
<Yorlik> After you explained that exception mechanics I decided to chack all responses
<Yorlik> OK
<Yorlik> There are three left, but not in the simulator - they live in the code launching commands on the server controller - only used when issuing administrative commands coming from the admin client.
<Yorlik> The simulation is completely apply free
<hkaiser> Yorlik: here is the test I promised: https://github.com/STEllAR-GROUP/hpx/pull/4590
<hkaiser> well, we need to investigate what's happening, then
<Yorlik> Will it throw at the loop callsite if something goes wrong?
<Yorlik> I'm currently rebuilding boost and hpx debug versions
<Yorlik> Shouzld I build from you PR for testing?
<hkaiser> Yorlik: no, the PR just adds a test, no changes
<Yorlik> OK
<Yorlik> I'll continue building latest stable then. I made sure its new - deleted all old stuff
<hkaiser> Yorlik: if you use par, the loop will throw, if you use par(task), then the return future will rethrow when .get() is called
<Yorlik> Makes sense
<Yorlik> I assume the loop will create tasks internally in any case, right?
<Yorlik> BTW: Is there a list of headers in hpx I should use when using program_options or fmt through HPX to reuse dependencies?
* Yorlik is browsing PC Builder sites with Threadripper machines while boost is building ....
<hkaiser> Yorlik: go through the hpx/libs folder for all modules
<Yorlik> ?
<hkaiser> program_options is in hpx/libs/program_options, format is in hpx/libs/format ;-)
<Yorlik> Ahh - I see!
<hkaiser> but your app gets those through cmake already
<Yorlik> I'm building with external project - I'd have to point to the cmake config files
<Yorlik> Just using the master HPX config will give me all the dependency targets?
<hkaiser> yes
<hkaiser> use the target HPX::hpx
<hkaiser> we don't officially expose the individual modles at this point
<hkaiser> might in the future, though
<Yorlik> Nice. So I can just start using the headers
nikunj has quit [Ping timeout: 260 seconds]
nikunj has joined #ste||ar
Abhishek09 has joined #ste||ar
<Abhishek09> hkaiser: When will GSoC project announced?
<Abhishek09> diehlpk_work
Abhishek09 has quit [Remote host closed the connection]
<heller1> what does the timeline say?
<heller1> tomorrow
bita_ has joined #ste||ar
diehlpk_work has quit [Remote host closed the connection]
Yorlik has quit [Read error: Connection reset by peer]
Yorlik has joined #ste||ar
bita_ has quit [Ping timeout: 240 seconds]