#ste||ar on 2020-04-25 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

00:23 <weilewei> hkaiser try to building my application, got this linker error: [100%] Linking CXX executable main_dca/usr/bin/ld: cannot find -lhpx_wrap

00:24 <weilewei> not sure why it is looking hpx_wrap in user/bin/lb

00:24 <hkaiser> weilewei: on that branch?

00:24 <weilewei> Yup hkaiser

00:24 <hkaiser> yah, I think this is a different bug that was already fixed ;-)

00:24 <hkaiser> sec

00:24 <weilewei> k

00:25 <hkaiser> hmm, I thought it was https://github.com/STEllAR-GROUP/hpx/issues/4537

00:25 <hkaiser> but it isn't :/

00:26 <hkaiser> could you create a ticket, please?

00:26 <hkaiser> that's a new one

00:26 <weilewei> ok sure hkaiser

00:26 <hkaiser> you should be able to work around the problem for now

00:28 <hkaiser> give me a sec

00:33 <hkaiser> try using -DHPX_WITH_DYNAMIC_HPX_MAIN=OFF with cmake

00:33 <hkaiser> weilewei: ^^

00:33 <weilewei> hkaiser will try later, thanks

00:57 <zao> weilewei: /usr/bin/ld is the linker.

01:05 nikunj97 has quit [Ping timeout: 260 seconds]

01:52 hkaiser has quit [Quit: bye]

02:17 <weilewei> zao I see

03:55 bita has quit [Ping timeout: 256 seconds]

04:10 Pranavug has joined #ste||ar

04:11 Pranavug has quit [Client Quit]

06:01 karame_ has quit [Quit: Ping timeout (120 seconds)]

06:44 weilewei has quit [Ping timeout: 240 seconds]

09:58 <Yorlik> I'm getting hpx exeptions when trying to acquire a lock with a hpx spinlock like: "lock_guard lock( lua_engine_lockable_ );" where lua_engine_lockable_ is an hpx lcos spinlock. The exception info says: "vector deleting destructor". I'm a bit clueless where to go from here.

10:12 mdiers_ has quit [Ping timeout: 264 seconds]

11:01 nikunj97 has joined #ste||ar

12:17 nikunj97 has quit [Ping timeout: 250 seconds]

12:40 hkaiser has joined #ste||ar

12:46 <Yorlik> hkaiser: yt?

12:46 <hkaiser> here

12:46 <Yorlik> I'm hitting a wall with a very strange lockup.

12:46 <Yorlik> I know sortof what is triggering it, but I have no clue why.

12:47 <Yorlik> I tried to find if there is any sort of shared resource, but couldn't find one

12:47 <Yorlik> The situation is like this:

12:48 <Yorlik> I have a central run() function, which contains a while loop inside of which there is an update() function.

12:48 <Yorlik> That's my central object update loop

12:49 <Yorlik> It runs updates, and if there is time left at the end of the frame it does a busy wait

12:49 <Yorlik> It works nicely and without problems so far.

12:49 <Yorlik> Then I recently introduces a second path of updates.

12:49 <Yorlik> These updates do not have a wait - so the ramerate is unbounded.

12:50 <Yorlik> I call then in run(), before entering the while loop

12:50 <Yorlik> I receive a future form these, since they are started async.

12:50 <Yorlik> They keep running until I give a signal through an atomic to halt the simulation

12:51 <Yorlik> When this happens, this type of update exits, as well the while loop which does the bounded framerate updates.

12:52 <Yorlik> after the while loop, still within run() I collect the future of the unbounded updates.

12:52 <Yorlik> The futures from the bounded updates are collected after each frame inside the while loop.

12:52 <Yorlik> I can run either of these two paths

12:53 <Yorlik> The decision, which path is used for which type of object depends on the cpomponents.

12:53 <Yorlik> Basically I am looping through all templates, but the updaters which do not meet the respective conditional are empty using an if constexpr()

12:54 <Yorlik> This way I can decie whiochg objects are updated where.

12:54 <Yorlik> I tested it and it seems to work nicely.

12:55 <Yorlik> Now, when running both paths - even if one is never really doing updates, because I do not create objects - the unbounded path locks up.

12:55 <Yorlik> Looking at the worker threads it seems only the path running the bounded updates is active and mostly sitting in its busy wait

12:55 <Yorlik> The other worker threads seem to idle and sit in their schedulers waiting for work.

12:56 <Yorlik> The lockup appears to happen inside the parallel loop after finishing a chunk of work.

12:57 <Yorlik> My updater function have quit, the executor has been destroyed but nothing is happenoing.

12:57 <Yorlik> It's like stuck in the for loop not scheduling any more chunks.

12:58 <Yorlik> It also is not my on_enter or _on exit lambdas

12:58 <Yorlik> They have quit and are not running when it hangs

12:59 <Yorlik> I also had another bug - which is probably unrelated and I could not reproduce it anymore - an exception when trying to acquire a lock with a spinlock

13:00 <Yorlik> From my perspective it loocks like some weirdness deep inside hpx, hidden from me - buty maybe I'm ding something obviously wrong and just don't see it.

13:00 <Yorlik> So - that's pretty much it.

13:01 <Yorlik> So hkaiser: that's the wall of text ^^ :)

13:02 <Yorlik> I wish I would better understand what these seemingly idleing threads are actually doing and how to understand the state they are in when it hangs.

13:04 <hkaiser> do you see any problems in debug? any memory issues? objects going out of scope prematurely?

13:04 <Yorlik> I don't really know what to look for further.

13:04 <hkaiser> the seemingly idling threads wait for new tasks to be created, nothing else

13:04 <hkaiser> the 'lock-up' could be acused by a future you're waiting on, but that never gets ready

13:05 <Yorlik> The two code branches work nicely independently from each other. That's the only hint I currently have

13:06 <Yorlik> I let the server run for two hours without any problem

13:06 <hkaiser> hmm

13:07 <Yorlik> But as soon as i activate the bounded updates again it stops. It might be, that for some reason, it doesn't schedule the unbounded updates after some point, because no one askes for the future at the entry point

13:07 <hkaiser> difficult to tell from here

13:08 <Yorlik> Between the creation of the future and the checking of it lies the while loop for the bounded updates.

13:08 <Yorlik> But it stats and then stops

13:08 <Yorlik> Like as if the system wanted to tell me: If you don't ask for the future I'm not going to do anything anymore.

13:09 <Yorlik> It runs for short and stops after running some chunks from the parallel loop.

13:09 <Yorlik> BTW: The unbounded callsite is like this: auto frameless_fut = hpx::async( &controller::update_frameless, this );

13:11 <hkaiser> nah

13:11 <hkaiser> the problem can happen if call get on a future that never becomes ready

13:11 <hkaiser> *if you call get()*

13:12 <Yorlik> The while loop never stops running - just the loop inside the unbounded call.

13:12 <Yorlik> Again - if I switch off the while loop it works

13:12 <Yorlik> So it reaches the future.get and waits for it

13:13 <Yorlik> After the bounded update while loop: frameless_fut.get( );

13:13 <hkaiser> the unbounded thread runs all the time ? or is it created for each frame?

13:13 <Yorlik> It runs the updaters for each type async, collects the futures and starts over

13:13 <hkaiser> what does that mean 'it starts over'?

13:13 <Yorlik> so the while loop keeps spinning - I checked it

13:14 <hkaiser> ok

13:14 <Yorlik> The while loop runs one frame after the other

13:14 <hkaiser> how does it make the future ready?

13:14 <Yorlik> the while loop has its own set of futures

13:14 <Yorlik> the bounded path inside that while loop and the unbounded path are independently synchronized

13:15 <Yorlik> They have their own futures to manage what#s async

13:15 <hkaiser> you said the unbounded runs until an atomic is set

13:15 <Yorlik> Fundamentally both paths are exactly the same - just different functions

13:15 <Yorlik> The unbounded one has no busy loop after a frame - that's all

13:16 <hkaiser> where is that atomic set?

13:16 <Yorlik> inside the unbounded one and at the top of the while loop for the bounded one are checks for the atomic

13:16 <Yorlik> The atomic is set from the outside - but no one writes to it

13:16 <Yorlik> Only if I use the admin client to pause the simulation

13:16 <Yorlik> Or if I sghutdown the server

13:16 <hkaiser> ok, let's recap

13:17 <hkaiser> you launch the unbounded one, get back a future, then do the bounded work, and then wait for the future returned from the unbounded one

13:17 <Yorlik> The wait for the unbounded future is only ever reached, if the bounded work has finished.

13:17 <hkaiser> is that correct?

13:18 <Yorlik> Basically yes

13:18 <hkaiser> so you relaunch unbounded work on each frame?

13:18 <Yorlik> Just the while loop runs continuously

13:18 <Yorlik> Yes - it's done inside the while loop

13:18 <Yorlik> The wile loop never exits as long as the simulation is runniong

13:19 <hkaiser> you get a new unbounded future for each frame?

13:19 <Yorlik> I only ever ask for the unbounded future when the simulation is halted

13:19 <Yorlik> it has it's own while loop inside

13:19 <hkaiser> you lost me

13:19 <Yorlik> So there are two while loops

13:19 <hkaiser> on two tasks?

13:19 <Yorlik> Yes

13:20 <hkaiser> ok

13:20 <Yorlik> One is inside the async I posted

13:20 <Yorlik> the other is in run()

13:20 <hkaiser> run does the bounded work?

13:20 <Yorlik> So run spawns the unbounded task before it enters the bounded while

13:20 <hkaiser> ok

13:20 <Yorlik> after the bounded while, on pause state the unbounded future is collected

13:21 <Yorlik> and then run() exits

13:21 <hkaiser> what does that mean?

13:21 <hkaiser> 'future is collected'?

13:21 <Yorlik> I am not exiting run() before all work is fnished

13:21 <Yorlik> It's just for synchronization

13:21 <hkaiser> so you call get() on the unbounded future

13:21 <Yorlik> exactly

13:21 <hkaiser> ok

13:22 <Yorlik> its value is discarded. I think it's a char right now

13:22 <hkaiser> sure, np

13:22 <Yorlik> Could also be void - noone ever cares

13:22 <hkaiser> so where does it lock up?

13:22 <Yorlik> Its like a sandwich: start unbounded - start bounded - collect unbounded

13:22 <hkaiser> ok

13:23 <Yorlik> It locks in the middle of running

13:23 <Yorlik> So - not server pause is triggered

13:23 <hkaiser> 'middle of running' ?

13:23 <Yorlik> In theory the center while should run doung bounded updates and the task for unbounded updates should do itsa thing

13:23 <hkaiser> which task locks up? the bounded or the unbounded one?

13:24 <Yorlik> The unbounded one

13:24 <Yorlik> The bounded one keeps ticking nicely

13:24 <hkaiser> how can it lock up if it doesn't do any synchronization?

13:24 <Yorlik> Its hangs inside the parallel loop after exiting a chunk

13:25 <Yorlik> It looks as if no more chunks are launched

13:25 <hkaiser> so the unbounded task runs a parallel for?

13:25 <Yorlik> Despite not being finished

13:25 <Yorlik> Yes - both do

13:25 <hkaiser> and it never exists the loop?

13:25 <Yorlik> Not until the server is halted

13:25 <hkaiser> how many chunks does it run?

13:26 <Yorlik> Again - just be switching off the middle while loop of the bounded updates it starts working correctly

13:26 <Yorlik> 2-3 chunks or so

13:26 <hkaiser> ok

13:26 <hkaiser> let's go back again

13:26 <Yorlik> both paths use different specializations of the par loop

13:27 <hkaiser> you launch unbounded work for each frame?

13:27 <Yorlik> I was thinking if something is shared, but there isn't

13:27 <Yorlik> essentially the unbounded updates run in frames too, but without a frequency limit

13:27 <hkaiser> so you launch unbounded work for each frame?

13:28 <Yorlik> after each frame it starts over isntead of trying to sync with a framrate

13:28 <hkaiser> a new async for each frame?

13:28 <Yorlik> no

13:28 <Yorlik> I did both

13:28 <hkaiser> you lost me again

13:28 <Yorlik> async and not async

13:28 <Yorlik> it doesn't make a difference

13:29 <Yorlik> I do not spawn additional tasks inside the unbounded updates, except what the parloop does.

13:29 <hkaiser> ok

13:29 <hkaiser> one bounded async per frame?

13:30 <Yorlik> One per entity type

13:30 <hkaiser> I'm lost, sorry

13:30 <Yorlik> The entities are different types - they are distributed between the two update methods.

13:31 <Yorlik> Each entity type has its own parloop

13:31 <Yorlik> So inside the managing loop which checks if the server is running, one function specialization for each entity type is started to do the updates

13:31 <hkaiser> that does not matter at all

13:31 <Yorlik> Indeed.

13:32 <Yorlik> Just saying - that'Äs the structure

13:32 <Yorlik> In the bounded updates these functions are asyncs

13:32 <hkaiser> pls create a 10 liner that reproduces the execution structure and relation between tasks

13:32 <Yorlik> I'll do some pseudocode

13:41 <Yorlik> hkaiser: https://gist.github.com/McKillroy/b97900a76a2c9d4a85a117e019882ff3

13:41 <Yorlik> This is not showing what's inside the updater, but it reflects this basic structure

13:46 <Yorlik> Inside the updater the parloops for both branches are running on the respective specializations of the update_array<> function.

14:21 nikunj97 has joined #ste||ar

14:30 <hkaiser> Yorlik: so far think look sane

14:31 <Yorlik> Thanks for looking. That's why I asked - I'm hitting a road block here.

14:33 <Yorlik> The output clarly shows, that a chunk from the parloop finishes with destruction of the executor and then it just hangs.

14:33 <Yorlik> It doesn't even exit the function which has the parloop.

14:33 <Yorlik> It's hanging inside the parloop.

14:37 <Yorlik> Switching on and off the middle while loop inside run() changes it

14:37 <hkaiser> what executor do you use?

14:37 <Yorlik> The one you wrote with the hooks

14:37 <Yorlik> it exists its destructor correctly

14:37 <hkaiser> is it an asynchronous loop?

14:37 <Yorlik> Last line on screen before it hangs: EXITING executor_with_thread_hooks::~on_exit

14:38 <hkaiser> i.e. par(task)?

14:38 Hashmi has joined #ste||ar

14:39 <Yorlik> hpx::parallel::for_loop(hpx::parallel::execution::par.on( exec ).with( auto_chunk_size( 500us ) ), ....

14:39 <hkaiser> ok, so it's not a async loop

14:39 <Yorlik> Nope. It's pretty simple.

14:42 <hkaiser> Yorlik: so the par-loop is inside update_entity_array_advanced<>()?

14:42 <Yorlik> Exactly

14:42 <hkaiser> so you have many of those?

14:42 <hkaiser> all using the same executor instance?

14:42 <Yorlik> The executor is local inside the function.

14:43 <hkaiser> k

14:43 <Yorlik> So each specialization has its own copy

14:43 <Yorlik> Same with the lambdas

14:43 <hkaiser> does it always lock-up in the same par-loop (same entity instance)?

14:44 <Yorlik> I am currently testing only with one entity type

14:44 <hkaiser> k

14:45 <hkaiser> where does the par-loop gets its work from?

14:45 <Yorlik> From the function which updates a single entity

14:45 <Yorlik> It exits properly

14:46 <hkaiser> so far - from what I see - the unbounded task runs only one instance of a par-loop is that correct?

14:46 <Yorlik> I spammed output everywhere - its not even re-entering this function.

14:46 <hkaiser> should it?

14:47 <Yorlik> I was checking exactly where it is hanging

14:47 <Yorlik> I wanted to exclude it's hanging inside my code somewhere.

14:48 <Yorlik> As I see it, it correctly finishes a chunk and destroys the executor

14:48 <Yorlik> And then lights oout

14:48 <Yorlik> An output immediately below the parloop never shows up again

14:48 <Yorlik> So it's inside the parallel for

14:49 <hkaiser> why does it destroy the executor before exiting the par-loop?

14:49 <Yorlik> I think I might add an output in the executo ctor

14:50 <Yorlik> Well - the on_exit to be exact

14:50 <Yorlik> Not the entire executor

14:50 <hkaiser> you souldn't destroy the executor before the par-loop is done

14:50 <Yorlik> my bad

14:51 <Yorlik> It's the nested on_exit struct you created inside it - I was imprecise here.

14:52 <hkaiser> ok, that one calls into your on_exit function, what happens there?

14:52 <Yorlik> it exits correctly.

14:52 <hkaiser> what exits correctly?

14:53 <Yorlik> I am just seeing the operator() of the hook_wrapper might need some output

14:53 <Yorlik> My lambda exits and the nested on_exit struct

14:53 <Yorlik> gets destroyed

15:00 <Yorlik> hkaiser: Something is strange: the hook_wrapper operator() never exits

15:00 <Yorlik> I added output before and after invoke

15:01 <Yorlik> the post invoke oputput never shows up

15:01 <hkaiser> so it hangs in your loop body?

15:01 <Yorlik> It seems

15:01 <Yorlik> I did this:

15:01 <Yorlik> decltype( auto ) operator( )( Ts&&... ts ) {

15:01 <Yorlik> on_exit _ { exec_ };

15:01 <Yorlik> return hpx::util::invoke( f_, std::forward<Ts>( ts )... );

15:01 <Yorlik> S_OUT( "hook_wrapper() - ENTER - " << std::endl );

15:01 <Yorlik> S_OUT( "hook_wrapper() - EXIT - " << std::endl );

15:01 <Yorlik> }

15:01 <hkaiser> well, the output after the return will never be executed

15:02 <Yorlik> I only see a couple of ENTERs

15:02 * Yorlik bangs head on table

15:02 <Yorlik> lol

15:03 <Yorlik> Actually I had that today at another spot already.

15:03 <hkaiser> Yorlik: the only way to execute things after a return is to run destructors for local objects

15:04 <Yorlik> Interesting idea - I wonder when yoiu'd ever use that

15:04 <hkaiser> like the ~on_exiit above

15:04 <Yorlik> I split up the return

15:04 <zao> Does C++ have a decent scope_exit thing yet?

15:04 <Yorlik> So its immediately before return result;

15:05 <hkaiser> zao: any destructor?

15:05 <zao> I mean, run arbitrary lambdas/blocks easily without having to hand-craft something.

15:05 <hkaiser> ahh, the facility itself - well that's trivial, isn't it?

15:06 <hkaiser> template <typename F> struct on_exit { F f; ~on_exit(f()); };

15:06 <hkaiser> template <typename F> struct on_exit { F f; ~on_exit() {f();} };, that is

15:07 <Yorlik> So - how do we get to the bottom of this?

15:07 <hkaiser> Yorlik: we could to a screen-share later today

15:08 <hkaiser> *do*

15:08 <Yorlik> That would be awesome - do you have teamviewer? imo it has best quality

15:08 <hkaiser> visual studion has a nice live-share facility nowadays

15:08 <Yorlik> Oh nice - I have it installed - we can use that

15:08 <hkaiser> ok

15:09 <hkaiser> I've never used it before - just heard good things

15:09 <Yorlik> Take some drugs before looking at my code so it becomes bearable ;)

15:09 <Yorlik> It's a sausage in trhe making.

15:36 <hkaiser> Yorlik: come on - no worries - we share all of our saucage with you as well...

15:37 <Yorlik> Indeed. I think it's this insecurity because of my inconsistency. I sometimes am struggleing with ridiculously trivial stuff.

15:37 <Yorlik> The fate of a self taught person /methinks.

15:55 <hkaiser> Yorlik: we're all self-taught

15:56 <Yorlik> True - that's what academia is supposed to teach us: Dive into unknown waters without fear ...

15:59 weilewei has joined #ste||ar

16:48 Hashmi has quit [Quit: Connection closed for inactivity]

16:49 gonidelis has joined #ste||ar

16:52 gonidelis has quit [Remote host closed the connection]

18:30 weilewei has quit [Ping timeout: 240 seconds]

19:06 Hashmi has joined #ste||ar

19:18 kale_ has joined #ste||ar

19:19 kale_ has quit [Client Quit]

19:27 <hkaiser> Yorlik: yt?

19:32 <Yorlik> Ya

19:33 <Yorlik> hkaiser - I'm there

19:33 <hkaiser> Yorlik: let me know whenever you have time to look at yur problem

19:33 <Yorlik> I have time now

19:33 <hkaiser> I can setup the zoom

19:34 <hkaiser> give me 5 minutes, ok?

19:34 <Yorlik> Allright

21:15 Hashmi has quit [Quit: Connection closed for inactivity]