#ste||ar on 2019-09-27 — irc logs at irclog.cct.lsu.edu

2019-06-17 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoD: https://developers.google.com/season-of-docs/

01:23 K-ballo has quit [Quit: K-ballo]

03:25 hkaiser has quit [Ping timeout: 245 seconds]

06:15 <heller> https://www.osti.gov/servlets/purl/1559921

06:17 <heller> https://bolt-omp.org

07:20 rori has joined #ste||ar

08:45 Amy2 has quit [Ping timeout: 252 seconds]

10:45 rori has quit [Quit: bye!]

10:56 K-ballo has joined #ste||ar

11:48 hkaiser has joined #ste||ar

12:08 mdiers_ has quit [Remote host closed the connection]

12:09 mdiers_ has joined #ste||ar

12:52 <hkaiser> heller: I really would like for #4108 or similar to go in asap

12:52 <hkaiser> the current situation is not sustainable anymore

12:53 <heller> what't the underlying issue?

12:53 <heller> or how does it manifest?

12:53 <heller> build failures on non c++17 compilers?

12:53 <heller> hkaiser: in an ideal world, we shouldn

12:54 <heller> hkaiser: in an ideal world, we shouldn't really set the C++ mode, but have the user decide...

12:54 <hkaiser> heller: the issue is that if hpx was configured with -std=c++17, then any dependent projects _must_ use c++17 as well

12:54 <heller> and then propagate it through the HPX target...

12:54 <hkaiser> not always possible

12:54 <heller> well, yes

12:54 <heller> ABI problems?

12:54 <hkaiser> phylanx depends on pybind11 which must be compiled with c++14

12:55 <heller> then you should configure HPX with the version you need

12:55 <hkaiser> well, circle use c++17

12:55 <heller> wow, is there such a breaking change that it doesn't work when compiled in C++17 mode?

12:55 weilewei has joined #ste||ar

12:56 <hkaiser> no it has a bug preventing it from working with aligned memory new which will be visible with c++17

12:57 <hkaiser> normally there are no ABI problems between 17 and 14 (on mainstream compilers), so #4108 enables HPX features only if a) they are enabled while compiling HPX and b) the dependent library supports those as well

12:59 <heller> ok

13:00 <weilewei> hkaiser Do I need to remove CXXFLAGS="-std=c++14" from my build script then?

13:02 <weilewei> I also use HPX_WITH_CXX14=ON on my camke, so two flags at the same time

13:15 <hkaiser> weilewei: as long as those are consistent you're fine

13:15 <weilewei> hkaiser ok

13:22 aserio has joined #ste||ar

13:46 hkaiser has quit [Ping timeout: 265 seconds]

13:54 Yorlik has quit [Read error: Connection reset by peer]

14:09 hkaiser has joined #ste||ar

14:16 K-ballo has quit [Quit: K-ballo]

14:17 K-ballo has joined #ste||ar

14:22 <aserio> hkaiser: is this true about --hpx::bind

14:22 <aserio> the detailed affinity description for the OS threads, see More details about HPX command line options for a detailed description of possible values. Do not use with --hpx:pu-step, --hpx:pu-offset or --hpx:affinity options. Setting this option additionally set --hpx:numa-sensitive unless --hpx:bind=none. In this case bind disables the thread affinities.

14:43 K-ballo has quit [Quit: K-ballo]

14:44 K-ballo has joined #ste||ar

14:50 jbjnr has joined #ste||ar

15:02 <jbjnr> hkaiser: yt?

15:04 <hkaiser> jbjnr: here

15:05 <jbjnr> hi. I osted a question yesterday and am looking again at it now.

15:05 <jbjnr> hold on ...

15:05 <jbjnr> I have a problem - a lock-up caused by my scheduler. When load components action runs at program start, it looks like the shared state is not getting set to ready when it completes, the task runs as expected, but when it returns, the return load_components_async(gid).get(); never returns, so the program sits in an infinite loop of looking for new tasks, but never gets any. I can't imagine why the scheduler

15:05 <jbjnr> change I made should cause this. What might make the future not get set?

15:05 <jbjnr> ^reposted

15:28 <hkaiser> jbjnr: no idea, frankly

15:29 <hkaiser> I have seen lockups at startup, but can't remember what was causing those...

15:29 aserio has quit [Ping timeout: 250 seconds]

15:31 <jbjnr> hkaiser:when an action completes, where does the future shared state get made ready - I put breakpoints in seveeral promis laces etc, but they are not triggered. Which class should I look at

15:32 <hkaiser> sec

15:35 <hkaiser> jbjnr: here: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/lcos/base_lco_with_value.hpp#L139

15:35 <jbjnr> thanks

15:36 Yorlik has joined #ste||ar

15:39 K-ballo has quit [Quit: K-ballo]

15:42 K-ballo has joined #ste||ar

15:55 aserio has joined #ste||ar

16:02 Yorlik has quit [Read error: Connection timed out]

16:23 akheir has joined #ste||ar

16:30 maxwellr96 has quit [Ping timeout: 246 seconds]

16:44 aserio has quit [Ping timeout: 245 seconds]

16:48 nikunj has joined #ste||ar

16:58 <weilewei> hkaiser for the future in hpx, I know that future.get() is not a blocking call. However, if I would like to have a forced blocking call to that future, what member function should I use?

17:00 <weilewei> I think I meet a timing issue in DCA++, the future has not computed yet, but the application tried to assert if the number of all futurized tasks == the number of expected tasks. This situation is not happening when they use std::future::get(), as that gaurantee to be fully computed

17:05 <zao> HPX source documentation seems to imply that future's get should wait(), or am I looking in the wrong place?

17:06 <weilewei> zao thanks, let me try

17:07 <zao> what kind of scenario are you using it in, and what kind of behaviour are you seeing?

17:09 <hkaiser> weilewei: what is a 'forced blocking call'?

17:09 <hkaiser> weilewei: future::get() does block the current HPX thread but not the underlying OS-thread, so I'm not sure what you need

17:15 <weilewei> so in this link: https://github.com/weilewei/DCA/blob/5e6fb4b5e55666815737a967cd1506bd441a8215/include/dca/phys/dca_step/cluster_solver/stdthread_qmci/stdthread_qmci_cluster_solver.hpp#L181, parameters_.get_walkers() is set to 7, while

17:15 <weilewei> (gdb) p walk_finished_$1 = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = 0}, <No data fields>}

17:17 <hkaiser> so it's set to 4 and not to 7

17:17 <hkaiser> ahh o, zet to zero

17:17 <weilewei> starting from line 169, that is the place where all hpx futrues start. However, I suspect that those launched futures have not finished

17:17 <hkaiser> it's set to zero

17:17 <hkaiser> yes

17:18 <hkaiser> are you sure you're not sitting in line 196?

17:18 <weilewei> In their previous implementation, every time, they luanch std::future, several lines later, they call future.get(), which makes sure to finish 1 worker

17:19 <weilewei> hmm. it breaks at line 181

17:19 <hkaiser> it could be called from line 196, no?

17:20 <weilewei> https://gist.github.com/weilewei/02e157ce93209f52af41818d7a7b2ea9, this is the backtrace

17:20 <zao> Note frame #5's line number.

17:21 <weilewei> Ohh

17:24 <weilewei> So that's the starting point that program breaks? zao

17:25 <zao> As you're in a catch block there, something has thrown in the associated try block.

17:25 <zao> Which would be the get() function of a future, or somehow the iteration over a vector, which I doubt can throw.

17:25 <zao> Inspecting the exception may help.

17:27 <zao> Figuring out what can make a future throw might be educational, maybe it's forwarding a failure from the task. Maybe it's inconsistent in some way.

17:28 <weilewei> I am wondering how can a lower line 196 call a higher line 181

17:29 <zao> print_metadata is an anonymous function, a lambda.

17:29 <zao> When control flows past the definition, it only initializes the variable, it doesn't run the body.

17:29 <zao> It's being invoked at one of two places later on, line 196 and 200.

17:31 <weilewei> Oh, I see what you mean, I got it

17:31 <zao> :D

17:33 <weilewei> Ok, I think then I should look into why that future throws and what is the failure inside that future

17:40 weilewei has quit [Remote host closed the connection]

17:44 weilewei has joined #ste||ar

17:47 <heller> weilewei: when running in gdb, type catch throw before you run

17:47 <heller> Then you break when the exception is being thrown

17:53 <weilewei> Ok, let me try that heller

17:56 aserio has joined #ste||ar

18:03 <hkaiser> heller: I have all self-contained serialization tests running again (the ones not depending on hpx)

18:06 <jbjnr> hmmmm. so it looks like pre-main executes load_components_action, but then doesn't get resumed fter

18:07 <hkaiser> so the task does not get rescheduled

18:21 <jbjnr> hkaiser:got a moment to look at a log

18:22 <hkaiser> sure

18:22 <jbjnr> 1 min

18:23 <jbjnr> https://gist.githubusercontent.com/biddisco/de7f8930db4ff9c0910ba1e6ece64f26/raw/232e557343e909aebb8fcea95eb62f8a7a18e3d0/local

18:23 <jbjnr> that's a good log using the local scheduler

18:24 <jbjnr> https://gist.githubusercontent.com/biddisco/391a7ad670c4507e984d87131c0249f9/raw/7dc7e6e27df67f63236eeac7742c2f3f51f2137a/gistfile1.txt

18:25 <jbjnr> that's using the share-p scheduler. The second log stops there. the first log I truncted just after the same point

18:25 <jbjnr> if you look at them alternately, hey're the same up to the last bit

18:25 <jbjnr> when the local carries on, but the shared stops

18:26 <jbjnr> any idea why the pre-main isn't restarted?

18:26 <jbjnr> I've no idea what I changed to make it stop working. it was running fine, then it wasn't

18:26 <jbjnr> and I can't see anything that's wrong.

18:27 <jbjnr> and all tasks in queues etc tally up properly

18:30 <heller> hkaiser: nice

18:31 <hkaiser> jbjnr: well, the log say that pre_main got rescheduled, no?

18:32 <hkaiser> last line: new state(pending), old state(suspended)

18:32 <jbjnr> yes

18:32 <jbjnr> but then it 'disappears'

18:32 <hkaiser> so your scheduler got the task to run

18:32 <hkaiser> hmmm

18:33 <jbjnr> after that last log line, then the scheduler just polls for tasks and never gets any

18:33 <jbjnr> when a task resumes, does it enter the scheduler differently - perhaps I removed an important line or something

18:33 <jbjnr> its still in the thread map btw

18:35 <jbjnr> the thread map has 1 entry, but the new tasks and work items are empty, so where did it go? it must have not entered the queues - but I am puzzled

18:35 <hkaiser> jbjnr: look here: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/runtime/threads/detail/set_thread_state.hpp#L258

18:35 <hkaiser> that's where the thread is handed to the scheduler after being resumed

18:37 <hkaiser> here is the log message we saw: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/runtime/threads/detail/set_thread_state.hpp#L196

18:37 <jbjnr> thanks. I'll put a breakpoint there and see if the thread is lost en-route

18:37 <heller> hkaiser: might not have been the best idea to mix the modularization with the new feature

18:39 <hkaiser> heller: yah, no other way, I think

18:40 <heller> Same path as we went with the collectives, then decoupling

18:40 <hkaiser> the change is so pervasive because of the include changes, you can just ignore those

18:40 <heller> Yeah

18:40 <hkaiser> I'm not going back at this point

18:40 <heller> Sure, I understand ;)

18:41 <heller> Just might take a little to review

18:42 <hkaiser> heller: I need more time before review

18:42 <heller> K

18:42 <hkaiser> it's usable as a separate library now, but hpx itself doesn't compile yet

18:44 <heller> Hmm

18:44 <heller> That's very nice already

18:44 <heller> Looking forward to that

18:45 <heller> I guess what I need in the near future are futures and serialization

18:45 <hkaiser> right

18:45 <heller> That'll keep me busy for a few months

18:49 <heller> First I need the official blessing though ;)

18:58 <jbjnr> hkaiser:thanks - found the problem. I was looking in the wrong place entirely

18:59 <jbjnr> but tracing from where you linked tracked it for me

19:01 <jbjnr> incidentally hkaiser what is the "allow_fallback" flag passed in when scheduling a thread?

19:04 <hkaiser> jbjnr: uhh

19:04 <hkaiser> where?

19:05 <heller> hkaiser: https://circleci.com/gh/STEllAR-GROUP/hpx/130254?utm_campaign=workflow-failed&utm_medium=email&utm_source=notification

19:09 <hkaiser> heller: thanks, I screwed that one up :/

19:09 <hkaiser> will fix

19:09 <hkaiser> just replace this with a return 0

19:13 <jbjnr> hkaiser:sorry here for example https://github.com/STEllAR-GROUP/hpx/blob/8f9678cae0b4e5902582e4f88855d69634beeb57/hpx/runtime/threads/policies/local_queue_scheduler.hpp#L469

19:14 <jbjnr> seems to be related to suspending threads, I didn't really pay attention to it

19:14 <jbjnr> doesn't matter, ignore me

19:29 weilewei has quit [Remote host closed the connection]

19:31 <zao> What's a typical reason for "get_runtime_ptr() != nullptr"? Forgetting to set up HPX at all?

19:32 <zao> (messing around with wei's fork of DCA, I've got no idea what it's doing, but it seems to be quite a bit into main by then)

19:36 <hkaiser> no idea

19:36 <hkaiser> this will be nullptr only if hpx is not running at all

19:41 aserio has quit [Ping timeout: 250 seconds]

19:41 nikunj97 has joined #ste||ar

19:45 nikunj has quit [Ping timeout: 240 seconds]

19:58 nikunj has joined #ste||ar

19:59 weilewei has joined #ste||ar

20:00 <zao> Interestingly enough, <hpx/hpx_main.hpp> does not take in main_dca.cpp, not sure if there's something I missed building HPX, but it's silently doing nothing.

20:01 nikunj97 has quit [Ping timeout: 240 seconds]

20:01 <zao> hkaiser: Ended up doing it the explicit way of including <hpx/hpx_init.hpp> and going main -> hpx::init -> hpx_main junk, and the run seems to churn on nicely.

20:01 <hkaiser> nod, it needs to be included in the file that defines main()

20:02 <zao> I did.

20:02 <hkaiser> hmm, then I don't know

20:02 <hkaiser> what platform?

20:02 <zao> Ubuntu 18.04, GCC 8.3, Boost 1.71

20:02 <hkaiser> heh, that should work :/

20:02 <zao> No idea how DCA uses HPX.

20:02 <hkaiser> me neither ;-)

20:03 <hkaiser> ask wei

20:03 <weilewei> I remember I had this problem before, but I forget how did I solve it. zao

20:03 <hkaiser> but explicit is better than implicit anyways

20:03 <zao> Ah, you got back, only saw the leave :)

20:04 <weilewei> I saw the log, so, did you pass -DHPX_DIR=$PATH_TO_HPX_INSTALL in your cmake?

20:05 <zao> I used CMAKE_PREFIX_PATH

20:06 <weilewei> and also, did you pass -DDCA_HAVE_HPX=ON

20:06 <zao> https://gist.githubusercontent.com/zao/b455f6d4730140b64aad38ecdb76d496/raw/d6fb3f82ac2199a98d35c7b92699074d9c8e60b7/gistfile1.txt

20:08 <weilewei> I am clueless, hmmm

20:09 <zao> What branch are you working on, and have you ensured that the HPX runtime is actually running?

20:09 hkaiser has quit [Ping timeout: 264 seconds]

20:10 <zao> Did you figure out what was throwing?

20:10 <weilewei> I am working on rostam_test branch

20:10 aserio has joined #ste||ar

20:10 <weilewei> I have not, still debugging

20:13 <weilewei> zao which test/executable are you running?

20:13 <zao> main_dca with some JSON generated by cooldown.py

20:14 <weilewei> I am running this too, but it does not complain about the HPX runtime

20:20 <jbjnr> weilewei:are you using any of my stuff?

20:22 <weilewei> jbjnr not really, I tried, but I cannot get it run. so I basically replace their thread pool implementation with hpx:async, future, etc.

20:22 <weilewei> Yours is in hpx-2018 branch, right? That's the place I found you have some threading implementation

20:25 <jbjnr> why not copy the thread pool replacement from my branch rather than redo it?

20:26 <jbjnr> the branch to use would be the hpx-master from aug 2019

20:27 <jbjnr> in the dca repo

20:28 <weilewei> jbjnr I did not redo things much, here is my try: https://github.com/weilewei/DCA/blob/rostam_test/include/dca/parallel/hpx/hpxthread.hpp

20:28 <weilewei> This is their implementation: https://github.com/weilewei/DCA/blob/master/include/dca/parallel/stdthread/stdthread.hpp

20:29 <jbjnr> can't look now (bed time here), and next week is hackathon, then hpx course <sigh> but I'l try to find some time to look at dca

20:30 <weilewei> jbjnr can you show me the link to your branch/repo? I have not found any

20:30 <weilewei> yea, sure, when you are free

20:30 <jbjnr> on my branch there is a complete replacement for all the threading in dca with hpx, probably around end of 2018 all was working, but stuff might have been broken since, but it should be fixable

20:30 <jbjnr> anywa, have a look at it to get ideas

20:30 <jbjnr> gtg

20:31 jbjnr has quit [Quit: WeeChat 2.5]

20:31 <weilewei> so on DCA public repo, I have not seen any hpx branch

20:32 <zao> weilewei: In biddisco/DCA, I would reckon.

20:32 <weilewei> jbjnr btw, the changes are suggested by Giovannie..

20:33 <weilewei> zao Oh, now I see

20:46 weilewei has quit [Remote host closed the connection]

21:10 hkaiser has joined #ste||ar

21:30 aserio has quit [Quit: aserio]

21:32 weilewei has joined #ste||ar

21:52 Yorlik has joined #ste||ar

21:56 jaafar has quit [Ping timeout: 245 seconds]

21:59 nikunj has quit [Ping timeout: 265 seconds]

23:21 akheir has quit [Quit: Leaving]

23:45 weilewei has quit [Remote host closed the connection]