hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoD: https://developers.google.com/season-of-docs/
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Ping timeout: 245 seconds]
rori has joined #ste||ar
Amy2 has quit [Ping timeout: 252 seconds]
rori has quit [Quit: bye!]
K-ballo has joined #ste||ar
hkaiser has joined #ste||ar
mdiers_ has quit [Remote host closed the connection]
mdiers_ has joined #ste||ar
<hkaiser> heller: I really would like for #4108 or similar to go in asap
<hkaiser> the current situation is not sustainable anymore
<heller> what't the underlying issue?
<heller> or how does it manifest?
<heller> build failures on non c++17 compilers?
<heller> hkaiser: in an ideal world, we shouldn
<heller> hkaiser: in an ideal world, we shouldn't really set the C++ mode, but have the user decide...
<hkaiser> heller: the issue is that if hpx was configured with -std=c++17, then any dependent projects _must_ use c++17 as well
<heller> and then propagate it through the HPX target...
<hkaiser> not always possible
<heller> well, yes
<heller> ABI problems?
<hkaiser> phylanx depends on pybind11 which must be compiled with c++14
<heller> then you should configure HPX with the version you need
<hkaiser> well, circle use c++17
<heller> wow, is there such a breaking change that it doesn't work when compiled in C++17 mode?
weilewei has joined #ste||ar
<hkaiser> no it has a bug preventing it from working with aligned memory new which will be visible with c++17
<hkaiser> normally there are no ABI problems between 17 and 14 (on mainstream compilers), so #4108 enables HPX features only if a) they are enabled while compiling HPX and b) the dependent library supports those as well
<heller> ok
<weilewei> hkaiser Do I need to remove CXXFLAGS="-std=c++14" from my build script then?
<weilewei> I also use HPX_WITH_CXX14=ON on my camke, so two flags at the same time
<hkaiser> weilewei: as long as those are consistent you're fine
<weilewei> hkaiser ok
aserio has joined #ste||ar
hkaiser has quit [Ping timeout: 265 seconds]
Yorlik has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
<aserio> hkaiser: is this true about --hpx::bind
<aserio> the detailed affinity description for the OS threads, see More details about HPX command line options for a detailed description of possible values. Do not use with --hpx:pu-step, --hpx:pu-offset or --hpx:affinity options. Setting this option additionally set --hpx:numa-sensitive unless --hpx:bind=none. In this case bind disables the thread affinities.
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
jbjnr has joined #ste||ar
<jbjnr> hkaiser: yt?
<hkaiser> jbjnr: here
<jbjnr> hi. I osted a question yesterday and am looking again at it now.
<jbjnr> hold on ...
<jbjnr> I have a problem - a lock-up caused by my scheduler. When load components action runs at program start, it looks like the shared state is not getting set to ready when it completes, the task runs as expected, but when it returns, the return load_components_async(gid).get(); never returns, so the program sits in an infinite loop of looking for new tasks, but never gets any. I can't imagine why the scheduler
<jbjnr> change I made should cause this. What might make the future not get set?
<jbjnr> ^reposted
<hkaiser> jbjnr: no idea, frankly
<hkaiser> I have seen lockups at startup, but can't remember what was causing those...
aserio has quit [Ping timeout: 250 seconds]
<jbjnr> hkaiser:when an action completes, where does the future shared state get made ready - I put breakpoints in seveeral promis laces etc, but they are not triggered. Which class should I look at
<hkaiser> sec
<jbjnr> thanks
Yorlik has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
aserio has joined #ste||ar
Yorlik has quit [Read error: Connection timed out]
akheir has joined #ste||ar
maxwellr96 has quit [Ping timeout: 246 seconds]
aserio has quit [Ping timeout: 245 seconds]
nikunj has joined #ste||ar
<weilewei> hkaiser for the future in hpx, I know that future.get() is not a blocking call. However, if I would like to have a forced blocking call to that future, what member function should I use?
<weilewei> I think I meet a timing issue in DCA++, the future has not computed yet, but the application tried to assert if the number of all futurized tasks == the number of expected tasks. This situation is not happening when they use std::future::get(), as that gaurantee to be fully computed
<zao> HPX source documentation seems to imply that future's get should wait(), or am I looking in the wrong place?
<weilewei> zao thanks, let me try
<zao> what kind of scenario are you using it in, and what kind of behaviour are you seeing?
<hkaiser> weilewei: what is a 'forced blocking call'?
<hkaiser> weilewei: future::get() does block the current HPX thread but not the underlying OS-thread, so I'm not sure what you need
<weilewei> (gdb) p walk_finished_$1 = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = 0}, <No data fields>}
<hkaiser> so it's set to 4 and not to 7
<hkaiser> ahh o, zet to zero
<weilewei> starting from line 169, that is the place where all hpx futrues start. However, I suspect that those launched futures have not finished
<hkaiser> it's set to zero
<hkaiser> yes
<hkaiser> are you sure you're not sitting in line 196?
<weilewei> In their previous implementation, every time, they luanch std::future, several lines later, they call future.get(), which makes sure to finish 1 worker
<weilewei> hmm. it breaks at line 181
<hkaiser> it could be called from line 196, no?
<zao> Note frame #5's line number.
<weilewei> Ohh
<weilewei> So that's the starting point that program breaks? zao
<zao> As you're in a catch block there, something has thrown in the associated try block.
<zao> Which would be the get() function of a future, or somehow the iteration over a vector, which I doubt can throw.
<zao> Inspecting the exception may help.
<zao> Figuring out what can make a future throw might be educational, maybe it's forwarding a failure from the task. Maybe it's inconsistent in some way.
<weilewei> I am wondering how can a lower line 196 call a higher line 181
<zao> print_metadata is an anonymous function, a lambda.
<zao> When control flows past the definition, it only initializes the variable, it doesn't run the body.
<zao> It's being invoked at one of two places later on, line 196 and 200.
<weilewei> Oh, I see what you mean, I got it
<zao> :D
<weilewei> Ok, I think then I should look into why that future throws and what is the failure inside that future
weilewei has quit [Remote host closed the connection]
weilewei has joined #ste||ar
<heller> weilewei: when running in gdb, type catch throw before you run
<heller> Then you break when the exception is being thrown
<weilewei> Ok, let me try that heller
aserio has joined #ste||ar
<hkaiser> heller: I have all self-contained serialization tests running again (the ones not depending on hpx)
<jbjnr> hmmmm. so it looks like pre-main executes load_components_action, but then doesn't get resumed fter
<hkaiser> so the task does not get rescheduled
<jbjnr> hkaiser:got a moment to look at a log
<hkaiser> sure
<jbjnr> 1 min
<jbjnr> that's a good log using the local scheduler
<jbjnr> that's using the share-p scheduler. The second log stops there. the first log I truncted just after the same point
<jbjnr> if you look at them alternately, hey're the same up to the last bit
<jbjnr> when the local carries on, but the shared stops
<jbjnr> any idea why the pre-main isn't restarted?
<jbjnr> I've no idea what I changed to make it stop working. it was running fine, then it wasn't
<jbjnr> and I can't see anything that's wrong.
<jbjnr> and all tasks in queues etc tally up properly
<heller> hkaiser: nice
<hkaiser> jbjnr: well, the log say that pre_main got rescheduled, no?
<hkaiser> last line: new state(pending), old state(suspended)
<jbjnr> yes
<jbjnr> but then it 'disappears'
<hkaiser> so your scheduler got the task to run
<hkaiser> hmmm
<jbjnr> after that last log line, then the scheduler just polls for tasks and never gets any
<jbjnr> when a task resumes, does it enter the scheduler differently - perhaps I removed an important line or something
<jbjnr> its still in the thread map btw
<jbjnr> the thread map has 1 entry, but the new tasks and work items are empty, so where did it go? it must have not entered the queues - but I am puzzled
<hkaiser> that's where the thread is handed to the scheduler after being resumed
<jbjnr> thanks. I'll put a breakpoint there and see if the thread is lost en-route
<heller> hkaiser: might not have been the best idea to mix the modularization with the new feature
<hkaiser> heller: yah, no other way, I think
<heller> Same path as we went with the collectives, then decoupling
<hkaiser> the change is so pervasive because of the include changes, you can just ignore those
<heller> Yeah
<hkaiser> I'm not going back at this point
<heller> Sure, I understand ;)
<heller> Just might take a little to review
<hkaiser> heller: I need more time before review
<heller> K
<hkaiser> it's usable as a separate library now, but hpx itself doesn't compile yet
<heller> Hmm
<heller> That's very nice already
<heller> Looking forward to that
<heller> I guess what I need in the near future are futures and serialization
<hkaiser> right
<heller> That'll keep me busy for a few months
<heller> First I need the official blessing though ;)
<jbjnr> hkaiser:thanks - found the problem. I was looking in the wrong place entirely
<jbjnr> but tracing from where you linked tracked it for me
<jbjnr> incidentally hkaiser what is the "allow_fallback" flag passed in when scheduling a thread?
<hkaiser> jbjnr: uhh
<hkaiser> where?
<hkaiser> heller: thanks, I screwed that one up :/
<hkaiser> will fix
<hkaiser> just replace this with a return 0
<jbjnr> seems to be related to suspending threads, I didn't really pay attention to it
<jbjnr> doesn't matter, ignore me
weilewei has quit [Remote host closed the connection]
<zao> What's a typical reason for "get_runtime_ptr() != nullptr"? Forgetting to set up HPX at all?
<zao> (messing around with wei's fork of DCA, I've got no idea what it's doing, but it seems to be quite a bit into main by then)
<hkaiser> no idea
<hkaiser> this will be nullptr only if hpx is not running at all
aserio has quit [Ping timeout: 250 seconds]
nikunj97 has joined #ste||ar
nikunj has quit [Ping timeout: 240 seconds]
nikunj has joined #ste||ar
weilewei has joined #ste||ar
<zao> Interestingly enough, <hpx/hpx_main.hpp> does not take in main_dca.cpp, not sure if there's something I missed building HPX, but it's silently doing nothing.
nikunj97 has quit [Ping timeout: 240 seconds]
<zao> hkaiser: Ended up doing it the explicit way of including <hpx/hpx_init.hpp> and going main -> hpx::init -> hpx_main junk, and the run seems to churn on nicely.
<hkaiser> nod, it needs to be included in the file that defines main()
<zao> I did.
<hkaiser> hmm, then I don't know
<hkaiser> what platform?
<zao> Ubuntu 18.04, GCC 8.3, Boost 1.71
<hkaiser> heh, that should work :/
<zao> No idea how DCA uses HPX.
<hkaiser> me neither ;-)
<hkaiser> ask wei
<weilewei> I remember I had this problem before, but I forget how did I solve it. zao
<hkaiser> but explicit is better than implicit anyways
<zao> Ah, you got back, only saw the leave :)
<weilewei> I saw the log, so, did you pass -DHPX_DIR=$PATH_TO_HPX_INSTALL in your cmake?
<zao> I used CMAKE_PREFIX_PATH
<weilewei> and also, did you pass -DDCA_HAVE_HPX=ON
<weilewei> I am clueless, hmmm
<zao> What branch are you working on, and have you ensured that the HPX runtime is actually running?
hkaiser has quit [Ping timeout: 264 seconds]
<zao> Did you figure out what was throwing?
<weilewei> I am working on rostam_test branch
aserio has joined #ste||ar
<weilewei> I have not, still debugging
<weilewei> zao which test/executable are you running?
<zao> main_dca with some JSON generated by cooldown.py
<weilewei> I am running this too, but it does not complain about the HPX runtime
<jbjnr> weilewei:are you using any of my stuff?
<weilewei> jbjnr not really, I tried, but I cannot get it run. so I basically replace their thread pool implementation with hpx:async, future, etc.
<weilewei> Yours is in hpx-2018 branch, right? That's the place I found you have some threading implementation
<jbjnr> why not copy the thread pool replacement from my branch rather than redo it?
<jbjnr> the branch to use would be the hpx-master from aug 2019
<jbjnr> in the dca repo
<weilewei> jbjnr I did not redo things much, here is my try: https://github.com/weilewei/DCA/blob/rostam_test/include/dca/parallel/hpx/hpxthread.hpp
<jbjnr> can't look now (bed time here), and next week is hackathon, then hpx course <sigh> but I'l try to find some time to look at dca
<weilewei> jbjnr can you show me the link to your branch/repo? I have not found any
<weilewei> yea, sure, when you are free
<jbjnr> on my branch there is a complete replacement for all the threading in dca with hpx, probably around end of 2018 all was working, but stuff might have been broken since, but it should be fixable
<jbjnr> anywa, have a look at it to get ideas
<jbjnr> gtg
jbjnr has quit [Quit: WeeChat 2.5]
<weilewei> so on DCA public repo, I have not seen any hpx branch
<zao> weilewei: In biddisco/DCA, I would reckon.
<weilewei> jbjnr btw, the changes are suggested by Giovannie..
<weilewei> zao Oh, now I see
weilewei has quit [Remote host closed the connection]
hkaiser has joined #ste||ar
aserio has quit [Quit: aserio]
weilewei has joined #ste||ar
Yorlik has joined #ste||ar
jaafar has quit [Ping timeout: 245 seconds]
nikunj has quit [Ping timeout: 265 seconds]
akheir has quit [Quit: Leaving]
weilewei has quit [Remote host closed the connection]