hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoD: https://developers.google.com/season-of-docs/
diehlpk has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
RostamLog has joined #ste||ar
RostamLog has joined #ste||ar
RostamLog has joined #ste||ar
<jbjnr_> heller: Yorlik - the thing about docs is that if I can write docs/comments in the file where I am writing the code. I'm happy to do it. If I have to write docs in a separate file in some other docs location, then it never gets done and it rapidly goes out of date. I'm even happy to put a doc file in the same location as the header/class/other if I must. My ideal solution is something like doxygen where you document the code 'in place' but get nice output
<jbjnr_> like sphynx.
<Yorlik> Yup. I tend to use doxygen more and more these days, as much as I like sphinx, especially with sphinx-autobuild. Since I learned how to use Doxygen with markdown files I'm using it more now.
<jbjnr_> can you do extended help/docs "in place"?
K-ballo has joined #ste||ar
<zao> Too bad you can’t do rustdoc for C++ :)
<Yorlik> jbjnr_: It's possible - I just tried and it worked. Essentially .md files are found like anay source file and put into the docs.
<jbjnr_> good to know, thanks
<jbjnr_> after simbergm spent ages redoing the docs, I doubt he'd want to go back to doxygen...
<Yorlik> Sphinx clearly looks better for normal text, but imo the automated API docs are not good - from a functionality and from a rendering point of view. Browsing and searching with Doxygen is much better.
<Yorlik> I think there are argument for each of Sphinx and Doxygen. But having two tools instead of just one is kinda cumbersome - for writers and users too.
<Yorlik> When I use sphinx, I do it in a separate IDe - VSCODE - and run sphinx-autobuild in a terminal, so every edit gets detected, the files re-rendered and the browser auto updated. I'm not sure there is something comparable on the Doxygen side.
<Yorlik> For me that would actually be a big plus for writing more longish docs.
<Yorlik> I'm kinda torn between the two.
diehlpk has quit [Ping timeout: 240 seconds]
aserio has joined #ste||ar
aserio has quit [Ping timeout: 250 seconds]
aserio has joined #ste||ar
hkaiser has quit [Ping timeout: 268 seconds]
nikunj has joined #ste||ar
aserio has quit [Ping timeout: 245 seconds]
<simbergm> jbjnr_ and the rest: I don't mind better solutions for the api docs at all
<simbergm> there's just so much non-api documentation that we shouldn't lose and I don't know how easy that is to integrate with a pure doxygen approach
aserio has joined #ste||ar
nikunj has quit [Ping timeout: 276 seconds]
rori has joined #ste||ar
hkaiser has joined #ste||ar
aserio has quit [Ping timeout: 264 seconds]
aserio has joined #ste||ar
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 245 seconds]
K-ballo1 is now known as K-ballo
<Yorlik> simbergm: yt?
<heller> hkaiser: Holding lock while suspending
<heller> hkaiser: why do you ask?
<heller> Don't remember if it was the allocator or the thread data ctor
nikunj has joined #ste||ar
<simbergm> Yorlik: on and off, but ask away, I'll reply later
<Yorlik> I am working on a PR for some docs, just wondering where I should put stuff that is not in-source.
aserio has quit [Ping timeout: 245 seconds]
rori has quit [Quit: bye]
<simbergm> Yorlik: where would you like to see it? ;) depends what it is... if it's about the action/component stuff you've been looking at maybe here: https://github.com/STEllAR-GROUP/hpx/blob/master/docs/sphinx/manual/writing_distributed_hpx_applications.rst? if it's more of a tutorial/cookbook-type thing we could probably add a new top-level section for it
<heller> Yorlik: the docs dir is a good place to start
<hkaiser> heller: just wondering
<Yorlik> I have a writeup which explains the general mechanics of comnponent and action declaration, definition/registration and what they do.
<hkaiser> we see a strange segfaukt on power in the thread_queue while creating new threads, so I was trying to understand things
<heller> Hmmm
<heller> Well, remove it, ignore locks and try again
<heller> Stack overflow maybe?
<Yorlik> simbergm: I want to put macro specific explanation into the source files as Doxygen comments, but an introductory writeup to get rid of this "black magic" feeling in the general docs
<Yorlik> We could also make it so, that I forst finish it and then ask you for review before creating a PR.
<heller> Yorlik: which macros?
<Yorlik> The general Action/Component registration/declaration/definition thingies
<Yorlik> There are gaps in the doxy comments
<heller> Ok, we already have a section for it
<heller> For the general action component foo
<simbergm> Yorlik: it sounds like that file I linked to above could be a good place, but it depends on what specifically you have
<simbergm> also, we'd love to know what you think... if you had looked for that information in the documentation, where would you have expected to find it?
<Yorlik> I'd suggest we maybe make a voice session about the docs once you have time - theres so much to day and discuss.
<Yorlik> day)say
<Yorlik> Argh - day = say
<heller> hkaiser: do you have a backtrace?
nikunj has quit [Ping timeout: 240 seconds]
<hkaiser> heller: sec
<hkaiser> heller: stack-overflow: unlikely, it's in teh scheduler, i.e. on a normal stack
nikunj has joined #ste||ar
<heller> hkaiser: looks like no thread has been created
<hkaiser> well, it's some non-sensical pointer value
<heller> hkaiser: can you ask him to go to frame 9 and print thrd
<hkaiser> not sure if it's the new thread or any other thread in the unordered map
<hkaiser> nod
<heller> Easy to find out...
<hkaiser> right
<heller> But yeah, looks like a race
<hkaiser> I asked him to come here
<hkaiser> nod
<hkaiser> that's what I think as well
<hkaiser> heller: I touched up on your tuple changes, could you have a look, pls
weilewei has joined #ste||ar
<hkaiser> hey weilewei, there you are
<hkaiser> weilewei: we were discussing your segfault
<heller> hkaiser: thanks for that, guess that's good to go now
<weilewei> Hi, I am here now
<heller> Hi weilewei
<hkaiser> heller: pls ask your questions ;-)
<heller> Is the segfault easily reproducible?
<weilewei> I am not sure how to reproduce, though
<heller> weilewei: so, please go to frame 9 and print thrd
<weilewei> how should I go to frame 9?
<heller> up 9, iirc
<heller> Or it up 9 times
<weilewei> you mean get call stacks? I am not familiar with what you are suggesting
<heller> You should end up in thread_queue.hpp line 228
<heller> In your gdb prompt
<hkaiser> weilewei: run in gdb until it stops at the segfault
<hkaiser> issue commands: up 9 and p trhd
<heller> Type up and hit enter until you get to the mentioned line
<weilewei> ok, let me try
weilewei64 has joined #ste||ar
<weilewei64> [Inferior 1 (process 108443) exited with code 01]Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.ppc64le libxml2-2.9.1-6.el7_2.3.ppc64le xz-libs-5.2.2-1.el7.ppc64le(gdb) up 9No stack.(gdb) p trhdNo symbol "trhd" in current context.
<heller> Did you hit the segfault?
<weilewei64> so after i type up 9, it says no stack
<heller> It says the process exited...
<heller> What was printed before?
<weilewei64> let me try it again, I forget to source things
<weilewei64> (gdb) up 9#9 0x000020000110632c in hpx::threads::policies::thread_queue<std::mutex, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_lifo>::add_new (this=0x20001a0c0000, add_count=997, addfrom=0x20001a0c0000, lk=..., steal=false) at
<weilewei64> /gpfs/alpine/proj-shared/cph102/weile/dev/src/hpx/hpx/runtime/threads/policies/thread_queue.hpp:228228 thread_map_.insert(thrd);
<weilewei64> (gdb) p thrd$1 = {thrd_ = 0x20001a063900}
<hkaiser> so the new thread got created just fine
<heller> weilewei64: can you print the entire backtrace again please?
<heller> And paste it
<heller> weilewei64: can you do a p &thrd please?
<weilewei64> (gdb) p &thrd$2 = (hpx::threads::thread_id_type *) 0x200019bfc1a8
<heller> Hmmm
<heller> Does this only happen with your application?
<weilewei64> I am not sure about other applications, as I am only running this application
<weilewei64> the DCA++ project on Summit
<heller> How certain are you that you built hpx and your application against the same version of gcc than what you have sourced right now?
<heller> Do you know in which phase of your program this segfault happens?
<weilewei64> They all use gcc/8.1.1 when I am building my application and hpx, otherwise, hpx will report that the compilers are not consistant
<weilewei64> how many phases do I have in a program? My understand is the segfault happens like in the middle of the program
<heller> Ok
<heller> Does it also happen when you say --hpx:threads=1 when starting the program?
<weilewei64> Yes, that's my command --hpx:threads=1
<weilewei64> when getting this segfault
<heller> Ok, so it can't be a race
<Yorlik> I did't find any "doc" targets in the HPX CMake- or VS- targets after building the CMake cache (using Ninja and the VS Generator). Do they just not exist? OFC - I can just build the docs manually - just wanna know.
<weilewei64> my whole command is jsrun -a 1 -n 1 -c 8 --smpiargs=none gdb --args ./dca_sp_DCA+_thread_test --hpx:threads=1
<heller> Interesting
<heller> Gotta run soon...
<weilewei64> Or if you have any other small applications that I can try, I can run them to see if I have the same segfault
<heller> weilewei64: can you run with a larger stack size?
<hkaiser> heller: this is on the system stack
<weilewei64> what command and number should I use?
<weilewei64> hpx.stacks.small_size=?
<hkaiser> weilewei64: I don't think this would change the stack in this context
<heller> hkaiser: what if a hpx task with a corrupted stack corrupted the thread_map?
<hkaiser> ahh, ok - could be
<heller> Anyways, the only explanation I have is that hpx and the application where built against a different gcc/stdlib than what you currently have in your environment
<heller> I'm on the run now, will check back later
<weilewei64> thanks! I am doing double check
aserio has joined #ste||ar
<heller> weilewei64: can you show the entire output of the segfault when but running in gdb?
<weilewei64> Ok, I have to hit lots of times Enter, let me try
<weilewei64> So, when I am building hpx, I turn off the network and without mpi, the compiler is gcc/8.1.1
<weilewei64> when I am building DCA++, it involves spectrum-mpi
<weilewei64> /CXX compilerCMAKE_CXX_COMPILER:FILEPATH=/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-8.1.1/spectrum-mpi-10.3.0.1-20190611-55ygkz53evhcwy3txeis32gc3kzu7wy6/bin/mpicxx//A wrapper around 'ar' adding the appropriate '--plugin' option// for the GCC
<weilewei64> compilerCMAKE_CXX_COMPILER_AR:FILEPATH=/sw/summit/gcc/8.1.1/bin/gcc-ar
<weilewei64> this is the CMakeCache output of DCA++
<weilewei64> on my current interactive environment, I have gcc/8.1.1 and spectrum-mpi/10.3.0.1-2019061
<weilewei64> I am not sure if that's what you mean that I have different compilers?
<weilewei64> Or let me know how should I examine/check if the compilers are different
K-ballo has quit [Quit: K-ballo]
<hkaiser> could the same compiler using different ABIs (C++11 vs. C++17 or somesuch)
<hkaiser> hpx defaults to C++17 nowadays
<hkaiser> not sure what DCA++ does
<weilewei64> When I am building hpx, I set it to 14, but I do not remember if I set DCA++ to 14
<hkaiser> I think it defaults to c++14
<hkaiser> iirc
<weilewei64> I can try to set (CMAKE_CXX_STANDARD 14) for both
<hkaiser> weilewei64: for HPX add -DHPX_WITH_CXX14=On
<hkaiser> dca++ is hardcoded to use -std=c++14
<weilewei64> Ok, let me try to rebuild hpx and dca++
<weilewei64> hkaiser thanks
weilewei64 has quit [Remote host closed the connection]
<heller> hkaiser: isn't phylanx running properly on power?
<hkaiser> it does on power8, no idea about power9
<heller> weilewei: another thing, is the machine you compile on the same as the host you run on?
<hkaiser> heller: its sumit
<hkaiser> summit
<heller> Sure, I just don't remember the details :p
<hkaiser> I think it's a homogeneous machine, but it's worth verifying, indeed
weilewei12 has joined #ste||ar
<hkaiser> weilewei12: do you compile things on the headnode or on one of the compute nodes?
<weilewei12> Oh, I am now compiling hpx on a compute node
<weilewei12> should I do it on the headnode
<hkaiser> that's good
<hkaiser> no, compile on the same/similar node as you run
<weilewei12> Ok
aserio1 has joined #ste||ar
<heller> Just checked, same architecture
<hkaiser> heller: ok
aserio has quit [Ping timeout: 250 seconds]
aserio1 is now known as aserio
<hkaiser> strange
<hkaiser> weilewei12: this segfault happens when usin gopenblas only, right?
<hkaiser> doesn't happen with a different blas library, does it?
<weilewei12> no, it does not happen to another one, which is netlib-lapack
<weilewei12> only to openblas
<hkaiser> what about essl?
<weilewei12> essl is another problem
<hkaiser> essl was failing with cuda enabled, yah, what about not using cuda?
<weilewei12> iirc, that's some arguments not valid? I am not sure
<hkaiser> the cuda issue is for giovanni to resolve
<weilewei12> ok, I can try the version: essl + netlib-lapack + no cuda later
<hkaiser> sure
<hkaiser> also, let's get back to teh hpx tests to make sure everything is ok there
<weilewei12> Sure, yesterday I tried hpx future overhead test, using --hpx:threads=42, the error is core dump
<weilewei12> with 1 thread, hpx is fine
<weilewei12> up to less than 16, I guess
<hkaiser> weilewei: try reducing the number of threads to a minimum reproducing the issue, then run in gdb and show us the backtrace
K-ballo has joined #ste||ar
<weilewei12> still failed, after using same c++ version heller hkaiser
<simbergm> not sure I'm being helpful but the future overhead has some problems, the sliding semaphore one is not quite right
<hkaiser> weilewei12: ok, thanks - just making sure...
<simbergm> Yorlik we could try to talk next week
<simbergm> Maybe easier once you have some text
<hkaiser> simbergm: ok, thanks for the heads up
<Yorlik> I already have some - still polishing.
<Yorlik> I finally managed to build the docs
<weilewei12> simbergm how about the stream test? It also failed when using more threads
<hkaiser> weilewei12: so the future-overheads test has problems anyways, let's not use that
<Yorlik> So I can try if my doxy comment additions are OK
<heller> weilewei12: alright, next step, increasing the stack size...
<weilewei12> heller what command and parameters should I use?
<simbergm> Om the other hand might better to talk before you write too much...
<simbergm> Om the other hand might better to talk before you write too much...
<simbergm> Next week earliest in any case
<heller> weilewei12: --hpx:ini=hpx.stacks.small_size=0x0200000 --hpx:ini=hpx.stacks.medium_size=0x0200000
<heller> weilewei12: so it only happens when using a specific version of a blas library?
<weilewei12> heller it happens to a specific implementation of a blas library, openblas segfault, netlib-lapack OK
<weilewei12> heller should I use two args at the same time?
<heller> weilewei12: yes please
<heller> can you comment out the blas calls in DCA++ and try again?
<weilewei12> also, this hpx I used is after I comment about this link https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/runtime/threads/policies/thread_queue.hpp#L181
<weilewei12> hmmm, there will be lots of blas calls here and there in DCA++
<heller> do you have your gdb still open?
<heller> please do a up 9 and then p thread_map_ please
Vir has quit [Ping timeout: 264 seconds]
<heller> weilewei12: p thread_map_count_ please
<weilewei12> $2 = {<std::__atomic_base<long>> = {static _S_alignment = 8, _M_i = 33}, <No data fields>}
<heller> hmpf. that's consistent :/
<weilewei12> consistent of what? threads number?
<heller> _M_i = 33
<heller> and the number of elements in the map
<weilewei12> ok
<heller> weilewei12: can you do a ldd on your binary please?
<Yorlik> Does HPX_REGISTER_ACTION_DECLARATION have to use the same action name as HPX_DEFINE_PLAIN_ACTION? Or can I assign an arbitrary new action name for remote calls, as long as it is compatible with the serialization identifiers?
<Yorlik> Especially: Would this example be correct?:
<Yorlik> namespace app
<Yorlik> {
<Yorlik> {
<Yorlik> void some_global_function(double d)
<Yorlik> cout << d;
<Yorlik> }
<Yorlik> // This will define the action type 'app::some_global_action' which
<Yorlik> // represents the function 'app::some_global_function'.
<Yorlik> HPX_DEFINE_PLAIN_ACTION(some_global_function, some_global_action);
<Yorlik> }
<Yorlik> / Outside the namespace (!) :
<Yorlik> / Assigning a name to the action which is usable in serialization for
<Yorlik> / remote calls. The namespace double colons would pose a problem otherwise.
<Yorlik> / It's a little bit like a typedef.
<Yorlik> HPX_REGISTER_ACTION_DECLARATION(app::some_global_function, app_some_global_function)
<Yorlik> ----
<weilewei12> hkaiser so, I am testing dca_sp_DCA+_thread_test, version 1: with essl, no cuda, with netlib-lapack, result: hang in the program forever if I use gdb debugging, or will generate segfault if I do not use gdb. Any suggestion for how to get call stacks when hanging?
<weilewei12> hkaiser version 2: without essl, no cuda, with netlib-lapack, result: passed
<weilewei12> hkaiser for hpx stream_test, when I use --hpx:threads>=17, then result failed in the computing result only, and the whole program finishes it run (no segfault)
<weilewei12> hkaiser https://gist.github.com/weilewei/f94b89262188f22bb25761a1d3e02851#gistcomment-3037528. corrected: --hpx:threads=17 works fine, however, when >=18 failed
<weilewei12> sorry if I am posing too many questions at once.
aserio has quit [Ping timeout: 245 seconds]
aserio has joined #ste||ar
<heller> Yorlik: alright
<Yorlik> Legal?
<heller> oh, didn't look ;)
<Yorlik> What allright then?
* Yorlik is puzzled
<heller> I have something to show ..
<Yorlik> DOCs?
<heller> in approx. 5 minutes or so
<heller> yes
<Yorlik> Cool!
<heller> so yes, we generally recommend to use the same name
<Yorlik> Wouldn't you have to define the name which is registered outside the namespace?
<heller> it's not strictly required right now, but we want to spell it like that
<heller> that is, the name is ignored in the HPX_DEFINE_PLAIN_ACTION at the moment
<Yorlik> Actually I am asking myself what HPX_DEFINE_PLAIN_ACTION is really good for.
<Yorlik> IC.
<heller> it's just a typedef
<Yorlik> After all you define an action to use it tremotely, otherwise you could just cal the function more efficiently locally.
<Yorlik> So - I don't really get the meaning of inside namespace definitions
<heller> well, where do you like to have your name some_global_action, in your namespace or outside of it?
<heller> the other one is different, as it needs to add some specializations to the hpx namespace
<Yorlik> Does it do anything useful in there?
<heller> it's your name
<heller> it's your structure
<Yorlik> Or: Would you ever define without registering?
<heller> yes
<Yorlik> For which use case?
<heller> when you don't care about platform independence
<Yorlik> I'd disallow that by default tbh.
<Yorlik> It's error prone
<Yorlik> It also would save you from having a single macro which is allowed to live inside a namespace
<heller> have you seen this?
<heller> ahh, that's where you got it from ;)
<Yorlik> If you want to use it remotely, you still have to register it
<heller> d'oh
<heller> not necessarily
<Yorlik> How that?
<Yorlik> Inside a namespace
<heller> there is some auto registration magic going on...
<Yorlik> Any namespaced action name would bust
<heller> behind the scenes
<Yorlik> I would strongly suggest to entirely get rid of that Macro
<Yorlik> it's confusing, not very helkpful and error prone.
<Yorlik> In the moment I think it does more harm than it actually heklps.
<Yorlik> I'd suggest to mark it as deprecated.
<Yorlik> Or explicitely require a define to allow using it
<heller> so, how do you like that?
<Yorlik> That looks nice !
<heller> which macro are we talking about? the one that has to be invoked in the global namespace?
<Yorlik> The css still is horrible, but that can be fixed later
<Yorlik> No. the otehr one
<heller> that's needed to have the action type in the first place :P
<Yorlik> HPX_DEFINE_PLAIN_ACTION
<Yorlik> No
<Yorlik> You can just use HPX_PLAIN_ACTIOPN
<heller> yes.
<heller> no.
<heller> that defines a new action type and registers it at the same time
<Yorlik> I thought HPX_PLAIN_ACTION alone is enough in the header
<Yorlik> OFC you'd still need the spource part
<heller> no ;)
<heller> HPX_PLAIN_ACTION defines the action type and registers it
<Yorlik> Yes
<Yorlik> So - you won't need HPX_DEFINE_PLAIN ACTION anymore
<heller> right
<heller> BUT
<heller> HPX_PLAIN_ACTION *needs* to be defined only once, and just once, preferably in a source file
<heller> if you put it in the header, and include the header in two different TUs, you get a ODR violation
<Yorlik> HPX_DEFINE_PLAIN_ACTION without HPX_REGISTER_ACTION_DECLARATION usually is pointless
<heller> no
<Yorlik> FFS - this is a mess - lol
<heller> HPX_REGISTER_ACTION_DECLARATION without HPX_REGISTER_ACTION_DEFINITION is pointless
<heller> so, there are three things going on here
<heller> 1) Defining the action type
<Yorlik> I demand an appear.in session. lol
<heller> ok
<heller> hkaiser: feel free to join ;)
<heller> need to reboot real quick
weilewei12 has quit [Remote host closed the connection]
weilewei has quit [Ping timeout: 245 seconds]
K-ballo has quit [Quit: K-ballo]
weilewei46 has joined #ste||ar
K-ballo has joined #ste||ar
weilewei has joined #ste||ar
nikunj has quit [Quit: Bye]
nikunj has joined #ste||ar
<nikunj> @hkaiser: yt?
aserio has quit [Ping timeout: 250 seconds]
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
weilewei46 has quit [Remote host closed the connection]
weilewei has quit [Ping timeout: 240 seconds]
hkaiser has quit [Ping timeout: 264 seconds]
aserio has joined #ste||ar
aserio has quit [Quit: aserio]
hkaiser has joined #ste||ar
<hkaiser> nikunj: here now
<nikunj> hkaiser, hi!
<hkaiser> hey
<nikunj> are you free for a call tomorrow?
<nikunj> I have had some progress with FlecSi
<hkaiser> nikunj: yes, definitely - sorry for not responding
<nikunj> I learnt MPI and finally understand how it works
<hkaiser> good, we should include Rod, then
<nikunj> I also wrote some code to gain confidence
<hkaiser> ok
<nikunj> yes we should also include rod
<nikunj> as for the FlecSi code, I couldn't well understand how they're using MPI
<hkaiser> could you send an email to him coordinating a meeting, pls?
<nikunj> that's why I needed your help with their MPI architecture
<nikunj> sure, I can do that
<hkaiser> thanks
<nikunj> that was the update that I had to give
<nikunj> it isn't much in terms of sheer work, but I think I can catch up on things quickly now
weilewei has joined #ste||ar
<hkaiser> nikunj: ok, cool
<hkaiser> would be nice if you helped with moving things ahead
<nikunj> yes, I will try to contribute to the code once I understand the MPI architecture
<hkaiser> might be more important to understand FLeCSI, not the MPI stuff
<hkaiser> but that's what Rod can explain
<nikunj> ok