K-ballo changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
weilewei has joined #ste||ar
sestro[m]1 has joined #ste||ar
gdaiss[m]1 has joined #ste||ar
mariella[m]2 has joined #ste||ar
hkaiser has quit [Quit: bye]
khuck has quit [*.net *.split]
mariella[m] has quit [*.net *.split]
sestro[m] has quit [*.net *.split]
gdaiss[m] has quit [*.net *.split]
spring[m] has quit [*.net *.split]
khuck has joined #ste||ar
spring[m] has joined #ste||ar
bita has quit [Ping timeout: 244 seconds]
nanmiao11 has quit [Remote host closed the connection]
weilewei has quit [Remote host closed the connection]
bita has joined #ste||ar
shahrzad has quit [Quit: Leaving]
akheir has quit [Quit: Leaving]
bita has quit [Ping timeout: 260 seconds]
hkaiser has joined #ste||ar
akheir has joined #ste||ar
nanmiao11 has joined #ste||ar
weilewei has joined #ste||ar
<weilewei> can someone take a look issue #4878 ?
<hkaiser> weilewei: how can I reproduce this?
<weilewei> hkaiser not sure how, I am trying to use gdb debug for 1 rank. But when I b MPI_FINALIZE , the application does not stop there
<hkaiser> weilewei: is MPI_INIT called at all?
bita has joined #ste||ar
weilewei has quit [Remote host closed the connection]
weilewei has joined #ste||ar
<weilewei> hkaiser yes, MPI_Init is called inside HPX mpi first
<weilewei> see the backtrace info in last two comments: https://github.com/STEllAR-GROUP/hpx/issues/4878
<weilewei> I believe inside HPX mpi, MPI_finialize is first being called, then when some MPI functions inside DCA are called after that, it crashes the program
<weilewei> cc ms[m]
<hkaiser> weilewei: how is hp xinitialized in dca?
<weilewei> hkaiser include hpx_init.hpp
<hkaiser> that doesn't initialize hpx
<hkaiser> do you call hpx_init explicitly?
<weilewei> @hk
<weilewei> yes
<hkaiser> ok
<hkaiser> ok, so hpx_init ultimately calls mpi_init
<weilewei> ok
<weilewei> maybe we don't want call MPI_Finalize in hpx, and let DCA call MPI_Finalize
<hkaiser> just call mpi_init before calling hpx_init and mpi_finalize after hpx_init has returned
<hkaiser> that should do the trick
<khuck> hey all - is there a way at runtime to disable the HPX error handling? I am getting a segmentation violation but the program doesn't drop a core because the HPX error handler is calling "exit"
<khuck> and the handler isn't giving me a stack backtrace
<hkaiser> khuck: don't think so
<khuck> hrm
shahrzad has joined #ste||ar
<khuck> is there a way to disable the signal handler at configuration/build time?
<weilewei> hkaiser here is my code: https://gist.github.com/weilewei/7edaddccd0e088865d6e4cc96720545e the MPI_Finialize is wrapped into Concurrency object I believe
<hkaiser> khuck: the signal handlers are set here: sigact
<khuck> thanks
<weilewei> hkaiser how should I move mpi_finalize after hpx_init has returned
<hkaiser> is dca calling mpi_init and mpi_finalize somewhere?
<weilewei> Yes
<hkaiser> does it make sure not to call finalize iif it didn't call init (or mpi_init did fail)?
<weilewei> The MPI_init is protected but finalize is not protected: https://gist.github.com/weilewei/8c492c97999231302c71c35dd3d8e03d
<hkaiser> weilewei: it could be that it is calling mpi_init after hpx has already called it, which would result in it's own call to fail, in which case it shouldn't cal mpi_finalize either?
<hkaiser> weilewei: wel, there you go
<weilewei> I added MPI_init protection, saying that if somewhere else initialize MPi, then don't initialize
<hkaiser> sure, but then you shouldn't call finalize either
<weilewei> But the error message is "The MPI_Comm_free() function was called after MPI_FINALIZE was invoked", so shall I protect MPI_Comm_free() function first?
<hkaiser> no
<hkaiser> who is calling that comm_free?
<weilewei> DCA is calling comm_free
<weilewei> and HPX calls MPI_FINALIZE before DCA calls comm_free
<hkaiser> well, I don't know - you have to make sure that mpi is finalized by the same code that has initialized it
<hkaiser> hpx uses the mpi_environment (which you have already found), it's the only place where we do that
<hkaiser> for dca, I don't know what's going on
<weilewei> so I think comm_free should be protected. In DCA, when comm_free is called, DCA's finalize has not been called yet
<hkaiser> no don't do that
<hkaiser> don't work around the issue, fix it
<hkaiser> find out who is calling init and finalize and make sure it happens at the correct times
<weilewei> HPX calls finalize way earlier than the correct time
<hkaiser> well, find out why and prevent it from happening
<hkaiser> alternatively, let dca handle the mpi initialization
<hkaiser> (before/after hpx is active)
<weilewei> how? I think I want DCA handles finalize
<hkaiser> this will prevent hpx fro calling init and finalize
<hkaiser> as I said, call init before hpx_init() and finalize after hpx_init returned, i.e. use you MPIInitializer class in main()
<weilewei> hmm let me see how I can do it
<hkaiser> also, make sure not to call finalize there if you didn't call init
<weilewei> so hpx::init will actually call hpx_main?
jaafar has quit [Remote host closed the connection]
<weilewei> If I create a DCA concurrency object (which calls mpi init) before hpx::init, how can I pass it to hpx_main(int argc, char** argv)
<weilewei> hkaiser ^^
jaafar has joined #ste||ar
nanmiao11 has quit [Remote host closed the connection]
<hkaiser> weilewei: why do you have to?
<hkaiser> but you can certainly bind arguments to be passed through to hpx_main
<weilewei> yes, how can I do so?
<hkaiser> hpx_init(f, argc, argv) will call 'f' as hpx_main
<hkaiser> weilewei: why do you need to create a dca concurrency object in main?
<hkaiser> wouldn't the mpiinitizlizer suffice?
<weilewei> Because that's how DCA init MPI
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
<weilewei> ahhh, I finally got it, I just call MPI_init before hpx::init, MPI is initialized, so constructing a Concurrency object inside main will not call MPI init (it is protected)
parsa has joined #ste||ar
<weilewei> now it runs fine, now hkaiser thanks!
<hkaiser> right
<hkaiser> just fix it not to call finalize either
<weilewei> Sure
<hkaiser> or just use the MPIInitializer directly
<hkaiser> no need to call init explicitly
<weilewei> right, let me clean up my code
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
parsa has joined #ste||ar
<weilewei> MPIInitializer is a protected constructor
<weilewei> I can't call it directly
<hkaiser> well, change that ;-)
<hkaiser> or derive a class that has a public constructor
<weilewei> Not very doable if I construct a MPIInitializer before hpx::init, because I got this error: The MPI_Comm_dup() function was called after MPI_FINALIZE was invoked.
<weilewei> The first MPI_FINALIZE is being called in the destructor of MPIInitializer
<weilewei> It seems the object gets destroyed before calling hpx::init
<weilewei> hkaiser ^^
<hkaiser> weilewei: sure
<hkaiser> you didn't create an object, just a temporary
<hkaiser> weilewei: write dca::parallel::MPIInitializer init(argc, argv); instead
<khuck> hkaiser: you'll be happy to know there are actually 6 signal handlers in HPX, and it took me this long to track down which one was getting triggered
<hkaiser> khuck: heh
<hkaiser> I was not aware of that :/
<weilewei> hkaiser it works! But I got another error: The MPI_Barrier() function was called after MPI_FINALIZE was invoked.
<weilewei> not sure who calls MPI_Barrier(), I did not find in DCA or HPX
<hkaiser> weilewei: urgs
<hkaiser> now finalize should be called only in the destructor of your init object
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
<hkaiser> khuck: ahh, yes! the stack overflow handler
parsa has joined #ste||ar
<khuck> weilewei: apex calls MPI_Barrier in the OTF2 finalization, which is called from apex::finalize. But that also includes a wrapper around MPI_Finalize to make sure that apex::finalize happens before MPI terminates
<khuck> so that's probably not it
<hkaiser> khuck: apex::finalize should get called before hpx returns from hpx_init
<khuck> also true
<weilewei> hkaiser apex is not used in this case though
<hkaiser> weilewei: did you protect you mpi_finalize calls now?
<hkaiser> your*
<weilewei> hkaiser oh! I forgot that, now after protecting, everything runs fine
<weilewei> Thanks so much
<diehlpk_work> Anyone aware of that Phylanx has some initilization error, like src/tcmalloc.cc:332] Attempt to free invalid pointer 0x55b1f6c79920
<diehlpk_work> If, I run the Python code without from phylanx import Phylanx everything works. However, with the import the code crashes
nanmiao11 has joined #ste||ar
<K-ballo> this is interesting.. including hpx/async_combinators/split_future.hpp triggers a static assert
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
parsa has joined #ste||ar
<K-ballo> looks like future is missing some include for wherever the actual future_then_result implementations live
<K-ballo> hpx/execution/detail/future_exec.hpp
<khuck> hkaiser: I found the crash... there's an error in shutdown when a vector of localities is constructed by iterating over the partitions_ map here: https://github.com/STEllAR-GROUP/hpx/blob/eecbd49ac839ef91b88bfaae29ba2fdf2991caae/src/runtime/agas/server/locality_namespace_server.cpp#L394-L398
<khuck> the crash happens on the ++it call of the for loop. Which suggests the map was modified while iterating over it.
<khuck> but... there's a lock guard a few lines above.
<hkaiser> khuck: are you sure that *this is still valid?
<khuck> it's a random crash that only happens in mpiexec runs, so I am running it in a loop until it crashes. Then I inspect the core file with gdb... the partitions_ map is valid, but I didn't check this
<khuck> doing that now
<khuck> (it takes a while to load into gdb)
<khuck> *this seems fine
<hkaiser> khuck: if the map is valid, then the *this should be valid as well
<hkaiser> I don't see a way for the map to be modified during iteration, there is no reason any code should do that
<hkaiser> it's initialized once and then never changes
<khuck> that's what I figured
<hkaiser> could be only that the memory gets trashed somehow
<khuck> maybe... or GDB is taking me to the wrong thread
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
parsa has joined #ste||ar
parsa has quit [Client Quit]
parsa has joined #ste||ar
parsa has quit [Client Quit]
parsa has joined #ste||ar
wash[m] has quit [Ping timeout: 260 seconds]
zao has quit [Ping timeout: 244 seconds]
wash[m] has joined #ste||ar
zao has joined #ste||ar
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
parsa has joined #ste||ar
parsa has quit [Client Quit]
parsa has joined #ste||ar
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
parsa has joined #ste||ar
weilewei has quit [Remote host closed the connection]
wash[m] has quit [Read error: Connection reset by peer]
zao has quit [Read error: Connection reset by peer]
zao has joined #ste||ar
wash[m] has joined #ste||ar
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
parsa has joined #ste||ar
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
parsa has joined #ste||ar