K-ballo changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
weilewei has joined #ste||ar
sestro[m]1 has joined #ste||ar
gdaiss[m]1 has joined #ste||ar
mariella[m]2 has joined #ste||ar
hkaiser has quit [Quit: bye]
khuck has quit [*.net *.split]
mariella[m] has quit [*.net *.split]
sestro[m] has quit [*.net *.split]
gdaiss[m] has quit [*.net *.split]
spring[m] has quit [*.net *.split]
khuck has joined #ste||ar
spring[m] has joined #ste||ar
bita has quit [Ping timeout: 244 seconds]
nanmiao11 has quit [Remote host closed the connection]
weilewei has quit [Remote host closed the connection]
bita has joined #ste||ar
shahrzad has quit [Quit: Leaving]
akheir has quit [Quit: Leaving]
bita has quit [Ping timeout: 260 seconds]
hkaiser has joined #ste||ar
akheir has joined #ste||ar
nanmiao11 has joined #ste||ar
weilewei has joined #ste||ar
<weilewei>
can someone take a look issue #4878 ?
<hkaiser>
weilewei: how can I reproduce this?
<weilewei>
hkaiser not sure how, I am trying to use gdb debug for 1 rank. But when I b MPI_FINALIZE , the application does not stop there
<hkaiser>
weilewei: is MPI_INIT called at all?
bita has joined #ste||ar
weilewei has quit [Remote host closed the connection]
weilewei has joined #ste||ar
<weilewei>
hkaiser yes, MPI_Init is called inside HPX mpi first
<weilewei>
I believe inside HPX mpi, MPI_finialize is first being called, then when some MPI functions inside DCA are called after that, it crashes the program
<hkaiser>
weilewei: how is hp xinitialized in dca?
<weilewei>
hkaiser include hpx_init.hpp
<hkaiser>
that doesn't initialize hpx
<hkaiser>
do you call hpx_init explicitly?
<weilewei>
@hk
<weilewei>
yes
<hkaiser>
ok
<hkaiser>
ok, so hpx_init ultimately calls mpi_init
<weilewei>
ok
<weilewei>
maybe we don't want call MPI_Finalize in hpx, and let DCA call MPI_Finalize
<hkaiser>
just call mpi_init before calling hpx_init and mpi_finalize after hpx_init has returned
<hkaiser>
that should do the trick
<khuck>
hey all - is there a way at runtime to disable the HPX error handling? I am getting a segmentation violation but the program doesn't drop a core because the HPX error handler is calling "exit"
<khuck>
and the handler isn't giving me a stack backtrace
<hkaiser>
khuck: don't think so
<khuck>
hrm
shahrzad has joined #ste||ar
<khuck>
is there a way to disable the signal handler at configuration/build time?
<hkaiser>
weilewei: it could be that it is calling mpi_init after hpx has already called it, which would result in it's own call to fail, in which case it shouldn't cal mpi_finalize either?
<hkaiser>
weilewei: wel, there you go
<weilewei>
I added MPI_init protection, saying that if somewhere else initialize MPi, then don't initialize
<hkaiser>
sure, but then you shouldn't call finalize either
<weilewei>
But the error message is "The MPI_Comm_free() function was called after MPI_FINALIZE was invoked", so shall I protect MPI_Comm_free() function first?
<hkaiser>
no
<hkaiser>
who is calling that comm_free?
<weilewei>
DCA is calling comm_free
<weilewei>
and HPX calls MPI_FINALIZE before DCA calls comm_free
<hkaiser>
well, I don't know - you have to make sure that mpi is finalized by the same code that has initialized it
<hkaiser>
hpx uses the mpi_environment (which you have already found), it's the only place where we do that
<hkaiser>
for dca, I don't know what's going on
<weilewei>
so I think comm_free should be protected. In DCA, when comm_free is called, DCA's finalize has not been called yet
<hkaiser>
no don't do that
<hkaiser>
don't work around the issue, fix it
<hkaiser>
find out who is calling init and finalize and make sure it happens at the correct times
<weilewei>
HPX calls finalize way earlier than the correct time
<hkaiser>
well, find out why and prevent it from happening
<hkaiser>
alternatively, let dca handle the mpi initialization
<hkaiser>
(before/after hpx is active)
<weilewei>
how? I think I want DCA handles finalize
<hkaiser>
this will prevent hpx fro calling init and finalize
<hkaiser>
as I said, call init before hpx_init() and finalize after hpx_init returned, i.e. use you MPIInitializer class in main()
<weilewei>
hmm let me see how I can do it
<hkaiser>
also, make sure not to call finalize there if you didn't call init
<weilewei>
so hpx::init will actually call hpx_main?
jaafar has quit [Remote host closed the connection]
<weilewei>
If I create a DCA concurrency object (which calls mpi init) before hpx::init, how can I pass it to hpx_main(int argc, char** argv)
<weilewei>
hkaiser ^^
jaafar has joined #ste||ar
nanmiao11 has quit [Remote host closed the connection]
<hkaiser>
weilewei: why do you have to?
<hkaiser>
but you can certainly bind arguments to be passed through to hpx_main
<weilewei>
yes, how can I do so?
<hkaiser>
hpx_init(f, argc, argv) will call 'f' as hpx_main
<hkaiser>
weilewei: why do you need to create a dca concurrency object in main?
<weilewei>
ahhh, I finally got it, I just call MPI_init before hpx::init, MPI is initialized, so constructing a Concurrency object inside main will not call MPI init (it is protected)
<weilewei>
MPIInitializer is a protected constructor
<weilewei>
I can't call it directly
<hkaiser>
well, change that ;-)
<hkaiser>
or derive a class that has a public constructor
<weilewei>
Not very doable if I construct a MPIInitializer before hpx::init, because I got this error: The MPI_Comm_dup() function was called after MPI_FINALIZE was invoked.
<weilewei>
The first MPI_FINALIZE is being called in the destructor of MPIInitializer
<weilewei>
It seems the object gets destroyed before calling hpx::init
<khuck>
hkaiser: you'll be happy to know there are actually 6 signal handlers in HPX, and it took me this long to track down which one was getting triggered
<hkaiser>
khuck: heh
<hkaiser>
I was not aware of that :/
<weilewei>
hkaiser it works! But I got another error: The MPI_Barrier() function was called after MPI_FINALIZE was invoked.
<weilewei>
not sure who calls MPI_Barrier(), I did not find in DCA or HPX
<hkaiser>
khuck: ahh, yes! the stack overflow handler
parsa has joined #ste||ar
<khuck>
weilewei: apex calls MPI_Barrier in the OTF2 finalization, which is called from apex::finalize. But that also includes a wrapper around MPI_Finalize to make sure that apex::finalize happens before MPI terminates
<khuck>
so that's probably not it
<hkaiser>
khuck: apex::finalize should get called before hpx returns from hpx_init
<khuck>
also true
<weilewei>
hkaiser apex is not used in this case though
<hkaiser>
weilewei: did you protect you mpi_finalize calls now?
<hkaiser>
your*
<weilewei>
hkaiser oh! I forgot that, now after protecting, everything runs fine
<weilewei>
Thanks so much
<diehlpk_work>
Anyone aware of that Phylanx has some initilization error, like src/tcmalloc.cc:332] Attempt to free invalid pointer 0x55b1f6c79920
<diehlpk_work>
If, I run the Python code without from phylanx import Phylanx everything works. However, with the import the code crashes
nanmiao11 has joined #ste||ar
<K-ballo>
this is interesting.. including hpx/async_combinators/split_future.hpp triggers a static assert
<khuck>
the crash happens on the ++it call of the for loop. Which suggests the map was modified while iterating over it.
<khuck>
but... there's a lock guard a few lines above.
<hkaiser>
khuck: are you sure that *this is still valid?
<khuck>
it's a random crash that only happens in mpiexec runs, so I am running it in a loop until it crashes. Then I inspect the core file with gdb... the partitions_ map is valid, but I didn't check this
<khuck>
doing that now
<khuck>
(it takes a while to load into gdb)
<khuck>
*this seems fine
<hkaiser>
khuck: if the map is valid, then the *this should be valid as well
<hkaiser>
I don't see a way for the map to be modified during iteration, there is no reason any code should do that
<hkaiser>
it's initialized once and then never changes
<khuck>
that's what I figured
<hkaiser>
could be only that the memory gets trashed somehow
<khuck>
maybe... or GDB is taking me to the wrong thread