hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
<weilewei> hkaiser I think my code is facing double de-allocation again, see error log here: https://gist.github.com/weilewei/1949941f8d63c51f39cba25f97640ada. So the overall logic is to copy G_ array (a.k.a G2) to sendbuff_G_, when doing copy, it first de-allocate send_buff_G_, then re-allocate, and finally memcopy (all happen in GPU). However, when program
<weilewei> is doing de-allocate, it finds sendbuff_G has been de-allocated, so it triggers error.
<hkaiser> use c++ managed pointers
<hkaiser> unique_ptr or shared_ptr depending on the situation
<hkaiser> so this will not happen
<weilewei> hkaiser for example how?
<hkaiser> unique_ptr automatically deallocates at destruction, no double deallocation can happen
<hkaiser> have you not listened to what your mom is telling you? ;-)
<hkaiser> NO RAW POINTERS!
<weilewei> but I don't have destruction in my code
<hkaiser> who is deallocating then, if not your code?
<weilewei> I am not sure who else in the program is deallocating that sendbuffer
<hkaiser> find out
<weilewei> unless some asynchronous operation happens in this section of the code, however, I feel like each step is synchronoused: https://github.com/STEllAR-GROUP/DCA/blob/distG4_pr/include/dca/phys/dca_step/cluster_solver/shared_tools/accumulation/tp/tp_accumulator_gpu.hpp#L564-L588
<weilewei> updateG4 is an async call into kernel, however, it does not touch sendbuff
<hkaiser> weilewei: well, somebody has to deallocate things for it getting deallocated twice
<weilewei> hkaiser right, in this case, how to track that thief down?
<hkaiser> set a break point on free() and wait for the pointer to come by
bita has joined #ste||ar
<weilewei> well... this double-deallocation error only happens when using multi-threaded and multi-ranks, and iteration is large enough, I need to think about that
<hkaiser> that's tough, then
shahrzad has joined #ste||ar
bita has quit [Quit: Leaving]
shahrzad has quit [Ping timeout: 240 seconds]
shahrzad has joined #ste||ar
<hkaiser> weilewei: I know how you feel
hkaiser has quit [Quit: bye]
<weilewei> hkaiser thanks!
shahrzad has quit [Ping timeout: 240 seconds]
nan11 has quit [Remote host closed the connection]
shahrzad has joined #ste||ar
shahrzad has quit [Quit: Leaving]
weilewei has quit [Remote host closed the connection]
<zao> Hrm, colleagues report that `nproc --all` output has changed recently, possibly after last night's kernel update.
<zao> If a 8C/16T machine boots with SMT on, `nproc --all` says 16; turning off `smt/control` for the CPU in `/sys`, `/proc/cpuinfo` reports 8 cores but `nproc --all` still says 16.
nikunj97 has joined #ste||ar
<Yorlik> What methods do you use to get to the bottom of memory leaks in HPX application that work on Windows?
<Yorlik> I tried using the "#include <crtdbg.h>" with "_CrtDumpMemoryLeaks( );" method,
<Yorlik> but when adding "#define _CRTDBG_MAP_ALLOC" to get detailed information a ton of compile errors
<Yorlik> Not sure if that is because I'm using jemalloc, just thought I'd ask.
<Yorlik> pop up all over the place.
<Yorlik> I'm also interested in using jemalloc entirely and use it for memory debugging,
<Yorlik> but the configuration process on windows is a bit different than on Linux.
<Yorlik> Ideas?
hkaiser has joined #ste||ar
<Yorlik> hkaiser: YT?
jbjnr has left #ste||ar ["User left"]
<hkaiser> Yorlik: hey
<hkaiser> g'morning
<Yorlik> Morning!
<Yorlik> I had just a quick question about finding memory leaks in Visual Studio
<Yorlik> the default crtdebug method fails with a ton of compile errors
<Yorlik> At least if I want to enable _CRTDBG_MAP_ALLOC"
<Yorlik> jemalloc config on windows is ~special
<Yorlik> So - I'm in search of a reliable method to pinpoint it
<hkaiser> use crtdebug without jemalloc
<Yorlik> I kinda know where it is - probably I'm doing some incorrect use of our Lua Bindings
<Yorlik> OK
<Yorlik> Makes sense
<Yorlik> Thanks !
<Yorlik> I'll check it out -thanks for the link !
<zao> I guess this is only tangentially HPX-related, but has any of you fine people looked at Conan for dependencies, and how bad is it? :P
<hkaiser> zao: we've had some discussions with the conan people a while back, but nothing has materialized so far (nobody felt the need to investigate)
<zao> I see there's some attempts at conanfiles out there, most up to date one targets 1.3.0
nikunj has quit [Ping timeout: 244 seconds]
nikunj has joined #ste||ar
<hkaiser> right, as said - it was a while back
<hkaiser> I think the conan guys did that at that time
<K-ballo> I'm using conan for dependencies in a project, self-hosted repository, we produce recipes for all our dependencies... works ok
<zao> Getting VSCode remotes with a shared codebase to interact well with module systems is turning out all sorts of "fun".
<zao> Rust has spoiled me :P
<hkaiser> ms[m]: I can't say anything about #4564, please go ahead as you see fit
<ms[m]> hkaiser: ok, thanks
<Yorlik> hkaiser: Does #define _CRTDBG_MAP_ALLOC work for you? I can't get it to work with HPX - even when jemalloc is off
<zao> Are you building a _DEBUG build too?
<Yorlik> Yes
<zao> I'd kind of expect that you'd need to build dependencies with it too.
<zao> A core problem of it is that it turns `malloc` into a macro, which reportedly is ... unhealthy for some code.
<Yorlik> I made a HPX debug build without jemalloc for that purpose - cleaned all dirs to really have a blank slate
<Yorlik> It seems to even touch all ::free functions I have in my ovject pools
<Yorlik> I think I'll abandon this method - it looks way too messy to me.
<hkaiser> Yorlik: try using the vld library
<Yorlik> vld is outdated - but you say it still works?
<Yorlik> They kinda stopped 2 years ago or so
<hkaiser> I have used it before with good results, it's been a while, however
<Yorlik> I'll give it a shot. The default crtdbg method is broken for us
<hkaiser> Yorlik: on windoes jemalloc does not replace the malloc/free - we use it explicitly through c++ allocators
<Yorlik> But I already know the source of my leaks: It happens deep inside Lua when using custom userdata objects which seem not to get cleaned up properly.
<hkaiser> mimalloc is fully-automatic,not sure if it can track leaks, though
<Yorlik> I might have to re-visit it
<Yorlik> jemalloc works nicely as explicit lua allocator
<Yorlik> jemalloc works nicely as explicit lua allocator - even on windows
<Yorlik> I'm just giving this function to Lua:
<Yorlik> extern "C" static void* custom_l_alloc( void* ud, void* ptr, size_t osize, size_t nsize ) {
<Yorlik> (void)ud;
<Yorlik> je_free( ptr );
<Yorlik> (void)osize; /* not used */
<Yorlik> if ( nsize == 0 ) {
<Yorlik> return nullptr;
<Yorlik> }
<Yorlik> else
<Yorlik> return je_realloc( ptr, nsize );
<Yorlik> }
<Yorlik> Didn't use it fo my main application.
<hkaiser> are lua objects reference counted?
<Yorlik> Yes
<hkaiser> so that might be your issue
<hkaiser> how do you manage the reference counts?
<Yorlik> I'm relying on our Lua Bindings
<Yorlik> It might be the case there's an issue or I'm doing sth wrong
<hkaiser> c++ bindings?
<Yorlik> Yes. We use Sol3
<hkaiser> ok - they should have gotten things right
<Yorlik> sol is pretty good.
<Yorlik> hkaiser: Got 5244 leaks reported fro my app, 6 for hpx
<hkaiser> the hpx ones are most probably globals that get free'd after the memory tracing has ended
<hkaiser> Yorlik: but pls feel free to give us the traces, we'll have a look
<Yorlik> Sure
<Yorlik> As gist?
weilewei has joined #ste||ar
karame_ has joined #ste||ar
<hkaiser> Yorlik: some of those are caused by your code
<Yorlik> In Leak 4 my code shows up, indeed
<hkaiser> 5 and 6 as well
<Yorlik> Also 5 and 6
<hkaiser> leak 1 I don't understand, Leak 2 and 3 are globals that eventually get released
<Yorlik> I haven't done any thorough analysis yet.
<hkaiser> but thanks
<Yorlik> It might be the core problem is something in the destruction of the LuaEngines
<hkaiser> I feel vindicated ;-)
<Yorlik> :(
bita has joined #ste||ar
nan11 has joined #ste||ar
<weilewei> hkaiser the hpx mpi async test runs fine on Summit, no double mpi init occurs, thanks
<hkaiser> weilewei: ok - that one explicitly calls MPI_Init before starting HPX
<weilewei> hkaiser right, that one
<hkaiser> that's the same as for dca++, I guess
<hkaiser> not sure what's different for you, however
<weilewei> I will try run it now
<Yorlik> hkaiser: All my "leaks" are gone if I destroy my lua states before exiting - seems i need to call the GC more often .... :)
<hkaiser> good
<Yorlik> Still this stuff connected with hpx is there - I'll find it out
<weilewei> Sources said SC20 might be held as normal with some degree of confidence :)
nikunj97 has quit [Ping timeout: 256 seconds]
nikunj97 has joined #ste||ar
<hkaiser> weilewei: that's surprising
<K-ballo> would people still go..?
<weilewei> hkaiser but also it is a developing situation, so I personally think no one can gaurantee
<hkaiser> right
<hkaiser> ms[m]: yt?
nikunj97 has quit [Ping timeout: 244 seconds]
<weilewei> Bryce is giving Cuda C++ lib talk tonight via Zoom: https://www.meetup.com/ACCU-Bay-Area/events/269904471/
<Yorlik> How complicated would it be to start experimenting with kokkos to compute on my local graphics card?
<hkaiser> Yorlik: download it and use it
<Yorlik> Wouzld I need anything additional, like CUDA stuff?
<hkaiser> weilewei: yah, they proted the clang libc++ to the device
<hkaiser> *ported*
<hkaiser> Yorlik: you most likely will need cuda (if you have nvidia gpu)
<Yorlik> OK
<hkaiser> Yorlik: not sure if it works on windows, though
<weilewei> hkaiser oh, that's nice and I will watch Bryce's talk then to understand better
<Yorlik> Aw
<hkaiser> codewise it might, but the buildsystem will not know anything about msvc
rtohid has joined #ste||ar
akheir has joined #ste||ar
<hkaiser> bita: yt?
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
<weilewei> Am I missing something? /gpfs/alpine/proj-shared/cph102/weile/dev/src/Ring_example_MPI_CUDA/gpuDirect_hpx.cpp:30:11: error: 'enable_user_polling' is not a member of 'hpx::mpi' hpx::mpi::enable_user_polling enable_polling;
<hkaiser> mpi::experimental
<weilewei> IC... sorry about that
nikunj has quit [Ping timeout: 240 seconds]
nikunj has joined #ste||ar
<hkaiser> weilewei: I don't think you still need that
<hkaiser> look at the tests to see how it's done
<hkaiser> it's much simpler now
<weilewei> hkaiser right, in hpx tests, it is hpx::mpi::experimental::enable_user_polling enable_polling;
nikunj has quit [Ping timeout: 244 seconds]
nikunj has joined #ste||ar
<weilewei> hkaiser in hpx libs, I have no problem running mpi_ring_async_executor_test (no double mpi init), but for my program here: https://github.com/weilewei/Ring_example_MPI_CUDA/blob/hpx_mpi_async/G2_ring_hpx.cpp, it still complains: Open MPI has detected that this process has attempted to initializeMPI (via MPI_INIT or MPI_INIT_THREAD) more than once.
<weilewei> This is erroneous
<weilewei> The only difference I can think of it is using hpx_main not hpx_init
<hkaiser> can you set a break point on MPI_Init[_thread] and wait until it comes by to get a stack-backtrace?
<weilewei> let me try
<bita> hkaiser, yes
<bita> sorry I missed your ping
<hkaiser> bita: nvm, found it - thanks!
<bita> :) :+1
<hkaiser> weilewei: I'm not sure I understand this
<hkaiser> does it come by the MPI_Init twice?
<weilewei> hkaiser my impression is after program hits MPI_Init and then next step will crash
<hkaiser> how's that?
<hkaiser> does it call MPI_Init_thread instead?
<weilewei> I don't know actually...
<hkaiser> did you set a breakpoint on MPI_Init_thread?
<weilewei> I set it on MPI_Init, because I did not use MPI_Init_thread
<hkaiser> weilewei: the mpi::experimental stuff uses MPI_Init_thread
<hkaiser> also, since everything is multi-threaded you should use the threaded version
<weilewei> hkaiser if that's the case that MPI_Init_thread is used by hpx, then if the application uses MPI_Init_thread, which will lead to double call to MPI_Init_thread? Is it correct?
<weilewei> hkaiser but I remembered early version of hpx mpi future stuff might not use MPI_init_thread, that's what works in my previous sample code.
<hkaiser> weilewei: just set the breakpoint on both functions
<weilewei> hold on I should set one more breakpoint at MPI_Init
<weilewei> (gdb) b MPI_Init Function "MPI_Init" not defined. since I replace MPI_Init as MPI_Init_thread, gdb can't place breakpoint to MPI_Init
<hkaiser> ok
<hkaiser> so where does the MPI_Init_thread call come from?
<hkaiser> the second one, that is?
<hkaiser> look up the stack and try to find out
<weilewei> hpx::util::mpi_environment::init
<hkaiser> weilewei: ^^
<weilewei> hkaiser IC
<hkaiser> does it happen there?
<weilewei> let me verify a bit more
<weilewei> second one is correct link
<hkaiser> weilewei: I think HPX is linked against a different MPI version than the application
<weilewei> hkaiser they are the same
nan11 has quit [Remote host closed the connection]
<hkaiser> they are not, the addresses of MPI_Init_thread are different in both break points
nan11 has joined #ste||ar
<weilewei> hkaiser but I compile hpx and my application with same spectrum-mpi version...
<weilewei> Also, it seems MPI_Init_thread is hit three times, two comes with hpx and one comes from application
<hkaiser> but why?
<hkaiser> try stepping through the code there
<hkaiser> all MPI_Init calls are protected by MPI_Initialized(), so it shouldn't be called more than once
<weilewei> hkaiser I switched hpx_main to hpx::init, and now the issue of dual mpi init goes away
<hkaiser> interesting
<hkaiser> but that does not explain what is wrong in the previous code
<hkaiser> ahh, I know what's up
<weilewei> Ah, why?
<hkaiser> weilewei: do you protect the MPI_Init in main with MPI_Initialized?
<weilewei> no, I did not put MPI_Initialized in my application
<weilewei> Do I need to?
<hkaiser> using hpx_main.hpp will cause for HPX to be initialized before main() is executed
<hkaiser> so your MPI_Init is the second one
<weilewei> Right, that's my guessing at the beginning, so I should do if (MPI_Initialized == true) {// skip my MPI_Init} else { // do MPI_Init} something like this
nan11 has quit [Remote host closed the connection]
<hkaiser> something like that, yes
<weilewei> hkaiser ok, dca with hpx mpi future seems running now after this trick
<weilewei> now it is time to try to break MPI_wait using hpx mpi future
nan11 has joined #ste||ar
rtohid has quit [Remote host closed the connection]
rtohid has joined #ste||ar
akheir has quit [Quit: Leaving]
karame_ has quit [Remote host closed the connection]
<hkaiser> weilewei: \o/
weilewei has quit [Remote host closed the connection]
rtohid has left #ste||ar [#ste||ar]