hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
diehlpk has joined #ste||ar
diehlpk has quit [Changing host]
diehlpk has joined #ste||ar
<diehlpk> hkaiser, I think I could fix the hpx issue on Cori
<diehlpk> Level 11 on 512 nodes finished
<diehlpk> Let us wait for 256 and 1024
<hkaiser> diehlpk: what did you change?
<diehlpk> Dominc merged the radiation and I used a commit before
<diehlpk> we will see what happens to the remaing jobs
<hkaiser> interesting, so it's caused by octotiger?
<diehlpk> I am not sure, since a simple hello world fails on QB
<hkaiser> right
<diehlpk> Maye the different versionddoes not has the race condition
<diehlpk> Wehave to debug this after the SC paper
<hkaiser> right
<diehlpk> Right now, I am happy that things seems to work again
hkaiser has quit [Quit: bye]
diehlpk has quit [Ping timeout: 240 seconds]
weilewei has quit [Remote host closed the connection]
nikunj has joined #ste||ar
nikunj has quit [Remote host closed the connection]
nikunj has joined #ste||ar
jaafar has quit [Ping timeout: 256 seconds]
nikunj97 has joined #ste||ar
nikunj97 has quit [Read error: Connection reset by peer]
<heller1> So there's a proper bridge now?
Hashmi has joined #ste||ar
nikunj has quit [Remote host closed the connection]
nikunj has joined #ste||ar
hkaiser has joined #ste||ar
nikunj has quit [Ping timeout: 240 seconds]
nikunj has joined #ste||ar
ronneigandhi has joined #ste||ar
ronneigandhi has left #ste||ar [#ste||ar]
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
K-ballo has quit [Remote host closed the connection]
K-ballo has joined #ste||ar
hkaiser has quit [Quit: bye]
hkaiser has joined #ste||ar
ct-clmsn has joined #ste||ar
hkaiser has quit [Ping timeout: 240 seconds]
hkaiser has joined #ste||ar
diehlpk has joined #ste||ar
diehlpk has joined #ste||ar
diehlpk has quit [Changing host]
ct-clmsn has quit [Ping timeout: 268 seconds]
<Yorlik> hkaiser: yt?
<Yorlik> The solution to the last problem in the end was easy: Just storing all instances if LuaState Pools on a static in the class, so I could access everything without knowing the actual threads using them. This now allows me after stopping the server to safely kill all LuaStates and reload the scripts and everything.
ct-clmsn has joined #ste||ar
ct-clmsn is now known as Guest32456
Guest32456 has quit [Quit: Leaving]
weilewei has joined #ste||ar
weilewei has quit [Remote host closed the connection]
<hkaiser> Yorlik: sure, but this requires synchronization, doesn't it?
<Yorlik> Yes - the opeation can only be performed, when the server is paused and no lua states are being used. But that's easy to do.
<hkaiser> what I meant is that you have to grab a lock or something in order to create a new state engine
<Yorlik> I am just leaving the update loop which guarantees that.
<hkaiser> or to get one
<Yorlik> The pools have a lock
<Yorlik> But when an engine is handed out it has left the pool and there is no concurrency until it is returned. these are extremely short, one liner code sequences which need locking
<Yorlik> It's just a pointer being pushed back into the vector.
rtohid has quit [Quit: Konversation terminated!]
gonidelis has joined #ste||ar
jaafar has joined #ste||ar
gonidelis has quit [Ping timeout: 240 seconds]
hkaiser has quit [Quit: bye]
Abhishek09 has joined #ste||ar
Abhishek09 has quit [Remote host closed the connection]
diehlpk has quit [Ping timeout: 256 seconds]
weilewei has joined #ste||ar
hkaiser has joined #ste||ar
<weilewei> hkaiser if I have the following bug when using gpudirect, what will you suggest? https://gist.github.com/weilewei/d1998501e4be86a72b1bf3307ed470e1
<hkaiser> looks like a double delete
<hkaiser> weilewei: ^^
<hkaiser> or an attempt to delete a pointer that was not returned by that allocator
<weilewei> ok... I see, hmmm not sure why it happens
<weilewei> hkaiser
<hkaiser> weilewei: could be anything, really - the result of a wild write cluttering the internal memory somewhere
<weilewei> the debugger doesn't work, arm-forge debugger cannot read debugging info. not sure how to use gdb to debug 2-MPI-rank run
<weilewei> If I run this program with 1 mpi rank, it is fine
<hkaiser> weilewei: is it reproducible?
<weilewei> only reproducible on Summit I guess, where has NVLink and Cuda GPU
<hkaiser> using gdb for a 2-rank run: mpirun -n2 gdb --args <your command line args> ./your-executable
<hkaiser> wait for the loop to hit, attach gdb in two new terminal windows, set the variable to 1 and continue
<weilewei> hkaiser when I do gdb debug and then I got hang: https://gist.github.com/weilewei/d1998501e4be86a72b1bf3307ed470e1#gistcomment-3211575
<hkaiser> weilewei: well, quit one of the instances
<hkaiser> sorry, I meant: you quit one of the instances
<weilewei> hkaiser the program quit one of the instances at the time I start the debugger
<hkaiser> attaching gdb might be the better solution
<hkaiser> (see above)
<weilewei> hkaiser not sure how to attached gdb debugger exactly, I need to learn a bit more
<hkaiser> well I tried to explain above
Amy1 has quit [Ping timeout: 272 seconds]
<hkaiser> add that code at startup
<hkaiser> then wait for the look to hit on both instances
<hkaiser> it will print the pid
<weilewei> hkaiser ok, I see, let me try
<hkaiser> then start gdb -p <pid> twice in two terminal windows with the two pids
Amy1 has joined #ste||ar
<hkaiser> both will likely sit inside the sleep, so go up one stack frame and set the variable to 1
<weilewei> hkaiser means I need to have two terminal windows in the same interactive compute node
<hkaiser> with set i = 1
<weilewei> I doubt Summit allows me to do so
<hkaiser> then 'continue'
<hkaiser> and you will run both instances in gdb
<hkaiser> weilewei: yes
<hkaiser> can't you simply launch a second terminal from your main one?
<weilewei> No, I don't think so. Everytime I open a new terminal, then I need to login Summit again. Or let me check if there is any way
<weilewei> hkaiser thanks, let me read through and try it out
<weilewei> "SSH multiplexing is disabled on all of the OLCF’s user-facing systems."
<weilewei> I also tried stackoverflow solution, seems I cannot get two terminals together