#ste||ar on 2020-03-13 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

01:20 diehlpk has joined #ste||ar

01:20 diehlpk has quit [Changing host]

01:20 diehlpk has joined #ste||ar

01:21 <diehlpk> hkaiser, I think I could fix the hpx issue on Cori

01:21 <diehlpk> Level 11 on 512 nodes finished

01:21 <diehlpk> Let us wait for 256 and 1024

01:40 <hkaiser> diehlpk: what did you change?

01:41 <diehlpk> Dominc merged the radiation and I used a commit before

01:41 <diehlpk> we will see what happens to the remaing jobs

01:41 <hkaiser> interesting, so it's caused by octotiger?

01:42 <diehlpk> I am not sure, since a simple hello world fails on QB

01:42 <hkaiser> right

01:42 <diehlpk> Maye the different versionddoes not has the race condition

01:42 <diehlpk> Wehave to debug this after the SC paper

01:43 <hkaiser> right

01:43 <diehlpk> Right now, I am happy that things seems to work again

01:47 hkaiser has quit [Quit: bye]

02:39 diehlpk has quit [Ping timeout: 240 seconds]

04:36 weilewei has quit [Remote host closed the connection]

05:31 nikunj has joined #ste||ar

05:34 nikunj has quit [Remote host closed the connection]

05:36 nikunj has joined #ste||ar

06:17 jaafar has quit [Ping timeout: 256 seconds]

06:25 nikunj97 has joined #ste||ar

07:22 nikunj97 has quit [Read error: Connection reset by peer]

07:31 <heller1> So there's a proper bridge now?

07:41 Hashmi has joined #ste||ar

07:43 nikunj has quit [Remote host closed the connection]

07:43 nikunj has joined #ste||ar

09:10 hkaiser has joined #ste||ar

09:39 nikunj has quit [Ping timeout: 240 seconds]

09:40 nikunj has joined #ste||ar

11:21 ronneigandhi has joined #ste||ar

11:21 ronneigandhi has left #ste||ar [#ste||ar]

11:51 nikunj has quit [Read error: Connection reset by peer]

11:51 nikunj has joined #ste||ar

12:49 K-ballo has quit [Remote host closed the connection]

12:50 K-ballo has joined #ste||ar

13:22 hkaiser has quit [Quit: bye]

14:12 hkaiser has joined #ste||ar

14:17 ct-clmsn has joined #ste||ar

14:35 hkaiser has quit [Ping timeout: 240 seconds]

14:46 hkaiser has joined #ste||ar

14:54 diehlpk has joined #ste||ar

14:54 diehlpk has quit [Changing host]

15:59 ct-clmsn has quit [Ping timeout: 268 seconds]

16:44 <Yorlik> hkaiser: yt?

16:46 <Yorlik> The solution to the last problem in the end was easy: Just storing all instances if LuaState Pools on a static in the class, so I could access everything without knowing the actual threads using them. This now allows me after stopping the server to safely kill all LuaStates and reload the scripts and everything.

17:07 ct-clmsn has joined #ste||ar

17:07 ct-clmsn is now known as Guest32456

17:15 Guest32456 has quit [Quit: Leaving]

17:29 weilewei has joined #ste||ar

17:30 weilewei has quit [Remote host closed the connection]

17:34 <hkaiser> Yorlik: sure, but this requires synchronization, doesn't it?

17:35 <Yorlik> Yes - the opeation can only be performed, when the server is paused and no lua states are being used. But that's easy to do.

17:36 <hkaiser> what I meant is that you have to grab a lock or something in order to create a new state engine

17:36 <Yorlik> I am just leaving the update loop which guarantees that.

17:36 <hkaiser> or to get one

17:36 <Yorlik> The pools have a lock

17:37 <Yorlik> But when an engine is handed out it has left the pool and there is no concurrency until it is returned. these are extremely short, one liner code sequences which need locking

17:37 <Yorlik> It's just a pointer being pushed back into the vector.

18:07 rtohid has quit [Quit: Konversation terminated!]

18:13 gonidelis has joined #ste||ar

19:19 jaafar has joined #ste||ar

19:40 gonidelis has quit [Ping timeout: 240 seconds]

19:51 hkaiser has quit [Quit: bye]

20:22 Abhishek09 has joined #ste||ar

21:29 Abhishek09 has quit [Remote host closed the connection]

21:55 diehlpk has quit [Ping timeout: 256 seconds]

21:55 weilewei has joined #ste||ar

22:03 hkaiser has joined #ste||ar

23:13 <weilewei> hkaiser if I have the following bug when using gpudirect, what will you suggest? https://gist.github.com/weilewei/d1998501e4be86a72b1bf3307ed470e1

23:29 <hkaiser> looks like a double delete

23:29 <hkaiser> weilewei: ^^

23:30 <hkaiser> or an attempt to delete a pointer that was not returned by that allocator

23:30 <weilewei> ok... I see, hmmm not sure why it happens

23:30 <weilewei> hkaiser

23:31 <hkaiser> weilewei: could be anything, really - the result of a wild write cluttering the internal memory somewhere

23:31 <weilewei> the debugger doesn't work, arm-forge debugger cannot read debugging info. not sure how to use gdb to debug 2-MPI-rank run

23:31 <weilewei> If I run this program with 1 mpi rank, it is fine

23:32 <hkaiser> weilewei: is it reproducible?

23:32 <weilewei> only reproducible on Summit I guess, where has NVLink and Cuda GPU

23:33 <hkaiser> using gdb for a 2-rank run: mpirun -n2 gdb --args <your command line args> ./your-executable

23:34 <hkaiser> or do what we do with hpx: https://github.com/STEllAR-GROUP/hpx/blob/master/libs/debugging/src/attach_debugger.cpp#L28-L39

23:34 <hkaiser> wait for the loop to hit, attach gdb in two new terminal windows, set the variable to 1 and continue

23:35 <weilewei> hkaiser when I do gdb debug and then I got hang: https://gist.github.com/weilewei/d1998501e4be86a72b1bf3307ed470e1#gistcomment-3211575

23:36 <hkaiser> weilewei: well, quit one of the instances

23:37 <hkaiser> sorry, I meant: you quit one of the instances

23:37 <weilewei> hkaiser the program quit one of the instances at the time I start the debugger

23:37 <hkaiser> attaching gdb might be the better solution

23:37 <hkaiser> (see above)

23:39 <weilewei> hkaiser not sure how to attached gdb debugger exactly, I need to learn a bit more

23:39 <hkaiser> well I tried to explain above

23:39 Amy1 has quit [Ping timeout: 272 seconds]

23:39 <hkaiser> add that code at startup

23:40 <hkaiser> then wait for the look to hit on both instances

23:40 <hkaiser> it will print the pid

23:40 <weilewei> hkaiser ok, I see, let me try

23:40 <hkaiser> then start gdb -p <pid> twice in two terminal windows with the two pids

23:41 Amy1 has joined #ste||ar

23:41 <hkaiser> both will likely sit inside the sleep, so go up one stack frame and set the variable to 1

23:42 <weilewei> hkaiser means I need to have two terminal windows in the same interactive compute node

23:42 <hkaiser> with set i = 1

23:42 <weilewei> I doubt Summit allows me to do so

23:42 <hkaiser> then 'continue'

23:42 <hkaiser> and you will run both instances in gdb

23:43 <hkaiser> weilewei: yes

23:43 <hkaiser> can't you simply launch a second terminal from your main one?

23:44 <hkaiser> weilewei: for instance https://askubuntu.com/questions/332104/open-another-terminal-window-with-the-same-ssh-session-as-original-window

23:44 <weilewei> No, I don't think so. Everytime I open a new terminal, then I need to login Summit again. Or let me check if there is any way

23:45 <hkaiser> here is a better solution https://stackoverflow.com/questions/20410252/how-to-reuse-an-ssh-connection/20410383#20410383

23:46 <weilewei> hkaiser thanks, let me read through and try it out

23:57 <weilewei> hkaiser https://www.olcf.ornl.gov/for-users/system-user-guides/rhea/connecting/

23:57 <weilewei> "SSH multiplexing is disabled on all of the OLCF’s user-facing systems."

23:58 <weilewei> I also tried stackoverflow solution, seems I cannot get two terminals together