#ste||ar on 2019-03-25 — irc logs at irclog.cct.lsu.edu

2018-08-26 23:03 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

02:25 hkaiser has quit [Quit: bye]

05:48 <heller> jbjnr__: Aha, that makes sense, sounds solvable as well

08:07 david_pfander has joined #ste||ar

08:32 nikunj has quit [Quit: Leaving]

08:39 Yorlik has quit [Read error: Connection reset by peer]

08:48 nikunj has joined #ste||ar

09:16 nikunj has quit [Ping timeout: 246 seconds]

09:19 nikunj has joined #ste||ar

09:41 Yorlik has joined #ste||ar

10:00 daissgr has quit [Quit: WeeChat 1.9.1]

10:14 nikunj has quit [Quit: Leaving]

11:18 hkaiser has joined #ste||ar

12:11 <hkaiser> simbergm: what's going on with the doc build, it fails on master?

12:25 K-ballo has quit [Quit: K-ballo]

12:30 K-ballo has joined #ste||ar

13:16 hkaiser has quit [Quit: bye]

13:18 bibek has joined #ste||ar

13:26 aserio has joined #ste||ar

13:56 eschnett has joined #ste||ar

14:12 <diehlpk_work> jbjnr__, I did a scaling run with CPU only and it looks to not too bad

14:12 <diehlpk_work> problem size was just too small for 16 and 32 nodes

14:17 nikunj has joined #ste||ar

14:20 <jbjnr__> diehlpk_work: how many nodes have you gone up to? my 1024-4096 node libfabric tests have been sitting in the queue for 24 hours or so

14:21 <jbjnr__> I got one run that finished on 4096 nodes and had 4TB/s transfer :)

14:21 <diehlpk_work> jbjnr__, I only was going up to 32 nodes since my input was too small

14:21 <diehlpk_work> Dominic is preparing larger files and I will increase nodes

14:22 <jbjnr__> heller: ^^ that's a new record btw. 4TB/s - previous best was 2 years ago when I got 2TB/s

14:22 <diehlpk_work> But 4TB/s is awesome and could help for the next operation bell

14:23 <jbjnr__> still got 74 jobs in the queue, so my graph is not complete yet.

14:23 <jbjnr__> ^graphs

14:23 <heller> jbjnr__: weeh, congrats!

14:23 <heller> jbjnr__: all deadlocks resolved now?

14:24 <heller> can you share what you got so far?

14:25 <jbjnr__> I had some deadlocks last night. today, I removed a bunch more yield_while snippets that might get called from a background task and I'm hopeful that now the resubmitted jobs will complete. The fact that I had 2 4096 runs finish without deadlocks gives me hope

14:25 <jbjnr__> that this time it might be fixed

14:25 <jbjnr__> until the jobs run, I just don't know

14:25 <jbjnr__> there's only so far you can test on a laptop :)

14:27 parsa_ is now known as parsa

14:27 <jbjnr__> heller: https://pasteboard.co/I746xvB.png is what I have so far

14:28 eschnett has quit [Quit: eschnett]

14:29 <jbjnr__> can't remember what the bisection bandwidth is on daint, but we must be getting close

14:30 <jbjnr__> hmmm. says "Peak Network Bisection Bandwidth

14:30 <jbjnr__> 33 TB/s" not sure I believe that.

14:31 akheir has quit [Quit: Konversation terminated!]

14:34 <heller> peak bisection bandwidth is still something different to random read/writes to nodes

14:35 akheir has joined #ste||ar

14:35 <heller> the constant dynamic routing that's happening there etc

14:39 <jbjnr__> yup

14:51 <diehlpk_work> jbjnr__, Your plots look promising for the paper

15:11 <jbjnr__> diehlpk_work: if the deadlock is fixed, then I'll need a big test to run with MPI on say 256/512 nodes and then run with LF to see if there's a differnce. If there is and it is significant, then I'll setup a load of tests to mimic any scaling runs you do with MPI.

15:14 <diehlpk_work> jbjnr__, We are working on this right now

15:15 <diehlpk_work> Dominic should have different level of refinements soon and I will play around to find some sufficient amount of sub grids per node so we have enough work to feed the GPUs

15:46 nikunj has quit [Quit: Leaving]

15:57 daissgr has joined #ste||ar

16:25 hkaiser has joined #ste||ar

16:56 eschnett has joined #ste||ar

17:16 david_pfander has quit [Ping timeout: 250 seconds]

17:33 aserio has quit [Ping timeout: 250 seconds]

17:35 diehlpk_work has quit [Remote host closed the connection]

17:39 <simbergm> hkaiser: yes :/ it's YARF (yet another random failure), I don't know why yet

17:40 <simbergm> maybe I should disable it again until I have time to look at it properly

17:42 diehlpk_work has joined #ste||ar

17:49 eschnett has quit [Quit: eschnett]

17:52 <diehlpk_work> jbjnr__, Any nes about the downtime of daint?

17:52 <diehlpk_work> *news

18:28 <jbjnr__> diehlpk_work: seems like no planned maintenace until next month. still no announcments of downtime etc

18:28 <diehlpk_work> Ok, good

18:28 hkaiser has quit [Quit: bye]

18:39 aserio has joined #ste||ar

18:57 david_pfander has joined #ste||ar

19:18 akheir has quit [Remote host closed the connection]

20:26 parsa is now known as parsa_

20:47 hkaiser has joined #ste||ar

21:11 Vir has quit [Ping timeout: 250 seconds]

21:46 <diehlpk_work> hkaiser, They finish the paper work and will send the loan agreement to LSU tomorrow

21:47 <diehlpk_work> Depending how fast LSU will be, they can ship the PI cluster

21:47 <diehlpk_work> And did Michael respond to you and might forgot the cc Adrian and me

22:00 aserio has quit [Ping timeout: 250 seconds]

22:03 <hkaiser> diehlpk_work: no, I have not received any response from Michael yet

22:12 aserio has joined #ste||ar

22:33 aserio has quit [Quit: aserio]

22:40 eschnett has joined #ste||ar

22:58 eschnett has quit [Quit: eschnett]

23:02 daissgr1 has joined #ste||ar

23:24 eschnett has joined #ste||ar

23:29 eschnett has quit [Quit: eschnett]

23:42 eschnett has joined #ste||ar