hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
hkaiser has quit [Quit: bye]
<heller> jbjnr__: Aha, that makes sense, sounds solvable as well
david_pfander has joined #ste||ar
nikunj has quit [Quit: Leaving]
Yorlik has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
nikunj has quit [Ping timeout: 246 seconds]
nikunj has joined #ste||ar
Yorlik has joined #ste||ar
daissgr has quit [Quit: WeeChat 1.9.1]
nikunj has quit [Quit: Leaving]
hkaiser has joined #ste||ar
<hkaiser> simbergm: what's going on with the doc build, it fails on master?
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
hkaiser has quit [Quit: bye]
bibek has joined #ste||ar
aserio has joined #ste||ar
eschnett has joined #ste||ar
<diehlpk_work> jbjnr__, I did a scaling run with CPU only and it looks to not too bad
<diehlpk_work> problem size was just too small for 16 and 32 nodes
nikunj has joined #ste||ar
<jbjnr__> diehlpk_work: how many nodes have you gone up to? my 1024-4096 node libfabric tests have been sitting in the queue for 24 hours or so
<jbjnr__> I got one run that finished on 4096 nodes and had 4TB/s transfer :)
<diehlpk_work> jbjnr__, I only was going up to 32 nodes since my input was too small
<diehlpk_work> Dominic is preparing larger files and I will increase nodes
<jbjnr__> heller: ^^ that's a new record btw. 4TB/s - previous best was 2 years ago when I got 2TB/s
<diehlpk_work> But 4TB/s is awesome and could help for the next operation bell
<jbjnr__> still got 74 jobs in the queue, so my graph is not complete yet.
<jbjnr__> ^graphs
<heller> jbjnr__: weeh, congrats!
<heller> jbjnr__: all deadlocks resolved now?
<heller> can you share what you got so far?
<jbjnr__> I had some deadlocks last night. today, I removed a bunch more yield_while snippets that might get called from a background task and I'm hopeful that now the resubmitted jobs will complete. The fact that I had 2 4096 runs finish without deadlocks gives me hope
<jbjnr__> that this time it might be fixed
<jbjnr__> until the jobs run, I just don't know
<jbjnr__> there's only so far you can test on a laptop :)
parsa_ is now known as parsa
<jbjnr__> heller: https://pasteboard.co/I746xvB.png is what I have so far
eschnett has quit [Quit: eschnett]
<jbjnr__> can't remember what the bisection bandwidth is on daint, but we must be getting close
<jbjnr__> hmmm. says "Peak Network Bisection Bandwidth
<jbjnr__> 33 TB/s" not sure I believe that.
akheir has quit [Quit: Konversation terminated!]
<heller> peak bisection bandwidth is still something different to random read/writes to nodes
akheir has joined #ste||ar
<heller> the constant dynamic routing that's happening there etc
<jbjnr__> yup
<diehlpk_work> jbjnr__, Your plots look promising for the paper
<jbjnr__> diehlpk_work: if the deadlock is fixed, then I'll need a big test to run with MPI on say 256/512 nodes and then run with LF to see if there's a differnce. If there is and it is significant, then I'll setup a load of tests to mimic any scaling runs you do with MPI.
<diehlpk_work> jbjnr__, We are working on this right now
<diehlpk_work> Dominic should have different level of refinements soon and I will play around to find some sufficient amount of sub grids per node so we have enough work to feed the GPUs
nikunj has quit [Quit: Leaving]
daissgr has joined #ste||ar
hkaiser has joined #ste||ar
eschnett has joined #ste||ar
david_pfander has quit [Ping timeout: 250 seconds]
aserio has quit [Ping timeout: 250 seconds]
diehlpk_work has quit [Remote host closed the connection]
<simbergm> hkaiser: yes :/ it's YARF (yet another random failure), I don't know why yet
<simbergm> maybe I should disable it again until I have time to look at it properly
diehlpk_work has joined #ste||ar
eschnett has quit [Quit: eschnett]
<diehlpk_work> jbjnr__, Any nes about the downtime of daint?
<diehlpk_work> *news
<jbjnr__> diehlpk_work: seems like no planned maintenace until next month. still no announcments of downtime etc
<diehlpk_work> Ok, good
hkaiser has quit [Quit: bye]
aserio has joined #ste||ar
david_pfander has joined #ste||ar
akheir has quit [Remote host closed the connection]
parsa is now known as parsa_
hkaiser has joined #ste||ar
Vir has quit [Ping timeout: 250 seconds]
<diehlpk_work> hkaiser, They finish the paper work and will send the loan agreement to LSU tomorrow
<diehlpk_work> Depending how fast LSU will be, they can ship the PI cluster
<diehlpk_work> And did Michael respond to you and might forgot the cc Adrian and me
aserio has quit [Ping timeout: 250 seconds]
<hkaiser> diehlpk_work: no, I have not received any response from Michael yet
aserio has joined #ste||ar
aserio has quit [Quit: aserio]
eschnett has joined #ste||ar
eschnett has quit [Quit: eschnett]
daissgr1 has joined #ste||ar
eschnett has joined #ste||ar
eschnett has quit [Quit: eschnett]
eschnett has joined #ste||ar