hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
hkaiser has quit [Quit: bye]
<heller>
jbjnr__: Aha, that makes sense, sounds solvable as well
david_pfander has joined #ste||ar
nikunj has quit [Quit: Leaving]
Yorlik has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
nikunj has quit [Ping timeout: 246 seconds]
nikunj has joined #ste||ar
Yorlik has joined #ste||ar
daissgr has quit [Quit: WeeChat 1.9.1]
nikunj has quit [Quit: Leaving]
hkaiser has joined #ste||ar
<hkaiser>
simbergm: what's going on with the doc build, it fails on master?
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
hkaiser has quit [Quit: bye]
bibek has joined #ste||ar
aserio has joined #ste||ar
eschnett has joined #ste||ar
<diehlpk_work>
jbjnr__, I did a scaling run with CPU only and it looks to not too bad
<diehlpk_work>
problem size was just too small for 16 and 32 nodes
nikunj has joined #ste||ar
<jbjnr__>
diehlpk_work: how many nodes have you gone up to? my 1024-4096 node libfabric tests have been sitting in the queue for 24 hours or so
<jbjnr__>
I got one run that finished on 4096 nodes and had 4TB/s transfer :)
<diehlpk_work>
jbjnr__, I only was going up to 32 nodes since my input was too small
<diehlpk_work>
Dominic is preparing larger files and I will increase nodes
<jbjnr__>
heller: ^^ that's a new record btw. 4TB/s - previous best was 2 years ago when I got 2TB/s
<diehlpk_work>
But 4TB/s is awesome and could help for the next operation bell
<jbjnr__>
still got 74 jobs in the queue, so my graph is not complete yet.
<jbjnr__>
^graphs
<heller>
jbjnr__: weeh, congrats!
<heller>
jbjnr__: all deadlocks resolved now?
<heller>
can you share what you got so far?
<jbjnr__>
I had some deadlocks last night. today, I removed a bunch more yield_while snippets that might get called from a background task and I'm hopeful that now the resubmitted jobs will complete. The fact that I had 2 4096 runs finish without deadlocks gives me hope
<jbjnr__>
that this time it might be fixed
<jbjnr__>
until the jobs run, I just don't know
<jbjnr__>
there's only so far you can test on a laptop :)
<heller>
peak bisection bandwidth is still something different to random read/writes to nodes
akheir has joined #ste||ar
<heller>
the constant dynamic routing that's happening there etc
<jbjnr__>
yup
<diehlpk_work>
jbjnr__, Your plots look promising for the paper
<jbjnr__>
diehlpk_work: if the deadlock is fixed, then I'll need a big test to run with MPI on say 256/512 nodes and then run with LF to see if there's a differnce. If there is and it is significant, then I'll setup a load of tests to mimic any scaling runs you do with MPI.
<diehlpk_work>
jbjnr__, We are working on this right now
<diehlpk_work>
Dominic should have different level of refinements soon and I will play around to find some sufficient amount of sub grids per node so we have enough work to feed the GPUs
nikunj has quit [Quit: Leaving]
daissgr has joined #ste||ar
hkaiser has joined #ste||ar
eschnett has joined #ste||ar
david_pfander has quit [Ping timeout: 250 seconds]
aserio has quit [Ping timeout: 250 seconds]
diehlpk_work has quit [Remote host closed the connection]
<simbergm>
hkaiser: yes :/ it's YARF (yet another random failure), I don't know why yet
<simbergm>
maybe I should disable it again until I have time to look at it properly
diehlpk_work has joined #ste||ar
eschnett has quit [Quit: eschnett]
<diehlpk_work>
jbjnr__, Any nes about the downtime of daint?
<diehlpk_work>
*news
<jbjnr__>
diehlpk_work: seems like no planned maintenace until next month. still no announcments of downtime etc
<diehlpk_work>
Ok, good
hkaiser has quit [Quit: bye]
aserio has joined #ste||ar
david_pfander has joined #ste||ar
akheir has quit [Remote host closed the connection]
parsa is now known as parsa_
hkaiser has joined #ste||ar
Vir has quit [Ping timeout: 250 seconds]
<diehlpk_work>
hkaiser, They finish the paper work and will send the loan agreement to LSU tomorrow
<diehlpk_work>
Depending how fast LSU will be, they can ship the PI cluster
<diehlpk_work>
And did Michael respond to you and might forgot the cc Adrian and me
aserio has quit [Ping timeout: 250 seconds]
<hkaiser>
diehlpk_work: no, I have not received any response from Michael yet