hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
jaafar has joined #ste||ar
jaafar has quit [Client Quit]
jaafar has joined #ste||ar
jaafar has quit [Client Quit]
jaafar has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: bye]
<jbjnr__> heller: simbergm diehlpk_work results for nodes N=2 up to N=1024 came out almost pefect on latest libfabric test run. Still waiting for N=2048 and N=4096, but expecting some lockups still on those runs. Fingers crossed https://pasteboard.co/I7tEENd.png
david_pfander has joined #ste||ar
<jbjnr__> turns out I had the wrong settings on those passing tests and they were very short ones. They probably won't pasds when I rerun them with longer more intensive tests :(
<heller> hmmm
hkaiser has joined #ste||ar
K-ballo has joined #ste||ar
hkaiser has quit [Quit: bye]
<parsa> is there a PDF or single-page HTML version of the new HPX docs available?
eschnett has joined #ste||ar
aserio has joined #ste||ar
<zao> Sphinx itself has a "singlehtml" builder, and can apparently emit LaTeX for some sort of PDF generation.
aserio has quit [Ping timeout: 250 seconds]
<simbergm> those are they up-to-date ones (except that the documentation fails to build sometimes... bah!)
<diehlpk_work> jbjnr__, I collected all scaling results for up to 1028 nodes
<diehlpk_work> Everything worked and now hanging
<jbjnr__> great. plots?
<jbjnr__> oh :(
<diehlpk_work> I have to check why 2048 had issues
<diehlpk_work> I will work on the plots
<diehlpk_work> At least we will have scaling with respect to time
<diehlpk_work> we found a bug for the gpu counters and need to rerun for the flops
<parsa> simbergm: brilliant! many thanks!
hkaiser has joined #ste||ar
<parsa> simbergm: should they not be on http://stellar.cct.lsu.edu/docs/ ?
<parsa> i mean listed there
<simbergm> parsa: they should, and I *just* logged in to fix that ;)
<zao> 1028, that's an interesting node count ;P
aserio has joined #ste||ar
<diehlpk_work> jbjnr__, hkaiser https://pasteboard.co/I7wBDJ7.png Total Time for Level 14 and Level 15 up to 1024 nodes
<diehlpk_work> and speedup for the two levels. The ideal speedup is only relevant for the blue line https://pasteboard.co/I7wC8AU.png
<diehlpk_work> Was to lazy to plot a third plot
hkaiser has quit [Quit: bye]
RostamLog has joined #ste||ar
<jbjnr__> heller: yt?
<diehlpk_work> zao, 1024
aserio has quit [Ping timeout: 264 seconds]
hkaiser has joined #ste||ar
<jbjnr__> hkaiser: yt?
<hkaiser> jbjnr__: here
<jbjnr__> hkaiser: I've got a very occasional case when the parcelport has rank 0,1,2,3...N assigned to nodes, but the main hpx program has the ranks assigned differently. Where does hpx get it's rank information from, I must have missed somewhere in the code where I handle the locality setup
<jbjnr__> the libfabric PP uses an address vector where the addresses are indexed by rank - which I assign during bootup, but I am not sure where the rest of hpx gets it's rank assignment from
<hkaiser> jbjnr__: hpx assins the rank in the order the localities register
<jbjnr__> ok, found code using this stuff void addressing_service::set_local_locality(naming::gid_type const& g)
<jbjnr__> I will investigate
<hkaiser> happens during bbb registration
<hkaiser> hold on, I'll show you
<jbjnr__> k
<diehlpk_work> jbjnr__, Can I submit a full system run in the large queue or do I need to do something special?
<diehlpk_work> I would need 1.5 hours for the full system
<jbjnr__> diehlpk_work: for a full machine run you need to contact Maria Grazia I think
<hkaiser> this assigns a 'prefix' i.e. rank
<jbjnr__> hkaiser: thanks. I will check it
<diehlpk_work> Ok, if we have good results for 2048 and 4096 I will ask her
<hkaiser> jbjnr__: that calls into the AGAS locality namespace: https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/agas/addressing_service.cpp#L318
<hkaiser> simbergm: yt?
<hkaiser> simbergm: please have a look at CircleCI tests for #3759, that has a strange assertion related to the scheduler masks
<jbjnr__> hkaiser: he's away this week at meetings in the USA. might not be logged in much
<hkaiser> ahh ok
<hkaiser> is he at Sandia?
<jbjnr__> I think it's colorado or somewhere. I can't remember
<hkaiser> k
eschnett_ has joined #ste||ar
eschnett has quit [Ping timeout: 246 seconds]
eschnett_ is now known as eschnett
aserio has joined #ste||ar
<simbergm> hkaiser yep saw it and will take a look
aserio has quit [Ping timeout: 250 seconds]
<jbjnr__> hkaiser: related question - in the mpi parcelport, are the mpi ranks always the same as the hpx locality id
<diehlpk_work> DAINT Usage: 4,799 NODE HOURS (NH) Quota: 9,000 NH 53.3%
<diehlpk_work> We fired 53% of our node hours
<hkaiser> jbjnr__ good question
<hkaiser> we either don't care or do special things
<hkaiser> I believe that is what the preferred_prefix is for, i.e. the locality tells the root please use this number as my rank
<hkaiser> sec, let me try to understand what we do
<hkaiser> jbjnr__ I almost think we don't care
diehlpk has joined #ste||ar
<diehlpk> jbjnr__, Meeting?
<jbjnr__> hkaiser: yes. I'm thinking it doesn't actually matter if the ordering is different as long as the locality object for the PP is valid
<jbjnr__> diehlpk: meeting? Oh now. sorry. Daughter's birthday today, forgot about it
<jbjnr__> diehlpk: node hours will be reset on 1st august
<jbjnr__> it's a quarterly quota
diehlpk has quit [Remote host closed the connection]
daissgr has joined #ste||ar
diehlpk has joined #ste||ar
<diehlpk> jbjnr__, Could we get another 9000 node hours
<jbjnr__> on 1st april
<diehlpk> Is it for sure?
<diehlpk> So we would burn now all of the hours
<parsa> daissgr: see if this directory still exists: home/users/khuck/src/operation-gordon-bell
daissgr has quit [Quit: WeeChat 1.9.1]
david_pfander has quit [Ping timeout: 250 seconds]
diehlpk has quit [Ping timeout: 250 seconds]
daissgr has joined #ste||ar
daissgr1 has joined #ste||ar
<daissgr> parsa: The folder does not exist anymore unfortunately
daissgr1 has quit [Client Quit]
aserio has joined #ste||ar
<jbjnr__> diehlpk_work: yes. we have 36000 node hours, but 9000 per quarter. We are a development project https://www.cscs.ch/user-lab/allocation-schemes/development-projects/
<diehlpk_work> Cool, so I cna burn more
<jbjnr__> so we should use all the remaining node hours over the next couple of days
aserio has quit [Ping timeout: 250 seconds]
hkaiser has quit [Quit: bye]
aserio has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 245 seconds]
aserio1 is now known as aserio
<diehlpk_work> jbjnr__, This is not a problem. My 2048 and 4096 run will eat them once they are finished
hkaiser has joined #ste||ar
aserio has quit [Quit: aserio]
eschnett has quit [Quit: eschnett]