#ste||ar on 2019-03-28 — irc logs at irclog.cct.lsu.edu

2018-08-26 23:03 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:19 jaafar has joined #ste||ar

00:19 jaafar has quit [Client Quit]

01:34 jaafar has joined #ste||ar

01:39 jaafar has quit [Client Quit]

01:41 jaafar has joined #ste||ar

02:10 K-ballo has quit [Quit: K-ballo]

03:11 hkaiser has quit [Quit: bye]

07:30 <jbjnr__> heller: simbergm diehlpk_work results for nodes N=2 up to N=1024 came out almost pefect on latest libfabric test run. Still waiting for N=2048 and N=4096, but expecting some lockups still on those runs. Fingers crossed https://pasteboard.co/I7tEENd.png

08:06 david_pfander has joined #ste||ar

08:48 <jbjnr__> turns out I had the wrong settings on those passing tests and they were very short ones. They probably won't pasds when I rerun them with longer more intensive tests :(

09:24 <heller> hmmm

11:00 hkaiser has joined #ste||ar

11:49 K-ballo has joined #ste||ar

13:00 hkaiser has quit [Quit: bye]

13:24 <parsa> is there a PDF or single-page HTML version of the new HPX docs available?

13:36 eschnett has joined #ste||ar

13:46 aserio has joined #ste||ar

13:59 <zao> Sphinx itself has a "singlehtml" builder, and can apparently emit LaTeX for some sort of PDF generation.

14:01 <K-ballo> https://github.com/STEllAR-GROUP/hpx/pull/3714

14:01 <K-ballo> https://stellar-group.github.io/hpx/docs/sphinx/branches/pdf-docs/pdf/HPX.pdf

14:05 aserio has quit [Ping timeout: 250 seconds]

14:20 <simbergm> parsa: https://stellar-group.github.io/hpx/docs/sphinx/branches/master/singlehtml/index.html

14:21 <simbergm> https://stellar-group.github.io/hpx/docs/sphinx/branches/master/pdf/HPX.pdf

14:21 <simbergm> those are they up-to-date ones (except that the documentation fails to build sometimes... bah!)

14:23 <diehlpk_work> jbjnr__, I collected all scaling results for up to 1028 nodes

14:23 <diehlpk_work> Everything worked and now hanging

14:23 <jbjnr__> great. plots?

14:23 <jbjnr__> oh :(

14:23 <diehlpk_work> I have to check why 2048 had issues

14:24 <diehlpk_work> I will work on the plots

14:24 <diehlpk_work> At least we will have scaling with respect to time

14:24 <diehlpk_work> we found a bug for the gpu counters and need to rerun for the flops

14:30 <parsa> simbergm: brilliant! many thanks!

14:33 hkaiser has joined #ste||ar

14:34 <parsa> simbergm: should they not be on http://stellar.cct.lsu.edu/docs/ ?

14:35 <parsa> i mean listed there

14:35 <simbergm> parsa: they should, and I *just* logged in to fix that ;)

14:35 <zao> 1028, that's an interesting node count ;P

14:41 aserio has joined #ste||ar

15:00 <diehlpk_work> jbjnr__, hkaiser https://pasteboard.co/I7wBDJ7.png Total Time for Level 14 and Level 15 up to 1024 nodes

15:01 <diehlpk_work> and speedup for the two levels. The ideal speedup is only relevant for the blue line https://pasteboard.co/I7wC8AU.png

15:01 <diehlpk_work> Was to lazy to plot a third plot

15:06 hkaiser has quit [Quit: bye]

15:51 RostamLog has joined #ste||ar

15:59 <jbjnr__> heller: yt?

16:19 <diehlpk_work> zao, 1024

16:22 aserio has quit [Ping timeout: 264 seconds]

16:29 hkaiser has joined #ste||ar

16:48 <jbjnr__> hkaiser: yt?

16:53 <hkaiser> jbjnr__: here

16:58 <jbjnr__> hkaiser: I've got a very occasional case when the parcelport has rank 0,1,2,3...N assigned to nodes, but the main hpx program has the ranks assigned differently. Where does hpx get it's rank information from, I must have missed somewhere in the code where I handle the locality setup

16:59 <jbjnr__> the libfabric PP uses an address vector where the addresses are indexed by rank - which I assign during bootup, but I am not sure where the rest of hpx gets it's rank assignment from

17:05 <hkaiser> jbjnr__: hpx assins the rank in the order the localities register

17:05 <jbjnr__> ok, found code using this stuff void addressing_service::set_local_locality(naming::gid_type const& g)

17:05 <jbjnr__> I will investigate

17:05 <hkaiser> happens during bbb registration

17:05 <hkaiser> hold on, I'll show you

17:06 <jbjnr__> k

17:06 <diehlpk_work> jbjnr__, Can I submit a full system run in the large queue or do I need to do something special?

17:06 <diehlpk_work> I would need 1.5 hours for the full system

17:06 <jbjnr__> diehlpk_work: for a full machine run you need to contact Maria Grazia I think

17:06 <hkaiser> jbjnr__: here: https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/agas/big_boot_barrier.cpp#L476

17:07 <hkaiser> this assigns a 'prefix' i.e. rank

17:07 <jbjnr__> hkaiser: thanks. I will check it

17:08 <diehlpk_work> Ok, if we have good results for 2048 and 4096 I will ask her

17:08 <hkaiser> jbjnr__: that calls into the AGAS locality namespace: https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/agas/addressing_service.cpp#L318

17:09 <hkaiser> which will eventually assign a rank here: https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/agas/server/locality_namespace_server.cpp#L215

17:12 <hkaiser> simbergm: yt?

17:13 <hkaiser> simbergm: please have a look at CircleCI tests for #3759, that has a strange assertion related to the scheduler masks

17:13 <jbjnr__> hkaiser: he's away this week at meetings in the USA. might not be logged in much

17:13 <hkaiser> ahh ok

17:13 <hkaiser> is he at Sandia?

17:14 <jbjnr__> I think it's colorado or somewhere. I can't remember

17:14 <hkaiser> k

17:31 eschnett_ has joined #ste||ar

17:33 eschnett has quit [Ping timeout: 246 seconds]

17:33 eschnett_ is now known as eschnett

18:00 aserio has joined #ste||ar

18:01 <simbergm> hkaiser yep saw it and will take a look

18:04 aserio has quit [Ping timeout: 250 seconds]

18:09 <jbjnr__> hkaiser: related question - in the mpi parcelport, are the mpi ranks always the same as the hpx locality id

18:22 <diehlpk_work> DAINT Usage: 4,799 NODE HOURS (NH) Quota: 9,000 NH 53.3%

18:22 <diehlpk_work> We fired 53% of our node hours

18:25 <hkaiser> jbjnr__ good question

18:26 <hkaiser> we either don't care or do special things

18:26 <hkaiser> I believe that is what the preferred_prefix is for, i.e. the locality tells the root please use this number as my rank

18:28 <hkaiser> sec, let me try to understand what we do

18:30 <hkaiser> jbjnr__ I almost think we don't care

18:37 diehlpk has joined #ste||ar

18:37 <diehlpk> jbjnr__, Meeting?

19:06 <jbjnr__> hkaiser: yes. I'm thinking it doesn't actually matter if the ordering is different as long as the locality object for the PP is valid

19:06 <jbjnr__> diehlpk: meeting? Oh now. sorry. Daughter's birthday today, forgot about it

19:07 <jbjnr__> diehlpk: node hours will be reset on 1st august

19:07 <jbjnr__> it's a quarterly quota

19:13 diehlpk has quit [Remote host closed the connection]

19:15 daissgr has joined #ste||ar

19:16 diehlpk has joined #ste||ar

19:16 <diehlpk> jbjnr__, Could we get another 9000 node hours

19:18 <jbjnr__> on 1st april

19:22 <diehlpk> Is it for sure?

19:23 <diehlpk> So we would burn now all of the hours

19:24 <parsa> daissgr: see if this directory still exists: home/users/khuck/src/operation-gordon-bell

19:26 daissgr has quit [Quit: WeeChat 1.9.1]

19:27 david_pfander has quit [Ping timeout: 250 seconds]

19:28 diehlpk has quit [Ping timeout: 250 seconds]

19:31 daissgr has joined #ste||ar

19:35 daissgr1 has joined #ste||ar

19:35 <daissgr> parsa: The folder does not exist anymore unfortunately

19:38 daissgr1 has quit [Client Quit]

19:39 aserio has joined #ste||ar

20:07 <jbjnr__> diehlpk_work: yes. we have 36000 node hours, but 9000 per quarter. We are a development project https://www.cscs.ch/user-lab/allocation-schemes/development-projects/

20:07 <diehlpk_work> Cool, so I cna burn more

20:08 <jbjnr__> so we should use all the remaining node hours over the next couple of days

20:26 aserio has quit [Ping timeout: 250 seconds]

20:38 hkaiser has quit [Quit: bye]

20:39 aserio has joined #ste||ar

20:40 aserio1 has joined #ste||ar

20:43 aserio has quit [Ping timeout: 245 seconds]

20:43 aserio1 is now known as aserio

21:13 <diehlpk_work> jbjnr__, This is not a problem. My 2048 and 4096 run will eat them once they are finished

21:19 hkaiser has joined #ste||ar

21:42 aserio has quit [Quit: aserio]

23:04 eschnett has quit [Quit: eschnett]