#ste||ar on 2020-09-05 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

00:23 hkaiser has joined #ste||ar

01:54 hkaiser has quit [Quit: bye]

04:20 akheir has quit [Quit: Leaving]

04:39 Jedi18 has joined #ste||ar

04:41 Jedi18 has quit [Client Quit]

05:05 bita has quit [Ping timeout: 244 seconds]

05:31 jaafar has joined #ste||ar

12:49 hkaiser has joined #ste||ar

14:46 zatumil has quit [Quit: leaving]

14:58 hkaiser has quit [Quit: bye]

18:38 hkaiser has joined #ste||ar

19:32 <parsa> hkaiser: sorry, apparently i was editing the same spot you are. i'll hold

19:38 <hkaiser> parsa: no, please go ahead

19:38 <hkaiser> I'm sorry - didn't mean to be disruptive

20:23 <hkaiser> parsa: I'm done with fig 3, please let me know if you want me to change things

20:24 bita has joined #ste||ar

20:24 <hkaiser> parsa: fig 2 looks great!

20:25 <parsa> hkaiser: thanks a lot!

20:25 <hkaiser> parsa: I hope fig 3 is what you wanted

20:27 <parsa> i wanted to specifically explain the three configurations of the experiment. what you made is more generic and better. no need to change it

20:28 <parsa> it's more interesting in this form

20:29 <hkaiser> parsa: ok

20:33 <parsa> hkaiser: but does the description of the experiment in 4.1 make sense to you?

20:35 <hkaiser> parsa: I know what you mean, I'd like to go over the description and make a little less terse, however

20:35 <hkaiser> would you mind me doing that?

20:38 <parsa> not at all

20:40 <parsa> me running out of ideas on how to explain them accurately without overloading the reader is why i haven't touched that bit

20:42 <hkaiser> ok, will do that later tonight

20:42 <hkaiser> parsa: I also owe you a paragraph explaining fig 3

20:43 <parsa> thanks a lot. i'm still doing the runs

20:43 <hkaiser> ok

20:43 <hkaiser> do the results look ok?

20:45 <parsa> don't know yet. i'm still checking whether or not things are actually running on one node or if mpirun is working as it used to

20:50 <parsa> yeah, it is working

21:05 <parsa> i think jenkins has died on one of the medusa nodes. it's been occupied for several hours now

21:05 <parsa> squeue has been showing its state as CG this whole time

21:45 <hkaiser> parsa: ask alireza to reboot the node

21:46 <parsa> waiting on him

22:38 <hkaiser> parsa: does that plot show weak scaling resuts?

22:38 <parsa> it's strong scaling

22:38 <hkaiser> why isn't getting faster?

22:39 <hkaiser> *it*

22:39 <hkaiser> the baseline at least

22:43 <parsa> it's stencil_8 with --np=1000 --nx=100

22:43 <parsa> i don't remember ever seeing it do well with strong scaling

22:44 <hkaiser> parsa: yah, the problem is too small to actually scale

22:44 <hkaiser> parsa: then you might want to plot raltive performance compared to the baseline instead

22:45 <hkaiser> *relative*

22:46 <parsa> i have been attempting to go higher this whole past year but don't want to deal with the mpi crashes

22:46 <hkaiser> sure

22:46 <parsa> higher->make the problem size go above 512mb

22:46 <hkaiser> just plot slowdown rel to baseline

22:47 <parsa> okay

22:47 <hkaiser> this is not a paer about demonstrating scaling

22:48 <parsa> you're right... that's why i didn't even pay attention to the horrific strong scaling behavior. relative performance it is

22:55 <parsa> --np=1000 --nx=1000 seems to work and show marginal increase (e.g. 7% between 10 nodes and 14 nodes) in speedup. would that work instead of relative times?

22:57 <parsa> hkaiser: ^ and maybe some reviewer would not like relative slowdown since it may look we're trying to hide the actual exec times

22:58 <hkaiser> we're not making a point about scaling

23:15 <parsa> hkaiser: updated plot 1 to relative slowdown, added the missing data

23:16 <hkaiser> ok

23:16 <hkaiser> interesting, the outlier is wierd

23:16 <parsa> it may look strange, but it really is what happens

23:16 <parsa> it's not an anomaly

23:16 <hkaiser> are those using blocking?

23:17 <parsa> no these are the overlapped ones

23:17 <hkaiser> on 7 nodes? or is it 8?

23:18 <parsa> ugh. i don't know why the axis is off… it's 8

23:18 <hkaiser> nod

23:18 <parsa> it's off by one everywhere

23:19 <hkaiser> still, why that outlier? any idea?

23:20 <hkaiser> this plot shows times for migrating each of the 1000 partitions once, correct?

23:20 <parsa> don't know the why but i've run this experiment enough times to know it's consistent

23:20 <parsa> the impaired case is migrating 1000/(n-1) partitions

23:20 <parsa> per locality

23:21 <parsa> the shifted case migrates all, yes

23:21 <parsa> i mean in the impaired case the migration from 0->0 won't do anything

23:21 <hkaiser> right

23:21 <hkaiser> still it is slower

23:22 <hkaiser> well, I'd say collect all the data, and if we have time collect some perf counters, would be interesting to understand th eoutlier

23:23 <hkaiser> we ight want to try a larger problem size after all as the little data we have doesn't allow to hide things

23:23 <hkaiser> *little work*

23:24 <parsa> how large would it make sense if things work?

23:25 <hkaiser> let's collect the data for this plot first

23:25 <hkaiser> then you can try increasing the problem

23:27 <parsa> aside from the anomaly, my take from this plot is that most of the slowdown is coming from the fact that we have network communication at all, the size of it does not have a massive impact with our problem size yet

23:27 bita has quit [Ping timeout: 258 seconds]