#ste||ar on 2019-04-05 — irc logs at irclog.cct.lsu.edu

2018-08-26 23:03 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:12 <K-ballo> simbergm: Boost 1.70 is going to break FindBoost.cmake *hard*, let's make sure 1.3 can handle it, let's block on it if we need to

01:11 K-ballo has quit [Quit: K-ballo]

02:18 hkaiser has quit [Quit: bye]

03:37 Guest70467 has quit [Quit: WeeChat 2.2]

03:37 Amy1 has joined #ste||ar

06:31 chinz07[m] has quit [Remote host closed the connection]

06:31 simbergm has quit [Write error: Connection reset by peer]

06:40 chinz07[m] has joined #ste||ar

06:40 simbergm has joined #ste||ar

06:40 <simbergm> K-ballo thanks for the heads up

06:41 <simbergm> More details?

06:52 nikunj has joined #ste||ar

07:36 jbjnr has joined #ste||ar

07:36 <jbjnr> What's wrong with exception handling. It might be that most of the deadlocks I've been seeing are caused by other nodes having an exception and just hanging instead of terminating/other

07:40 <zao> simbergm: indeed sounds exciting, gonna suck for us in EasyBuild unless upstream CMake fixes it

07:41 <zao> We already have problems from separately compiling the Boost.Python builds, as FindBoost assumes a single Boost tree :)

09:53 chinz07[m] has quit [Remote host closed the connection]

09:53 simbergm has quit [Read error: Connection reset by peer]

10:01 chinz07[m] has joined #ste||ar

10:14 simbergm has joined #ste||ar

10:30 <daissgr> @jbjnr: Out of curiosity (I have never really worked with the internals of HPX): Could the exception handling problem be related to the HPX cmake flag: -DHPX_WITH_DISABLED_SIGNAL_EXCEPTION_HANDLERS=ON ? I have always wondered where that argument came from..

10:46 K-ballo has joined #ste||ar

10:58 <zao> K-ballo: What fun has 1.70 come up with now?

11:14 <jbjnr> daissgr: yes, that is one var that controls it, but I think it's broken at the moment

11:43 <K-ballo> zao: they deploy cmake package config files now, with imported targets, etc

11:44 <K-ballo> for some bizarre reason the first thing FindBoost.cmake do is look for package config files, something to do with the boost-cmake efforts from ~2010

11:45 <K-ballo> so now FindBoost will delegate to BoostConfig, which is wildly different from the find module.. for instance, it will not define any of the BOOST_ vars

11:58 <zao> Ooh, sounds awesome.

11:58 <zao> I wonder how much this will break our Boost.Python stuff.

11:59 <zao> Well, every single piece of software we build and install.

11:59 <K-ballo> everything using anything but imported targets will fail.. some things using imported targets will change slightly..

12:00 <zao> A bit optimistic of them to think that everyone has moved to imported targets already.

12:00 <K-ballo> the FindBoost.cmake maintainer is supposedly going to address some of that, by synthezising the BOOST_ vars when delegating to the module, but that wont help older cmakes

12:01 <K-ballo> up until rather recently the imported targets in FindBoost were discouraged

12:01 <K-ballo> also the imported targets don't work for scenarios that link against both static and shared boost

13:00 hkaiser has joined #ste||ar

13:30 aserio has joined #ste||ar

13:54 hkaiser has quit [Quit: bye]

14:17 <diehlpk_work> jbjnr, I just arrived at CCT and doing the plots right now

14:17 <jbjnr> cool. I have to leave very soon, but I'll check back tonight when I arrive

14:17 <jbjnr> just recompiled octotiger with no libfabric and will rerun mpi tests for sanity

14:18 <diehlpk_work> I will send them to the mailing list

14:18 <jbjnr> would be great to have some >15 steps runs to do better comparisons of PP

14:18 <diehlpk_work> I think it will not be possible with the current version of octotiger

14:19 <diehlpk_work> we could do one trick, but I am not sure if the simulations makes sense anymore

14:19 <diehlpk_work> We could prevent octotiger to do the regriding and not adding nee sub grids

14:19 <diehlpk_work> *new

14:20 <diehlpk_work> The issue is that after each regrid we need too much new memory and after 30 or 45 steps we run out of memory

14:20 <jbjnr> surely not. regridding means sending blocks around - this is communication. we want that

14:21 <jbjnr> anyway. I'll have to leave soon. we've at least got some data

14:21 <jbjnr> starting pur mpi runs now

14:21 <daissgr> that was the ngrids parameter, right? Also, what exactly was the cfl parameter? I was looking through the code and this parameter basically defines the frequency in which we need to regrid

14:21 <jbjnr> ^pure

14:21 <diehlpk_work> Cool, thanks again. I will look into the results and errors

14:22 <diehlpk_work> daissgr, ngrids are the initial amount of grids and we can say that we do not want more

14:24 <daissgr> yeah, I remember that! So fixing that could help us work around the regridding problem, right? Still, I am curious about the cfl parameter. Documentation about that one is rather sparse in the options.cpp

14:25 <daissgr> Regridding or not - we send a lot of stuff around during computation since we always need the neighboring subgrids. So it could help us the see a larger difference with the libfabric simply by running more timesteps

14:28 <diehlpk_work> daissgr, I will ask Dominic about the cfl parameter soon

14:30 <diehlpk_work> yes, avoid regrid could help to run longer and possible improve the libfabric, but on the other hand if we do not send more data it will be not as significant as producing more data with regriding

14:30 <diehlpk_work> let us look into the data

14:38 <diehlpk_work> daissgr, cfl influences the time step size which a signal needs to go trough one cell

14:39 <diehlpk_work> Dominic recommended not to play with this parameter, since it will influence the numeric

14:39 <daissgr> so: Changing that makes the signals slower - so we would do more time_steps to reach the same "time/status" of the simulation? Hence the difference in the refinment frequency

14:39 <diehlpk_work> daissgr, level 17 is not possible, since the regrid takes 3 hours

14:40 <diehlpk_work> daissgr, You only can make this value larger

14:40 <diehlpk_work> Lower could result in nan

14:41 <diehlpk_work> Dominic was not recommending to play with this parameter since we do not know if we still can do correct numeric stuff

14:41 quaz0r has quit [Ping timeout: 255 seconds]

14:43 <daissgr> Okay, then let's see where we stand: Basically the only way to make the simulation larger is to increase the level?

14:50 <diehlpk_work> Total: 15 3.021760e+02 1.962003e-03 2.650207e+01 2.469685e+01 0.000000e+00 0.000000e+00 8.173002e-02 0.000000e+00

14:50 <diehlpk_work> 262.982

14:50 <diehlpk_work> Is the last one the total time as before your change

14:51 <diehlpk_work> daissgr, Yes, I asked Dominic to run level 17 for me on queenb to validate how long it will take to load the file and refine. it took around 3 hours before the computation started

14:52 <daissgr> it is! the output of octotiger sometimes glitches (always has) and the output of the last timestep will be printed inbetween "Total:" and the actual time (262.982)

14:53 <daissgr> so the best we can run is level 16?

14:54 <daissgr> by the way: for the scaling graphs in the paper: We definitely should mention that we base the speedup upon the computation time and not the total time

14:55 <diehlpk_work> Yes, I think so, unless Dominic can do the distributed IO, at least reading by this evening

14:56 <diehlpk_work> No, I used the total time for my plots

14:56 <diehlpk_work> Computation + Regrid

14:56 <diehlpk_work> Compare analytics is close to zero and Find localities is zero

14:57 <daissgr> was is extracted by the total timer? I changed that a couple of days ago to include the loading of the initial dataset

14:57 hkaiser has joined #ste||ar

14:57 <daissgr> so the 3 hours regrid: that's the one getting from the initial level 13 scenario to level 17 - not the usual regridding that would happen afterwards every 15 timesteps?

14:58 <diehlpk_work> yes, from level 13 to 17

14:58 <diehlpk_work> 15 to 17 is not much faster as the reading of the file is longer and the refinement work is the same, we just do not refine the cheap levels

14:59 <diehlpk_work> Ok, it seems that the the loading time was not in the previous total counter

15:00 <diehlpk_work> On one node libfrabic is 60 seconds slower

15:00 <diehlpk_work> So we only can use the computation time since John and I used different timeers

15:01 <diehlpk_work> *timers

15:01 <daissgr> yeah! unfortunately getting the load time into the total timer was sort of necessary to get any meaningful times to use with the perf fractions

15:02 <daissgr> didn't we want to run both scenarios again anyway? mpi and libfabric

15:02 <daissgr> also we wanted to focus on compute time anyway, right?

15:02 <daissgr> to get your times we can probably just add computation time and regrid time together to get the old total_time

15:06 <hkaiser> daissgr: when should that happen?

15:06 <hkaiser> that rerun of the mpi, that is?

15:06 <hkaiser> and who should do that?

15:08 <diehlpk_work> Ok, I will add the regrid and computation time

15:08 <daissgr> Didn't John already start some MPI runs? Also, I thought we were focussing on the computation time anyway since the loading of the scenario takes forever? We can't just claim that it's the total time without the input time?

15:09 <diehlpk_work> Yes, but Johns MPU runs are hanging

15:09 <hkaiser> daissgr: sure, but he does not have it running yet, and he will not be around over the weekend

15:09 <hkaiser> daissgr: you may want to start thinking about how to make the measurements comparable

15:09 <diehlpk_work> I will uses John's libfabric results and my mpi runs, since we run out of time

15:10 <daissgr> What do you mean? The are as compatible as before? Beforehand we used the old total time that was just computation time + regrid time

15:10 <hkaiser> diehlpk_work: nod, but how do you want to compare the results? apples == bananas?

15:10 <daissgr> now we are using the the computational time and the regrid time

15:10 <daissgr> it is as comparable as beforehand?

15:10 <hkaiser> you tell me

15:11 <diehlpk_work> hkaiser, in the mpi run, the total time is the sum of computation time + regrid + compare analytics + find localities

15:11 <diehlpk_work> In the new version Gregor added the input time

15:11 <hkaiser> Gregor, this is _your_ paper, you have to make sure everything is consistent etc.

15:12 <hkaiser> everybody else just helps and does what you say

15:12 <diehlpk_work> Since find localities and compare analytic is always zero, I think adding computation time and regried is a fair approximation

15:12 <hkaiser> and if you don't give directions, everybody will do what they deem to be right, but bviously will lead to inconsistencies

15:14 <daissgr> Well I am sorry but beforehand it was just plain wrong data without the input time. My last status was that we will do runs over the weekend with both parcelport and mpi. IF we cannot do the mpi run we can still use our existing data

15:14 <daissgr> neither the computational time nor the regrid time have changed

15:14 <daissgr> I actually thought we were using them for the scaling graphs anyway

15:16 <daissgr> Since Patrick and I noticed that the IO would be a problem we wanted to do the runs without it. One that means disable_output is always on. Secondly that means using the computation time because that is what's actually scaling

15:18 <daissgr> So the direction is (and was for the last weeks) to use the computation time for the scaling. The current scaling graph we have reflects that as well

15:23 <diehlpk_work> Ok, I extracted the data for level 14 for libfabric

15:26 <diehlpk_work> level 14 from 1 to 32 nodes and I computed time mpi - time libfabric: -10.84927-7.15121000000001-5.06981-1.582061.16153.67019

15:26 <diehlpk_work> I have to go to a meeting and will extract level 15 later

15:27 <daissgr> alright thanks

15:27 <diehlpk_work> I will keep you updated and send around the plots for the results we have

15:28 <diehlpk_work> I know we have results up to 512 nodes

15:28 <daissgr> are they in your scratch folder?

15:38 <diehlpk_work> No, /scratch/snx3000/biddisco/octoresults

15:47 <diehlpk_work> daissgr, level 15: time mpi - time libfabric: Level 153.12735.31912-2.274986.010616.099074

16:05 aserio has quit [Ping timeout: 240 seconds]

16:06 david_pfander has quit [Ping timeout: 250 seconds]

16:06 <daissgr> huh - the negative time with level 15: Is it the one for 128 nodes?

16:07 <diehlpk_work> It seems so, let me push the plots

16:07 <diehlpk_work> daissgr, We also have a teaser image now

16:08 <daissgr> cool

16:10 <daissgr> also I just looked up the two files in the scratch folder - looks to me like both the computation time just as the regrid time goes done for 128

16:10 <K-ballo> hkaiser: yt?

16:10 <daissgr> that's our only outlier so far - other than that the libfabric seems to work rather well - especially considering the short computation times

16:12 <daissgr> goes down*

16:29 aserio has joined #ste||ar

16:56 <daissgr> diehlpk_work: Can you take another look at the outlier on level 15?

16:59 <Amy1> why some memory write will slow down program than just memory read?

16:59 <Amy1> slowly down very much

17:10 aserio has quit [Ping timeout: 250 seconds]

17:17 <hkaiser> K-ballo: here

17:17 <hkaiser> now

17:18 <K-ballo> hkaiser: see pm

17:21 hkaiser has quit [Read error: Connection reset by peer]

17:22 hkaiser has joined #ste||ar

17:22 <hkaiser> Amy1: you'd have to give us some code to understand your proble

17:40 quaz0r has joined #ste||ar

17:44 <diehlpk_work> daissgr, Sure, I think the results re better now, I had a bug in my script with the new data

17:51 eschnett has joined #ste||ar

17:53 <diehlpk_work> daissgr, Should I compare sub grids per second or speed up between MPI and libfabric

17:56 aserio has joined #ste||ar

18:01 aserio1 has joined #ste||ar

18:04 aserio has quit [Ping timeout: 245 seconds]

18:04 aserio1 is now known as aserio

18:05 <daissgr> diehlpk_work: The speedup comparison would be really interesting I think

18:05 <diehlpk_work> Ok, I habe results for level 16 up to 2048 nodes

18:06 <diehlpk_work> I will add them first and do the comparison plot after

18:06 <diehlpk_work> I also have results on Cori for level 14

18:06 <diehlpk_work> up to 16 nodes

18:09 <daissgr> what was the extra_regrid you use for level 14? 1 oder 2?

18:16 <diehlpk_work> 2

18:17 <diehlpk_work> 14-13 +1

18:17 <diehlpk_work> daissgr, I have results for libfabric up to 2048 nodes

18:17 <diehlpk_work> I added them to the paper

18:34 <parsa> diehlpk_work: ping

18:34 <diehlpk_work> Yes

18:34 <parsa> diehlpk_work: can you commit the Cori log?

18:34 <diehlpk_work> Sure

18:38 <diehlpk_work> parsa, added to the github repo

18:39 <parsa> thanks!

18:52 aserio1 has joined #ste||ar

18:56 aserio has quit [Ping timeout: 264 seconds]

18:56 aserio1 is now known as aserio

19:01 daissgr1 has joined #ste||ar

19:01 <daissgr1> Is the webex not open yet?

19:03 hkaiser has quit [Quit: bye]

19:07 <daissgr1> we get "It's not yet time to join this meeting"

19:09 <jbjnr> diehlpk_work: daissgr1 I'm online briefly - what have I missed

19:12 <aserio> https://lsucct.webex.com/lsucct/j.php?MTID=m9a5aa19671b55e14234604d4a0154890

19:13 <aserio> daissgr1: ^^

19:14 <jbjnr> aserio: when meeting

19:14 <aserio> now

19:14 <jbjnr> nobody there on that link

19:15 <jbjnr> are the LF results good?

19:16 <aserio> jbjnr, daissgr1: https://lsucct.webex.com/lsucct/j.php?MTID=m751c8f38ce13256be59082c21410230b

19:16 <aserio> sorry

19:16 <aserio> try this link instead

20:13 khuck_ has joined #ste||ar

20:14 khuck has quit []

20:14 khuck_ has quit [Client Quit]

20:14 khuck has joined #ste||ar

20:26 aserio has quit [Ping timeout: 246 seconds]

20:26 daissgr1 has quit [Ping timeout: 240 seconds]

20:29 aserio has joined #ste||ar

20:33 daissgr has quit [Quit: WeeChat 1.9.1]

20:34 eschnett has quit [Remote host closed the connection]

20:35 eschnett has joined #ste||ar

20:48 khuck has quit [Remote host closed the connection]

21:01 eschnett has quit [Quit: eschnett]

21:03 khuck has joined #ste||ar

21:31 maxwellr96 has joined #ste||ar

21:31 <maxwellr96> Is it a requirement that an hpx::lcos::barrier include the root locality as one of the threads?

21:31 <khuck> diehlpk_work: I've got all the level 14 runs, and some of the level 15. The filesystem is having problems on the compute nodes, so jobs are failing once in a while. More frequently with larger jobs.

21:46 hkaiser has joined #ste||ar

21:53 aserio has quit [Quit: aserio]

22:01 maxwellr96 has quit [Ping timeout: 246 seconds]

22:08 <khuck> diehlpk_work: and I added the initial values to the paper

22:57 eschnett has joined #ste||ar

23:07 khuck has quit []

23:15 eschnett has quit [Quit: eschnett]

23:21 eschnett has joined #ste||ar

23:44 eschnett has quit [Quit: eschnett]