hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
<K-ballo>
simbergm: Boost 1.70 is going to break FindBoost.cmake *hard*, let's make sure 1.3 can handle it, let's block on it if we need to
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: bye]
Guest70467 has quit [Quit: WeeChat 2.2]
Amy1 has joined #ste||ar
chinz07[m] has quit [Remote host closed the connection]
simbergm has quit [Write error: Connection reset by peer]
chinz07[m] has joined #ste||ar
simbergm has joined #ste||ar
<simbergm>
K-ballo thanks for the heads up
<simbergm>
More details?
nikunj has joined #ste||ar
jbjnr has joined #ste||ar
<jbjnr>
What's wrong with exception handling. It might be that most of the deadlocks I've been seeing are caused by other nodes having an exception and just hanging instead of terminating/other
<zao>
simbergm: indeed sounds exciting, gonna suck for us in EasyBuild unless upstream CMake fixes it
<zao>
We already have problems from separately compiling the Boost.Python builds, as FindBoost assumes a single Boost tree :)
chinz07[m] has quit [Remote host closed the connection]
simbergm has quit [Read error: Connection reset by peer]
chinz07[m] has joined #ste||ar
simbergm has joined #ste||ar
<daissgr>
@jbjnr: Out of curiosity (I have never really worked with the internals of HPX): Could the exception handling problem be related to the HPX cmake flag: -DHPX_WITH_DISABLED_SIGNAL_EXCEPTION_HANDLERS=ON ? I have always wondered where that argument came from..
K-ballo has joined #ste||ar
<zao>
K-ballo: What fun has 1.70 come up with now?
<jbjnr>
daissgr: yes, that is one var that controls it, but I think it's broken at the moment
<K-ballo>
zao: they deploy cmake package config files now, with imported targets, etc
<K-ballo>
for some bizarre reason the first thing FindBoost.cmake do is look for package config files, something to do with the boost-cmake efforts from ~2010
<K-ballo>
so now FindBoost will delegate to BoostConfig, which is wildly different from the find module.. for instance, it will not define any of the BOOST_ vars
<zao>
Ooh, sounds awesome.
<zao>
I wonder how much this will break our Boost.Python stuff.
<zao>
Well, every single piece of software we build and install.
<K-ballo>
everything using anything but imported targets will fail.. some things using imported targets will change slightly..
<zao>
A bit optimistic of them to think that everyone has moved to imported targets already.
<K-ballo>
the FindBoost.cmake maintainer is supposedly going to address some of that, by synthezising the BOOST_ vars when delegating to the module, but that wont help older cmakes
<K-ballo>
up until rather recently the imported targets in FindBoost were discouraged
<K-ballo>
also the imported targets don't work for scenarios that link against both static and shared boost
hkaiser has joined #ste||ar
aserio has joined #ste||ar
hkaiser has quit [Quit: bye]
<diehlpk_work>
jbjnr, I just arrived at CCT and doing the plots right now
<jbjnr>
cool. I have to leave very soon, but I'll check back tonight when I arrive
<jbjnr>
just recompiled octotiger with no libfabric and will rerun mpi tests for sanity
<diehlpk_work>
I will send them to the mailing list
<jbjnr>
would be great to have some >15 steps runs to do better comparisons of PP
<diehlpk_work>
I think it will not be possible with the current version of octotiger
<diehlpk_work>
we could do one trick, but I am not sure if the simulations makes sense anymore
<diehlpk_work>
We could prevent octotiger to do the regriding and not adding nee sub grids
<diehlpk_work>
*new
<diehlpk_work>
The issue is that after each regrid we need too much new memory and after 30 or 45 steps we run out of memory
<jbjnr>
surely not. regridding means sending blocks around - this is communication. we want that
<jbjnr>
anyway. I'll have to leave soon. we've at least got some data
<jbjnr>
starting pur mpi runs now
<daissgr>
that was the ngrids parameter, right? Also, what exactly was the cfl parameter? I was looking through the code and this parameter basically defines the frequency in which we need to regrid
<jbjnr>
^pure
<diehlpk_work>
Cool, thanks again. I will look into the results and errors
<diehlpk_work>
daissgr, ngrids are the initial amount of grids and we can say that we do not want more
<daissgr>
yeah, I remember that! So fixing that could help us work around the regridding problem, right? Still, I am curious about the cfl parameter. Documentation about that one is rather sparse in the options.cpp
<daissgr>
Regridding or not - we send a lot of stuff around during computation since we always need the neighboring subgrids. So it could help us the see a larger difference with the libfabric simply by running more timesteps
<diehlpk_work>
daissgr, I will ask Dominic about the cfl parameter soon
<diehlpk_work>
yes, avoid regrid could help to run longer and possible improve the libfabric, but on the other hand if we do not send more data it will be not as significant as producing more data with regriding
<diehlpk_work>
let us look into the data
<diehlpk_work>
daissgr, cfl influences the time step size which a signal needs to go trough one cell
<diehlpk_work>
Dominic recommended not to play with this parameter, since it will influence the numeric
<daissgr>
so: Changing that makes the signals slower - so we would do more time_steps to reach the same "time/status" of the simulation? Hence the difference in the refinment frequency
<diehlpk_work>
daissgr, level 17 is not possible, since the regrid takes 3 hours
<diehlpk_work>
daissgr, You only can make this value larger
<diehlpk_work>
Lower could result in nan
<diehlpk_work>
Dominic was not recommending to play with this parameter since we do not know if we still can do correct numeric stuff
quaz0r has quit [Ping timeout: 255 seconds]
<daissgr>
Okay, then let's see where we stand: Basically the only way to make the simulation larger is to increase the level?
<diehlpk_work>
Is the last one the total time as before your change
<diehlpk_work>
daissgr, Yes, I asked Dominic to run level 17 for me on queenb to validate how long it will take to load the file and refine. it took around 3 hours before the computation started
<daissgr>
it is! the output of octotiger sometimes glitches (always has) and the output of the last timestep will be printed inbetween "Total:" and the actual time (262.982)
<daissgr>
so the best we can run is level 16?
<daissgr>
by the way: for the scaling graphs in the paper: We definitely should mention that we base the speedup upon the computation time and not the total time
<diehlpk_work>
Yes, I think so, unless Dominic can do the distributed IO, at least reading by this evening
<diehlpk_work>
No, I used the total time for my plots
<diehlpk_work>
Computation + Regrid
<diehlpk_work>
Compare analytics is close to zero and Find localities is zero
<daissgr>
was is extracted by the total timer? I changed that a couple of days ago to include the loading of the initial dataset
hkaiser has joined #ste||ar
<daissgr>
so the 3 hours regrid: that's the one getting from the initial level 13 scenario to level 17 - not the usual regridding that would happen afterwards every 15 timesteps?
<diehlpk_work>
yes, from level 13 to 17
<diehlpk_work>
15 to 17 is not much faster as the reading of the file is longer and the refinement work is the same, we just do not refine the cheap levels
<diehlpk_work>
Ok, it seems that the the loading time was not in the previous total counter
<diehlpk_work>
On one node libfrabic is 60 seconds slower
<diehlpk_work>
So we only can use the computation time since John and I used different timeers
<diehlpk_work>
*timers
<daissgr>
yeah! unfortunately getting the load time into the total timer was sort of necessary to get any meaningful times to use with the perf fractions
<daissgr>
didn't we want to run both scenarios again anyway? mpi and libfabric
<daissgr>
also we wanted to focus on compute time anyway, right?
<daissgr>
to get your times we can probably just add computation time and regrid time together to get the old total_time
<hkaiser>
daissgr: when should that happen?
<hkaiser>
that rerun of the mpi, that is?
<hkaiser>
and who should do that?
<diehlpk_work>
Ok, I will add the regrid and computation time
<daissgr>
Didn't John already start some MPI runs? Also, I thought we were focussing on the computation time anyway since the loading of the scenario takes forever? We can't just claim that it's the total time without the input time?
<diehlpk_work>
Yes, but Johns MPU runs are hanging
<hkaiser>
daissgr: sure, but he does not have it running yet, and he will not be around over the weekend
<hkaiser>
daissgr: you may want to start thinking about how to make the measurements comparable
<diehlpk_work>
I will uses John's libfabric results and my mpi runs, since we run out of time
<daissgr>
What do you mean? The are as compatible as before? Beforehand we used the old total time that was just computation time + regrid time
<hkaiser>
diehlpk_work: nod, but how do you want to compare the results? apples == bananas?
<daissgr>
now we are using the the computational time and the regrid time
<daissgr>
it is as comparable as beforehand?
<hkaiser>
you tell me
<diehlpk_work>
hkaiser, in the mpi run, the total time is the sum of computation time + regrid + compare analytics + find localities
<diehlpk_work>
In the new version Gregor added the input time
<hkaiser>
Gregor, this is _your_ paper, you have to make sure everything is consistent etc.
<hkaiser>
everybody else just helps and does what you say
<diehlpk_work>
Since find localities and compare analytic is always zero, I think adding computation time and regried is a fair approximation
<hkaiser>
and if you don't give directions, everybody will do what they deem to be right, but bviously will lead to inconsistencies
<daissgr>
Well I am sorry but beforehand it was just plain wrong data without the input time. My last status was that we will do runs over the weekend with both parcelport and mpi. IF we cannot do the mpi run we can still use our existing data
<daissgr>
neither the computational time nor the regrid time have changed
<daissgr>
I actually thought we were using them for the scaling graphs anyway
<daissgr>
Since Patrick and I noticed that the IO would be a problem we wanted to do the runs without it. One that means disable_output is always on. Secondly that means using the computation time because that is what's actually scaling
<daissgr>
So the direction is (and was for the last weeks) to use the computation time for the scaling. The current scaling graph we have reflects that as well
<diehlpk_work>
Ok, I extracted the data for level 14 for libfabric
<diehlpk_work>
level 14 from 1 to 32 nodes and I computed time mpi - time libfabric: -10.84927-7.15121000000001-5.06981-1.582061.16153.67019
<diehlpk_work>
I have to go to a meeting and will extract level 15 later
<daissgr>
alright thanks
<diehlpk_work>
I will keep you updated and send around the plots for the results we have
<diehlpk_work>
I know we have results up to 512 nodes
<daissgr>
are they in your scratch folder?
<diehlpk_work>
No, /scratch/snx3000/biddisco/octoresults
<diehlpk_work>
daissgr, level 15: time mpi - time libfabric: Level 153.12735.31912-2.274986.010616.099074
aserio has quit [Ping timeout: 240 seconds]
david_pfander has quit [Ping timeout: 250 seconds]
<daissgr>
huh - the negative time with level 15: Is it the one for 128 nodes?
<diehlpk_work>
It seems so, let me push the plots
<diehlpk_work>
daissgr, We also have a teaser image now
<daissgr>
cool
<daissgr>
also I just looked up the two files in the scratch folder - looks to me like both the computation time just as the regrid time goes done for 128
<K-ballo>
hkaiser: yt?
<daissgr>
that's our only outlier so far - other than that the libfabric seems to work rather well - especially considering the short computation times
<daissgr>
goes down*
aserio has joined #ste||ar
<daissgr>
diehlpk_work: Can you take another look at the outlier on level 15?
<Amy1>
why some memory write will slow down program than just memory read?
<Amy1>
slowly down very much
aserio has quit [Ping timeout: 250 seconds]
<hkaiser>
K-ballo: here
<hkaiser>
now
<K-ballo>
hkaiser: see pm
hkaiser has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
<hkaiser>
Amy1: you'd have to give us some code to understand your proble
quaz0r has joined #ste||ar
<diehlpk_work>
daissgr, Sure, I think the results re better now, I had a bug in my script with the new data
eschnett has joined #ste||ar
<diehlpk_work>
daissgr, Should I compare sub grids per second or speed up between MPI and libfabric
aserio has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 245 seconds]
aserio1 is now known as aserio
<daissgr>
diehlpk_work: The speedup comparison would be really interesting I think
<diehlpk_work>
Ok, I habe results for level 16 up to 2048 nodes
<diehlpk_work>
I will add them first and do the comparison plot after
<diehlpk_work>
I also have results on Cori for level 14
<diehlpk_work>
up to 16 nodes
<daissgr>
what was the extra_regrid you use for level 14? 1 oder 2?
<diehlpk_work>
2
<diehlpk_work>
14-13 +1
<diehlpk_work>
daissgr, I have results for libfabric up to 2048 nodes
<diehlpk_work>
I added them to the paper
<parsa>
diehlpk_work: ping
<diehlpk_work>
Yes
<parsa>
diehlpk_work: can you commit the Cori log?
<diehlpk_work>
Sure
<diehlpk_work>
parsa, added to the github repo
<parsa>
thanks!
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 264 seconds]
aserio1 is now known as aserio
daissgr1 has joined #ste||ar
<daissgr1>
Is the webex not open yet?
hkaiser has quit [Quit: bye]
<daissgr1>
we get "It's not yet time to join this meeting"
<jbjnr>
diehlpk_work: daissgr1 I'm online briefly - what have I missed
eschnett has quit [Remote host closed the connection]
eschnett has joined #ste||ar
khuck has quit [Remote host closed the connection]
eschnett has quit [Quit: eschnett]
khuck has joined #ste||ar
maxwellr96 has joined #ste||ar
<maxwellr96>
Is it a requirement that an hpx::lcos::barrier include the root locality as one of the threads?
<khuck>
diehlpk_work: I've got all the level 14 runs, and some of the level 15. The filesystem is having problems on the compute nodes, so jobs are failing once in a while. More frequently with larger jobs.
hkaiser has joined #ste||ar
aserio has quit [Quit: aserio]
maxwellr96 has quit [Ping timeout: 246 seconds]
<khuck>
diehlpk_work: and I added the initial values to the paper