#ste||ar on 2018-02-01 — irc logs at irclog.cct.lsu.edu

2017-05-17 13:54 aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:16 Smasher has quit [Quit: Connection reset by beer]

00:17 EverYoung has joined #ste||ar

00:19 Smasher has joined #ste||ar

00:21 EverYoun_ has joined #ste||ar

00:21 EverYoun_ has quit [Remote host closed the connection]

00:22 EverYoun_ has joined #ste||ar

00:23 EverYoung has quit [Ping timeout: 240 seconds]

00:36 jaafar_ has joined #ste||ar

00:36 jaafar has quit [Ping timeout: 252 seconds]

00:37 EverYoun_ has quit [Remote host closed the connection]

00:38 EverYoung has joined #ste||ar

00:47 EverYoung has quit [Remote host closed the connection]

00:48 EverYoung has joined #ste||ar

01:08 EverYoun_ has joined #ste||ar

01:08 EverYoung has quit [Ping timeout: 240 seconds]

01:16 EverYoun_ has quit [Ping timeout: 240 seconds]

01:24 EverYoung has joined #ste||ar

01:28 EverYoung has quit [Ping timeout: 240 seconds]

01:37 vamatya has quit [Ping timeout: 248 seconds]

01:41 <github> [hpx] hkaiser pushed 1 new commit to release: https://git.io/vNdOf

01:41 <github> hpx/release ce3c62a Hartmut Kaiser: Use nullptr instead of '0' in windows specific code (was not checked by clang-tidy)

02:02 vamatya has joined #ste||ar

02:10 EverYoung has joined #ste||ar

02:11 eschnett has joined #ste||ar

02:15 EverYoung has quit [Ping timeout: 276 seconds]

02:20 hkaiser has quit [Quit: bye]

02:34 ct-clmsn has joined #ste||ar

03:28 ct-clmsn has quit [Quit: Leaving]

04:10 EverYoung has joined #ste||ar

04:10 vamatya has quit [Ping timeout: 240 seconds]

04:30 nanashi55 has quit [Ping timeout: 248 seconds]

04:31 nanashi55 has joined #ste||ar

04:49 Smasher has quit [Quit: Connection reset by beer]

04:52 gedaj has quit [Quit: Konversation terminated!]

04:54 Smasher has joined #ste||ar

06:14 vamatya has joined #ste||ar

06:27 jaafar_ has quit [Ping timeout: 240 seconds]

06:42 vamatya has quit [Ping timeout: 268 seconds]

08:00 marco has joined #ste||ar

08:00 <marco> Hello

08:22 <github> [hpx] msimberg pushed 1 new commit to master: https://git.io/vNdgd

08:22 <github> hpx/master 4df531c Mikael Simberg: Merge pull request #3126 from STEllAR-GROUP/coroutine_cleanup...

08:22 <simbergm> marco: hi, what brings you here?

08:24 <zao> Sucked in by the gravitational pull of massive resources needed to the build HPX? :)

08:25 <simbergm> zao: sounds like a very good guess!

08:30 <github> [hpx] msimberg pushed 1 new commit to release: https://git.io/vNd2R

08:30 <github> hpx/release 98ae1f9 Mikael Simberg: Update instructions on changing the logo in the release procedures

08:46 <github> [hpx] msimberg pushed 1 new commit to release: https://git.io/vNdah

08:46 <github> hpx/release d8a4bd4 Mikael Simberg: Change to non-draft HPX logo in documentation

09:38 <marco> Hi, I do my first steps with hpx, and have a issue with a fix 1d stencil operator. Should I paste the code here, to discuss it (<100 lines)?

09:39 <zao> marco: Please link a pastebin page or a github gist.

09:39 <zao> Like this one - https://gist.github.com/zao/dd5cdbf03084f27cac5ca7a88e1f9f12

09:40 david_pfander has joined #ste||ar

09:47 <marco> Here is the code: https://pastebin.com/6Fk3DJE8

09:48 <marco> My issue is the performance: my output: - nx : 2000 - nz : 1000 - padsize : 4 - count : 1000 Seriell : 4.27075 OpenMP : 1.68201 HPX : 4.59588

09:49 <zao> Quite often the problem with crappy HPX perf is the granularity of your tasks.

09:50 <zao> Are you running with a suitable number of threads?

09:51 <zao> HPX has a bunch of perf counters one can use to get a feeling for how things run, but I have no idea how to use those :)

09:51 <marco> I use a quadcore machine with hyperthreading and hpx:threads=all --> 8 threads

09:54 <zao> Is this a released version or master?

09:56 <zao> (I'm not overly familiar with actual HPX functionality, but I expect someone competent will drop in later)

10:05 <heller_> marco: how did you build HPX and your application?

10:06 <heller_> other than that ... I can't spot any real problems

10:09 <marco> It is the version from master (10.01.2018), build on centos7 with devtoolset-6

10:10 <heller_> ok

10:10 <heller_> debug/release? how did you build your benchmark?

10:11 <heller_> if you are somewhere near erlangen, you can also drop by my office ;)

10:12 <marco> hpx is build as release, test program is with build with O2 and seperate debug info (to analyze with intel vtune amplifier)

10:13 <heller_> using cmake or a manually crafted makefile (similar)?

10:13 <heller_> be sure to pass -DHPX_DISABLE_ASSERTS and -DNDEBUG to your compiler

10:14 <marco> If I change the value access of the stencil function to single value access (+-N to +-0), there is no performance issue

10:15 <marco> Sorry, I'm near Hannover

10:17 <marco> I have only with this type of loops problems. Different other loops have a perfect performance.

10:20 <heller_> ok

10:20 <heller_> let me check again

10:21 <heller_> so

10:21 <heller_> I don't understand the change of the single value access

10:21 <zao> False sharing or something?

10:21 <heller_> NB: vtune is fine with O3 as well, as long as you generate the debug symbols (with -g)

10:21 <zao> How does it run for 1 or 2 cores?

10:22 <heller_> yeah, try with 4 cores as well

10:24 <heller_> marco: so you are saying that for_each gives you bad performance, but changing it to for_loop works better?

10:27 <github> [hpx] msimberg pushed 1 new commit to release: https://git.io/vNd1W

10:27 <github> hpx/release 7077bc0 Mikael Simberg: Small formatting fix in release_procedure.rst

10:27 <heller_> marco: one other thing, std::vector<std::vector<float>> is not exactly good for performance ;)

10:27 <heller_> especially with such stencil things

10:27 <heller_> you should linearize that

10:28 <heller_> this should give you an overall better performance result

10:28 <marco> I mean with single value access, only an access to the centered element of the stencil operator, without the neighbour elements.

10:30 <heller_> ah ok

10:30 <heller_> yeah, that might hint to false sharing

10:31 <heller_> try the suggestion with the linearized 2d vector

10:32 <marco> yes, I know that vtune works well with O3. O3 is the standard for our release versions, but our compile environment set O2 for profiling. I must change it, ...

10:33 <zao> Always fun.

10:33 <marco> ok, I test it with 2/4 cores, and linearized it.

10:35 <marco> thanks for the first tips, I will contact you later

10:36 <heller_> sure

10:36 <heller_> no problem, always here to help

11:11 zbyerly_ has joined #ste||ar

11:11 zbyerly__ has quit [Ping timeout: 276 seconds]

11:37 Smasher has quit [Quit: Connection reset by beer]

11:42 Smasher has joined #ste||ar

12:16 hkaiser has joined #ste||ar

12:37 <hkaiser> simbergm: so what's the merge procedure for the release now? do we merge to master or to release?

12:39 <heller_> should we remove hpx_fwd.hpp?

12:39 <heller_> It's been deprecated for a while now

12:39 <hkaiser> when did we deprecate it?

12:40 <heller_> 2 years ago

12:40 <heller_> https://github.com/STEllAR-GROUP/hpx/commit/c28a8bf289e180f2ad2fabea7784936fbf0b57f4

12:40 <simbergm> hkaiser: I don't have a strong as I don't know yet which one works better in practice, but my feeling is that we should keep merging to master and I will pick from there to the release branch

12:40 <hkaiser> nod, let's remove it, but leave a note in the what's new section

12:40 <hkaiser> simbergm: ok - heller_ will disagree ;-)

12:40 <simbergm> like this we can keep merging things to master that need not go in the release

12:41 <simbergm> if heller_ has good arguments for making PRs to release I'm more than okay with that as well :)

12:41 <heller_> I don't like cherry-picking from master

12:41 <heller_> since it makes it hard to reintegrate release back to master

12:42 <simbergm> okay, that's fair

12:42 <heller_> in a perfect world, we would just branch off of master, and call it a release

12:42 <simbergm> agreed

12:42 <hkaiser> simbergm: last release we decided to merge to release and do one merge back to master after th erelease - worked quite well, actually

12:42 <heller_> *nod*

12:42 <heller_> the only problem we had was that testing was a bit of a mess, IIRC

12:42 <hkaiser> yes

12:42 <simbergm> meaning cherry picking from master before release, and then merging to master after release?

12:43 <hkaiser> no

12:43 <heller_> that is, things that landed on master during the release weren't properly tested etc.

12:43 <hkaiser> during the release time merge all PRs to release, leaving master alone

12:43 <simbergm> hkaiser: sorry misread

12:43 <simbergm> I see now

12:43 <simbergm> okay, so let's do that now as well

12:43 <simbergm> already open PRs against master can be picked to release

12:43 <hkaiser> simbergm: requires changes to testing infrastructure, tests should run off of release

12:44 <simbergm> hkaiser: yes

12:44 <hkaiser> existing PRs can be merged manually to release after being merged to master

12:44 <simbergm> I didn't want to do it yet since there's no rc

12:44 <simbergm> yeah

12:44 <hkaiser> let's avoid cherry picking

12:44 <simbergm> okay, missed the distinction between merging and cherry picking

12:44 <simbergm> I see

12:45 <hkaiser> cherry picking creates a new independent commit, merging does not

12:45 <simbergm> so at the moment I think master is in a pretty good state (i.e. not too many failing tests, and all failures are occasional except for the stacksize test)

12:46 <simbergm> in your opinion is this a good state for an rc?

12:46 <simbergm> what has buildbot looked like usually during an rc?

12:46 <heller_> same or even worse :P

12:46 <hkaiser> rc means thatthere will be no new features or refactorings, only bug fixes before the release

12:46 <heller_> I'd like to get the stacksize problem fixed though

12:47 <hkaiser> sure, that's a bug

12:47 <heller_> let's also try to keep the timeframe for doing the release as minimal as possible

12:47 <heller_> last time, we had lots of trouble keeping everything in sync...

12:48 <simbergm> but then I would stick to master still for now and try to keep fixing as much as possible

12:48 <hkaiser> heller_: it wasn't too bad

12:48 <simbergm> I guess the only feature still going in is my suspension PR

12:48 <simbergm> do you have anything planned?

12:48 <simbergm> anything else

12:48 <hkaiser> simbergm: and the thread scheduler changes heller_ has in the pipeline

12:49 <heller_> should they go into the release?

12:49 <simbergm> they change APIs?

12:49 <heller_> a little, yes

12:49 <hkaiser> heller_: I thought we delayed the release for those

12:49 <simbergm> ok

12:50 <heller_> I was aware that we delayed the release for them

12:50 <heller_> and jbjnr reported no real speedup in his application

12:50 <hkaiser> heller_: so why do we do those, then?

12:50 <heller_> so I guess they need more work

12:50 <heller_> first of all, he didn't test the full set

12:51 <heller_> second of all, the full set needs more work

12:51 <heller_> probably another week or so

12:51 <hkaiser> heller_: so you're contradicting yourself here ;)

12:51 <heller_> for my micro benchmarks, they showed better performance

12:51 <heller_> that's all I said

12:52 <hkaiser> sure

12:53 <simbergm> in my opinion we're not really behind schedule as I set the rc date quite conservatively and optimistically, so let's keep working on master still until e.g. wednesday next week and see again where we are

12:53 <heller_> sounds good

12:54 <heller_> night shifts ahead!

12:54 <heller_> ;)

12:54 <heller_> we are not doing the tutorial in march anyway

12:54 <hkaiser> nice

12:55 <simbergm> yeah, that's good

12:55 <jbjnr> heller_: my best results were about 985GFlops before, but this week I got 1010 peak, so there's been a 2% or so improvment as a result of some general cleanup from your continuations and the profiling fixes etc. gtg.

12:56 <heller_> ok

12:56 <heller_> I wasn't aware that the 2% were due to my changes

12:57 <hkaiser> that's a nice result as well

12:57 <heller_> and since the grain size is relatively large for your application, I am not sure they'll matter

12:57 <heller_> (except for the allocations etc.)

13:02 <heller_> I am working on trying to reduce binary size right now...

13:03 <heller_> which I am hoping helps with partitioned_vector tests

13:03 <hkaiser> heller_: let's finish the other stuff first

13:04 <hkaiser> we've been living with the partitioned_vector things for a while, no need to work on it now

13:04 <heller_> well ... low risk change, I was in "let's improve the code as it is without changing functionality"

13:04 <hkaiser> heller_: we have enough half-way done things

13:04 <heller_> we do

13:07 <hkaiser> #3031 has inspect problems, #3036 is still open

13:07 <hkaiser> I meant #3130

13:10 <heller_> #3036 can't be closed until we fix the partitioned_vector compile problems

13:10 <hkaiser> heller_: ok - was not aware of that

13:10 <heller_> well, we can merge it

13:10 <heller_> but then we'll have to live with failing tests until the compile/link problems are fixed

13:22 <simbergm> jbjnr: for dca++ you're going to need cuda support, no? have you compiled hpx (with cuda) successfully using something else than what's on rostam?

13:28 daissgr has joined #ste||ar

13:29 <K-ballo> heller_: what's griwes sfinae thing?

13:30 <heller_> K-ballo: https://godbolt.org/g/Kp4DhH

13:30 <heller_> K-ballo: omitting it from the name

13:31 <heller_> the SFINAE type that is

13:31 <K-ballo> that looks like a pack expansion of void non-type template arguments, what is it?

13:34 <heller_> yeah

13:34 <heller_> I am not sure why it doesn't show up in the name of the function though

13:37 Vir has quit [Ping timeout: 240 seconds]

13:40 <K-ballo> I don't understand why it even works, but it seems related

13:40 <K-ballo> if it did show up it would form a `void...` in the non-sfinae case, which would be ill-formed

13:47 <heller_> great. time totally wasted

13:47 <heller_> full debug build, griwes trick: 31.09GB, current master: 31.21 GB, so no real gain there. It's even slower to compile

13:58 hkaiser has quit [Quit: bye]

14:00 daissgr has quit [Ping timeout: 255 seconds]

14:12 daissgr has joined #ste||ar

14:27 Vir has joined #ste||ar

14:31 <github> [hpx] msimberg closed pull request #3131: Fixing #2325 (master...fixing_2325) https://git.io/vN9hx

14:32 eschnett has quit [Quit: eschnett]

14:34 hkaiser has joined #ste||ar

14:58 aserio has joined #ste||ar

15:06 eschnett has joined #ste||ar

15:49 aserio has quit [Ping timeout: 276 seconds]

16:02 eschnett has quit [Quit: eschnett]

16:07 aserio has joined #ste||ar

16:34 eschnett has joined #ste||ar

16:50 mbremer has joined #ste||ar

16:54 <mbremer> @hkaiser: yt?

17:02 <hkaiser> mbremer: here

17:02 <mbremer> Any updates on the paper?

17:02 <hkaiser> mbremer: I have not done anything :/

17:02 <simbergm> heller_: #3131 seems to have broken reduce_by_key, any guesses why? I'll try to look tomorrow

17:02 <hkaiser> simbergm: heh

17:03 <mbremer> @hkaiser: Also do you have a bibtex entry for the GB paper?

17:03 <mbremer> Alternatively, I suppose the scaling results are also mentioned in the OpenSuCo paper

17:11 <jbjnr> simbergm: yes. I am running dca++ with cuda on my laptop and on daint using (hpx+cuda)+(dca+++cuda)

17:12 <jbjnr> no problems now.

17:15 <jbjnr> heller_: I ran the cholesky several times and discovered that the map change was not really making any differnce, it is within the noise - but variance of runs is quite high - however, everything is 'just a bit' faster than it used to be - so I'm assuming the string cleanup and your continuation fixes are the main thing

17:40 aserio has quit [Ping timeout: 252 seconds]

17:48 vamatya has joined #ste||ar

17:56 jaafar_ has joined #ste||ar

18:21 david_pfander has quit [Ping timeout: 265 seconds]

18:25 daissgr has quit [Ping timeout: 252 seconds]

19:18 <heller_> jbjnr: great!

19:18 <heller_> simbergm: I'll have a look as well

19:20 <heller_> simbergm: reduce_by_key might be related to the scan_partitioner changes after all and meaning it's not fixed yet

19:20 aserio has joined #ste||ar

19:21 <heller_> simbergm: same problem as before, I guess

19:32 daissgr has joined #ste||ar

19:33 aserio1 has joined #ste||ar

19:33 EverYoung has quit [Remote host closed the connection]

19:34 EverYoung has joined #ste||ar

19:35 aserio has quit [Ping timeout: 265 seconds]

19:35 aserio1 is now known as aserio

19:47 twwright_ has joined #ste||ar

19:47 twwright has quit [Read error: Connection reset by peer]

19:47 twwright_ is now known as twwright

19:48 daissgr has quit [Ping timeout: 252 seconds]

19:48 <heller_> simbergm: since it's just reduce_by_key, my guess would be some kind of race in the algorithm itself, or the partitioner, or wrong usage of it

19:51 EverYoung has quit [Remote host closed the connection]

19:51 EverYoung has joined #ste||ar

19:55 aserio has quit [Ping timeout: 276 seconds]

19:58 aserio has joined #ste||ar

20:00 daissgr has joined #ste||ar

20:01 <heller_> simbergm: nope. It's a bug in the scan partitioner/executors ignoring the sync policy

20:11 <heller_> jbjnr: did I read the code correctly that all algorithms should execute sequentially for reduce_by_key?

20:48 <jbjnr> the scan part is not sequential, but there was one sequential bit in there. I can't remember without looking and I'm in a meeting right now.

21:01 Smasher has quit [Ping timeout: 240 seconds]

21:02 Smasher has joined #ste||ar

21:02 Smasher is now known as Smashor

21:42 hkaiser has quit [Quit: bye]

22:03 aserio has quit [Ping timeout: 240 seconds]

22:09 aserio has joined #ste||ar

22:48 Smashor has quit [Remote host closed the connection]

22:57 aserio has quit [Quit: aserio]

23:13 hkaiser has joined #ste||ar

23:34 <github> [hpx] hkaiser deleted coroutine_cleanup at 9e1648c: https://git.io/vNbl9

23:57 EverYoung has quit [Read error: Connection reset by peer]