aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
Smasher has quit [Quit: Connection reset by beer]
EverYoung has joined #ste||ar
Smasher has joined #ste||ar
EverYoun_ has joined #ste||ar
EverYoun_ has quit [Remote host closed the connection]
EverYoun_ has joined #ste||ar
EverYoung has quit [Ping timeout: 240 seconds]
jaafar_ has joined #ste||ar
jaafar has quit [Ping timeout: 252 seconds]
EverYoun_ has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoun_ has joined #ste||ar
EverYoung has quit [Ping timeout: 240 seconds]
EverYoun_ has quit [Ping timeout: 240 seconds]
EverYoung has joined #ste||ar
EverYoung has quit [Ping timeout: 240 seconds]
vamatya has quit [Ping timeout: 248 seconds]
<github> [hpx] hkaiser pushed 1 new commit to release: https://git.io/vNdOf
<github> hpx/release ce3c62a Hartmut Kaiser: Use nullptr instead of '0' in windows specific code (was not checked by clang-tidy)
vamatya has joined #ste||ar
EverYoung has joined #ste||ar
eschnett has joined #ste||ar
EverYoung has quit [Ping timeout: 276 seconds]
hkaiser has quit [Quit: bye]
ct-clmsn has joined #ste||ar
ct-clmsn has quit [Quit: Leaving]
EverYoung has joined #ste||ar
vamatya has quit [Ping timeout: 240 seconds]
nanashi55 has quit [Ping timeout: 248 seconds]
nanashi55 has joined #ste||ar
Smasher has quit [Quit: Connection reset by beer]
gedaj has quit [Quit: Konversation terminated!]
Smasher has joined #ste||ar
vamatya has joined #ste||ar
jaafar_ has quit [Ping timeout: 240 seconds]
vamatya has quit [Ping timeout: 268 seconds]
marco has joined #ste||ar
<marco> Hello
<github> [hpx] msimberg pushed 1 new commit to master: https://git.io/vNdgd
<github> hpx/master 4df531c Mikael Simberg: Merge pull request #3126 from STEllAR-GROUP/coroutine_cleanup...
<simbergm> marco: hi, what brings you here?
<zao> Sucked in by the gravitational pull of massive resources needed to the build HPX? :)
<simbergm> zao: sounds like a very good guess!
<github> [hpx] msimberg pushed 1 new commit to release: https://git.io/vNd2R
<github> hpx/release 98ae1f9 Mikael Simberg: Update instructions on changing the logo in the release procedures
<github> [hpx] msimberg pushed 1 new commit to release: https://git.io/vNdah
<github> hpx/release d8a4bd4 Mikael Simberg: Change to non-draft HPX logo in documentation
<marco> Hi, I do my first steps with hpx, and have a issue with a fix 1d stencil operator. Should I paste the code here, to discuss it (<100 lines)?
<zao> marco: Please link a pastebin page or a github gist.
david_pfander has joined #ste||ar
<marco> Here is the code: https://pastebin.com/6Fk3DJE8
<marco> My issue is the performance: my output: - nx : 2000 - nz : 1000 - padsize : 4 - count : 1000 Seriell : 4.27075 OpenMP : 1.68201 HPX : 4.59588
<zao> Quite often the problem with crappy HPX perf is the granularity of your tasks.
<zao> Are you running with a suitable number of threads?
<zao> HPX has a bunch of perf counters one can use to get a feeling for how things run, but I have no idea how to use those :)
<marco> I use a quadcore machine with hyperthreading and hpx:threads=all --> 8 threads
<zao> Is this a released version or master?
<zao> (I'm not overly familiar with actual HPX functionality, but I expect someone competent will drop in later)
<heller_> marco: how did you build HPX and your application?
<heller_> other than that ... I can't spot any real problems
<marco> It is the version from master (10.01.2018), build on centos7 with devtoolset-6
<heller_> ok
<heller_> debug/release? how did you build your benchmark?
<heller_> if you are somewhere near erlangen, you can also drop by my office ;)
<marco> hpx is build as release, test program is with build with O2 and seperate debug info (to analyze with intel vtune amplifier)
<heller_> using cmake or a manually crafted makefile (similar)?
<heller_> be sure to pass -DHPX_DISABLE_ASSERTS and -DNDEBUG to your compiler
<marco> If I change the value access of the stencil function to single value access (+-N to +-0), there is no performance issue
<marco> Sorry, I'm near Hannover
<marco> I have only with this type of loops problems. Different other loops have a perfect performance.
<heller_> ok
<heller_> let me check again
<heller_> so
<heller_> I don't understand the change of the single value access
<zao> False sharing or something?
<heller_> NB: vtune is fine with O3 as well, as long as you generate the debug symbols (with -g)
<zao> How does it run for 1 or 2 cores?
<heller_> yeah, try with 4 cores as well
<heller_> marco: so you are saying that for_each gives you bad performance, but changing it to for_loop works better?
<github> [hpx] msimberg pushed 1 new commit to release: https://git.io/vNd1W
<github> hpx/release 7077bc0 Mikael Simberg: Small formatting fix in release_procedure.rst
<heller_> marco: one other thing, std::vector<std::vector<float>> is not exactly good for performance ;)
<heller_> especially with such stencil things
<heller_> you should linearize that
<heller_> this should give you an overall better performance result
<marco> I mean with single value access, only an access to the centered element of the stencil operator, without the neighbour elements.
<heller_> ah ok
<heller_> yeah, that might hint to false sharing
<heller_> try the suggestion with the linearized 2d vector
<marco> yes, I know that vtune works well with O3. O3 is the standard for our release versions, but our compile environment set O2 for profiling. I must change it, ...
<zao> Always fun.
<marco> ok, I test it with 2/4 cores, and linearized it.
<marco> thanks for the first tips, I will contact you later
<heller_> sure
<heller_> no problem, always here to help
zbyerly_ has joined #ste||ar
zbyerly__ has quit [Ping timeout: 276 seconds]
Smasher has quit [Quit: Connection reset by beer]
Smasher has joined #ste||ar
hkaiser has joined #ste||ar
<hkaiser> simbergm: so what's the merge procedure for the release now? do we merge to master or to release?
<heller_> should we remove hpx_fwd.hpp?
<heller_> It's been deprecated for a while now
<hkaiser> when did we deprecate it?
<heller_> 2 years ago
<simbergm> hkaiser: I don't have a strong as I don't know yet which one works better in practice, but my feeling is that we should keep merging to master and I will pick from there to the release branch
<hkaiser> nod, let's remove it, but leave a note in the what's new section
<hkaiser> simbergm: ok - heller_ will disagree ;-)
<simbergm> like this we can keep merging things to master that need not go in the release
<simbergm> if heller_ has good arguments for making PRs to release I'm more than okay with that as well :)
<heller_> I don't like cherry-picking from master
<heller_> since it makes it hard to reintegrate release back to master
<simbergm> okay, that's fair
<heller_> in a perfect world, we would just branch off of master, and call it a release
<simbergm> agreed
<hkaiser> simbergm: last release we decided to merge to release and do one merge back to master after th erelease - worked quite well, actually
<heller_> *nod*
<heller_> the only problem we had was that testing was a bit of a mess, IIRC
<hkaiser> yes
<simbergm> meaning cherry picking from master before release, and then merging to master after release?
<hkaiser> no
<heller_> that is, things that landed on master during the release weren't properly tested etc.
<hkaiser> during the release time merge all PRs to release, leaving master alone
<simbergm> hkaiser: sorry misread
<simbergm> I see now
<simbergm> okay, so let's do that now as well
<simbergm> already open PRs against master can be picked to release
<hkaiser> simbergm: requires changes to testing infrastructure, tests should run off of release
<simbergm> hkaiser: yes
<hkaiser> existing PRs can be merged manually to release after being merged to master
<simbergm> I didn't want to do it yet since there's no rc
<simbergm> yeah
<hkaiser> let's avoid cherry picking
<simbergm> okay, missed the distinction between merging and cherry picking
<simbergm> I see
<hkaiser> cherry picking creates a new independent commit, merging does not
<simbergm> so at the moment I think master is in a pretty good state (i.e. not too many failing tests, and all failures are occasional except for the stacksize test)
<simbergm> in your opinion is this a good state for an rc?
<simbergm> what has buildbot looked like usually during an rc?
<heller_> same or even worse :P
<hkaiser> rc means thatthere will be no new features or refactorings, only bug fixes before the release
<heller_> I'd like to get the stacksize problem fixed though
<hkaiser> sure, that's a bug
<heller_> let's also try to keep the timeframe for doing the release as minimal as possible
<heller_> last time, we had lots of trouble keeping everything in sync...
<simbergm> but then I would stick to master still for now and try to keep fixing as much as possible
<hkaiser> heller_: it wasn't too bad
<simbergm> I guess the only feature still going in is my suspension PR
<simbergm> do you have anything planned?
<simbergm> anything else
<hkaiser> simbergm: and the thread scheduler changes heller_ has in the pipeline
<heller_> should they go into the release?
<simbergm> they change APIs?
<heller_> a little, yes
<hkaiser> heller_: I thought we delayed the release for those
<simbergm> ok
<heller_> I was aware that we delayed the release for them
<heller_> and jbjnr reported no real speedup in his application
<hkaiser> heller_: so why do we do those, then?
<heller_> so I guess they need more work
<heller_> first of all, he didn't test the full set
<heller_> second of all, the full set needs more work
<heller_> probably another week or so
<hkaiser> heller_: so you're contradicting yourself here ;)
<heller_> for my micro benchmarks, they showed better performance
<heller_> that's all I said
<hkaiser> sure
<simbergm> in my opinion we're not really behind schedule as I set the rc date quite conservatively and optimistically, so let's keep working on master still until e.g. wednesday next week and see again where we are
<heller_> sounds good
<heller_> night shifts ahead!
<heller_> ;)
<heller_> we are not doing the tutorial in march anyway
<hkaiser> nice
<simbergm> yeah, that's good
<jbjnr> heller_: my best results were about 985GFlops before, but this week I got 1010 peak, so there's been a 2% or so improvment as a result of some general cleanup from your continuations and the profiling fixes etc. gtg.
<heller_> ok
<heller_> I wasn't aware that the 2% were due to my changes
<hkaiser> that's a nice result as well
<heller_> and since the grain size is relatively large for your application, I am not sure they'll matter
<heller_> (except for the allocations etc.)
<heller_> I am working on trying to reduce binary size right now...
<heller_> which I am hoping helps with partitioned_vector tests
<hkaiser> heller_: let's finish the other stuff first
<hkaiser> we've been living with the partitioned_vector things for a while, no need to work on it now
<heller_> well ... low risk change, I was in "let's improve the code as it is without changing functionality"
<hkaiser> heller_: we have enough half-way done things
<heller_> we do
<hkaiser> #3031 has inspect problems, #3036 is still open
<hkaiser> I meant #3130
<heller_> #3036 can't be closed until we fix the partitioned_vector compile problems
<hkaiser> heller_: ok - was not aware of that
<heller_> well, we can merge it
<heller_> but then we'll have to live with failing tests until the compile/link problems are fixed
<simbergm> jbjnr: for dca++ you're going to need cuda support, no? have you compiled hpx (with cuda) successfully using something else than what's on rostam?
daissgr has joined #ste||ar
<K-ballo> heller_: what's griwes sfinae thing?
<heller_> K-ballo: https://godbolt.org/g/Kp4DhH
<heller_> K-ballo: omitting it from the name
<heller_> the SFINAE type that is
<K-ballo> that looks like a pack expansion of void non-type template arguments, what is it?
<heller_> yeah
<heller_> I am not sure why it doesn't show up in the name of the function though
Vir has quit [Ping timeout: 240 seconds]
<K-ballo> I don't understand why it even works, but it seems related
<K-ballo> if it did show up it would form a `void...` in the non-sfinae case, which would be ill-formed
<heller_> great. time totally wasted
<heller_> full debug build, griwes trick: 31.09GB, current master: 31.21 GB, so no real gain there. It's even slower to compile
hkaiser has quit [Quit: bye]
daissgr has quit [Ping timeout: 255 seconds]
daissgr has joined #ste||ar
Vir has joined #ste||ar
<github> [hpx] msimberg closed pull request #3131: Fixing #2325 (master...fixing_2325) https://git.io/vN9hx
eschnett has quit [Quit: eschnett]
hkaiser has joined #ste||ar
aserio has joined #ste||ar
eschnett has joined #ste||ar
aserio has quit [Ping timeout: 276 seconds]
eschnett has quit [Quit: eschnett]
aserio has joined #ste||ar
eschnett has joined #ste||ar
mbremer has joined #ste||ar
<mbremer> @hkaiser: yt?
<hkaiser> mbremer: here
<mbremer> Any updates on the paper?
<hkaiser> mbremer: I have not done anything :/
<simbergm> heller_: #3131 seems to have broken reduce_by_key, any guesses why? I'll try to look tomorrow
<hkaiser> simbergm: heh
<mbremer> @hkaiser: Also do you have a bibtex entry for the GB paper?
<mbremer> Alternatively, I suppose the scaling results are also mentioned in the OpenSuCo paper
<jbjnr> simbergm: yes. I am running dca++ with cuda on my laptop and on daint using (hpx+cuda)+(dca+++cuda)
<jbjnr> no problems now.
<jbjnr> heller_: I ran the cholesky several times and discovered that the map change was not really making any differnce, it is within the noise - but variance of runs is quite high - however, everything is 'just a bit' faster than it used to be - so I'm assuming the string cleanup and your continuation fixes are the main thing
aserio has quit [Ping timeout: 252 seconds]
vamatya has joined #ste||ar
jaafar_ has joined #ste||ar
david_pfander has quit [Ping timeout: 265 seconds]
daissgr has quit [Ping timeout: 252 seconds]
<heller_> jbjnr: great!
<heller_> simbergm: I'll have a look as well
<heller_> simbergm: reduce_by_key might be related to the scan_partitioner changes after all and meaning it's not fixed yet
aserio has joined #ste||ar
<heller_> simbergm: same problem as before, I guess
daissgr has joined #ste||ar
aserio1 has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
aserio has quit [Ping timeout: 265 seconds]
aserio1 is now known as aserio
twwright_ has joined #ste||ar
twwright has quit [Read error: Connection reset by peer]
twwright_ is now known as twwright
daissgr has quit [Ping timeout: 252 seconds]
<heller_> simbergm: since it's just reduce_by_key, my guess would be some kind of race in the algorithm itself, or the partitioner, or wrong usage of it
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
aserio has quit [Ping timeout: 276 seconds]
aserio has joined #ste||ar
daissgr has joined #ste||ar
<heller_> simbergm: nope. It's a bug in the scan partitioner/executors ignoring the sync policy
<heller_> jbjnr: did I read the code correctly that all algorithms should execute sequentially for reduce_by_key?
<jbjnr> the scan part is not sequential, but there was one sequential bit in there. I can't remember without looking and I'm in a meeting right now.
Smasher has quit [Ping timeout: 240 seconds]
Smasher has joined #ste||ar
Smasher is now known as Smashor
hkaiser has quit [Quit: bye]
aserio has quit [Ping timeout: 240 seconds]
aserio has joined #ste||ar
Smashor has quit [Remote host closed the connection]
aserio has quit [Quit: aserio]
hkaiser has joined #ste||ar
<github> [hpx] hkaiser deleted coroutine_cleanup at 9e1648c: https://git.io/vNbl9
EverYoung has quit [Read error: Connection reset by peer]