#ste||ar on 2019-06-26 — irc logs at irclog.cct.lsu.edu

2019-06-17 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoD: https://developers.google.com/season-of-docs/

00:20 Coldblackice has quit [Remote host closed the connection]

00:20 Coldblackice has joined #ste||ar

00:23 Coldblackice has quit [Remote host closed the connection]

00:24 Coldblackice has joined #ste||ar

00:24 Coldblackice has quit [Remote host closed the connection]

00:24 Coldblackice has joined #ste||ar

00:27 Coldblackice has quit [Remote host closed the connection]

00:27 Coldblackice has joined #ste||ar

00:29 Coldblackice has quit [Remote host closed the connection]

00:29 Coldblackice has joined #ste||ar

00:30 Coldblackice has quit [Remote host closed the connection]

00:30 Coldblackice has joined #ste||ar

00:32 Coldblackice has quit [Remote host closed the connection]

00:32 Coldblackice has joined #ste||ar

00:33 Coldblackice has quit [Remote host closed the connection]

00:33 Coldblackice has joined #ste||ar

00:33 Coldblackice has quit [Client Quit]

00:55 K-ballo has quit [Quit: K-ballo]

01:56 nikunj has joined #ste||ar

02:15 hkaiser has quit [Quit: bye]

05:56 nikunj has quit [Remote host closed the connection]

08:13 <mdiers_> heller: come closer to the problem: had tested with sanitize=leak, but the sanitize adjustments in hpx are only done with sanitize=address. now continue with sanitize=address. now i get a undeclared identifier asan_fake_stack in context_base.hpp:221 should i use lx::x86_linux_context_impl_base::asan_fake_stack instead of asan_fake_stack? or is there something missing?

09:20 rori has joined #ste||ar

09:32 JClave has joined #ste||ar

09:44 JClave has quit [Remote host closed the connection]

09:45 JClave has joined #ste||ar

10:53 <simbergm> jbjnr: yt? cdash submissions have been missing for a while and it looks like it started happening after the cdash upgrade

10:53 <JClave> does anyone know of any commercial software projects using HPX?

10:53 <simbergm> do you know if something changed in the format or submission url?

10:54 <simbergm> JClave: I don't think there are any who would at least publicly say so

10:56 <JClave> because of security reasons?

10:56 <mdiers_> we have one in development, but nothing public yet

11:11 <simbergm> JClave: not necessarily, just that there might be commercial projects using HPX but they just haven't told us

11:12 <simbergm> academic projects is something else, there are at least a few

11:13 <JClave> would you mind naming some please?

11:13 <jbjnr> simbergm: I'll take a look at it. It seemed to be working when it was upgraded, but must have stopped with new results

11:14 <jbjnr> mdiers_: anything you can share with us about your application?

11:16 <simbergm> JClave: octotiger is the most prominent one I can think of, hpxMP is a small reimplementation of OpenMP with HPX, flecsi apparently has some sort of HPX backend, here at CSCS we're working on a cholesky decomposition with HPX (not public)

11:16 <simbergm> hopefully others can fill in the gaps

11:34 <tarzeau> simbergm: did it work at all? were you able to test anything?

11:34 <simbergm> tarzeau: no time yet, sorry

11:35 <tarzeau> i like the stellar group logo, who created it?

11:39 <heller> CCT

11:47 K-ballo has joined #ste||ar

11:52 hkaiser has joined #ste||ar

11:54 <mdiers_> jbjnr: in short: an application for processing seismic data. a small overview will be posted on our website soon.

11:55 <jbjnr> very nice. an HPC related theme then.

11:55 <jbjnr> Do make sure you send an anouncment to the HPX user's list when you write about it, as we'll all be interested in knowing about it

11:59 <Yorlik> hakaiser: Got yesterdays mess cleaned up. It was a typical newbie-doesn't-know-what-he-does thing. Had to clean it up myself. However the input here still helped me, since it changed the way how I was looking at things. Thanks K-ballo and heller too. :)

11:59 <Yorlik> + zao :)

11:59 <mdiers_> jbjnr: Yes, but it is also HTC related. I will try to think about the user's list.

12:00 <zao> <3

12:01 <Yorlik> <:3 )~~

12:03 <hkaiser> JClave: we work on a fairly large machine-learning project that uses HPX: Phylanx (github)

12:04 <hkaiser> JClave: also, Yorlik here develops a MMO game using it

12:13 <JClave> Thanks! Keen to start contributing soon, good to verify that this project is key in many production ready softwares.

12:14 <hkaiser> JClave: welcome on board, then!

12:14 <hkaiser> JClave: what's your interest?

12:17 <jbjnr> I've got nothing in my calendar for HPX meeting this afternoon, so if anyone has a link to click at the right time, please send it to me (webex or appear.in ?)

12:17 <JClave> anything involving multithreading and synchronisation primitives. Was going to find something that people want done in HPX

12:17 <hkaiser> jbjnr: we'll probably do appear.in

12:17 <hkaiser> JClave: cool

12:18 <hkaiser> JClave: parallel algorithms?

12:18 <JClave> yeah i was looking for some work related to that

12:20 <hkaiser> jaafar: here are a couple of related tickets: #1141, #1338, #2235, #1836, #1668

12:20 <hkaiser> there might be more, just look around

12:22 <JClave> hkaiser: thanks! will have a look and comment on ones i wish to pick up. just managed to run HPX examples successfully on windows today so will spend a bit more time getting comfortable first :)

12:22 <hkaiser> :D

12:23 <jbjnr> hkaiser: ta

12:31 JClave has quit [Quit: Going offline, see ya! (www.adiirc.com)]

12:32 JClave has joined #ste||ar

13:20 hkaiser has quit [Quit: bye]

13:39 <diehlpk_work> jbjnr, Where can I find the libfrabric branch?

13:41 <jbjnr> diehlpk_work: https://github.com/biddisco/hpx/tree/rdma_object

13:42 <jbjnr> or https://github.com/STEllAR-GROUP/hpx/tree/rdma_object

13:45 <diehlpk_work> jbjnr, Thanks

13:45 <diehlpk_work> Bryce is asking around at Nvidia to get support for our next attempt

13:46 <jbjnr> we haven't got our paper into SC yet.

13:48 <diehlpk_work> Yes, but how does this relate to the next attempt?

13:48 <jbjnr> crawl, walk, run

13:49 hkaiser has joined #ste||ar

13:59 <hkaiser> heller, simbergm, jbjnr: appear.in?

14:01 <simbergm> hkaiser: yep, sec

14:37 Karame has joined #ste||ar

14:50 rori has quit [Ping timeout: 245 seconds]

15:06 <Yorlik> Any suggestions to what to read on strategies about when and how to use huge / large memory tables to relief contention from the TLB, and especially how to measure if it makes sense in the first place?

15:11 Karame has quit [Ping timeout: 252 seconds]

15:12 <heller> hkaiser: simbergm: damn, totally forgot :/

15:12 <heller> I need irc at work...

15:30 JClave has quit [Remote host closed the connection]

15:48 <simbergm> heller: you have time tomorrow or friday?

16:25 Yorlik has quit [Read error: Connection reset by peer]

16:49 <heller> simbergm: tomorrow between 9 and 12 would be good

16:50 <simbergm> heller: fine by me

16:50 <simbergm> jbjnr: good for you?

16:51 <heller> What time do you prefer?

17:37 <simbergm> heller: any time is fine

18:22 hkaiser has quit [Quit: bye]

19:16 hkaiser has joined #ste||ar

20:00 <nikunj97> hkaiser: yt?

20:00 <hkaiser> here

20:00 <nikunj97> I was running Jackson's code

20:00 <nikunj97> and something is fishy

20:00 <hkaiser> k

20:00 <hkaiser> why am I not surprised?

20:00 <nikunj97> xD

20:01 <nikunj97> so the thing is, with 128 tiles, 16000 doubles/tile and 8192 iterations with 128 steps/iteration they report times of around 5s

20:01 <nikunj97> they -> GaTech

20:02 <hkaiser> k

20:02 <nikunj97> and in their description they say it's Jackson's idea

20:02 <nikunj97> the same code that we run won't finish anywhere close to 5s

20:03 <nikunj97> is it coz many shared futures bottlenecking the performance?

20:03 <hkaiser> well, let's see

20:03 <hkaiser> how many future do we create for this?

20:04 <nikunj97> let me check

20:04 <nikunj97> 128 shared futures per iteration

20:05 <nikunj97> and 8192 iteration in total

20:06 <hkaiser> and they do that without any futures?

20:06 <nikunj97> well, if they use the same code then they do it using promises and futures

20:06 <hkaiser> that's ~1Mio futures for us, i.e. about 1-2s overhead from them

20:07 <hkaiser> what do you mean by 128 steps/iteration?

20:07 <diehlpk_work> hkaiser, I asked for the compiler matches, because this is an issue for the Fedora packages.

20:07 <nikunj97> so they copy the left and right tiles

20:07 <nikunj97> so that they can do more time steps per iteration

20:08 <hkaiser> and each step requires a future?

20:08 Vir has joined #ste||ar

20:08 <hkaiser> diehlpk_work: nod, thought so - can you define the flag?

20:08 <hkaiser> or would that be in the user's responsibility?

20:09 <diehlpk_work> There are two sides of the medal

20:09 <nikunj97> hkaiser: Don't think so

20:09 <hkaiser> nikunj97: do we have a future per timestep or a future per iteration?

20:09 <diehlpk_work> First, if the user will use the fedora package and compile his own code, it is his responsibility

20:09 <nikunj97> future per iteration

20:09 <nikunj97> not time step

20:10 <hkaiser> nikunj97: ok

20:10 <diehlpk_work> Second, if one uses our fedora package on their build system, I do not know

20:10 <hkaiser> how long does one iteration take?

20:10 <nikunj97> I didn't check

20:10 <nikunj97> but 30 min in with the parameters and it was still running

20:10 <nikunj97> it should not take that long

20:10 <hkaiser> diehlpk_work: we can make that check optional to begin with, or limit it to the major version as you suggested

20:11 <hkaiser> nikunj97: so it just hang?

20:11 <hkaiser> does it make progress at all?

20:11 <nikunj97> that's what I think

20:11 <nikunj97> It's surely making progress, but it's taking too long

20:11 <hkaiser> ok

20:11 <diehlpk_work> What about check the major version and if the major version matches, we allow to compile, but have a warning that minor does not match and we recommend to make them match

20:11 <diehlpk_work> if major not matches we throw an error

20:12 <nikunj97> so doing 4096 as subdomain width, 1024 time steps, and 3 subdomains itself is taking 28s to run

20:12 <hkaiser> ok, do you care enough to have a look into this?

20:12 <hkaiser> nikunj97: in release?

20:13 <nikunj97> yes

20:13 <hkaiser> ;-)

20:13 <hkaiser> diehlpk_work: ^^

20:13 <nikunj97> hkaiser: everything is explicitly release now xD

20:13 <hkaiser> ok

20:13 <diehlpk_work> hkaiser, Yes, I will have a look

20:13 <hkaiser> diehlpk_work: the code is here: https://github.com/STEllAR-GROUP/hpx/blob/master/cmake/templates/HPXMacros.cmake.in#L12-L37

20:14 <nikunj97> hkaiser: could you please take a look at the code? https://github.com/STEllAR-GROUP/hpxr/blob/master/benchmarks/replay/dataflow_replay.cpp

20:14 <hkaiser> nikunj97: let's have a look at som eperf-counters andor vtune

20:14 <nikunj97> I don't understand stencils well, so I must be missing something

20:14 <diehlpk_work> I will have a look later this week and if it can be done in one hour I will do it

20:14 <nikunj97> let me analyse it in vtune

20:14 <diehlpk_work> If not I will add the flag to fedora

20:15 <hkaiser> nikunj97: I think we overwhelm the system with all those futures, what it the memory footprint of the application?

20:15 <hkaiser> diehlpk_work: I can also have a look

20:15 <nikunj97> hkaiser: I didn't check that

20:16 <nikunj97> I didn't analyse the application as of now, just reporting fishy behavior

20:16 <hkaiser> nikunj97: we create the whole tree in one go

20:16 <nikunj97> yes, that we do

20:16 <hkaiser> inserting a sliding semaphore might help as it would limit the depth of the tree dynamically

20:17 <hkaiser> nikunj97: also parallelizing this loop may help: https://github.com/STEllAR-GROUP/hpxr/blob/master/benchmarks/replay/dataflow_replay.cpp#L164

20:18 <nikunj97> you want me to do a par_for?

20:18 <diehlpk_work> hkaiser, sure, I will not have time to do it before Thursday, I like to finish the course project and put it to the web page first.

20:19 <hkaiser> sure

20:19 <hkaiser> nikunj97: as one of the things, yes - but not first priority

20:19 <diehlpk_work> let me give it a try this Friday and I will assume I will need your help anyway

20:20 <hkaiser> nikunj97: the sliding_semaphore would be more important: https://github.com/STEllAR-GROUP/hpx/blob/master/examples/1d_stencil/1d_stencil_8.cpp#L556

20:21 <hkaiser> and here: sem

20:21 <hkaiser> https://github.com/STEllAR-GROUP/hpx/blob/master/examples/1d_stencil/1d_stencil_8.cpp#L605-L619

20:22 <nikunj97> adding sliding semaphore should help, but I'm still not sure if it'll finish everything in ~5-10s

20:23 <hkaiser> nikunj97: please look at some perf-counters: idle-rate (enabled at build-time), average thread duration, number of created threads

20:23 <hkaiser> nikunj97: one step at a time

20:23 <nikunj97> ok let me see what I can do :)

20:24 <hkaiser> nikunj97: what error rates do you use?

20:24 <nikunj97> it was without injecting errors

20:24 <nikunj97> btw should I just use the 1d stencil code instead?

20:25 <nikunj97> the one in hpx examples

20:25 <hkaiser> also, this allocation will kill perf (https://github.com/STEllAR-GROUP/hpxr/blob/master/benchmarks/replay/dataflow_replay.cpp#L100)

20:25 <hkaiser> nikunj97: worth a try, however the local stencil1d does not use the sliding semaphore, I think

20:26 <nikunj97> yeah that's true, they're copyig left and right tile

20:26 <nikunj97> yes but 1d_stencil_4 does have limit for depth

20:26 <nikunj97> I ran it, took 3.2s to run

20:27 <hkaiser> does it?

20:27 <hkaiser> nod

20:27 <nikunj97> ./1d_stencil_4 --nx=16000 --nt=8192 --nd=10 --np=128 --k=0.5

20:27 <nikunj97> took 3.28161533

20:27 <hkaiser> I'm not sure the code has been looked at from the standpoint of perf at all

20:28 <hkaiser> ahh, it uses sliding semaphore after all

20:28 <nikunj97> it was Jackson's code above which adrian made things right, and I simply added a function to inject errors

20:29 <hkaiser> nikunj97: sure - but now it calculates a checksum and throws an exception without looking at it

20:30 <nikunj97> yes coz I made it like that

20:30 <hkaiser> and it allocates a buffer for each timestep and partition, etc.

20:30 <nikunj97> checksums will always give right results

20:30 <nikunj97> so you'll have to inject errors artificially

20:30 <hkaiser> nikunj97: if you can retrofit Jackson's kernel into stencil1d_4, sure - have a try

20:31 <hkaiser> you could have overwritten some of the calculated values and let the checksum tell you whether to fail

20:31 <hkaiser> (but this is irrelevant for perf)

20:32 <nikunj97> that's what I was going to do, but then adrian said it's fine either ways since it's benchmarking and we need to inject errors

20:33 <hkaiser> nikunj97: sure, changing stencil1d_4 to simply use dataflow_replay with error injection (and the existing kernel) would give use some numbers as well

20:33 <nikunj97> yes

20:33 <nikunj97> and will save me the hassle to analyse and optimize Jackson's code

20:34 <nikunj97> it's much easier for me to add error injections into stencil1d_4

20:39 <nikunj97> hkaiser: just to let you know, the actual example lw_1d_replay throws error for the given parameters

20:39 <nikunj97> I tried running it rn

21:29 <nikunj97> hkaiser: just took a deep look into it. The init function itself does some 128 allocations for 16001 sized vector. Furthermore it is allocating ~25000 times a vector of 48000 elements. No wonder it's taking much longre than usual

21:40 <hkaiser> right

21:53 <nikunj97> hkaiser: should I ask Keita if we can use the existing 1d_stencil that we have?

21:53 <nikunj97> instead of trying to reduce the allocations and optimizing things here and there

22:04 Yorlik has joined #ste||ar

22:37 <hkaiser> nikunj97: nah, just do it