hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoD: https://developers.google.com/season-of-docs/
Coldblackice has quit [Remote host closed the connection]
Coldblackice has joined #ste||ar
Coldblackice has quit [Remote host closed the connection]
Coldblackice has joined #ste||ar
Coldblackice has quit [Remote host closed the connection]
Coldblackice has joined #ste||ar
Coldblackice has quit [Remote host closed the connection]
Coldblackice has joined #ste||ar
Coldblackice has quit [Remote host closed the connection]
Coldblackice has joined #ste||ar
Coldblackice has quit [Remote host closed the connection]
Coldblackice has joined #ste||ar
Coldblackice has quit [Remote host closed the connection]
Coldblackice has joined #ste||ar
Coldblackice has quit [Remote host closed the connection]
Coldblackice has joined #ste||ar
Coldblackice has quit [Client Quit]
K-ballo has quit [Quit: K-ballo]
nikunj has joined #ste||ar
hkaiser has quit [Quit: bye]
nikunj has quit [Remote host closed the connection]
<mdiers_> heller: come closer to the problem: had tested with sanitize=leak, but the sanitize adjustments in hpx are only done with sanitize=address. now continue with sanitize=address. now i get a undeclared identifier asan_fake_stack in context_base.hpp:221 should i use lx::x86_linux_context_impl_base::asan_fake_stack instead of asan_fake_stack? or is there something missing?
rori has joined #ste||ar
JClave has joined #ste||ar
JClave has quit [Remote host closed the connection]
JClave has joined #ste||ar
<simbergm> jbjnr: yt? cdash submissions have been missing for a while and it looks like it started happening after the cdash upgrade
<JClave> does anyone know of any commercial software projects using HPX?
<simbergm> do you know if something changed in the format or submission url?
<simbergm> JClave: I don't think there are any who would at least publicly say so
<JClave> because of security reasons?
<mdiers_> we have one in development, but nothing public yet
<simbergm> JClave: not necessarily, just that there might be commercial projects using HPX but they just haven't told us
<simbergm> academic projects is something else, there are at least a few
<JClave> would you mind naming some please?
<jbjnr> simbergm: I'll take a look at it. It seemed to be working when it was upgraded, but must have stopped with new results
<jbjnr> mdiers_: anything you can share with us about your application?
<simbergm> JClave: octotiger is the most prominent one I can think of, hpxMP is a small reimplementation of OpenMP with HPX, flecsi apparently has some sort of HPX backend, here at CSCS we're working on a cholesky decomposition with HPX (not public)
<simbergm> hopefully others can fill in the gaps
<tarzeau> simbergm: did it work at all? were you able to test anything?
<simbergm> tarzeau: no time yet, sorry
<tarzeau> i like the stellar group logo, who created it?
<heller> CCT
K-ballo has joined #ste||ar
hkaiser has joined #ste||ar
<mdiers_> jbjnr: in short: an application for processing seismic data. a small overview will be posted on our website soon.
<jbjnr> very nice. an HPC related theme then.
<jbjnr> Do make sure you send an anouncment to the HPX user's list when you write about it, as we'll all be interested in knowing about it
<Yorlik> hakaiser: Got yesterdays mess cleaned up. It was a typical newbie-doesn't-know-what-he-does thing. Had to clean it up myself. However the input here still helped me, since it changed the way how I was looking at things. Thanks K-ballo and heller too. :)
<Yorlik> + zao :)
<mdiers_> jbjnr: Yes, but it is also HTC related. I will try to think about the user's list.
<zao> <3
<Yorlik> <:3 )~~
<hkaiser> JClave: we work on a fairly large machine-learning project that uses HPX: Phylanx (github)
<hkaiser> JClave: also, Yorlik here develops a MMO game using it
<JClave> Thanks! Keen to start contributing soon, good to verify that this project is key in many production ready softwares.
<hkaiser> JClave: welcome on board, then!
<hkaiser> JClave: what's your interest?
<jbjnr> I've got nothing in my calendar for HPX meeting this afternoon, so if anyone has a link to click at the right time, please send it to me (webex or appear.in ?)
<JClave> anything involving multithreading and synchronisation primitives. Was going to find something that people want done in HPX
<hkaiser> jbjnr: we'll probably do appear.in
<hkaiser> JClave: cool
<hkaiser> JClave: parallel algorithms?
<JClave> yeah i was looking for some work related to that
<hkaiser> jaafar: here are a couple of related tickets: #1141, #1338, #2235, #1836, #1668
<hkaiser> there might be more, just look around
<JClave> hkaiser: thanks! will have a look and comment on ones i wish to pick up. just managed to run HPX examples successfully on windows today so will spend a bit more time getting comfortable first :)
<hkaiser> :D
<jbjnr> hkaiser: ta
JClave has quit [Quit: Going offline, see ya! (www.adiirc.com)]
JClave has joined #ste||ar
hkaiser has quit [Quit: bye]
<diehlpk_work> jbjnr, Where can I find the libfrabric branch?
<diehlpk_work> jbjnr, Thanks
<diehlpk_work> Bryce is asking around at Nvidia to get support for our next attempt
<jbjnr> we haven't got our paper into SC yet.
<diehlpk_work> Yes, but how does this relate to the next attempt?
<jbjnr> crawl, walk, run
hkaiser has joined #ste||ar
<hkaiser> heller, simbergm, jbjnr: appear.in?
<simbergm> hkaiser: yep, sec
Karame has joined #ste||ar
rori has quit [Ping timeout: 245 seconds]
<Yorlik> Any suggestions to what to read on strategies about when and how to use huge / large memory tables to relief contention from the TLB, and especially how to measure if it makes sense in the first place?
Karame has quit [Ping timeout: 252 seconds]
<heller> hkaiser: simbergm: damn, totally forgot :/
<heller> I need irc at work...
JClave has quit [Remote host closed the connection]
<simbergm> heller: you have time tomorrow or friday?
Yorlik has quit [Read error: Connection reset by peer]
<heller> simbergm: tomorrow between 9 and 12 would be good
<simbergm> heller: fine by me
<simbergm> jbjnr: good for you?
<heller> What time do you prefer?
<simbergm> heller: any time is fine
hkaiser has quit [Quit: bye]
hkaiser has joined #ste||ar
<nikunj97> hkaiser: yt?
<hkaiser> here
<nikunj97> I was running Jackson's code
<nikunj97> and something is fishy
<hkaiser> k
<hkaiser> why am I not surprised?
<nikunj97> xD
<nikunj97> so the thing is, with 128 tiles, 16000 doubles/tile and 8192 iterations with 128 steps/iteration they report times of around 5s
<nikunj97> they -> GaTech
<hkaiser> k
<nikunj97> and in their description they say it's Jackson's idea
<nikunj97> the same code that we run won't finish anywhere close to 5s
<nikunj97> is it coz many shared futures bottlenecking the performance?
<hkaiser> well, let's see
<hkaiser> how many future do we create for this?
<nikunj97> let me check
<nikunj97> 128 shared futures per iteration
<nikunj97> and 8192 iteration in total
<hkaiser> and they do that without any futures?
<nikunj97> well, if they use the same code then they do it using promises and futures
<hkaiser> that's ~1Mio futures for us, i.e. about 1-2s overhead from them
<hkaiser> what do you mean by 128 steps/iteration?
<diehlpk_work> hkaiser, I asked for the compiler matches, because this is an issue for the Fedora packages.
<nikunj97> so they copy the left and right tiles
<nikunj97> so that they can do more time steps per iteration
<hkaiser> and each step requires a future?
Vir has joined #ste||ar
<hkaiser> diehlpk_work: nod, thought so - can you define the flag?
<hkaiser> or would that be in the user's responsibility?
<diehlpk_work> There are two sides of the medal
<nikunj97> hkaiser: Don't think so
<hkaiser> nikunj97: do we have a future per timestep or a future per iteration?
<diehlpk_work> First, if the user will use the fedora package and compile his own code, it is his responsibility
<nikunj97> future per iteration
<nikunj97> not time step
<hkaiser> nikunj97: ok
<diehlpk_work> Second, if one uses our fedora package on their build system, I do not know
<hkaiser> how long does one iteration take?
<nikunj97> I didn't check
<nikunj97> but 30 min in with the parameters and it was still running
<nikunj97> it should not take that long
<hkaiser> diehlpk_work: we can make that check optional to begin with, or limit it to the major version as you suggested
<hkaiser> nikunj97: so it just hang?
<hkaiser> does it make progress at all?
<nikunj97> that's what I think
<nikunj97> It's surely making progress, but it's taking too long
<hkaiser> ok
<diehlpk_work> What about check the major version and if the major version matches, we allow to compile, but have a warning that minor does not match and we recommend to make them match
<diehlpk_work> if major not matches we throw an error
<nikunj97> so doing 4096 as subdomain width, 1024 time steps, and 3 subdomains itself is taking 28s to run
<hkaiser> ok, do you care enough to have a look into this?
<hkaiser> nikunj97: in release?
<nikunj97> yes
<hkaiser> ;-)
<hkaiser> diehlpk_work: ^^
<nikunj97> hkaiser: everything is explicitly release now xD
<hkaiser> ok
<diehlpk_work> hkaiser, Yes, I will have a look
<nikunj97> hkaiser: could you please take a look at the code? https://github.com/STEllAR-GROUP/hpxr/blob/master/benchmarks/replay/dataflow_replay.cpp
<hkaiser> nikunj97: let's have a look at som eperf-counters andor vtune
<nikunj97> I don't understand stencils well, so I must be missing something
<diehlpk_work> I will have a look later this week and if it can be done in one hour I will do it
<nikunj97> let me analyse it in vtune
<diehlpk_work> If not I will add the flag to fedora
<hkaiser> nikunj97: I think we overwhelm the system with all those futures, what it the memory footprint of the application?
<hkaiser> diehlpk_work: I can also have a look
<nikunj97> hkaiser: I didn't check that
<nikunj97> I didn't analyse the application as of now, just reporting fishy behavior
<hkaiser> nikunj97: we create the whole tree in one go
<nikunj97> yes, that we do
<hkaiser> inserting a sliding semaphore might help as it would limit the depth of the tree dynamically
<nikunj97> you want me to do a par_for?
<diehlpk_work> hkaiser, sure, I will not have time to do it before Thursday, I like to finish the course project and put it to the web page first.
<hkaiser> sure
<hkaiser> nikunj97: as one of the things, yes - but not first priority
<diehlpk_work> let me give it a try this Friday and I will assume I will need your help anyway
<hkaiser> nikunj97: the sliding_semaphore would be more important: https://github.com/STEllAR-GROUP/hpx/blob/master/examples/1d_stencil/1d_stencil_8.cpp#L556
<hkaiser> and here: sem
<nikunj97> adding sliding semaphore should help, but I'm still not sure if it'll finish everything in ~5-10s
<hkaiser> nikunj97: please look at some perf-counters: idle-rate (enabled at build-time), average thread duration, number of created threads
<hkaiser> nikunj97: one step at a time
<nikunj97> ok let me see what I can do :)
<hkaiser> nikunj97: what error rates do you use?
<nikunj97> it was without injecting errors
<nikunj97> btw should I just use the 1d stencil code instead?
<nikunj97> the one in hpx examples
<hkaiser> nikunj97: worth a try, however the local stencil1d does not use the sliding semaphore, I think
<nikunj97> yeah that's true, they're copyig left and right tile
<nikunj97> yes but 1d_stencil_4 does have limit for depth
<nikunj97> I ran it, took 3.2s to run
<hkaiser> does it?
<hkaiser> nod
<nikunj97> ./1d_stencil_4 --nx=16000 --nt=8192 --nd=10 --np=128 --k=0.5
<nikunj97> took 3.28161533
<hkaiser> I'm not sure the code has been looked at from the standpoint of perf at all
<hkaiser> ahh, it uses sliding semaphore after all
<nikunj97> it was Jackson's code above which adrian made things right, and I simply added a function to inject errors
<hkaiser> nikunj97: sure - but now it calculates a checksum and throws an exception without looking at it
<nikunj97> yes coz I made it like that
<hkaiser> and it allocates a buffer for each timestep and partition, etc.
<nikunj97> checksums will always give right results
<nikunj97> so you'll have to inject errors artificially
<hkaiser> nikunj97: if you can retrofit Jackson's kernel into stencil1d_4, sure - have a try
<hkaiser> you could have overwritten some of the calculated values and let the checksum tell you whether to fail
<hkaiser> (but this is irrelevant for perf)
<nikunj97> that's what I was going to do, but then adrian said it's fine either ways since it's benchmarking and we need to inject errors
<hkaiser> nikunj97: sure, changing stencil1d_4 to simply use dataflow_replay with error injection (and the existing kernel) would give use some numbers as well
<nikunj97> yes
<nikunj97> and will save me the hassle to analyse and optimize Jackson's code
<nikunj97> it's much easier for me to add error injections into stencil1d_4
<nikunj97> hkaiser: just to let you know, the actual example lw_1d_replay throws error for the given parameters
<nikunj97> I tried running it rn
<nikunj97> hkaiser: just took a deep look into it. The init function itself does some 128 allocations for 16001 sized vector. Furthermore it is allocating ~25000 times a vector of 48000 elements. No wonder it's taking much longre than usual
<hkaiser> right
<nikunj97> hkaiser: should I ask Keita if we can use the existing 1d_stencil that we have?
<nikunj97> instead of trying to reduce the allocations and optimizing things here and there
Yorlik has joined #ste||ar
<hkaiser> nikunj97: nah, just do it