aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
kisaacs has quit [Ping timeout: 256 seconds]
kisaacs has joined #ste||ar
kisaacs has quit [Ping timeout: 260 seconds]
kisaacs has joined #ste||ar
kisaacs has quit [Ping timeout: 240 seconds]
daissgr has quit [Ping timeout: 265 seconds]
eschnett has joined #ste||ar
eschnett has quit [Quit: eschnett]
eschnett has joined #ste||ar
hkaiser has quit [Quit: bye]
kisaacs has joined #ste||ar
daissgr has joined #ste||ar
nanashi55 has quit [Ping timeout: 246 seconds]
nanashi55 has joined #ste||ar
jaafar_ has quit [Remote host closed the connection]
kisaacs has quit [Ping timeout: 240 seconds]
mcopik has joined #ste||ar
mcopik has quit [Ping timeout: 256 seconds]
kisaacs has joined #ste||ar
kisaacs has quit [Ping timeout: 256 seconds]
daissgr has quit [Ping timeout: 240 seconds]
<heller_> jbjnr: you are running with HPX_NETWORKING=Off, right?
david_pfander has joined #ste||ar
<jbjnr> heller_: yes. when testing cholesky
<heller_> ok
<jbjnr> heller_: How's it going?
<heller_> still trying to get some data
<jbjnr> I have not run any papi tests yet. No time just now. Leaving for USA tomorrow and need to prepare everything for that
<jbjnr> sorry.
<jbjnr> However, I do not thing the L1Cache for cholesky will help you at all. Just way too much going on.
<jbjnr> (or rather won't hekp with scheduling decisions in the stuff you're looking at)
<heller_> jbjnr: the thing I am looking into right now is avoidable L1 misses due to atomic instructions
<heller_> what I am seeing right now, is massive thrashing
<heller_> for my task spawning benchmark
<jbjnr> ok. The reason I don't think cholesky will help uis because this signalwill be totally swamped by the movement of data for the matrices
<jbjnr> tiny simple tasks like you're using will be more useful
<heller_> right
<heller_> ok, i'll play around more
<heller_> I'll do the 'allocation' improvement for the future today
<heller_> jbjnr: so what I am seeing is lots of false sharing going on
<heller_> jbjnr: the memory allocation for the completion handlers might be more relevant for the cholesky usecase though
<jbjnr> ok
<jbjnr> I tried rebasing your branch onto master, but gave up
<jbjnr> so I'm not using your fixes at the moment
<jbjnr> (except the thread_id cleanup on master)
<heller_> ok
<heller_> yeah, my fixes are a little messed up
<heller_> so i'll try to get the future completion handler stuff based off of master
<heller_> this will hopefully get you some boost
<jbjnr> thanks
<heller_> jbjnr: the false sharing might not hit you at all due to the relative large granularity
<jbjnr> I'd like to retest my numa aware scheduler - with your fixes, some of these effects are going to compound each other - and until I try everything....
<heller_> sure
<heller_> jbjnr: right now I think the biggest issue with the cholesky code is the memory allocation
<jbjnr> (I'll test as soon as it's ready).
<jbjnr> you're using small_vector?
<heller_> I will, yes
<heller_> jbjnr: how much stuff do you capture in the continuations?
<jbjnr> real code starts after line 108 etc
<jbjnr> main loop line 165
<heller_> jbjnr: taskname is a std::string?
<jbjnr> yup, I can remove it if you are worried by the alloc
<heller_> I am mostly worried because it increases the size of the lambda ;)
<jbjnr> the tasks use the indexing in the name so we can see the iterations/substeps in the task view in vampi. Next banchmark, I'll completely remove all profiling
<jbjnr> I'll create a special task_name type that becomes void when profiling disabled (or something like that)
<heller_> yeah
<heller_> std::string has something like 32 byte
<jbjnr> k
<jbjnr> I'm keen to see what happens when tasks don't go through N staging/pending queues etc and don't have to wait for stack allocations in batches.
<jbjnr> my suspicion is that that's where the problem lies
<heller_> I have more
<heller_> panel_sf
<heller_> that's currently a std::map
<heller_> but I see that it will hold n - k - nb entries, contigously
<jbjnr> I'll try unordered map
<heller_> you should even be able to do a std::vector there
<jbjnr> yes. I'll check
<heller_> not sure about block_ft
<heller_> but should be similar
<jbjnr> I actually hadn't paid any attention to those maps. I will have a play and make sure that they don't need to be maps
<jbjnr> should gain a few cycles there
<jbjnr> thanks
<heller_> yup
<heller_> should actually give you lots
<heller_> jbjnr: so, as a first measure, you should try to get rid of the maps (go for unordered_map if it is really sparse), the next step should then be to get rid of the taskname for benchmarking, at least
<heller_> that is, turn it into a char or so ;)
<jbjnr> already working on it
<heller_> how many percentage of performance do we miss right now?
<jbjnr> can't use char cos the name gets overwrriten on each iteration/substep
<heller_> yeah
<jbjnr> so i,j,k indices change and name needs to be captured by value
<jbjnr> (could use char*, but would make wrapper code nasty)
<heller_> you could only generate the name in the lambdas
<jbjnr> will diable it completely in benchmarking mdoe
<heller_> right
hkaiser has joined #ste||ar
<jbjnr> heller_: when profiling is disabled, we now use only inline task_name_type createName(std::string s) { return nullptr; }
<jbjnr> and task_name_type is nullptr_t
<jbjnr> so everything compiles, but we don't allocate any string
<jbjnr> s
<heller_> ok
<heller_> it's not about the string allocation, it's about the string increasing the size of the capture ;)
<jbjnr> now it's just a nullptr_t
<jbjnr> so only 1 pointer size
<heller_> right
<heller_> so now, the continuation should hit the SBO of our function implementation
<jbjnr> what's the difference anyway. increasing the lambda size, is another allocation anyway?
<heller_> yes
<heller_> but now, we got rid of 3 allocations
<heller_> 1) the string, 2) the copy of the string inside the lambda, 3) the function
<jbjnr> 3) ?
<jbjnr> it will still be created/allocated (does it have SBO for <N)
<heller_> if the callable is smaller than 24 bytes, there will be no allocation
<jbjnr> thanks
<jbjnr> I'll add some printf(sizeof()) to see if anything can be reduced.
<heller_> looks like it is fine in general without the std::string
<jbjnr> when disabled, the taskname just becomes a nullptr_t, but I'm getting compiler warning that taskname is set, but not used
<jbjnr> what would be a nice way of making those warnings disappear
<heller_> (void)taskname
<jbjnr> but then when enabled it's not be a string type
<heller_> inside the lambda, just add a line (void)taskname;
<heller_> this will make it appear as if it was used
<jbjnr> sorry, yes, misunderstod
<heller_> jbjnr: you might also want to test out this patch: https://gist.github.com/sithhell/bf32d7239f5e1c0344bc154181df560d
<jbjnr> it'd be nice if it could not be passed in at all. I wonder if the compiler optimizes it away
<heller_> jbjnr: yeah, not passing it all would be best
<hkaiser> heller_: has to be 'auto const&'
<heller_> jbjnr: IIUC, you could generate directly in the lambda
<heller_> hkaiser: yeah ... and might not have an effect anyway...
<heller_> sicne get_shared_state returns by value
<jbjnr> I can do, but then must pass in extra indices
<heller_> right
<jbjnr> didn't want to pollute the lambdas just for profiling if I could avoid it
<heller_> ok
<jbjnr> it'll be ok with nullptr_t for now
<heller_> well, check if it helped first
<jbjnr> yup
<heller_> hkaiser: ahh, it returns by const&, excellent :D
<jbjnr> in c++17 will be have a void_t that can be passed around?
<jbjnr> I seem to recall a proposal for void types
<hkaiser> jbjnr: create your own: struct void_t {};
<jbjnr> hmm
<heller_> see if it helped first, then get rid of std::map ;)
<heller_> hkaiser: auto& state is fine: https://wandbox.org/permlink/YkDf41UjspufJiQ7
<heller_> auto const& is certainly clearer and avoids problems once the code changes
<hkaiser> heller_: sure, if the functions return a ref themselves
<heller_> which they luckily do
<heller_> anyway...
<jbjnr> heller_: jobs submitted with all string allocs removed from profiling
<heller_> jbjnr: about those maps...
<heller_> don't you know the offsets and size etc. beforehand? such that you can store those blocks in a contigous array?
<heller_> or vector
<simbergm> hkaiser: the lock guard is to make sure that the notify does not happen before reaching the cond.wait (unlikely but possible)
<simbergm> at least that's what I intended to do (I hope I didn't mess it up)
<simbergm> it also avoids holding the lock when notify is called, so that when cond.wait() returns it can take the lock again immediately
<hkaiser> simbergm: you acquire the lock just to immediately releasing it
<simbergm> hkaiser: yes
<hkaiser> doesn't make sense to me, frankly
<simbergm> do you know of another way of ensuring notify does not get called before wait?
<hkaiser> I don't thik your code ensures that
<hkaiser> think*
<simbergm> hmm, possible
<hkaiser> you'll need a separate bool for that, like in the code you refer to above
<simbergm> what do you see as the failure case for my version? condition_variable::wait unlocks the lock once it's waiting
<hkaiser> locking/unlocking a mutex does not buy you anything, except for unneeded overhead
<simbergm> I'm happy to change it back to be on the safe side, but it is slightly slower
<simbergm> but the mutex can't be locked until wait has been called
<hkaiser> nod, I see
<hkaiser> I now understand what you're doing
<hkaiser> adding a comment might be helpful - this is an unusal way of achieving what you want
<simbergm> hkaiser: yeah, I can add a comment
<simbergm> I was trying to find a standard way of doing this, and ended up with the one in runtime_impl, and then realized one doesn't need the extra variable
<hkaiser> nod
<simbergm> couldn't find anything useful elsewhere
<hkaiser> not in your case as everything is in the same scope
<simbergm> yeah
<simbergm> hkaiser: thanks for checking it though
<hkaiser> thanks for explaining what you had in mind
<simbergm> hkaiser: do you know if there's anything new about rostam? I guess the conclusion was the segfaults are not caused by hpx?
<hkaiser> simbergm: still investigating... we suspect one of the patches for meltdown are bad
<hkaiser> disabling papi makes things work, though - we know that much
<simbergm> hkaiser: okay, good to hear that it's ongoing
<simbergm> the release candidate was scheduled for wednesday but unsure if it's a good idea to do it as long as buildbot looks like it does
<simbergm> jbjnr: any news on the tutorial? still taking place?
<heller_> simbergm: nope
<simbergm> heller_: which question are you answering to? ;)
<heller_> simbergm: the tutorial ;)
<simbergm> heller_: okay, good
<simbergm> then we can be *a bit* more relaxed about pushing the release, but I'd still like to see it done as soon as we have things in place
<heller_> yeah...
<heller_> jbjnr: is there any chance I could run your cholesky myself?
<github> [hpx] sithhell created completion_handlers (+2 new commits): https://git.io/vNyS2
<github> hpx/completion_handlers 9e199a2 Thomas Heller: Avoid taking the shared state by value
<github> hpx/completion_handlers bb3e3a6 Thomas Heller: Removing compose_cb...
<heller_> please do test
<hkaiser> heller_: looks reasonable to me
<hkaiser> heller_: what do your measurements say?
<heller_> hkaiser: I don't have a benchmark for this
<heller_> waiting on jbjnr to test it with the cholesky code ;)
<heller_> i'd love to do it myself
<hkaiser> heller_: jbjnr is not allowed to give it away (yet), at least that's my impression
<heller_> yeah
Smasher has quit [Ping timeout: 240 seconds]
<github> [hpx] sithhell force-pushed completion_handlers from bb3e3a6 to 56f195a: https://git.io/vNy5F
<github> hpx/completion_handlers 56f195a Thomas Heller: Removing compose_cb...
kisaacs has joined #ste||ar
<hkaiser> heller_: I know it's an old problem, but this will cause the reference to become invalid: https://github.com/STEllAR-GROUP/hpx/blob/56f195ab05920606a4aad3033dbe3edf49d4b2e3/src/lcos/detail/future_data.cpp#L167
<hkaiser> would you mind trying to fix it while you're at it?
<hkaiser> heller_: nvm, I'm too dense today
kisaacs has quit [Ping timeout: 256 seconds]
<heller_> hkaiser: yeah ... no idea how to fix it
<heller_> should probably be taken on by another PR
<hkaiser> heller_: it's not a problem, I was wrong
<heller_> ok
<heller_> jbjnr: tumbleweed?
<jbjnr> sorry.
<jbjnr> removing all strings made no difference. Just about to test the stellar/completion_handlers branch
<jbjnr> completion_handler branch - no real difference. possibly a small improvement, but in the noise region at this stage.
<jbjnr> somethig is a trifle fishy though, so all the tests I've run today are a little bit slower than in november, so maybe I screwed up some setting somewhere.
<jbjnr> or perhaps HPX has regressed a bit overall
<jbjnr> Nov - 512 block size = 980GFlops
<jbjnr> Now 512 Block size = 945GFlops
<jbjnr> 5% ish
<K-ballo> we used to once have a performance benchmark being run periodically, updating some nice plot, is that still on?
<jbjnr> I only have my ctest generated times for the subset of tests that I added output to
kisaacs has joined #ste||ar
<jbjnr> SortByKeyTime is the graph to plot
hkaiser has quit [Quit: bye]
mbremer has quit [Quit: Page closed]
<simbergm> jbjnr: is pycicle not running now while you're doing your tests?
<jbjnr> oops. I closed the terminal it was running in
<jbjnr> hold on
eschnett has quit [Quit: eschnett]
<jbjnr> restarted. should start producing results. Sorry about that
Vir has quit [Ping timeout: 240 seconds]
<jbjnr> note to self : cron job and check if not already running ...
<simbergm> no problem, thanks for restarting
Vir has joined #ste||ar
<jbjnr> yay - got my 980GFLops back by enabling my fancy block aligned allocator!
<jbjnr> awesome
<jbjnr> jesus. It made a whopping difference
heller_ has quit [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
heller_ has joined #ste||ar
mbremer has joined #ste||ar
Vir has quit [Ping timeout: 240 seconds]
<heller_> jbjnr: so what's the status now?
Vir has joined #ste||ar
<heller_> jbjnr: where is the parsec code wrt performance?
<jbjnr> 512 block good. 256 still too slow. I'll need to plot some new graphs. Juggling too many fixes and branches at the moment. Need to step back a bit and reassess. Supposed to be workin on DCA for meeting in UAS - leaving tomorrow
<jbjnr> ^USA
<heller_> jbjnr: I could take it from here if you give me the code ;)
<jbjnr> daint:/scratch/snx3000/biddisco/src/hvtkm/linear_algebra
<jbjnr> total mess atm
<heller_> I don't care much ;)
<heller_> do I need a special HPX or is master fine?
<jbjnr> I'm using guided_pool_executor branch - just pushed it, ymmv
Vir has quit [Ping timeout: 265 seconds]
Vir has joined #ste||ar
Vir has quit [Changing host]
Vir has joined #ste||ar
eschnett has joined #ste||ar
<heller_> salloc: error: Project cannot use constraint=MC
Vir has quit [Ping timeout: 240 seconds]
<heller_> jbjnr: meaning, I can only test on those 12 core nodes
aserio has joined #ste||ar
<heller_> on daint, at least ...
Vir has joined #ste||ar
<heller_> jbjnr: is this code with or without CUDA?
<jbjnr> yee haw! 512block size just went over 1TFlop for the first time ever!
<jbjnr> heller_: no cuda for now
<jbjnr> check_cholesky_d is the only binary you want to compile, the rest are broken
<jbjnr> CSVData, nodes, 1, threads, 36, matrixsize, 40960, blocksize, 512, rows, 1, cols, 1, time, 22.73719, GFlop/s, 1007.446, pool, 1, scheduler, 1, queues, 1, executor, 1, allocator, 2,
<heller_> nice
<heller_> jbjnr: what's the reference performance we want to achieve?
<jbjnr> 1TF
<jbjnr> lop
<jbjnr> or thereabouts.
<heller_> for a CPU based implementation?
<jbjnr> 256 block size is giving about 900 today. Best I ever had was 920
<heller_> hmmm
<jbjnr> on daint mc nodes using 36 cores out of the 72
<jbjnr> gpu version is broken
<heller_> and parsec is doing 1k for the 256 block size as well?
<jbjnr> 1T yes
<jbjnr> (or very close)
<heller_> can't access the mc nodes :/
<heller_> did you change the map to an unordered_map already?
<jbjnr> yes
<heller_> try changing it to a vector
hkaiser has joined #ste||ar
<heller_> jbjnr: you USA trip, is this the one to this SOS conference?
<jbjnr> no. SOS end March (we have some time still). This is DCA++ - Quantum monte carlo using HPX for scheduling cpu/gpu
<jbjnr> hopefully first project that will use full summit machine for hpx etc ....
Vir has quit [Ping timeout: 252 seconds]
Vir has joined #ste||ar
<jbjnr> has anything been done with octotiger recently?
<jbjnr> hkaiser: ^
<jbjnr> or anyone else
<hkaiser> jbjnr: Gregor is working on improving the kernels
<jbjnr> and dominic?
<hkaiser> no idea
<jbjnr> heller_: you are my hero. looks liek changing map to unordered map is giving a measurable boost. Raffaele tells me that the map is very sparse, so a vector might be a problem. (don't want to default construct a ton of unused futures)
<heller_> jbjnr: great!
rtohid has joined #ste||ar
<heller_> jbjnr: a proper crs Container might be more suitable then
<jbjnr> seems a bit odd to me. the map access should be tiny compared to all the other crap going on. I'll check with/without again to be sure
<jbjnr> mostly, enabling the fancy allocator is making a big difference :)
<hkaiser> unordered_map might do less allocations
<heller_> map is also a node based container
<hkaiser> that's what I meant, yes
<heller_> And logN vs constant access
<hkaiser> well...
<heller_> It's not just the allocations, also the non linear access
<heller_> And it's "just" the 10% performance difference we are after...
<heller_> So it has to be something in the whole orchestration code
<heller_> vtune should be able to tell us where to look next
<heller_> Or even gprof
<aserio> heller_: hey
<aserio> see pm
<heller_> Hey aserio
<heller_> jbjnr: for a single node run, I'd assume that the map is fully populated?
daissgr has joined #ste||ar
kisaacs has quit [Ping timeout: 256 seconds]
daissgr has quit [Ping timeout: 256 seconds]
Vir has quit [Read error: Connection reset by peer]
Vir has joined #ste||ar
david_pfander has quit [Ping timeout: 268 seconds]
daissgr has joined #ste||ar
Smasher has joined #ste||ar
EverYoung has joined #ste||ar
kisaacs has joined #ste||ar
Vir has quit [Ping timeout: 240 seconds]
Vir has joined #ste||ar
Vir has quit [Read error: Connection reset by peer]
daissgr has quit [Ping timeout: 256 seconds]
Guest72807 has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
kisaacs has quit [Ping timeout: 255 seconds]
Guest72807 has quit [Ping timeout: 240 seconds]
Vir- has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
Vir- has quit [Ping timeout: 240 seconds]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
aserio has quit [Ping timeout: 252 seconds]
Vir- has joined #ste||ar
Vir- has quit [Ping timeout: 265 seconds]
EverYoung has quit [Ping timeout: 246 seconds]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
aserio has joined #ste||ar
<K-ballo> hkaiser: openCppCoverage, does it work?
kisaacs has joined #ste||ar
Vir- has joined #ste||ar
zombieleet has joined #ste||ar
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 256 seconds]
K-ballo1 is now known as K-ballo
eschnett has quit [Quit: eschnett]
daissgr has joined #ste||ar
Vir- has quit [Ping timeout: 265 seconds]
Vir has joined #ste||ar
zombieleet has quit [Ping timeout: 248 seconds]
daissgr has quit [Ping timeout: 240 seconds]
zombieleet has joined #ste||ar
daissgr has joined #ste||ar
EverYoun_ has joined #ste||ar
EverYoung has quit [Ping timeout: 240 seconds]
zombieleet has quit [Ping timeout: 256 seconds]
EverYoun_ has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
EverYoung has quit [Ping timeout: 246 seconds]
hkaiser has quit [Quit: bye]
EverYoung has joined #ste||ar
jaafar_ has joined #ste||ar
EverYoung has quit [Ping timeout: 246 seconds]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
jaafar_ has quit [Ping timeout: 240 seconds]
jaafar_ has joined #ste||ar
kisaacs has quit [Ping timeout: 256 seconds]
kisaacs has joined #ste||ar
hkaiser has joined #ste||ar
akheir has joined #ste||ar
aserio has quit [Quit: aserio]
akheir has quit [Remote host closed the connection]
zao has quit [Quit: Up, up, and away!]
zao_ has joined #ste||ar
zao_ is now known as zao
rtohid has left #ste||ar [#ste||ar]