aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
kisaacs has quit [Ping timeout: 256 seconds]
kisaacs has joined #ste||ar
kisaacs has quit [Ping timeout: 260 seconds]
kisaacs has joined #ste||ar
kisaacs has quit [Ping timeout: 240 seconds]
daissgr has quit [Ping timeout: 265 seconds]
eschnett has joined #ste||ar
eschnett has quit [Quit: eschnett]
eschnett has joined #ste||ar
hkaiser has quit [Quit: bye]
kisaacs has joined #ste||ar
daissgr has joined #ste||ar
nanashi55 has quit [Ping timeout: 246 seconds]
nanashi55 has joined #ste||ar
jaafar_ has quit [Remote host closed the connection]
kisaacs has quit [Ping timeout: 240 seconds]
mcopik has joined #ste||ar
mcopik has quit [Ping timeout: 256 seconds]
kisaacs has joined #ste||ar
kisaacs has quit [Ping timeout: 256 seconds]
daissgr has quit [Ping timeout: 240 seconds]
<heller_>
jbjnr: you are running with HPX_NETWORKING=Off, right?
david_pfander has joined #ste||ar
<jbjnr>
heller_: yes. when testing cholesky
<heller_>
ok
<jbjnr>
heller_: How's it going?
<heller_>
still trying to get some data
<jbjnr>
I have not run any papi tests yet. No time just now. Leaving for USA tomorrow and need to prepare everything for that
<jbjnr>
sorry.
<jbjnr>
However, I do not thing the L1Cache for cholesky will help you at all. Just way too much going on.
<jbjnr>
(or rather won't hekp with scheduling decisions in the stuff you're looking at)
<heller_>
jbjnr: the thing I am looking into right now is avoidable L1 misses due to atomic instructions
<heller_>
what I am seeing right now, is massive thrashing
<heller_>
for my task spawning benchmark
<jbjnr>
ok. The reason I don't think cholesky will help uis because this signalwill be totally swamped by the movement of data for the matrices
<jbjnr>
tiny simple tasks like you're using will be more useful
<heller_>
right
<heller_>
ok, i'll play around more
<heller_>
I'll do the 'allocation' improvement for the future today
<heller_>
jbjnr: so what I am seeing is lots of false sharing going on
<heller_>
jbjnr: the memory allocation for the completion handlers might be more relevant for the cholesky usecase though
<jbjnr>
ok
<jbjnr>
I tried rebasing your branch onto master, but gave up
<jbjnr>
so I'm not using your fixes at the moment
<jbjnr>
(except the thread_id cleanup on master)
<heller_>
ok
<heller_>
yeah, my fixes are a little messed up
<heller_>
so i'll try to get the future completion handler stuff based off of master
<heller_>
this will hopefully get you some boost
<jbjnr>
thanks
<heller_>
jbjnr: the false sharing might not hit you at all due to the relative large granularity
<jbjnr>
I'd like to retest my numa aware scheduler - with your fixes, some of these effects are going to compound each other - and until I try everything....
<heller_>
sure
<heller_>
jbjnr: right now I think the biggest issue with the cholesky code is the memory allocation
<jbjnr>
(I'll test as soon as it's ready).
<jbjnr>
you're using small_vector?
<heller_>
I will, yes
<heller_>
jbjnr: how much stuff do you capture in the continuations?
<jbjnr>
yup, I can remove it if you are worried by the alloc
<heller_>
I am mostly worried because it increases the size of the lambda ;)
<jbjnr>
the tasks use the indexing in the name so we can see the iterations/substeps in the task view in vampi. Next banchmark, I'll completely remove all profiling
<jbjnr>
I'll create a special task_name type that becomes void when profiling disabled (or something like that)
<heller_>
yeah
<heller_>
std::string has something like 32 byte
<jbjnr>
k
<jbjnr>
I'm keen to see what happens when tasks don't go through N staging/pending queues etc and don't have to wait for stack allocations in batches.
<jbjnr>
my suspicion is that that's where the problem lies
<heller_>
I have more
<heller_>
panel_sf
<heller_>
that's currently a std::map
<heller_>
but I see that it will hold n - k - nb entries, contigously
<jbjnr>
I'll try unordered map
<heller_>
you should even be able to do a std::vector there
<jbjnr>
yes. I'll check
<heller_>
not sure about block_ft
<heller_>
but should be similar
<jbjnr>
I actually hadn't paid any attention to those maps. I will have a play and make sure that they don't need to be maps
<jbjnr>
should gain a few cycles there
<jbjnr>
thanks
<heller_>
yup
<heller_>
should actually give you lots
<heller_>
jbjnr: so, as a first measure, you should try to get rid of the maps (go for unordered_map if it is really sparse), the next step should then be to get rid of the taskname for benchmarking, at least
<heller_>
that is, turn it into a char or so ;)
<jbjnr>
already working on it
<heller_>
how many percentage of performance do we miss right now?
<jbjnr>
can't use char cos the name gets overwrriten on each iteration/substep
<heller_>
yeah
<jbjnr>
so i,j,k indices change and name needs to be captured by value
<jbjnr>
(could use char*, but would make wrapper code nasty)
<heller_>
you could only generate the name in the lambdas
<jbjnr>
will diable it completely in benchmarking mdoe
<heller_>
right
hkaiser has joined #ste||ar
<jbjnr>
heller_: when profiling is disabled, we now use only inline task_name_type createName(std::string s) { return nullptr; }
<jbjnr>
and task_name_type is nullptr_t
<jbjnr>
so everything compiles, but we don't allocate any string
<jbjnr>
s
<heller_>
ok
<heller_>
it's not about the string allocation, it's about the string increasing the size of the capture ;)
<jbjnr>
now it's just a nullptr_t
<jbjnr>
so only 1 pointer size
<heller_>
right
<heller_>
so now, the continuation should hit the SBO of our function implementation
<jbjnr>
what's the difference anyway. increasing the lambda size, is another allocation anyway?
<heller_>
yes
<heller_>
but now, we got rid of 3 allocations
<heller_>
1) the string, 2) the copy of the string inside the lambda, 3) the function
<jbjnr>
3) ?
<jbjnr>
it will still be created/allocated (does it have SBO for <N)
<heller_>
if the callable is smaller than 24 bytes, there will be no allocation
<jbjnr>
thanks
<jbjnr>
I'll add some printf(sizeof()) to see if anything can be reduced.
<heller_>
looks like it is fine in general without the std::string
<simbergm>
at least that's what I intended to do (I hope I didn't mess it up)
<simbergm>
it also avoids holding the lock when notify is called, so that when cond.wait() returns it can take the lock again immediately
<hkaiser>
simbergm: you acquire the lock just to immediately releasing it
<simbergm>
hkaiser: yes
<hkaiser>
doesn't make sense to me, frankly
<simbergm>
do you know of another way of ensuring notify does not get called before wait?
<hkaiser>
I don't thik your code ensures that
<hkaiser>
think*
<simbergm>
hmm, possible
<hkaiser>
you'll need a separate bool for that, like in the code you refer to above
<simbergm>
what do you see as the failure case for my version? condition_variable::wait unlocks the lock once it's waiting
<hkaiser>
locking/unlocking a mutex does not buy you anything, except for unneeded overhead
<simbergm>
I'm happy to change it back to be on the safe side, but it is slightly slower
<simbergm>
but the mutex can't be locked until wait has been called
<hkaiser>
nod, I see
<hkaiser>
I now understand what you're doing
<hkaiser>
adding a comment might be helpful - this is an unusal way of achieving what you want
<simbergm>
hkaiser: yeah, I can add a comment
<simbergm>
I was trying to find a standard way of doing this, and ended up with the one in runtime_impl, and then realized one doesn't need the extra variable
<heller_>
should probably be taken on by another PR
<hkaiser>
heller_: it's not a problem, I was wrong
<heller_>
ok
<heller_>
jbjnr: tumbleweed?
<jbjnr>
sorry.
<jbjnr>
removing all strings made no difference. Just about to test the stellar/completion_handlers branch
<jbjnr>
completion_handler branch - no real difference. possibly a small improvement, but in the noise region at this stage.
<jbjnr>
somethig is a trifle fishy though, so all the tests I've run today are a little bit slower than in november, so maybe I screwed up some setting somewhere.
<jbjnr>
or perhaps HPX has regressed a bit overall
<jbjnr>
Nov - 512 block size = 980GFlops
<jbjnr>
Now 512 Block size = 945GFlops
<jbjnr>
5% ish
<K-ballo>
we used to once have a performance benchmark being run periodically, updating some nice plot, is that still on?
<jbjnr>
I only have my ctest generated times for the subset of tests that I added output to
<heller_>
jbjnr: where is the parsec code wrt performance?
<jbjnr>
512 block good. 256 still too slow. I'll need to plot some new graphs. Juggling too many fixes and branches at the moment. Need to step back a bit and reassess. Supposed to be workin on DCA for meeting in UAS - leaving tomorrow
<jbjnr>
^USA
<heller_>
jbjnr: I could take it from here if you give me the code ;)
<heller_>
jbjnr: what's the reference performance we want to achieve?
<jbjnr>
1TF
<jbjnr>
lop
<jbjnr>
or thereabouts.
<heller_>
for a CPU based implementation?
<jbjnr>
256 block size is giving about 900 today. Best I ever had was 920
<heller_>
hmmm
<jbjnr>
on daint mc nodes using 36 cores out of the 72
<jbjnr>
gpu version is broken
<heller_>
and parsec is doing 1k for the 256 block size as well?
<jbjnr>
1T yes
<jbjnr>
(or very close)
<heller_>
can't access the mc nodes :/
<heller_>
did you change the map to an unordered_map already?
<jbjnr>
yes
<heller_>
try changing it to a vector
hkaiser has joined #ste||ar
<heller_>
jbjnr: you USA trip, is this the one to this SOS conference?
<jbjnr>
no. SOS end March (we have some time still). This is DCA++ - Quantum monte carlo using HPX for scheduling cpu/gpu
<jbjnr>
hopefully first project that will use full summit machine for hpx etc ....
Vir has quit [Ping timeout: 252 seconds]
Vir has joined #ste||ar
<jbjnr>
has anything been done with octotiger recently?
<jbjnr>
hkaiser: ^
<jbjnr>
or anyone else
<hkaiser>
jbjnr: Gregor is working on improving the kernels
<jbjnr>
and dominic?
<hkaiser>
no idea
<jbjnr>
heller_: you are my hero. looks liek changing map to unordered map is giving a measurable boost. Raffaele tells me that the map is very sparse, so a vector might be a problem. (don't want to default construct a ton of unused futures)
<heller_>
jbjnr: great!
rtohid has joined #ste||ar
<heller_>
jbjnr: a proper crs Container might be more suitable then
<jbjnr>
seems a bit odd to me. the map access should be tiny compared to all the other crap going on. I'll check with/without again to be sure
<jbjnr>
mostly, enabling the fancy allocator is making a big difference :)
<hkaiser>
unordered_map might do less allocations
<heller_>
map is also a node based container
<hkaiser>
that's what I meant, yes
<heller_>
And logN vs constant access
<hkaiser>
well...
<heller_>
It's not just the allocations, also the non linear access
<heller_>
And it's "just" the 10% performance difference we are after...
<heller_>
So it has to be something in the whole orchestration code
<heller_>
vtune should be able to tell us where to look next
<heller_>
Or even gprof
<aserio>
heller_: hey
<aserio>
see pm
<heller_>
Hey aserio
<heller_>
jbjnr: for a single node run, I'd assume that the map is fully populated?