aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
<K-ballo>
hkaiser: how does one use the tagged_tuple ?
<hkaiser>
then you can access the elements of the tagged_tuple using the tag names: tt = make_tagged_tuple<t1, t2>(v1, v2); tt.t1() -> first element, tt.t2() second element
<K-ballo>
there's no tag/type relation?
<hkaiser>
no, types are deduced from the arguments
<github>
[hpx] hkaiser created rolling_statistics_counters (+1 new commit): https://git.io/vHwXg
<github>
hpx/rolling_statistics_counters 340aca3 Hartmut Kaiser: Adding new statistics performance counters:...
<github>
[hpx] hkaiser force-pushed fixing_future_serialization from 19972e8 to 33ba6b1: https://git.io/vHzFI
<github>
hpx/fixing_future_serialization 33ba6b1 Hartmut Kaiser: Turning assertions into exceptions...
hkaiser has quit [Quit: bye]
K-ballo has quit [Quit: K-ballo]
eschnett has quit [Quit: eschnett]
eschnett has joined #ste||ar
jbjnr_ has joined #ste||ar
jbjnr has quit [Ping timeout: 246 seconds]
jbjnr_ is now known as jbjnr
shoshijak has quit [Ping timeout: 255 seconds]
pree has joined #ste||ar
jaafar has quit [Ping timeout: 240 seconds]
<github>
[hpx] biddisco created fixing_2679 (+1 new commit): https://git.io/vHwFv
<github>
hpx/fixing_2679 b94b8cf John Biddiscombe: Fix bad size and optimization flags during archive creation...
<github>
[hpx] biddisco opened pull request #2680: Fix bad size and optimization flags during archive creation (master...fixing_2679) https://git.io/vHwFJ
<github>
[hpx] biddisco force-pushed fixing_2679 from b94b8cf to d65d18b: https://git.io/vHwFY
<github>
hpx/fixing_2679 d65d18b John Biddiscombe: Fix bad size during archive creation...
VeXocide has quit [Read error: Connection reset by peer]
VeXocide has joined #ste||ar
shoshijak has joined #ste||ar
<taeguk>
Excuse me, Is there anyone who has experience to benchmark memory bandwidth of parallel algorithms?
<taeguk>
I want to experiment about whether memory bandwidth is a bottleneck of parallel algorithm performance.
<taeguk>
I have no experience about that. So, I want to ask anyone about tools, materials, or some tips.
<jbjnr>
taeguk one thing you can do, is compute the number of memory accesses during the algorithm (should be some factor of N - the array size) and then you can do a simple memory BW calculation by saying I did M reads and N writes in X seconds. Then look at the specs of your processor to see what the mem BW is
<jbjnr>
on some modern processors, the mem BW is a few hundred GB/s, on older ones, much less.
<jbjnr>
Look at the stream benchmark in hpx
<jbjnr>
to give yourself an idea. - and also an estimate of the mem BW of your machine
david_pfander has joined #ste||ar
<taeguk>
jbjnr: Thank you for your advices. But I have worries about differences between theory and realworld.
<taeguk>
If you consider 'cache', the way you presented might not be accurate. And maybe there is a difference between the number of memory accesses I can guess in C++ codes and real things in compiled assembly codes.
<jbjnr>
yes. The difference between theory and practice is huge
<jbjnr>
First you start with the theoreticl peak performance of the algorithm - assuming that the memory BW is fully used
<jbjnr>
then you look at the cache hit/miss etc and see how this affects your algorithm
<jbjnr>
measuring cache missed requires special profiling tools like papi
<jbjnr>
an algorithm like is_heap will be memory bound - how do we know? - because all it does is iterate of a list and make a simple comparison. it accesses memory continuously and does almost no calculation on the data.
<taeguk>
jbjnr: Yes. I think so, too. Can I find the solution of that problem?
<jbjnr>
not really. it's a common problem
<jbjnr>
most of the algorithms we develop are memory bound, few are compute bound it seems - at least the computer science ones (as opposed to the physics solvers)
<jbjnr>
do the sums anyway and see how close to peak memory you are getting ....
<taeguk>
Right.
<taeguk>
But, in case of is_heap, it is bad aspect of cache.
<taeguk>
I have no idea to solve that cache problem.
<jbjnr>
it's a problem for sure.
<jbjnr>
if you think of a way to solve that cache miss problem, by developing a better algorithm, then you are a star! (I haven't researched the is_heap, so I'm really not sure how many ways there are of doing it, but I guess, not many)
<jbjnr>
heller_: I'll have a play with the LF PR, but the MAX_TERMINATED_THREADS one I suspect hartmut want's to make a user option instead of just reducing it
<heller_>
jbjnr: ok, just asking if I can reduce your load today
<heller_>
rehearsal is boring
<jbjnr>
rehaearsal?
<heller_>
rehearsal for the review tomorrow
<jbjnr>
I don't think we ever did rehearsals for the smaller projects. Always got top marks and it was easy :)
<jbjnr>
do good work, and all wil be fine
<jbjnr>
my rma stuff is now officially awesome
<heller_>
great!
<heller_>
jbjnr: osu latency in the native MPI ballpark now?
<jbjnr>
no of course not. just the design is lovely
<jbjnr>
just about to run first full osu test after more tweaking
<heller_>
woohoo! :D
<jbjnr>
will have results soon. won't be much better than current implementation, but migh show improvement on smaler thread counts, and uses less memory
<heller_>
k
<jbjnr>
on higher thread couts the registration costs of memory are hidden by the overlapping sends
<jbjnr>
so I hope to see better perf on single thread compared to non rma version
<heller_>
that'll be great already
<heller_>
that's more or less exactly what we need
<jbjnr>
fingers crossed
bikineev has joined #ste||ar
pree has quit [Ping timeout: 255 seconds]
bikineev has quit [Remote host closed the connection]
Matombo has joined #ste||ar
Remko has joined #ste||ar
pree has joined #ste||ar
Remko has quit [Remote host closed the connection]
<diehlpk_work>
hkaiser, It depends on heller_ , because he did not reply
<hkaiser>
yah
hkaiser has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
<hkaiser>
diehlpk_work: for me it would work only after 2pm which is most likely too late for heller_ :/
<github>
[hpx] hkaiser force-pushed rolling_statistics_counters from 5a0bbdb to fca4aaa: https://git.io/vHr99
<github>
hpx/rolling_statistics_counters fca4aaa Hartmut Kaiser: Adding new statistics performance counters:...
<diehlpk_work>
Ok, so maybe Thursday or Friday
<hkaiser>
diehlpk_work: sure, depends on heller_
<zbyerly>
hkaiser, you might need to use -DWITH_SCOTCH=false
<zbyerly>
building libgeodecomp
<hkaiser>
zbyerly: ok
aserio has joined #ste||ar
<heller_>
hkaiser: diehlpk_work: I can't do it today. I'm fully booked this week. Sorry
bikineev has joined #ste||ar
<jbjnr>
HPX_REGISTER_BASE_LCO_WITH_VALUE_DECLARATION - do we need this?
<jbjnr>
what does it do? I've asked before,but forgotten
<hkaiser>
jbjnr: only for heterogeneous settings
<jbjnr>
aha. thanks
<jbjnr>
should put that int the name really
<hkaiser>
also, it avoids instantiating a base_lco_with_value<T> for each T only once across the whole application
<hkaiser>
more than once* even
<jbjnr>
ok
aserio has quit [Ping timeout: 260 seconds]
aserio has joined #ste||ar
hkaiser has quit [Quit: bye]
aserio has quit [Ping timeout: 246 seconds]
bikineev has quit [Ping timeout: 240 seconds]
hkaiser has joined #ste||ar
aserio has joined #ste||ar
<K-ballo>
I've implemented a variation of peter's exception_info proposal
<K-ballo>
I'll try to integrate into HPX, see if it can replace boost.exception
<hkaiser>
K-ballo: ohh cool!
akheir has joined #ste||ar
<github>
[hpx] hkaiser force-pushed rolling_statistics_counters from fca4aaa to 65c8ec1: https://git.io/vHr99
<github>
hpx/rolling_statistics_counters 65c8ec1 Hartmut Kaiser: Adding new statistics performance counters:...
hkaiser has quit [Quit: bye]
david_pf_ has joined #ste||ar
eschnett has quit [Quit: eschnett]
hkaiser has joined #ste||ar
<github>
[hpx] Naios opened pull request #2685: Add support of std::array to hpx::util::tuple_size and tuple_element (master...size_std_array) https://git.io/vHoKz
<K-ballo>
ha! ^ interesting... I was considering doing that, and dropping boost::array support in the process
<hkaiser>
we should move away from boost::integral_constant et.al. as well
<hkaiser>
that's another painful thing (lot'sa work)
<K-ballo>
yep, yep yep yep
<K-ballo>
and util::decay
<hkaiser>
right
<K-ballo>
but a massive overall PR would be disruptive
<hkaiser>
nod
<K-ballo>
besides painful :P
<hkaiser>
one thing at a time...
<heller_>
hkaiser: are there still any problems with #2619?
<jbjnr>
what you see there is the rma vector (rma) versus the serialize_buffer(ser)
<jbjnr>
when the number of threads is smallish (2 in this case) and the number of messages in flight (window size) is amllish, then the cost of rma memory registration can be seen
<jbjnr>
the rma version has preregistered buffers on both ends and outperforms the serialize buffer quite nicely on the large message sizes. for <4096 they are both using copy mode, but above that they switch to rendezvous protocol and the rma kicks in.
<jbjnr>
for 524288 the differnce is 14GB/s to 10GB/s lovely.
<jbjnr>
I will plot some graphs and submit a paper on friday.
<jbjnr>
there's an odd bump at 4096 I need to look into, but apaer from that all is well.
<jbjnr>
see ya tomorrow
david_pf_ has quit [Quit: david_pf_]
<github>
[hpx] K-ballo created throw_with_info (+1 new commit): https://git.io/vHo9A