K-ballo changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
hkaiser has quit [Quit: bye]
parsa| has joined #ste||ar
srinivasyadav224 has joined #ste||ar
wash[m]_ has joined #ste||ar
zao_ has joined #ste||ar
mdiers[m]1 has joined #ste||ar
rainmaker6[m]1 has joined #ste||ar
bering[m]1 has joined #ste||ar
wash[m] has quit [*.net *.split]
zao has quit [*.net *.split]
mdiers[m] has quit [*.net *.split]
rainmaker6[m] has quit [*.net *.split]
rgoswami has quit [*.net *.split]
jedi18[m] has quit [*.net *.split]
Deepak1411[m] has quit [*.net *.split]
parsa has quit [*.net *.split]
srinivasyadav227 has quit [*.net *.split]
bering[m] has quit [*.net *.split]
ms[m] has quit [*.net *.split]
parsa| is now known as parsa
wash[m]_ is now known as wash[m]
zao_ is now known as zao
rgoswami has joined #ste||ar
jedi18[m] has joined #ste||ar
ms[m] has joined #ste||ar
Deepak1411[m] has joined #ste||ar
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
parsa has joined #ste||ar
hkaiser has joined #ste||ar
chuanqiu has joined #ste||ar
chuanqiu has quit [Client Quit]
<srinivasyadav224> gnikunj: yt?
<gnikunj[m]> srinivasyadav227: here
<srinivasyadav224> gnikunj: I am back to normal :) , i started on working on roofline model , can you please check this link https://docs.google.com/document/d/1ULUW4ZibZK9hDBd8TuCaZgT8k8eh_JLsSE0MRpAQMhs/edit?usp=sharing ?
<gnikunj[m]> P peak calculated is for single core peak performance here
<srinivasyadav224> yes
<gnikunj[m]> looks good as a starting point
<gnikunj[m]> try plotting it into a graph now
<srinivasyadav224> yeah okay :)
<srinivasyadav224> in the table it showed it got to 20GFlops i thinks thats when data transfer cost is minimum i.e from L1 cache right?
<gnikunj[m]> which table?
<srinivasyadav224> output figure, last column i.e avg simd flops in last page
<gnikunj[m]> yes, that seems like the doing on L1 cache
<gnikunj[m]> that's what we concluded in the last met as well I believe. Your machine had 640KB or so which easily meant that the array would fit into the L1 cache.
<srinivasyadav224> yea, so if we increase the number of elements great than that would fit on cache, that would give us DRAM roofline or peak right?
<gnikunj[m]> if you increase the number such that it doesn't fit into the caches, yes
<gnikunj[m]> I'm looking into your numbers and things don't make sense to me
<gnikunj[m]> it seems that the memory bandwidth calculated is for the whole system and not a single thread
<srinivasyadav224> yeah, STREAM test is using all the threads (40) ?
<gnikunj[m]> yeah, we don't want that
<gnikunj[m]> could you use hwloc-bind to disable it
<gnikunj[m]> do: hwloc-bind core:0.PU:0 ./stream
<srinivasyadav224> actually i lost all the links that you posted in the google meet once we exited the meeting :(, sorry
<gnikunj[m]> this would ensure that only single core memory bandwidth is calculated
<gnikunj[m]> that's fine, you've been holding up yourself just fine ;)
<gnikunj[m]> essentially we want to calculate single core memory bandwidth. This is because a single core can't saturate the memory bandwidth completely. Hence your single core figures will never reach that 35GFLOPS peak you showed.
<srinivasyadav224> <gnikunj[m] "do: hwloc-bind core:0.PU:0 ./str"> thanks :) , this gave me 10GBps
<gnikunj[m]> if you calculate memory bandwidth for a single core (using the above command), the memory bandwidth should come out pretty low at 10GBps or so. This would change the peak and the numbers will look more aligned to the peak.
<srinivasyadav224> what. ?? how did you know that?
<srinivasyadav224> i mean how did you tell 10GBps ?
<srinivasyadav224> just curios
<srinivasyadav224> its amost exact 😲
<gnikunj[m]> and we can explain the 20GFLOPS in 2 ways: 1) using vector intrinsics saturate the memory bandwidth more -> more performance for single core (won't see similar difference for 40 cores though) and 2) L1 cache fits the vector size causing caching to increase the performance
<gnikunj[m]> <srinivasyadav224 "i mean how did you tell 10GBps ?"> I have worked with stream results enough to predict the performance :P
<srinivasyadav224> <gnikunj[m] "I have worked with stream result"> that left my brain blanked :)
<srinivasyadav224> wait, if bandwidth is 10GBPS then P will be 10GFlops right?
<gnikunj[m]> yes
<gnikunj[m]> point 1 explains why you see results beyond that point for simd intrinsics
<srinivasyadav224> with 33554432 elements i.e 128MB data, i get 8.3 GFlops, my L3 cache is 5MB, so it wouldnt fit it any of the cache right?
<gnikunj[m]> sounds right
<srinivasyadav224> okay so peak is 10GFlops and this app is doing 8.3 GFlops? so is this good?
<gnikunj[m]> yes, this is decent first performance
<gnikunj[m]> we can try to improve it further and we also know the legal max we can get
<gnikunj[m]> that's why we asked you to plot the roofline model on this
<srinivasyadav224> <gnikunj[m] "that's why we asked you to plot "> yea, i get it now :) , i just started today afternoon, was down for 2 days approx
<gnikunj[m]> it was bad for me too. I'm still sort of recovering but I'm able to work now I guess.
<gnikunj[m]> I'm glad you're back to health :)
<srinivasyadav224> i am back to 100% now, i will try to finish up things in next few days
<hkaiser> srinivasyadav224: sounds great!
<srinivasyadav224> :)
<hkaiser> jedi18[m]: just a heads-up: I fixed the tests on the minmax PR in your repository
<jedi18[m]> hkaiser: Thanks a lot! What was the issue?
<hkaiser> for now, you forgot to change the segmented algorithms to tag_dispatch
<jedi18[m]> Oh right I did, sorry about that
<hkaiser> np
<jedi18[m]> Btw regarding my comment here https://github.com/STEllAR-GROUP/hpx/pull/5371
<hkaiser> the tests themselves run ok for, no idea why they failed on the CI, let's see
<jedi18[m]> What can I do about that unused variable?
<jedi18[m]> Oh ok sure
<hkaiser> jedi18[m]: I added a suggestion on how to change to the PR
<jedi18[m]> Ah, that was a simple fix :D, thanks!
<jedi18[m]> Btw for overloads that return void, for the parallel ones it would return util::detail::algorithm_result<ExPolicy> right?
<jedi18[m]> Is there something special about the algorithm_result<ExPolicy> returned from the base implementation or is it the same as util::detail::algorithm_result<ExPolicy>::get(hpx::util::unused_type);
<hkaiser> jedi18[m]: yes
<jedi18[m]> Oh ok thanks
zao has quit [K-Lined]
wash[m] has quit [K-Lined]
zao has joined #ste||ar
<zao> Well, I'm apparently going to participate way less in this incarnation of this channel now.
<zao> Kind of hard when the popular and usable IRC service I'm and others are using is not allowed on the network.
<hkaiser> zao: that's unfortunate
<hkaiser> if freenode goes down we'll need to look for alternatives anyways
<hkaiser> I have secured the #ste||ar* channels on LibraChat, btw
<zao> Excellent.
<zao> Gonna keep this caveman irssi running if people need to reach me :D
<hkaiser> great, thanks
<hkaiser> I'd suspect that people will start discussing matrix or other platforms again
<zao> I sent a longer email with my thoughts to ms[m] earlier tonight.
<hkaiser> could you cc me on that one as well, pls?
<zao> Forwarded.
<hkaiser> thanks
hkaiser has quit [Quit: bye]
hkaiser has joined #ste||ar
hkaiser has quit [Quit: bye]
hkaiser has joined #ste||ar
hkaiser has quit [Client Quit]
hkaiser has joined #ste||ar
zao has quit [*.net *.split]
zao has joined #ste||ar
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
parsa has joined #ste||ar