#ste||ar on 2020-04-03 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

00:25 gonidelis has quit [Ping timeout: 240 seconds]

00:49 karame_ has quit [Ping timeout: 240 seconds]

00:58 <diehlpk_work> weilewei, I think you can take the course as some external course

01:00 <weilewei> diehlpk_work ok, I need to think about it as I also have some requirement for taking external course. I need to check

01:00 bita_ has quit [Quit: Leaving]

01:02 shahrzad has quit [Ping timeout: 240 seconds]

01:02 shahrzad has joined #ste||ar

01:19 shahrzad has quit [Ping timeout: 252 seconds]

01:23 wate123_Jun has quit [Remote host closed the connection]

01:31 wate123_Jun has joined #ste||ar

01:42 shahrzad has joined #ste||ar

01:43 diehlpk_work has quit [Remote host closed the connection]

01:44 hkaiser has quit [Quit: bye]

01:48 shahrzad has quit [Remote host closed the connection]

01:49 shahrzad has joined #ste||ar

02:15 shahrzad has quit [Remote host closed the connection]

02:15 shahrzad has joined #ste||ar

02:36 nan11 has quit [Remote host closed the connection]

02:45 wate123_Jun has quit [Remote host closed the connection]

03:06 Vir has quit [Ping timeout: 256 seconds]

03:06 Vir has joined #ste||ar

03:06 Vir has quit [Changing host]

03:19 wate123_Jun has joined #ste||ar

03:25 wate123_Jun has quit [Ping timeout: 256 seconds]

03:53 weilewei has quit [Remote host closed the connection]

03:59 shahrzad has quit [Ping timeout: 240 seconds]

04:20 wate123_Jun has joined #ste||ar

04:34 wate123_Jun has quit [Ping timeout: 240 seconds]

04:48 shahrzad has joined #ste||ar

05:23 shahrzad has quit [Quit: Leaving]

05:58 nikunj97 has joined #ste||ar

07:41 jaafar has quit [Remote host closed the connection]

07:42 jaafar has joined #ste||ar

07:54 <nikunj97> heller1, on some testing with hwloc-bind I can see that indeed E5 consistently performs at about 2700-2800MLUPS (float) and 1100-1300MLUPS (double)

07:54 <nikunj97> that's when I use hwloc-bind to bind equal number of cores from each package

07:55 <nikunj97> it starts off low and then saturates to that number no matter how many cores I put into

07:56 <nikunj97> which is the behavior we're looking at

07:58 <simbergm> everyone interested in having a say in whether the pragma once pr goes in or not, please respond to the email on hpx-devel

07:58 <simbergm> we're doing a vote!

08:15 nikunj97 has quit [Ping timeout: 240 seconds]

08:16 nikunj has quit [Read error: Connection reset by peer]

08:16 nikunj has joined #ste||ar

08:37 nikunj has quit [Ping timeout: 264 seconds]

08:37 nikunj has joined #ste||ar

08:50 nikunj97 has joined #ste||ar

09:38 <heller1> ms: weeeh!

09:38 <heller1> nikunj97: that's expected behavior

09:41 <nikunj97> heller1, yes it is, and I'm happy about it

09:42 <heller1> good!

09:42 <heller1> now, you will be able to see the effect of cache blocking

09:45 <nikunj97> heller1, btw I couldn't understand why performance dropped when we started performing on both CPUs

09:45 <nikunj97> both on Xeon and Hi1616

09:46 <heller1> nikunj97: not always, just in some configurations

09:46 <nikunj97> for ex: the performance dropped when going from 10 cores to 20 cores (1 CPU and 2 CPU respectively)

09:46 <nikunj97> in xeon e5

09:46 <heller1> right

09:47 <nikunj97> while the memory bandwidth doubled, which I think is expected since each CPU is is attached to a quad channel RAM

09:47 <heller1> yes

09:47 <heller1> in theory

09:47 <heller1> here is the catch

09:48 <heller1> you had more than half (I think it was even more than 2/3) of the threads bound to one NUMA domain and the remaining few to another

09:48 <heller1> now

09:49 <heller1> a single core is not designed to max out the available bandwidth of its attached memory controller

09:49 <heller1> however, you usually see a saturation of the bandwidth when using about 2/3 of the cores attached to a single memory controller

09:50 <heller1> what you then see, is that 2/3 of your program is operating at maximum speed, while the remaining 1/3 is operating way slower, because they don't get as much memory bandwidth

09:51 <heller1> as such, slowing down the entire algorithm, since we are basically operating in a lock step fashion, where the slowest path dominates the overall runtime

09:51 <nikunj97> aah, that makes sense! but why the drop going from one complete CPU to 2 complete CPUs?

09:51 <heller1> that's another story

09:52 <heller1> compare the idle rates in your runs

09:52 <nikunj97> using a profiler?

09:52 <heller1> the number of tasks executed, and the average runtime of those tasks

09:52 <heller1> the HPX performance counter should suffice here

09:53 <nikunj97> you think that the stencil dimensions are not large enough to keep all CPUs busy all the time?

09:54 <nikunj97> also, if we make it into a dataflow style, we should not observe a sharp drop in cases of non uniform memory bandwidth usage

09:54 <heller1> figure it out

09:55 <heller1> we still do

09:55 <nikunj97> why's that?

09:55 <heller1> there's still the slowest path

09:55 <nikunj97> aah, alright. we're not doing anything to speed up the slowest path

09:56 <heller1> well, you will mitigate the effect to some degree

09:56 <nikunj97> thanks for the lead, I'll try to figure idle rate now!

09:56 <heller1> because potentially, you can hide some latencies

09:56 <nikunj97> btw, I also found some improvements with loop unrolling, about ~100-150MLUPS

09:57 <heller1> the catch here however is, that we don't have asynchronous memory operations, so we can't schedule other useful in between

09:57 <heller1> the dataflow approach as such will only be really benefitial once going distributed

09:58 <nikunj97> so it's beneficial to convert the code to a dataflow approach at this point

09:58 <heller1> and only if the "communication surface" is small enough to be hidden by computation

09:58 <heller1> depends

09:59 <heller1> for this static problem: unlikely. for a more dynamic one: there are lots of benefits

10:00 <nikunj97> aah, I think I got some of what you said. But it is making sense to me

10:00 <nikunj97> I also think that I took a wrong example for simd performance

10:01 <nikunj97> I should've chosen something on the compute bound side

10:03 <heller1> :P

10:03 <heller1> get those cache optimizations in, and you'll be compute bound

10:04 <nikunj97> am trying. I want to get as much performance as possible :D

10:19 tiagofg[m] has joined #ste||ar

10:50 nikunj97 has quit [Quit: Leaving]

11:57 kale_ has joined #ste||ar

12:01 nikunj97 has joined #ste||ar

12:04 nikunj97 has quit [Remote host closed the connection]

12:13 kale_ has quit [Ping timeout: 258 seconds]

12:14 nikunj97 has joined #ste||ar

12:18 gonidelis has joined #ste||ar

12:26 nikunj97 has quit [Ping timeout: 240 seconds]

12:59 nikunj97 has joined #ste||ar

13:00 parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]

13:28 <nikunj97> how do I call a templated function from hpx::async?

13:32 <nikunj97> found it: do an async lambda and call the function from the lambda

13:35 <heller1> or spell out the type

13:36 parsa has joined #ste||ar

13:42 <nikunj97> spelling out the type still doesn't work

13:42 <nikunj97> foo<type>, args... doesn't seem to work

13:54 <nikunj97> pure virtual method called

13:54 <nikunj97> terminate called without an active exception

13:54 <nikunj97> cool error message from HPX ^^

13:55 diehlpk_work has joined #ste||ar

14:22 Hashmi has joined #ste||ar

14:23 hkaiser has joined #ste||ar

14:26 wate123_Jun has joined #ste||ar

14:27 shahrzad has joined #ste||ar

14:30 weilewei has joined #ste||ar

14:32 hkaiser_ has joined #ste||ar

14:36 hkaiser has quit [Ping timeout: 240 seconds]

14:44 <nikunj97> heller1, you were right about CPU idling. The execution time is significantly higher than task execution time * num tasks.

14:44 <nikunj97> let me try profiling to identify where things are going wrong

14:55 karame_ has joined #ste||ar

14:57 shahrzad has quit [Ping timeout: 240 seconds]

15:01 <hkaiser_> nikunj97: idle-rate!

15:01 <nikunj97> hkaiser_, how do I use that?

15:01 <nikunj97> I am using Valgrind to check for timings and stuff

15:02 <hkaiser_> --hpx:print-counter=/threads/idle-rate ?

15:02 <nikunj97> ohh, didn't know it existed. wait let me try

15:02 <hkaiser_> nikunj97: needs to be enabled at compile time, however

15:02 <nikunj97> ohh with compile time option?

15:03 <nikunj97> you mean compiling HPX right?

15:03 <hkaiser_> -DHPX_WITH_THREAD_IDLE_RATES=On while compiling HPX

15:03 <nikunj97> why did we decide to keep it off by default?

15:04 <hkaiser_> simbergm: is there a way to add an anchor in the HPX docs for each perf counter in the table?

15:04 <hkaiser_> it adds some (possibly minor) overhead

15:04 <nikunj97> simbergm, that will be of great help!

15:04 <nikunj97> hkaiser_, aah! I see

15:05 <nikunj97> hkaiser_, is there any other compile time option that I'd want to use?

15:05 <hkaiser_> nikunj97: I think not

15:06 <nikunj97> alright, let me compile it again and use that

15:08 nan11 has joined #ste||ar

15:08 <simbergm> hkaiser_: maybe... it looks like it's possible to create a target, but you'd have to manually know the link (i.e. there'd be no button like with headings)

15:08 <simbergm> I'll look around

15:09 <hkaiser_> simbergm: thanks, I'd volutnteer to add those

15:11 K-ballo has quit [Remote host closed the connection]

15:11 K-ballo has joined #ste||ar

15:30 <nikunj97> hkaiser_, /threads{locality#0/pool#default/worker-thread#19}/idle-rate,1,5.473154,[s],6652,[0.01%]

15:30 <nikunj97> this shows a different story somehow

15:31 <nikunj97> complete output: https://gist.github.com/NK-Nikunj/36c88a9f06fa7554f89aeca309afc4ce

15:31 shahrzad has joined #ste||ar

15:40 bita has joined #ste||ar

15:40 K-ballo has quit [Remote host closed the connection]

15:41 K-ballo has joined #ste||ar

15:43 <hkaiser_> nikunj97: numa 0 has ~33% and numa 1 has ~66% idle-rate

15:44 <nikunj97> how did you calculate that?

15:44 <nikunj97> I thought 0.01% mentioned is the idle rate

15:44 <hkaiser_> well, your counters show that

15:44 <hkaiser_> no 0.01% is the unit of measure

15:45 <nikunj97> what are all the values then?

15:45 <nikunj97> comma separated values I mean

15:45 <hkaiser_> RTFM

15:46 <hkaiser_> nikunj97: HPX docs <quote>These lines have 6 fields, the counter name, the sequence number of the counter invocation, the time stamp at which this information has been sampled, the unit of measure for the time stamp, the actual counter value, and an optional unit of measure for the counter value.</quote>

15:47 <nikunj97> aah so 3596 -> 35.96% idle rate

15:47 <hkaiser_> right

15:48 <nikunj97> what should I do to decrease the idle rate? more tasks?

15:48 <hkaiser_> more parallelism

15:48 <hkaiser_> ask yurself why is num1 more idle than numa domain 0

15:50 rtohid has joined #ste||ar

15:50 <nikunj97> good question. Because they're allocated on numa 0 and migrated to numa 1?

15:51 <nikunj97> causing a delay and idling cpus

15:51 <hkaiser_> that's what you can improve, then

15:51 <nikunj97> how many tasks does parallel_for create?

15:52 <hkaiser_> as many as it creates chunks

15:52 <nikunj97> I found that using plain futures improved performance

15:52 <heller1> nikunj97: you are doing strong scaling right now. Try the same with scaling the number of elements with the number of CPUs

15:52 <hkaiser_> depends on the chunk sizes, I guess

15:53 <hkaiser_> also, try using the parallel_aggregated executor, that is more efficient than the default one

15:54 <nikunj97> heller1, isn't the current dimension of 8192x131072 not enough for strong scaling?

15:54 <nikunj97> hkaiser_, I'm using hpx::compute::host::block_executor<>

15:56 <hkaiser_> ok

15:56 <hkaiser_> that's fine, I guess

16:02 <nan11> hkaiser_ will we have meeting today?

16:02 <hkaiser_> nan11: if you would like to meet, I'm ready

16:03 shahrzad has quit [Ping timeout: 256 seconds]

16:03 <nan11> Yes, could we have a short meeting

16:04 <hkaiser_> sure

16:32 Hashmi has quit [Quit: Connection closed for inactivity]

16:34 shahrzad has joined #ste||ar

16:42 <nikunj97> hkaiser_, what happened to executor_traits? I can't see them in the documentation anymore

16:43 <nikunj97> did we change the syntax for `exec.async_apply`?

16:46 shahrzad has quit [Ping timeout: 246 seconds]

16:51 <hkaiser_> no

16:52 <hkaiser_> nikunj97: but you shouldn't ever use exec.async_execute or similar

16:52 <hkaiser_> always use the customization point async_execute(exec, ...)

16:54 <nikunj97> what does async_execute return?

16:54 <nikunj97> a future?

16:54 <hkaiser_> the same as executor.async_execute returns

16:54 <hkaiser_> simbergm: yt?

16:55 <hkaiser_> simbergm: I'm trying to fix the issues with #4487

16:55 <hkaiser_> but I'm running into errors in the shared_priority_queue_scheduler that are unrelated

16:57 <nikunj97> hkaiser_, got it. thanks!

17:00 <simbergm> hkaiser_: half here

17:01 <hkaiser_> simbergm: I added some asserts to the scheduler that fire now on that PR

17:01 <simbergm> not sure I can help you but what errors do you get? jbjnr would be the one to ask about shared_priority_queue_scheduler

17:01 <hkaiser_> not sure who can look at that scheduler, I don't know anything about it

17:02 <simbergm> I guess the errors are in the pr status? I'll have a look

17:02 <hkaiser_> basically the local_nums used for the threads end up being -1

17:02 <hkaiser_> and then are used as array indices

17:02 <simbergm> the cross_pool_injection test also still has problems and it's using the shared priority queue scheduler

17:02 <simbergm> hmm, ok

17:02 <hkaiser_> yes, all of those fire the assert now

17:03 <simbergm> that could be me

17:03 <simbergm> related to the thread num refactoring

17:03 <hkaiser_> nod, right

17:03 <simbergm> but you say only in the shared priority queue scheduler? that would be the one that actually uses the local numbers, the others I'm not sure they even set the local numbers (only global)

17:03 <hkaiser_> I just pushed the changes, you'll see it in the CI runs

17:03 <hkaiser_> yes

17:04 <simbergm> in any case, I'll have a look latest monday

17:04 <hkaiser_> have not seen it for others

17:04 <hkaiser_> sure, no rush

17:04 <hkaiser_> I have it off my desk now ;-)

17:04 <simbergm> as in I'll have a look at fixing it

17:04 <simbergm> if I get any ideas I'll let you know over the weekend

17:04 <simbergm> :P

17:04 <hkaiser_> ok, thanks - try to relax over the free days

17:05 <simbergm> but you got around the other failing tests? the actual exception handling? or is this in the way of that?

17:11 <hkaiser_> yes

17:11 <hkaiser_> that's fixed, I believe

17:17 hkaiser_ has quit [Read error: Connection reset by peer]

17:51 Yorlik has quit [Read error: Connection reset by peer]

17:57 hkaiser has joined #ste||ar

18:02 Yorlik has joined #ste||ar

18:08 <nikunj97> heller1, hkaiser using `hpx::compute::host::block_executor` was leading to thread idling. Replacing it with `hpx::parallel::execution::parallel_executor_aggregated` fixed most of the thread idling. Now floats idle about 1.5-2.5% and doubles at 14%

18:10 <nikunj97> also in case of doubles, it's numa domain 1 that's causing the idling. Is there a way through which I could tell `hpx::parallel::execution::parallel_executor_aggregated` about the numa domains, just like there was with block_executor?

18:10 <nikunj97> hpx::compute::host::numa_domains() doesn't seem to work with parallel_executor_aggregated

18:10 <hkaiser> nikunj97: that's complicated and you are hitting a gap in our functionality

18:11 <nikunj97> is it too complicated?

18:11 <nikunj97> if not, I can try to add that to it

18:12 <heller1> Hmmm, interesting

18:12 <heller1> That's definitely a performance regression

18:13 <simbergm> nikunj97: master or 1.4.1?

18:13 <nikunj97> this is with master

18:13 <simbergm> from when exactly? or which commit?

18:14 <nikunj97> simbergm, e89bb10

18:14 <nikunj97> it's about 1.5 month old now

18:14 <heller1> So the idle rate is better, what about the actual performance though?

18:15 <nikunj97> it's at almost peak now

18:15 <nikunj97> so it gets to the peak expected and then saturates

18:15 <nikunj97> more like what we expected

18:15 <nikunj97> but couldn't see in the graph. Now I can reproduce what we expected

18:16 <simbergm> you might want to try latest master, the block executor changed to use a different executor in its implementation

18:16 <simbergm> also, are you scheduling individual tasks or doing parallel fors/bulk_async_execute?

18:16 <nikunj97> I'm doing a parallel_for

18:17 <nikunj97> without any chunk based input

18:17 <simbergm> ok, good

18:17 <nikunj97> let me try it with latest master then

18:17 <heller1> Can I see a graph please?

18:17 <simbergm> in that case I would compare parallel_executor (the default), parallel_executor_aggregate, thread_pool_executor, and block_executor

18:18 <nikunj97> heller1, I haven't plotted yet. I just ran it a bunch of times with the parameters I used in graph to see how it was behaving

18:18 <simbergm> you might also want to try `--hpx:queuing=shared-priority` (a different scheduler that handles numa better than the default one)

18:18 <nikunj97> simbergm, I'll try it as well

18:19 <nikunj97> sure, I can do a comparison about executors as well

18:19 <simbergm> and `--hpx:numa-sensitive`...

18:19 <simbergm> so many combinations you could try :P

18:19 <nikunj97> heller1, let me plot them tomorrow

18:19 <nikunj97> simbergm, I'm using numa-sensitive already

18:19 <simbergm> ok, good

18:21 shahrzad has joined #ste||ar

18:22 shahrzad has quit [Remote host closed the connection]

18:24 <heller1> nikunj97: https://opus4.kobv.de/opus4-fau/files/11078/DissertationHellerThomas.pdf page 130, figure 6.19

18:25 <heller1> It should look a bit like this. Just noticed that I have that drop there as well ;)

18:26 <heller1> If the graph is flattening in a uniform way, you're doing it wrong. Adding a second numa domain should double the bandwidth

18:28 <nikunj97> heller1, let me run a script tonight and see what the results are

18:28 <nikunj97> and then we can discuss over it tomorrow

18:29 <nikunj97> should I run them with hwloc-bind just to make sure?

18:29 <nikunj97> also about arm, you said running n cores from each numa domain. Should I write a script that does that?

18:44 <nikunj97> heller1, fig 6.13, 6.14 pg-102. Why are HPX stream results significantly different from OpenMP ones?

18:53 <heller1> Sure, do it

18:54 rtohid has left #ste||ar [#ste||ar]

18:55 <heller1> Grain size

19:15 <nikunj97> simbergm, just tried with the new block_executor. It's much better now. My results finally got to about 6700MLUPS on E5 at 20 core limit

19:16 <nikunj97> that's a 2300MLUP jump from my previous result

19:20 <simbergm> nikunj97: very good to hear!

19:21 <nikunj97> let me take a closer look at the idle rates now

19:27 gonidelis has quit [Remote host closed the connection]

19:47 <diehlpk_work> hkaiser, Do you know if Ali is around?

19:47 <diehlpk_work> I have issues with the boost modules

19:56 <nikunj97> diehlpk_work, you too?

19:56 <nikunj97> is it related to boost not found?

19:56 <nikunj97> even when you load the module?

19:59 <hkaiser> diehlpk_work: send him mail

20:03 Hashmi has joined #ste||ar

20:03 <diehlpk_work> nikunj97, Using my own build of boost works

20:03 <diehlpk_work> The preloaded module files does not

20:03 <nikunj97> yes that's what I'm facing as well

20:04 <nikunj97> I thought I messed up with my configuration. Looks like it's happening to others as well

20:08 <Yorlik> o/

20:31 <nikunj97> heller1, I see a lot of idling on Hi1616

20:31 <nikunj97> even with single numanode

20:33 <nikunj97> the application is not scaling on arm basically

20:41 wate123_Jun has quit [Remote host closed the connection]

20:51 wate123_Jun has joined #ste||ar

20:57 <nikunj97> hkaiser, why does --hpx:numa-sensitive not have an option for 4 numa domains?

20:57 <nikunj97> my Hi1616 have 4 numa nodes and I'm lacking performance

20:58 <nikunj97> is there a way around it?

21:03 <weilewei> hkaiser the mpi_isend approach you introduced works on my code. However, there seems to be some physics knowledge prevent me to do so. So I use a simpler version and waiting for physicist to get back my questions to me. Thanks!

21:04 <hkaiser> weilewei: sure

21:05 <hkaiser> nikunj97: --hpx:numa-sensitive has essentially no effect on the execution, iirc - also it had never a way to specify the number of numa domains

21:06 <nikunj97> what is it used for then?

21:06 <nikunj97> also, `hpx::init: std::exception caught: Invalid argument value for --hpx:numa-sensitive. Allowed values are 0, 1, or 2`

21:06 <hkaiser> right, don't think about using it, I doubt it has any effect whatsoever

21:07 <nikunj97> alright

21:07 <nikunj97> I still have some 40% idle rate on ARM to tackle

21:07 <nikunj97> and about 30

21:07 <nikunj97> on x86

21:08 <nikunj97> but with x86 it's on full node utilisation while on ARM it starts as early as 16 cores

21:10 <nikunj97> increasing the problem size by 10x seems to make cpu utilisation a bit better. But it still is pretty significant idle time

21:25 <nikunj97> hkaiser, anything from hpx to improve performance?

21:25 <hkaiser> nikunj97: high idle-rates almost always point towards too little parallelism

21:51 <nikunj97> hkaiser, got it to work :D

21:52 <nikunj97> 11k MLUPS on ARM

21:52 <nikunj97> from 3K MLUPS

21:52 <hkaiser> nice

21:52 <nikunj97> I had to use basic futures to see the grain-size

21:52 <nikunj97> now it's idling at 10%

21:53 <nikunj97> I'm trying to fine tune and once I find the right grain-size, I'll use static_chunk_size to define the number of splits

21:53 <nikunj97> I just hit 41k MLUPS no ARM

21:54 <hkaiser> nikunj97: you could use the auto-chunker and print the value it has come up with

21:54 <hkaiser> and then use that value

21:54 <nikunj97> doesn't auto_chunker require a constructor value?

21:54 <hkaiser> have you looked at it?

21:55 <nikunj97> http://stellar.cct.lsu.edu/files/hpx-0.9.99-rc1/html/hpx/parallel/v3/auto_chunk_size.html

21:55 <nikunj97> you mean this right?

21:55 <hkaiser> you _can_ specify both, the time to use as the minimal chunk duration and the number of iterations to measure

21:55 <hkaiser> yes

21:55 <hkaiser> but a bit newer than the one you linked

21:56 <nikunj97> how does it know if it's num iterations or the duration?

21:56 <nikunj97> hkaiser, you might want to look at this: https://gist.github.com/NK-Nikunj/2f95affe08743873471658116af1b2bd

21:57 <nikunj97> too good for my eye

21:58 <nikunj97> took me some time to figure it out, but these are some mind boggling results for me

22:01 <nikunj97> I feel like checking for correctness of my algorithm just to make sure if these are even right numbers

22:02 <nikunj97> https://gist.github.com/NK-Nikunj/2f95affe08743873471658116af1b2bd#file-simd_float-txt, can't believe this

22:02 <hkaiser> nikunj97: idle-rates are excellent

22:03 <nikunj97> yes! they're finally under control

22:03 <nikunj97> but seriously 80GLUPS is a lot really

22:03 <nikunj97> I feel like checking for correctness now, let me try printing a smaller matrix for correctness

22:09 <nikunj97> hkaiser, I can confirm that the algorithm is indeed correct!

22:09 <nikunj97> the print results are accurate

22:09 <nikunj97> damn, I never thought that I was doing 28x slower than achievable

22:10 <nikunj97> also, I have no clue why I have this high performance numbers. will have to ask heller1 about this

22:11 wate123_Jun has quit [Remote host closed the connection]

22:12 Hashmi has quit [Quit: Connection closed for inactivity]

22:13 wate123_Jun has joined #ste||ar

22:16 <nikunj97> ohh, they were apparently cache effects. so many cores. So much L2 cache...

22:16 <Yorlik> :D

22:16 <Yorlik> The cache is your best friend :)

22:17 <nikunj97> Yorlik, :D

22:23 <nikunj97> my application is crashing, something is wrong with my code :/

22:24 <nikunj97> found the mistake

22:26 <nikunj97> the performance was all a hoax :/

22:26 <nikunj97> wrong placement of hpx::wait_all

22:26 <nikunj97> I had my hopes so high

22:34 <nikunj97> hkaiser, they're not real numbers, my bad :/

22:41 <hkaiser> lol

22:41 <hkaiser> that's how you write papers! you finally figured it out ;-)

22:43 <nikunj97> xD

22:43 <nikunj97> I will get there... eventually

22:44 <nikunj97> will plot some graphs comparing with roofline to compare with the peak performance available at that arithmetic intensity

22:44 <nikunj97> for now I'll go sleep

22:46 nikunj97 has quit [Read error: Connection reset by peer]

22:52 bita has quit [Ping timeout: 252 seconds]

23:04 <Yorlik> Any idea to improve this? Speed? Something from <algorithm>? https://godbolt.org/z/GQ8pdW

23:57 nan11 has quit [Remote host closed the connection]