#ste||ar on 2021-11-16 — irc logs at irclog.cct.lsu.edu

2021-08-06 22:55 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu

01:13 K-ballo has quit [Quit: K-ballo]

03:35 hkaiser has quit [Quit: Bye!]

03:46 diehlpk_work_ has quit [Remote host closed the connection]

08:46 <ms[m]> circleci is back (was already yesterday)

08:46 <ms[m]> apparently some automated system "erroneously flagged" us...

10:37 K-ballo has joined #ste||ar

13:08 hkaiser has joined #ste||ar

13:27 <K-ballo> hkaiser: have you been able to successfully use VerySleepy on windows?

13:27 <K-ballo> it stopped working for me when I switched to win7

13:27 <hkaiser> K-ballo: I have not tried

13:27 <hkaiser> I'm usually using vtune for this

13:28 <K-ballo> lately i've been using the performance tools that come with VS

13:29 <hkaiser> ok

13:29 <hkaiser> do they work well? I never even tried

13:30 <K-ballo> they're nowhere near vtune, but most of the time they're good enough

13:32 <K-ballo> sometimes though they magically stop capturing events, and it takes a rebuild and/or restart to get it going again

13:33 <hkaiser> ok

13:34 <jedi18[m]> I can't use vtune on my AMD laptop right? Any idea how's the alternative AMD μProf compared to vtune?

13:35 <hkaiser> jedi18[m]: vtune should work on AMD machines, although it might have reduced functionalities

13:35 <hkaiser> I have never tried using AMDs tools, however

13:36 <jedi18[m]> Oh ok cool, I'll try using it on the ranges stuff

13:42 <zao> VTune has greatly reduced profiling functionality on non-Intel chips, I think it even rules out any form of hardware-assisted instrumentation. Nowhere in the development specs for the software did the phrase "maybe look at what analogous functionality AMD offers for their CPUs" exist :D

13:43 <zao> Not saying it's intentionally worse than it should be, but it's intentionally worse than it should be.

13:47 <hkaiser> zao: +1

13:48 <hkaiser> zao: Intel is known for crippling their tools on AMDs architectures

13:51 <jedi18[m]> That's rude of them xD, but I guess it's expected since they are their primary competitor

13:51 <jedi18[m]> Let's hope AMD μProf is good then

13:53 <hkaiser> jedi18[m]: for a first assessment, software based analysis techniques are good enough, usually - the hardware assisted stuff is needed only if you already know where your bottlenecks are

13:54 <jedi18[m]> Oh ok yeah I don't have any immediate use for it, just wanted to try it out

13:55 <hkaiser> sure

13:59 <gonidelis[m]> jedi18: been struggling with that quite a lot. vtune hates my ryzen.i was able to do minimal analysis like cpu usage and stuff though

14:00 <gonidelis[m]> AMD's uprof is the worst program in the world. not just compared to other profilers, but in general

14:01 <gonidelis[m]> botttom line, i would opt in for vtune even if they want treat us like second class citizens. it's that good.

14:05 <jedi18[m]> Oh ok thanks, won't bother trying out uprof then, vtune it is

14:06 <gonidelis[m]> jedi18: please don't

14:12 <gonidelis[m]> K-ballo: how come this works? yet even, how come this does not work when I uncomment line #16? even worse, I would expect `views::for_each(vv, lambda)` to work. https://wandbox.org/permlink/GBcpLZxsWolqz2cZ

14:13 <K-ballo> why wouldn't it work?

14:14 <gonidelis[m]> seems like the problem is this

14:14 <gonidelis[m]> https://github.com/ericniebler/range-v3/blob/83783f578e0e6666d68a3bf17b0038a80e62530e/include/range/v3/view/for_each.hpp#L63

14:15 <gonidelis[m]> but according to this https://github.com/ericniebler/range-v3/blob/83783f578e0e6666d68a3bf17b0038a80e62530e/include/range/v3/view/for_each.hpp#L52 I would expect it to be able to accept the rng argument

14:15 <K-ballo> can't match what you are saying to the wandbox snippet... tell me with words, why wouldn't that work?

14:17 <gonidelis[m]> that's the least important of my three questions because I can undestand that it is just a view closure copying assignment, yet I do not use its usage

14:18 <K-ballo> ?

14:18 <gonidelis[m]> ok ok. bottom line is why that wouldn't work https://wandbox.org/permlink/Mm3kWbjnFqHV6AJu?

14:18 <gonidelis[m]> i just took it a step back

14:19 <gonidelis[m]> what's your question?

14:21 <K-ballo> when you ask why something work or doesn't work you need to tell what your expectation actually is

14:21 <gonidelis[m]> i expect the second snippet to work

14:21 <gonidelis[m]> i expect it to lazily multiply each element of vv by 2

14:24 <K-ballo> are you describing the transform view?

14:27 <gonidelis[m]> ...

14:27 <gonidelis[m]> the for_each vs transform beef dictates that their main difference is the one is being done in place

14:28 <gonidelis[m]> they sound similar

14:28 <K-ballo> no, they are views

14:28 <K-ballo> I see neither docs nor tests for the for_each view, but from the implementation it seems to be a join over a transform

14:28 <gonidelis[m]> aha

14:29 <gonidelis[m]> "Lazily applies an unary function to each element in the source range that returns another range (possibly empty), flattening the result. "

14:29 <gonidelis[m]> "Given a source range and a unary function, return a new range where each result element is the result of applying the unary function to a source element."

14:29 <gonidelis[m]> excluding the "flattening results part", what's the difference?

14:29 <gonidelis[m]> K-ballo: so it means they do aproximately the same thing. almost.

14:30 <K-ballo> except for the part they are different, they do the same.. is that what you are saying? you are entirely correct

14:30 <gonidelis[m]> so the only difference is the flattening result thingy?

14:30 <gonidelis[m]> what does flattening the result even mean?

14:31 <K-ballo> flattening means going from range of range of T to range of U

14:31 <K-ballo> flattening without transformation would go from range of range of T to range of T

14:31 <gonidelis[m]> ahh thaks for that. wow

14:32 <gonidelis[m]> i see!

14:32 <K-ballo> flattened {"a", "bc", "d"} is "abcd"

14:32 <gonidelis[m]> with all these things we are saying sounds like for_each(vv, lambda) should work. aha. got it. that's nice actually

14:36 <gonidelis[m]> K-ballo: it's the functor!

14:36 <K-ballo> function object

14:37 <gonidelis[m]> no, it's views:for_each's accepted functor

14:37 <gonidelis[m]> from SO: "You misunderstand what view::for_each() is, it's totally different from std::for_each", oh really? 😅

14:37 <K-ballo> views::for_each actually takes a callable

14:37 <gonidelis[m]> https://stackoverflow.com/a/53347369/8242494

14:38 <gonidelis[m]> you talkin about the first or the second arg?

14:55 <gonidelis[m]> hkaiser: that's a universal refernce, right? https://github.com/ericniebler/range-v3/blob/83783f578e0e6666d68a3bf17b0038a80e62530e/include/range/v3/view/indirect.hpp#L145

14:55 <hkaiser> yes

14:55 <gonidelis[m]> which is then specifically casted to an rvalue ref

14:56 <hkaiser> yah, they circumvent using forward to safe compile time

14:56 <hkaiser> that cast is doing the same as std::forward

14:57 <gonidelis[m]> huh.... ok that's nice

14:58 <hkaiser> gnikunj[m]: yt?

15:28 jehelset has joined #ste||ar

15:37 <gnikunj[m]> hkaiser: forgot to set an alarm :/

15:38 <gnikunj[m]> Ofc it had to happen again

15:38 <hkaiser> never happened before - and yet again ;-)

15:38 <srinivasyadav227> Or gnikunj c

15:39 <gnikunj[m]> When am I getting that alarm clock you talked about?

15:39 <gnikunj[m]> I think I'm in desperate need of one ;)

15:40 <hkaiser> gnikunj[m]: what if you used your cell phone, it can wake you up as well

15:40 <gonidelis[m]> gnikunj: uiuc should be givin them for free, given how much they exhaust you over there ;p

15:40 <gnikunj[m]> Hahahaha true

15:40 <hkaiser> gonidelis[m]: nah, he's paying for being punished

15:40 <gnikunj[m]> I'll pester the CS dept here for one

15:40 <gonidelis[m]> hahaha

15:41 <gnikunj[m]> hkaiser: good thing the pay is small ;)

15:41 tufei has quit [Remote host closed the connection]

15:41 tufei has joined #ste||ar

15:43 <hkaiser> gnikunj[m]: most likely mdspan will go into C++23, so excuses for not looking into striding for vectorization anymore

15:44 <gnikunj[m]> I did go through the implementation

15:44 <hkaiser> any insights?

15:44 <gnikunj[m]> Striding is imp!

15:45 <gnikunj[m]> I don't think any other runtime supports striding. So I want us to be first!

15:45 <gonidelis[m]> what's striding?

15:46 <gnikunj[m]> A stride of n is considering elements in order i, i+n,..., i+k*n,...

15:47 <gonidelis[m]> what's the proposal then?

15:47 <gnikunj[m]> We're trying to get vector pack of strides

15:47 <gnikunj[m]> Vector pack in general is applied to contiguous data elements

15:48 <gnikunj[m]> So if the user wants stride, the user needs to change the data structure used to have a behavior similar to stride

15:48 <gonidelis[m]> wow!

15:48 <gonidelis[m]> talkin about convenience

15:49 <gnikunj[m]> Yes, having stride make our vector implementation very general

15:57 <hkaiser> gnikunj[m]: but possibly non-efficient, so let's try it out!

15:58 <gnikunj[m]> Yes, it will be non-efficient until we figure out the data locality party

15:58 <gnikunj[m]> s/party/part/

16:00 <hkaiser> freudian typo ;-)

16:02 <gonidelis[m]> hahahahahahahhhahahahaha

16:02 <gnikunj[m]> Shhhh no one saw that 🤫

17:46 <pedro_barbosa[m]> Hey, I was doing an example with HPXCL and CUDA and at some point I wanted to replace some values in an array I pass as argument to the kernel with the value of a smaller array, however I keep getting an error, if I try to replace the argument array with fixed numbers I can do it without a problem but when I try do replace it with a value of another array I get the following error:

17:46 <pedro_barbosa[m]> ```

17:46 <pedro_barbosa[m]> what(): CudaError: an illegal memory access was encountered at buffer::~buffer Error during synchronization of stream

17:46 <pedro_barbosa[m]> ```

17:55 <hkaiser> pedro_barbosa[m]: well, it's difficult to know what's wrong without seeing the code

17:56 * pedro_barbosa[m] uploaded an image: (43KiB) < https://libera.ems.host/_matrix/media/r0/download/matrix.org/zGgYZFHbjqyAfGpjdeRjxtDo/image.png >

17:56 <pedro_barbosa[m]> float* newPos is passed as an argument

17:57 <pedro_barbosa[m]> in this example I'm trying to access deviceOffset+index but if I try to access 0 the error persists

17:57 <hkaiser> what's newPos?

17:58 <hkaiser> where does that come from?

17:58 <pedro_barbosa[m]> it's a float* declared on the host and passed as an argument to the kernel, I can send both files if it is easier

17:58 <hkaiser> so the code snippet runs on the device?

17:59 <pedro_barbosa[m]> Yes

17:59 <hkaiser> is the buffer newPos points to (you said it's a host pointer) somehow transferred to the device before executing the code snippet?

18:01 <pedro_barbosa[m]> Not sure I understand what you're asking

18:01 <hkaiser> you can't just use a host ointer on the device

18:02 <pedro_barbosa[m]> yeah I know that, I believe I'm doing it correctly, unless I'm missing something obvious

18:02 * pedro_barbosa[m] posted a file: (4KiB) < https://libera.ems.host/_matrix/media/r0/download/matrix.org/BosaQKoiRbGLWmSJKqtwoagK/my_nvidia_nbody_kernel.cu >

18:02 * pedro_barbosa[m] posted a file: (5KiB) < https://libera.ems.host/_matrix/media/r0/download/matrix.org/bnyWBRTPruzoqMxagCoRUkJp/my_nvidia_nbody.cpp >

18:02 <pedro_barbosa[m]> These are the source files

18:03 <pedro_barbosa[m]> line 27 and 45 on the cpp file has the declaration of both the host array and then the buffer I use to copy to the device

18:03 <hkaiser> not sure, I see cudaMallocHost calls only

18:04 <pedro_barbosa[m]> line 45

18:05 <hkaiser> sorry, I don't understand the code

18:07 <hkaiser> I still don't see what's wrong - where is newPos?

18:09 <pedro_barbosa[m]> in line 45 I declare the buffer that I'm going to use to copy the newPos to the device, on line 46 I do the copy, line 124 I add it to the argument list and then on line 140 I run the kernel with that argument list

18:09 <pedro_barbosa[m]> then if you go to the kernel file on line 72 you can see the function that's being executed with newPost being the 1st argument

18:12 <hkaiser> ok

18:40 diehlpk has joined #ste||ar

18:49 diehlpk has quit [Quit: Leaving.]

18:59 diehlpk has joined #ste||ar

19:08 RostamLog has joined #ste||ar

19:27 <gonidelis[m]> hkaiser: pm

19:51 diehlpk has quit [Quit: Leaving.]

19:54 diehlpk has joined #ste||ar

20:17 akheir has quit [Ping timeout: 264 seconds]

20:22 jehelset has quit [Remote host closed the connection]

22:33 diehlpk has quit [Quit: Leaving.]