hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: Bye!]
diehlpk_work_ has quit [Remote host closed the connection]
<ms[m]> circleci is back (was already yesterday)
<ms[m]> apparently some automated system "erroneously flagged" us...
K-ballo has joined #ste||ar
hkaiser has joined #ste||ar
<K-ballo> hkaiser: have you been able to successfully use VerySleepy on windows?
<K-ballo> it stopped working for me when I switched to win7
<hkaiser> K-ballo: I have not tried
<hkaiser> I'm usually using vtune for this
<K-ballo> lately i've been using the performance tools that come with VS
<hkaiser> ok
<hkaiser> do they work well? I never even tried
<K-ballo> they're nowhere near vtune, but most of the time they're good enough
<K-ballo> sometimes though they magically stop capturing events, and it takes a rebuild and/or restart to get it going again
<hkaiser> ok
<jedi18[m]> I can't use vtune on my AMD laptop right? Any idea how's the alternative AMD μProf compared to vtune?
<hkaiser> jedi18[m]: vtune should work on AMD machines, although it might have reduced functionalities
<hkaiser> I have never tried using AMDs tools, however
<jedi18[m]> Oh ok cool, I'll try using it on the ranges stuff
<zao> VTune has greatly reduced profiling functionality on non-Intel chips, I think it even rules out any form of hardware-assisted instrumentation. Nowhere in the development specs for the software did the phrase "maybe look at what analogous functionality AMD offers for their CPUs" exist :D
<zao> Not saying it's intentionally worse than it should be, but it's intentionally worse than it should be.
<hkaiser> zao: +1
<hkaiser> zao: Intel is known for crippling their tools on AMDs architectures
<jedi18[m]> That's rude of them xD, but I guess it's expected since they are their primary competitor
<jedi18[m]> Let's hope AMD μProf is good then
<hkaiser> jedi18[m]: for a first assessment, software based analysis techniques are good enough, usually - the hardware assisted stuff is needed only if you already know where your bottlenecks are
<jedi18[m]> Oh ok yeah I don't have any immediate use for it, just wanted to try it out
<hkaiser> sure
<gonidelis[m]> jedi18: been struggling with that quite a lot. vtune hates my ryzen.i was able to do minimal analysis like cpu usage and stuff though
<gonidelis[m]> AMD's uprof is the worst program in the world. not just compared to other profilers, but in general
<gonidelis[m]> botttom line, i would opt in for vtune even if they want treat us like second class citizens. it's that good.
<jedi18[m]> Oh ok thanks, won't bother trying out uprof then, vtune it is
<gonidelis[m]> jedi18: please don't
<gonidelis[m]> K-ballo: how come this works? yet even, how come this does not work when I uncomment line #16? even worse, I would expect `views::for_each(vv, lambda)` to work. https://wandbox.org/permlink/GBcpLZxsWolqz2cZ
<K-ballo> why wouldn't it work?
<gonidelis[m]> seems like the problem is this
<gonidelis[m]> but according to this https://github.com/ericniebler/range-v3/blob/83783f578e0e6666d68a3bf17b0038a80e62530e/include/range/v3/view/for_each.hpp#L52 I would expect it to be able to accept the rng argument
<K-ballo> can't match what you are saying to the wandbox snippet... tell me with words, why wouldn't that work?
<gonidelis[m]> that's the least important of my three questions because I can undestand that it is just a view closure copying assignment, yet I do not use its usage
<K-ballo> ?
<gonidelis[m]> ok ok. bottom line is why that wouldn't work https://wandbox.org/permlink/Mm3kWbjnFqHV6AJu?
<gonidelis[m]> i just took it a step back
<gonidelis[m]> what's your question?
<K-ballo> when you ask why something work or doesn't work you need to tell what your expectation actually is
<gonidelis[m]> i expect the second snippet to work
<gonidelis[m]> i expect it to lazily multiply each element of vv by 2
<K-ballo> are you describing the transform view?
<gonidelis[m]> ...
<gonidelis[m]> the for_each vs transform beef dictates that their main difference is the one is being done in place
<gonidelis[m]> they sound similar
<K-ballo> no, they are views
<K-ballo> I see neither docs nor tests for the for_each view, but from the implementation it seems to be a join over a transform
<gonidelis[m]> aha
<gonidelis[m]> "Lazily applies an unary function to each element in the source range that returns another range (possibly empty), flattening the result. "
<gonidelis[m]> "Given a source range and a unary function, return a new range where each result element is the result of applying the unary function to a source element."
<gonidelis[m]> excluding the "flattening results part", what's the difference?
<gonidelis[m]> K-ballo: so it means they do aproximately the same thing. almost.
<K-ballo> except for the part they are different, they do the same.. is that what you are saying? you are entirely correct
<gonidelis[m]> so the only difference is the flattening result thingy?
<gonidelis[m]> what does flattening the result even mean?
<K-ballo> flattening means going from range of range of T to range of U
<K-ballo> flattening without transformation would go from range of range of T to range of T
<gonidelis[m]> ahh thaks for that. wow
<gonidelis[m]> i see!
<K-ballo> flattened {"a", "bc", "d"} is "abcd"
<gonidelis[m]> with all these things we are saying sounds like for_each(vv, lambda) should work. aha. got it. that's nice actually
<gonidelis[m]> K-ballo: it's the functor!
<K-ballo> function object
<gonidelis[m]> no, it's views:for_each's accepted functor
<gonidelis[m]> from SO: "You misunderstand what view::for_each() is, it's totally different from std::for_each", oh really? 😅
<K-ballo> views::for_each actually takes a callable
<gonidelis[m]> you talkin about the first or the second arg?
<hkaiser> yes
<gonidelis[m]> which is then specifically casted to an rvalue ref
<hkaiser> yah, they circumvent using forward to safe compile time
<hkaiser> that cast is doing the same as std::forward
<gonidelis[m]> huh.... ok that's nice
<hkaiser> gnikunj[m]: yt?
jehelset has joined #ste||ar
<gnikunj[m]> hkaiser: forgot to set an alarm :/
<gnikunj[m]> Ofc it had to happen again
<hkaiser> never happened before - and yet again ;-)
<srinivasyadav227> Or gnikunj c
<gnikunj[m]> When am I getting that alarm clock you talked about?
<gnikunj[m]> I think I'm in desperate need of one ;)
<hkaiser> gnikunj[m]: what if you used your cell phone, it can wake you up as well
<gonidelis[m]> gnikunj: uiuc should be givin them for free, given how much they exhaust you over there ;p
<gnikunj[m]> Hahahaha true
<hkaiser> gonidelis[m]: nah, he's paying for being punished
<gnikunj[m]> I'll pester the CS dept here for one
<gonidelis[m]> hahaha
<gnikunj[m]> hkaiser: good thing the pay is small ;)
tufei has quit [Remote host closed the connection]
tufei has joined #ste||ar
<hkaiser> gnikunj[m]: most likely mdspan will go into C++23, so excuses for not looking into striding for vectorization anymore
<gnikunj[m]> I did go through the implementation
<hkaiser> any insights?
<gnikunj[m]> Striding is imp!
<gnikunj[m]> I don't think any other runtime supports striding. So I want us to be first!
<gonidelis[m]> what's striding?
<gnikunj[m]> A stride of n is considering elements in order i, i+n,..., i+k*n,...
<gonidelis[m]> what's the proposal then?
<gnikunj[m]> We're trying to get vector pack of strides
<gnikunj[m]> Vector pack in general is applied to contiguous data elements
<gnikunj[m]> So if the user wants stride, the user needs to change the data structure used to have a behavior similar to stride
<gonidelis[m]> wow!
<gonidelis[m]> talkin about convenience
<gnikunj[m]> Yes, having stride make our vector implementation very general
<hkaiser> gnikunj[m]: but possibly non-efficient, so let's try it out!
<gnikunj[m]> Yes, it will be non-efficient until we figure out the data locality party
<gnikunj[m]> s/party/part/
<hkaiser> freudian typo ;-)
<gonidelis[m]> hahahahahahahhhahahahaha
<gnikunj[m]> Shhhh no one saw that 🤫
<pedro_barbosa[m]> Hey, I was doing an example with HPXCL and CUDA and at some point I wanted to replace some values in an array I pass as argument to the kernel with the value of a smaller array, however I keep getting an error, if I try to replace the argument array with fixed numbers I can do it without a problem but when I try do replace it with a value of another array I get the following error:
<pedro_barbosa[m]> ```
<pedro_barbosa[m]> what(): CudaError: an illegal memory access was encountered at buffer::~buffer Error during synchronization of stream
<pedro_barbosa[m]> ```
<hkaiser> pedro_barbosa[m]: well, it's difficult to know what's wrong without seeing the code
<pedro_barbosa[m]> float* newPos is passed as an argument
<pedro_barbosa[m]> in this example I'm trying to access deviceOffset+index but if I try to access 0 the error persists
<hkaiser> what's newPos?
<hkaiser> where does that come from?
<pedro_barbosa[m]> it's a float* declared on the host and passed as an argument to the kernel, I can send both files if it is easier
<hkaiser> so the code snippet runs on the device?
<pedro_barbosa[m]> Yes
<hkaiser> is the buffer newPos points to (you said it's a host pointer) somehow transferred to the device before executing the code snippet?
<pedro_barbosa[m]> Not sure I understand what you're asking
<hkaiser> you can't just use a host ointer on the device
<pedro_barbosa[m]> yeah I know that, I believe I'm doing it correctly, unless I'm missing something obvious
<pedro_barbosa[m]> These are the source files
<pedro_barbosa[m]> line 27 and 45 on the cpp file has the declaration of both the host array and then the buffer I use to copy to the device
<hkaiser> not sure, I see cudaMallocHost calls only
<pedro_barbosa[m]> line 45
<hkaiser> sorry, I don't understand the code
<hkaiser> I still don't see what's wrong - where is newPos?
<pedro_barbosa[m]> in line 45 I declare the buffer that I'm going to use to copy the newPos to the device, on line 46 I do the copy, line 124 I add it to the argument list and then on line 140 I run the kernel with that argument list
<pedro_barbosa[m]> then if you go to the kernel file on line 72 you can see the function that's being executed with newPost being the 1st argument
<hkaiser> ok
diehlpk has joined #ste||ar
diehlpk has quit [Quit: Leaving.]
diehlpk has joined #ste||ar
RostamLog has joined #ste||ar
<gonidelis[m]> hkaiser: pm
diehlpk has quit [Quit: Leaving.]
diehlpk has joined #ste||ar
akheir has quit [Ping timeout: 264 seconds]
jehelset has quit [Remote host closed the connection]
diehlpk has quit [Quit: Leaving.]