#ste||ar on 2021-07-15 — irc logs at irclog.cct.lsu.edu

2021-07-01 15:44 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu | Everybody: please respond to the documentation survey: https://forms.gle/aCpNkhjSfW55isXGA

00:17 K-ballo has quit [Quit: K-ballo]

01:47 Yorlik_ has joined #ste||ar

01:51 Yorlik has quit [Ping timeout: 255 seconds]

02:55 hkaiser has quit [Quit: Bye!]

03:00 diehlpk_work has quit [Remote host closed the connection]

08:05 <gnikunj[m]> ms: why does MPI functions like MPI_Comm_size and MPI_Comm_rank takes int* as an argument while all of their status are positive? Why can't they simply take unsigned int* instead?

08:09 <ms[m]> gnikunj: um, not sure what you expect me to say... :P it's a very old c api and the signedness probably wasn't the biggest concern when they standardized it, they might even have left it as an int to give them the option of returning negative values in the future thought who knows for what

08:09 <ms[m]> that's the longer version of: I've no clue

08:10 <gnikunj[m]> I think I'll ask it somewhere on the MPI forum then :D

08:14 <ms[m]> please upvote btw if you haven't done so already: https://www.reddit.com/r/cpp/comments/okbndq/hpx_170_released_the_stear_group/

08:15 <gnikunj[m]> ms: done!

08:15 <zao> Signed values are more portable, like for Java. int is also quite the vocabulary type in C

08:15 <gnikunj[m]> so they're doing it so that porting to other languages becomes easier?

08:16 <zao> The use as a return value also allows for error returns with -1 or -error

08:17 <zao> I’m speculating based on experience in general.

08:17 <gnikunj[m]> right, but all their statuses are positive valued!

08:17 <gnikunj[m]> that's why I was thinking. It made sense if they were returning negative values but they're all positive anyway.

08:29 <zao> I would blame FORTRAN somehow ;)

08:42 <srinivasyadav227> gnikunj: hey, i think find_end parallel version is not implemented properly, even the test cases, here is a gist https://gist.github.com/srinivasyadav18/15c30ff7b48f1b7616d37f6f0a11efe1, please have a look, i went through the implementation again, its just checking the if first element is matched, its not checking if the whole sequence(from first2 --> last2) is matched, the actual output from std::find_end(par, ...) is here

08:42 <srinivasyadav227> https://godbolt.org/z/fonaW8qMa

08:43 <srinivasyadav227> * gnikunj: hey, i think find_end parallel version is not implemented properly, even the test cases, here is a gist https://gist.github.com/srinivasyadav18/15c30ff7b48f1b7616d37f6f0a11efe1, please have a look, i went through the implementation again, its just checking the if first element is matched, its not checking if the whole sequence(from first2 --> last2) is matched, the expected output from std::find_end(par, ...) is here

08:43 <srinivasyadav227> https://godbolt.org/z/fonaW8qMa

12:14 hkaiser has joined #ste||ar

12:26 <hkaiser> ms[m]: hey, good morning

12:27 <ms[m]> hkaiser: hey

12:27 <hkaiser> do you have a post-release version update lined up already, or would you want me to do that?

12:27 <ms[m]> uh, lined up, but got distracted by other things

12:27 <ms[m]> it's in progress

12:27 <hkaiser> +1, np sorries

12:28 <hkaiser> no worries

12:28 <ms[m]> btw, saw your message from yesterday about the sanitizer build messing up things for senders...

12:28 <ms[m]> hopefully it's just temporary then

12:28 <ms[m]> sanitizer support is still pretty new in msvc, no?

12:28 <hkaiser> yes, I was a bit worried there ;-)

12:41 K-ballo has joined #ste||ar

13:00 <gnikunj[m]> ms: I'm trying the fork_join_executor and I'm observing notorious results. Sometimes it takes 10us to execute while other times it takes 1ms. Is there something I need to make sure before I instantiate the executor? (The total grain-size is 25us so I wanted to see how your executor would perform)

13:02 <gnikunj[m]> here's how the code looks: https://gist.github.com/NK-Nikunj/c309e2d2416e2f8b534c8713e1b7946a

13:03 <ms[m]> it has a timeout after which it yields the "worker threads" (plain hpx threads, but specifically for the executor)

13:03 <ms[m]> if there's enough work in between it might end up yielding

13:03 <ms[m]> and cause latency for the next parallel region

13:03 <gnikunj[m]> aah, is there a way to control timeout?

13:03 <ms[m]> if you launch multiple parallel regions quickly after each other it should do them all without yielding in between

13:04 <ms[m]> yeah, it's "yield_timeout"

13:04 <ms[m]> last argument of the constructor

13:04 <ms[m]> it's 1 ms by default

13:04 <gnikunj[m]> the parallel region is pretty small then

13:04 <ms[m]> but then your example only has a for loop of parallel for loops, so it might not be that

13:04 <gnikunj[m]> the total grain size there is 25us and 250ns per vertex

13:05 <ms[m]> it's 1 ms waiting from the end of the previous parallel region until the next one starts

13:05 <ms[m]> not 1 ms in total for the parallel region

13:05 <gnikunj[m]> could you elaborate please?

13:06 <ms[m]> each "fork-join executor worker thread" busy waits for 1 ms once it's done for the next parallel region to start

13:06 <ms[m]> and if there's nothing the worker threads yield

13:07 <ms[m]> but that should not be the case here

13:07 <gnikunj[m]> right, it does get its fair share of work

13:07 <ms[m]> ah, actually...

13:07 <ms[m]> the chunking might affect things here

13:07 <ms[m]> (powers of two chunk sizes)

13:07 <ms[m]> not sure

13:07 <gnikunj[m]> ah, so the last bit of chunks may have to wait 1ms before the other threads yield?

13:08 <ms[m]> but play around with setting a nice chunk size for it, that evenly distributes work

13:08 <gnikunj[m]> got it, thanks!

13:08 <ms[m]> no, but some worker threads may end up with no work (though I think that should be unlikely)

13:08 <ms[m]> how many threads are you using?

13:09 <ms[m]> os threads, that is

13:09 <gnikunj[m]> 48

13:09 <gnikunj[m]> 4 NUMA nodes each with 12 cores (no hyperthreading)

13:09 <hkaiser> mdiers[m]: btw, #5441 should be fine now

13:09 <hkaiser> thanks for taking the time to look

13:09 <hkaiser> ms[m]: ^^

13:09 <hkaiser> darn autocomplete :/

13:09 <ms[m]> :P

13:10 <ms[m]> hkaiser: that means "fine now for taking a look" or you fixed it already? :P

13:10 <hkaiser> I did fix it

13:10 <hkaiser> nah, I did prepare it for you to look ;-)

13:10 <ms[m]> thanks ;)

13:11 <hkaiser> most welcome

13:11 <ms[m]> it's next up then

13:11 <ms[m]> gnikunj: you could also run stream just to sanity check things: https://github.com/STEllAR-GROUP/hpx/blob/master/tests/performance/local/stream.cpp

13:11 <ms[m]> that's what I was using to check the fork-join executor

13:11 <ms[m]> and it should at least be faster than the other executors

13:12 <ms[m]> but you may be hitting the next limit with only 100 items

13:12 <gnikunj[m]> right, I remember the PR where you shared those graphs

13:12 <ms[m]> that would anyway just be 2 elements per worker thread

13:12 <gnikunj[m]> let me try running stream, thanks!

13:13 <ms[m]> the high variance is something that I wouldn't expect to see though, so if you find something out let me know

13:14 <ms[m]> gnikunj: are the work items roughly the same amount of work?

13:15 <gnikunj[m]> yes, they're precisely the same amount of work

13:15 <gnikunj[m]> ms: ^^

13:15 <ms[m]> gnikunj: 👍️

13:15 <ms[m]> manually checking the chunking (and distribution to worker threads) might be a good thing to do as well

13:15 <gnikunj[m]> https://github.com/STEllAR-GROUP/hpx/blob/master/tests/performance/local/stream.cpp#L21 local/algorithm.hpp is a header that came after 1.6?

13:18 <ms[m]> gnikunj: yes

13:18 <ms[m]> does not include the segmented overloads

13:19 <gnikunj[m]> ms: got it. So are we doing hpx/local hpx/distributed categorizations then?

13:20 <ms[m]> not hpx/distributed/x.hpp since hpx/x.hpp includes both, but hpx/local yes

13:20 <gnikunj[m]> yeah, that makes sense

13:21 <gnikunj[m]> ms: here is the result for stream: https://gist.github.com/NK-Nikunj/c309e2d2416e2f8b534c8713e1b7946a#file-stream

13:21 <gnikunj[m]> I ran for the default case. If you want specific commandline options, please let me know.

13:22 <ms[m]> --executor 3 is the fork-join executor

13:22 <ms[m]> 2 is the default parallel_executor

13:23 <gnikunj[m]> the results improved signifcantly!

13:23 <gnikunj[m]> here are the new results: https://gist.github.com/NK-Nikunj/c309e2d2416e2f8b534c8713e1b7946a#file-stream

13:23 <ms[m]> I'd also increase the number of iterations (it's 10 by default)

13:23 <ms[m]> yeah, that looks sane

13:23 <ms[m]> I don't think it'll go much faster with that tiny array

13:24 <ms[m]> the main point is that it should be significantly faster than the default, but I don't know how much room there is to still improve it

13:24 <ms[m]> at least down to a few k elements it was still on par with openmp (kokkos openmp backend) which I took as "good enough" at the time

13:25 <ms[m]> in your benchmark it might be a good idea to just print the timings of the individual iterations to see if there's variation from iteration to iteration or if it's just e.g. the first iteration that's slow

13:25 <gnikunj[m]> I see. Yes, if it did 10us for me everytime, I would want to use it too :P

13:25 <ms[m]> and to do a warmup iteration

13:25 <gnikunj[m]> it is a warmup iteration

13:26 <gnikunj[m]> there's another one that runs before this portion of the code

13:26 <gnikunj[m]> can that interfere somehow? I ensure that all of the threads are joinable before I move to this part of the code.

13:27 <ms[m]> is it a fork-join executor? if yes, it can

13:27 <ms[m]> and does it live over bm_hpx_small_tasks?

13:27 <gnikunj[m]> no, it's parallel_aggregated

13:27 <ms[m]> ok, then you should be good, but do a warmup with the fork_join_executor in that function

13:28 <ms[m]> though again, if I remember correctly what I did it should be ready to go as soon as the constructor returns

13:28 <gnikunj[m]> <ms[m] "and does it live over bm_hpx_sma"> no, that's called separately from main. Within that micro-benchmark I launch new threads and benchmark following which I wait for all threads to complete execution and then return. Then I proceed on calling that small code.

13:28 <ms[m]> ok

13:29 <gnikunj[m]> let me try playing with chunk size to see if that has something to do with this

13:29 <ms[m]> creating multiple fork_join_executors is currently not a good idea for the same reason it's not a good idea to mix e.g. hpx and openmp

13:29 <ms[m]> there will be multiple hpx worker threads busy waiting and starving each others resources

13:29 <gnikunj[m]> right I saw my code deadlock (or it seemed) when I used fork_join_executor with parallel_aggregated

13:31 <ms[m]> yep, it should not fully deadlock because of the yielding after some time, but it might significantly slow down

13:32 <ms[m]> actually, it could deadlock because the threads are run with high priority...

13:32 <gnikunj[m]> ah, yeah

13:33 <gnikunj[m]> so fork_join_executors should only be used in its own scope while also ensuring that no-other hpx thread can interfere?

13:35 <ms[m]> more or less

13:35 <ms[m]> the priority is another option that can be changed, and with normal priority it's a bit less critical to do that

13:35 <ms[m]> but yeah, it's fast because it assumes that it can busy wait and it's hopefully not keeping too much other work from executing

13:36 hkaiser has quit [Quit: Bye!]

13:50 <gnikunj[m]> ms: another question: What happens if I give a static_chunk of 100 to a parallel for_loop that goes from 0 to 100. Does this mean only first thread gets work and all others don't?

13:52 <ms[m]> gnikunj: yes

13:52 <gnikunj[m]> got it. So other threads, once done executing can be swapped by the scheduler to do something else?

13:53 <gnikunj[m]> as in load any other HPX thread that was previously waiting on a lock or a new HPX thread that is created by the user after the parallel for

14:36 <gnikunj[m]> @hkaiser ms : I consider this cheating :P https://github.com/STEllAR-GROUP/hpx/blob/master/libs/core/coroutines/include/hpx/coroutines/thread_enums.hpp#L263

14:49 <ms[m]> gnikunj: why?

14:50 <gnikunj[m]> because you have small and minimal both defined as small

14:51 <ms[m]> small is the smallest

14:51 <ms[m]> if it makes you happier you can think of "minimal" as "smallest"

14:51 <gnikunj[m]> yes, then don't define minimal if small is the smallest :P

14:52 <gnikunj[m]> minimal sounds like the smallest of them all :P

14:52 hkaiser has joined #ste||ar

14:52 <ms[m]> it is though

14:52 <ms[m]> except for no stack

14:52 <ms[m]> it's the smallest one we have

14:53 <ms[m]> and if we ever decide to add a tiny stacksize which is smaller than tiny then minimal would be tiny

14:55 <gnikunj[m]> so it's all in the name of future proofing ;)

14:56 <ms[m]> not so much that, but also that

14:57 <ms[m]> it's just a "relative stacksize" in some sense, just like the default refers to some actual stacksize and maximal refers to the largest stacksize

14:59 <ms[m]> rachitt_shah, hkaiser, or gnikunj, could you share the gsod meeting link? our webmail (and thus calendar) is down at the moment...

15:00 <rachitt_shah[m]> Sure

15:00 <rachitt_shah[m]> https://lsu.zoom.us/j/3340410194

15:36 <hkaiser> ms[m]: I have remove the ref-count from hpx::thread::id, this should fix the issue you investgated - thanks again

15:38 <ms[m]> hkaiser: np, and thanks

15:38 <ms[m]> I suppose we should be fine with that change since it's the same semantics as on master... (the test would be easy to change as well)

15:40 <hkaiser> right

15:53 tufei has joined #ste||ar

16:14 Yorlik_ has quit [Ping timeout: 245 seconds]

17:37 tufei has quit [Quit: Leaving]

18:18 hkaiser has quit [Quit: Bye!]

20:06 diehlpk has joined #ste||ar

20:07 <diehlpk> ms[m]: 1.7.0 is available in Fedora

21:10 hkaiser has joined #ste||ar

21:12 diehlpk1 has joined #ste||ar

21:14 diehlpk has quit [Ping timeout: 255 seconds]

21:41 diehlpk has joined #ste||ar

21:43 diehlpk1 has quit [Ping timeout: 240 seconds]

21:43 diehlpk has quit [Client Quit]

22:41 hkaiser has quit [Read error: Connection reset by peer]

22:43 hkaiser has joined #ste||ar

23:27 <srinivasyadav227> hkaiser: any reason why some algorithms in test cases use std algorithms to compare results and some do not?

23:28 <gonidelis[m]> srinivasyadav227: what do those who don't use std algorithms use?

23:29 <srinivasyadav227> gonidelis: example findend algorithm here https://github.com/STEllAR-GROUP/hpx/blob/master/libs/parallelism/algorithms/tests/unit/algorithms/findend.cpp

23:31 <gonidelis[m]> srinivasyadav227: what's this algo doing?

23:32 <hkaiser> srinivasyadav227: no reason ;-)

23:32 <hkaiser> different people wrote different tests

23:33 <srinivasyadav227> gonidelis: it just finds the subsequence if exists in the main sequence (i.e if first2->second2 exists in first->second) and returns first2 if found else it returns last

23:33 <gonidelis[m]> srinivasyadav227: exactly. so what do we need to test for validity?

23:33 <srinivasyadav227> hkaiser: okay cool ;-)

23:34 <srinivasyadav227> gonidelis: yea, just use std findend ;-) xD to compare the result

23:34 <gonidelis[m]> lol

23:34 <gonidelis[m]> srinivasyadav227: i like the way you are thinking

23:35 <srinivasyadav227> i am working on it ;-),

23:35 <gonidelis[m]> but since the writer knew the answer already, since they planted the subsequence themselves

23:35 <gonidelis[m]> isn't it more "efficinent" for the test to look straightforwardly to that specific index (which the author already knows) ?

23:37 <srinivasyadav227> aah, yes that also works, since he has only 2 elements in the sequence

23:37 <gonidelis[m]> it's not because it's 2 elements

23:38 <gonidelis[m]> srinivasyadav227: it could be a thoudand elements. we already know where those are, in the middle of the sequence. we put them there

23:38 <gonidelis[m]> using std::find with a thousand elements would be even worse

23:38 <srinivasyadav227> but we need to check if the whole subsequence exists right ?

23:39 <hkaiser> I think we used std algorithms wherever we generate random sequences

23:39 <hkaiser> gdaiss[m]: it's not sufficient to look at the index

23:39 <hkaiser> you have to make sure that it doesn't find the wrong instance of the given subsequence

23:39 <hkaiser> gonidelis[m]: ^^

23:40 <gonidelis[m]> hkaiser: sure. the test has the potential to provide a false positive.

23:40 <gonidelis[m]> it's much trivial though

23:41 <srinivasyadav227> yeah