hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu | Everybody: please respond to the documentation survey: https://forms.gle/aCpNkhjSfW55isXGA
K-ballo has quit [Quit: K-ballo]
Yorlik_ has joined #ste||ar
Yorlik has quit [Ping timeout: 255 seconds]
hkaiser has quit [Quit: Bye!]
diehlpk_work has quit [Remote host closed the connection]
<gnikunj[m]> ms: why does MPI functions like MPI_Comm_size and MPI_Comm_rank takes int* as an argument while all of their status are positive? Why can't they simply take unsigned int* instead?
<ms[m]> gnikunj: um, not sure what you expect me to say... :P it's a very old c api and the signedness probably wasn't the biggest concern when they standardized it, they might even have left it as an int to give them the option of returning negative values in the future thought who knows for what
<ms[m]> that's the longer version of: I've no clue
<gnikunj[m]> I think I'll ask it somewhere on the MPI forum then :D
<ms[m]> please upvote btw if you haven't done so already: https://www.reddit.com/r/cpp/comments/okbndq/hpx_170_released_the_stear_group/
<gnikunj[m]> ms: done!
<zao> Signed values are more portable, like for Java. int is also quite the vocabulary type in C
<gnikunj[m]> so they're doing it so that porting to other languages becomes easier?
<zao> The use as a return value also allows for error returns with -1 or -error
<zao> Iā€™m speculating based on experience in general.
<gnikunj[m]> right, but all their statuses are positive valued!
<gnikunj[m]> that's why I was thinking. It made sense if they were returning negative values but they're all positive anyway.
<zao> I would blame FORTRAN somehow ;)
<srinivasyadav227> gnikunj: hey, i think find_end parallel version is not implemented properly, even the test cases, here is a gist https://gist.github.com/srinivasyadav18/15c30ff7b48f1b7616d37f6f0a11efe1, please have a look, i went through the implementation again, its just checking the if first element is matched, its not checking if the whole sequence(from first2 --> last2) is matched, the actual output from std::find_end(par, ...) is here
<srinivasyadav227> * gnikunj: hey, i think find_end parallel version is not implemented properly, even the test cases, here is a gist https://gist.github.com/srinivasyadav18/15c30ff7b48f1b7616d37f6f0a11efe1, please have a look, i went through the implementation again, its just checking the if first element is matched, its not checking if the whole sequence(from first2 --> last2) is matched, the expected output from std::find_end(par, ...) is here
hkaiser has joined #ste||ar
<hkaiser> ms[m]: hey, good morning
<ms[m]> hkaiser: hey
<hkaiser> do you have a post-release version update lined up already, or would you want me to do that?
<ms[m]> uh, lined up, but got distracted by other things
<ms[m]> it's in progress
<hkaiser> +1, np sorries
<hkaiser> no worries
<ms[m]> btw, saw your message from yesterday about the sanitizer build messing up things for senders...
<ms[m]> hopefully it's just temporary then
<ms[m]> sanitizer support is still pretty new in msvc, no?
<hkaiser> yes, I was a bit worried there ;-)
K-ballo has joined #ste||ar
<gnikunj[m]> ms: I'm trying the fork_join_executor and I'm observing notorious results. Sometimes it takes 10us to execute while other times it takes 1ms. Is there something I need to make sure before I instantiate the executor? (The total grain-size is 25us so I wanted to see how your executor would perform)
<ms[m]> it has a timeout after which it yields the "worker threads" (plain hpx threads, but specifically for the executor)
<ms[m]> if there's enough work in between it might end up yielding
<ms[m]> and cause latency for the next parallel region
<gnikunj[m]> aah, is there a way to control timeout?
<ms[m]> if you launch multiple parallel regions quickly after each other it should do them all without yielding in between
<ms[m]> yeah, it's "yield_timeout"
<ms[m]> last argument of the constructor
<ms[m]> it's 1 ms by default
<gnikunj[m]> the parallel region is pretty small then
<ms[m]> but then your example only has a for loop of parallel for loops, so it might not be that
<gnikunj[m]> the total grain size there is 25us and 250ns per vertex
<ms[m]> it's 1 ms waiting from the end of the previous parallel region until the next one starts
<ms[m]> not 1 ms in total for the parallel region
<gnikunj[m]> could you elaborate please?
<ms[m]> each "fork-join executor worker thread" busy waits for 1 ms once it's done for the next parallel region to start
<ms[m]> and if there's nothing the worker threads yield
<ms[m]> but that should not be the case here
<gnikunj[m]> right, it does get its fair share of work
<ms[m]> ah, actually...
<ms[m]> the chunking might affect things here
<ms[m]> (powers of two chunk sizes)
<ms[m]> not sure
<gnikunj[m]> ah, so the last bit of chunks may have to wait 1ms before the other threads yield?
<ms[m]> but play around with setting a nice chunk size for it, that evenly distributes work
<gnikunj[m]> got it, thanks!
<ms[m]> no, but some worker threads may end up with no work (though I think that should be unlikely)
<ms[m]> how many threads are you using?
<ms[m]> os threads, that is
<gnikunj[m]> 48
<gnikunj[m]> 4 NUMA nodes each with 12 cores (no hyperthreading)
<hkaiser> mdiers[m]: btw, #5441 should be fine now
<hkaiser> thanks for taking the time to look
<hkaiser> ms[m]: ^^
<hkaiser> darn autocomplete :/
<ms[m]> :P
<ms[m]> hkaiser: that means "fine now for taking a look" or you fixed it already? :P
<hkaiser> I did fix it
<hkaiser> nah, I did prepare it for you to look ;-)
<ms[m]> thanks ;)
<hkaiser> most welcome
<ms[m]> it's next up then
<ms[m]> gnikunj: you could also run stream just to sanity check things: https://github.com/STEllAR-GROUP/hpx/blob/master/tests/performance/local/stream.cpp
<ms[m]> that's what I was using to check the fork-join executor
<ms[m]> and it should at least be faster than the other executors
<ms[m]> but you may be hitting the next limit with only 100 items
<gnikunj[m]> right, I remember the PR where you shared those graphs
<ms[m]> that would anyway just be 2 elements per worker thread
<gnikunj[m]> let me try running stream, thanks!
<ms[m]> the high variance is something that I wouldn't expect to see though, so if you find something out let me know
<ms[m]> gnikunj: are the work items roughly the same amount of work?
<gnikunj[m]> yes, they're precisely the same amount of work
<gnikunj[m]> ms: ^^
<ms[m]> gnikunj: šŸ‘ļø
<ms[m]> manually checking the chunking (and distribution to worker threads) might be a good thing to do as well
<gnikunj[m]> https://github.com/STEllAR-GROUP/hpx/blob/master/tests/performance/local/stream.cpp#L21 local/algorithm.hpp is a header that came after 1.6?
<ms[m]> gnikunj: yes
<ms[m]> does not include the segmented overloads
<gnikunj[m]> ms: got it. So are we doing hpx/local hpx/distributed categorizations then?
<ms[m]> not hpx/distributed/x.hpp since hpx/x.hpp includes both, but hpx/local yes
<gnikunj[m]> yeah, that makes sense
<gnikunj[m]> I ran for the default case. If you want specific commandline options, please let me know.
<ms[m]> --executor 3 is the fork-join executor
<ms[m]> 2 is the default parallel_executor
<gnikunj[m]> the results improved signifcantly!
<ms[m]> I'd also increase the number of iterations (it's 10 by default)
<ms[m]> yeah, that looks sane
<ms[m]> I don't think it'll go much faster with that tiny array
<ms[m]> the main point is that it should be significantly faster than the default, but I don't know how much room there is to still improve it
<ms[m]> at least down to a few k elements it was still on par with openmp (kokkos openmp backend) which I took as "good enough" at the time
<ms[m]> in your benchmark it might be a good idea to just print the timings of the individual iterations to see if there's variation from iteration to iteration or if it's just e.g. the first iteration that's slow
<gnikunj[m]> I see. Yes, if it did 10us for me everytime, I would want to use it too :P
<ms[m]> and to do a warmup iteration
<gnikunj[m]> it is a warmup iteration
<gnikunj[m]> there's another one that runs before this portion of the code
<gnikunj[m]> can that interfere somehow? I ensure that all of the threads are joinable before I move to this part of the code.
<ms[m]> is it a fork-join executor? if yes, it can
<ms[m]> and does it live over bm_hpx_small_tasks?
<gnikunj[m]> no, it's parallel_aggregated
<ms[m]> ok, then you should be good, but do a warmup with the fork_join_executor in that function
<ms[m]> though again, if I remember correctly what I did it should be ready to go as soon as the constructor returns
<gnikunj[m]> <ms[m] "and does it live over bm_hpx_sma"> no, that's called separately from main. Within that micro-benchmark I launch new threads and benchmark following which I wait for all threads to complete execution and then return. Then I proceed on calling that small code.
<ms[m]> ok
<gnikunj[m]> let me try playing with chunk size to see if that has something to do with this
<ms[m]> creating multiple fork_join_executors is currently not a good idea for the same reason it's not a good idea to mix e.g. hpx and openmp
<ms[m]> there will be multiple hpx worker threads busy waiting and starving each others resources
<gnikunj[m]> right I saw my code deadlock (or it seemed) when I used fork_join_executor with parallel_aggregated
<ms[m]> yep, it should not fully deadlock because of the yielding after some time, but it might significantly slow down
<ms[m]> actually, it could deadlock because the threads are run with high priority...
<gnikunj[m]> ah, yeah
<gnikunj[m]> so fork_join_executors should only be used in its own scope while also ensuring that no-other hpx thread can interfere?
<ms[m]> more or less
<ms[m]> the priority is another option that can be changed, and with normal priority it's a bit less critical to do that
<ms[m]> but yeah, it's fast because it assumes that it can busy wait and it's hopefully not keeping too much other work from executing
hkaiser has quit [Quit: Bye!]
<gnikunj[m]> ms: another question: What happens if I give a static_chunk of 100 to a parallel for_loop that goes from 0 to 100. Does this mean only first thread gets work and all others don't?
<ms[m]> gnikunj: yes
<gnikunj[m]> got it. So other threads, once done executing can be swapped by the scheduler to do something else?
<gnikunj[m]> as in load any other HPX thread that was previously waiting on a lock or a new HPX thread that is created by the user after the parallel for
<ms[m]> gnikunj: why?
<gnikunj[m]> because you have small and minimal both defined as small
<ms[m]> small is the smallest
<ms[m]> if it makes you happier you can think of "minimal" as "smallest"
<gnikunj[m]> yes, then don't define minimal if small is the smallest :P
<gnikunj[m]> minimal sounds like the smallest of them all :P
hkaiser has joined #ste||ar
<ms[m]> it is though
<ms[m]> except for no stack
<ms[m]> it's the smallest one we have
<ms[m]> and if we ever decide to add a tiny stacksize which is smaller than tiny then minimal would be tiny
<gnikunj[m]> so it's all in the name of future proofing ;)
<ms[m]> not so much that, but also that
<ms[m]> it's just a "relative stacksize" in some sense, just like the default refers to some actual stacksize and maximal refers to the largest stacksize
<ms[m]> rachitt_shah, hkaiser, or gnikunj, could you share the gsod meeting link? our webmail (and thus calendar) is down at the moment...
<rachitt_shah[m]> Sure
<hkaiser> ms[m]: I have remove the ref-count from hpx::thread::id, this should fix the issue you investgated - thanks again
<ms[m]> hkaiser: np, and thanks
<ms[m]> I suppose we should be fine with that change since it's the same semantics as on master... (the test would be easy to change as well)
<hkaiser> right
tufei has joined #ste||ar
Yorlik_ has quit [Ping timeout: 245 seconds]
tufei has quit [Quit: Leaving]
hkaiser has quit [Quit: Bye!]
diehlpk has joined #ste||ar
<diehlpk> ms[m]: 1.7.0 is available in Fedora
hkaiser has joined #ste||ar
diehlpk1 has joined #ste||ar
diehlpk has quit [Ping timeout: 255 seconds]
diehlpk has joined #ste||ar
diehlpk1 has quit [Ping timeout: 240 seconds]
diehlpk has quit [Client Quit]
hkaiser has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
<srinivasyadav227> hkaiser: any reason why some algorithms in test cases use std algorithms to compare results and some do not?
<gonidelis[m]> srinivasyadav227: what do those who don't use std algorithms use?
<gonidelis[m]> srinivasyadav227: what's this algo doing?
<hkaiser> srinivasyadav227: no reason ;-)
<hkaiser> different people wrote different tests
<srinivasyadav227> gonidelis: it just finds the subsequence if exists in the main sequence (i.e if first2->second2 exists in first->second) and returns first2 if found else it returns last
<gonidelis[m]> srinivasyadav227: exactly. so what do we need to test for validity?
<srinivasyadav227> hkaiser: okay cool ;-)
<srinivasyadav227> gonidelis: yea, just use std findend ;-) xD to compare the result
<gonidelis[m]> lol
<gonidelis[m]> srinivasyadav227: i like the way you are thinking
<srinivasyadav227> i am working on it ;-),
<gonidelis[m]> but since the writer knew the answer already, since they planted the subsequence themselves
<gonidelis[m]> isn't it more "efficinent" for the test to look straightforwardly to that specific index (which the author already knows) ?
<srinivasyadav227> aah, yes that also works, since he has only 2 elements in the sequence
<gonidelis[m]> it's not because it's 2 elements
<gonidelis[m]> srinivasyadav227: it could be a thoudand elements. we already know where those are, in the middle of the sequence. we put them there
<gonidelis[m]> using std::find with a thousand elements would be even worse
<srinivasyadav227> but we need to check if the whole subsequence exists right ?
<hkaiser> I think we used std algorithms wherever we generate random sequences
<hkaiser> gdaiss[m]: it's not sufficient to look at the index
<hkaiser> you have to make sure that it doesn't find the wrong instance of the given subsequence
<hkaiser> gonidelis[m]: ^^
<gonidelis[m]> hkaiser: sure. the test has the potential to provide a false positive.
<gonidelis[m]> it's much trivial though
<srinivasyadav227> yeah