hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu | Everybody: please respond to the documentation survey: https://forms.gle/aCpNkhjSfW55isXGA
K-ballo has quit [Quit: K-ballo]
Yorlik_ has joined #ste||ar
Yorlik has quit [Ping timeout: 255 seconds]
hkaiser has quit [Quit: Bye!]
diehlpk_work has quit [Remote host closed the connection]
<gnikunj[m]>
ms: why does MPI functions like MPI_Comm_size and MPI_Comm_rank takes int* as an argument while all of their status are positive? Why can't they simply take unsigned int* instead?
<ms[m]>
gnikunj: um, not sure what you expect me to say... :P it's a very old c api and the signedness probably wasn't the biggest concern when they standardized it, they might even have left it as an int to give them the option of returning negative values in the future thought who knows for what
<ms[m]>
that's the longer version of: I've no clue
<gnikunj[m]>
I think I'll ask it somewhere on the MPI forum then :D
<zao>
Signed values are more portable, like for Java. int is also quite the vocabulary type in C
<gnikunj[m]>
so they're doing it so that porting to other languages becomes easier?
<zao>
The use as a return value also allows for error returns with -1 or -error
<zao>
Iām speculating based on experience in general.
<gnikunj[m]>
right, but all their statuses are positive valued!
<gnikunj[m]>
that's why I was thinking. It made sense if they were returning negative values but they're all positive anyway.
<zao>
I would blame FORTRAN somehow ;)
<srinivasyadav227>
gnikunj: hey, i think find_end parallel version is not implemented properly, even the test cases, here is a gist https://gist.github.com/srinivasyadav18/15c30ff7b48f1b7616d37f6f0a11efe1, please have a look, i went through the implementation again, its just checking the if first element is matched, its not checking if the whole sequence(from first2 --> last2) is matched, the actual output from std::find_end(par, ...) is here
<srinivasyadav227>
* gnikunj: hey, i think find_end parallel version is not implemented properly, even the test cases, here is a gist https://gist.github.com/srinivasyadav18/15c30ff7b48f1b7616d37f6f0a11efe1, please have a look, i went through the implementation again, its just checking the if first element is matched, its not checking if the whole sequence(from first2 --> last2) is matched, the expected output from std::find_end(par, ...) is here
<hkaiser>
do you have a post-release version update lined up already, or would you want me to do that?
<ms[m]>
uh, lined up, but got distracted by other things
<ms[m]>
it's in progress
<hkaiser>
+1, np sorries
<hkaiser>
no worries
<ms[m]>
btw, saw your message from yesterday about the sanitizer build messing up things for senders...
<ms[m]>
hopefully it's just temporary then
<ms[m]>
sanitizer support is still pretty new in msvc, no?
<hkaiser>
yes, I was a bit worried there ;-)
K-ballo has joined #ste||ar
<gnikunj[m]>
ms: I'm trying the fork_join_executor and I'm observing notorious results. Sometimes it takes 10us to execute while other times it takes 1ms. Is there something I need to make sure before I instantiate the executor? (The total grain-size is 25us so I wanted to see how your executor would perform)
<ms[m]>
I'd also increase the number of iterations (it's 10 by default)
<ms[m]>
yeah, that looks sane
<ms[m]>
I don't think it'll go much faster with that tiny array
<ms[m]>
the main point is that it should be significantly faster than the default, but I don't know how much room there is to still improve it
<ms[m]>
at least down to a few k elements it was still on par with openmp (kokkos openmp backend) which I took as "good enough" at the time
<ms[m]>
in your benchmark it might be a good idea to just print the timings of the individual iterations to see if there's variation from iteration to iteration or if it's just e.g. the first iteration that's slow
<gnikunj[m]>
I see. Yes, if it did 10us for me everytime, I would want to use it too :P
<ms[m]>
and to do a warmup iteration
<gnikunj[m]>
it is a warmup iteration
<gnikunj[m]>
there's another one that runs before this portion of the code
<gnikunj[m]>
can that interfere somehow? I ensure that all of the threads are joinable before I move to this part of the code.
<ms[m]>
is it a fork-join executor? if yes, it can
<ms[m]>
and does it live over bm_hpx_small_tasks?
<gnikunj[m]>
no, it's parallel_aggregated
<ms[m]>
ok, then you should be good, but do a warmup with the fork_join_executor in that function
<ms[m]>
though again, if I remember correctly what I did it should be ready to go as soon as the constructor returns
<gnikunj[m]>
<ms[m] "and does it live over bm_hpx_sma"> no, that's called separately from main. Within that micro-benchmark I launch new threads and benchmark following which I wait for all threads to complete execution and then return. Then I proceed on calling that small code.
<ms[m]>
ok
<gnikunj[m]>
let me try playing with chunk size to see if that has something to do with this
<ms[m]>
creating multiple fork_join_executors is currently not a good idea for the same reason it's not a good idea to mix e.g. hpx and openmp
<ms[m]>
there will be multiple hpx worker threads busy waiting and starving each others resources
<gnikunj[m]>
right I saw my code deadlock (or it seemed) when I used fork_join_executor with parallel_aggregated
<ms[m]>
yep, it should not fully deadlock because of the yielding after some time, but it might significantly slow down
<ms[m]>
actually, it could deadlock because the threads are run with high priority...
<gnikunj[m]>
ah, yeah
<gnikunj[m]>
so fork_join_executors should only be used in its own scope while also ensuring that no-other hpx thread can interfere?
<ms[m]>
more or less
<ms[m]>
the priority is another option that can be changed, and with normal priority it's a bit less critical to do that
<ms[m]>
but yeah, it's fast because it assumes that it can busy wait and it's hopefully not keeping too much other work from executing
hkaiser has quit [Quit: Bye!]
<gnikunj[m]>
ms: another question: What happens if I give a static_chunk of 100 to a parallel for_loop that goes from 0 to 100. Does this mean only first thread gets work and all others don't?
<ms[m]>
gnikunj: yes
<gnikunj[m]>
got it. So other threads, once done executing can be swapped by the scheduler to do something else?
<gnikunj[m]>
as in load any other HPX thread that was previously waiting on a lock or a new HPX thread that is created by the user after the parallel for
<gnikunj[m]>
because you have small and minimal both defined as small
<ms[m]>
small is the smallest
<ms[m]>
if it makes you happier you can think of "minimal" as "smallest"
<gnikunj[m]>
yes, then don't define minimal if small is the smallest :P
<gnikunj[m]>
minimal sounds like the smallest of them all :P
hkaiser has joined #ste||ar
<ms[m]>
it is though
<ms[m]>
except for no stack
<ms[m]>
it's the smallest one we have
<ms[m]>
and if we ever decide to add a tiny stacksize which is smaller than tiny then minimal would be tiny
<gnikunj[m]>
so it's all in the name of future proofing ;)
<ms[m]>
not so much that, but also that
<ms[m]>
it's just a "relative stacksize" in some sense, just like the default refers to some actual stacksize and maximal refers to the largest stacksize
<ms[m]>
rachitt_shah, hkaiser, or gnikunj, could you share the gsod meeting link? our webmail (and thus calendar) is down at the moment...
<gonidelis[m]>
srinivasyadav227: what's this algo doing?
<hkaiser>
srinivasyadav227: no reason ;-)
<hkaiser>
different people wrote different tests
<srinivasyadav227>
gonidelis: it just finds the subsequence if exists in the main sequence (i.e if first2->second2 exists in first->second) and returns first2 if found else it returns last
<gonidelis[m]>
srinivasyadav227: exactly. so what do we need to test for validity?
<srinivasyadav227>
hkaiser: okay cool ;-)
<srinivasyadav227>
gonidelis: yea, just use std findend ;-) xD to compare the result
<gonidelis[m]>
lol
<gonidelis[m]>
srinivasyadav227: i like the way you are thinking
<srinivasyadav227>
i am working on it ;-),
<gonidelis[m]>
but since the writer knew the answer already, since they planted the subsequence themselves
<gonidelis[m]>
isn't it more "efficinent" for the test to look straightforwardly to that specific index (which the author already knows) ?
<srinivasyadav227>
aah, yes that also works, since he has only 2 elements in the sequence
<gonidelis[m]>
it's not because it's 2 elements
<gonidelis[m]>
srinivasyadav227: it could be a thoudand elements. we already know where those are, in the middle of the sequence. we put them there
<gonidelis[m]>
using std::find with a thousand elements would be even worse
<srinivasyadav227>
but we need to check if the whole subsequence exists right ?
<hkaiser>
I think we used std algorithms wherever we generate random sequences
<hkaiser>
gdaiss[m]: it's not sufficient to look at the index
<hkaiser>
you have to make sure that it doesn't find the wrong instance of the given subsequence
<hkaiser>
gonidelis[m]: ^^
<gonidelis[m]>
hkaiser: sure. the test has the potential to provide a false positive.