hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu
Yorlik_ has joined #ste||ar
Yorlik has quit [Ping timeout: 246 seconds]
hkaiser has quit [*.net *.split]
Kalium has quit [*.net *.split]
K-ballo has quit [Ping timeout: 240 seconds]
K-ballo has joined #ste||ar
Kalium has joined #ste||ar
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 240 seconds]
K-ballo1 is now known as K-ballo
hkaiser has joined #ste||ar
<pansysk75[m]> hkaiser: I was not able to reproduce the failed range_sort tests, on a configuration identical to the jenkins one (as far as i can tell)
<pansysk75[m]> Just to confirm, I should be working on a "cuda" node (not on a "jenkins-cuda" one), correct?
<hkaiser> what's the difference?
<pansysk75[m]> the jenkins nodes are similar to the others, just "reserved" for running the CI?
<hkaiser> no, jenkins just runs on rostam, nothing special - different builders do use different slurm partitions, though
<pansysk75[m]> oh, i was looking at the "partition" naming when typing sinfo
<pansysk75[m]> the node names are the ones on the right ("geev" for example)
<hkaiser> the slurm partition a builder uses is defined here, e.g.: https://github.com/STEllAR-GROUP/hpx/blob/master/.jenkins/lsu/slurm-configuration-gcc-10-cuda-11.sh
<hkaiser> pansysk75[m]: I believe however, that the test error is efemeral and not caused by your changes
<hkaiser> could be an issue in the test itself, even - not the algorithm (even more as the range algorithms simply dispatch to the iterator-based ones
<hkaiser> and those are fine
<pansysk75[m]> yes, makes sense
hkaiser has quit [Quit: Bye!]
hkaiser has joined #ste||ar
karamemp[m] has quit [Quit: You have been kicked for being idle]
<pansysk75[m]> I noticed that a handful of algorithms behave differently to manually setting the chunk size.... (full message at <https://libera.ems.host/_matrix/media/v3/download/libera.chat/5821f7e6c8118d812403ed2a00a811cd8ce06174>)
<pansysk75[m]> This seems like it's not conforming to the approach of the rest of the algorithms
<pansysk75[m]> Check the pics below, where I use static_chunk_size with "transform" and "sort", and I get a maybe unexpected result in the 2nd case
<pansysk75[m]> I mean, it behaves well, but as a user I would expect that decision to be up to me and not the hpx::sort impl
<pansysk75[m]> * Check the pics below, where I set the number of chunks on parallel "transform" and "sort", and I get a maybe unexpected result in the 2nd case
<pansysk75[m]> * Check the pics below, where I set the number of chunks on parallel "transform" and "sort" (using static_chunk_size), and I get a maybe unexpected result in the 2nd case
<hkaiser> pansysk75[m]: yes, I know that sort is different
<hkaiser> how does the changed sort compares against the original version?
<pansysk75[m]> by "changed" you mean?
<hkaiser> the one that uses the chunking interface
<hkaiser> I might have misunderstood what you said
<hkaiser> did you actually change sort ?
<pansysk75[m]> nope, but I'll get back to you with that
<hkaiser> ahh
<hkaiser> ok - cool
<pansysk75[m]> I'm not that concerned about performance, more about uniformity (if that word exists)
<hkaiser> both are important ;-)
<pansysk75[m]> because setting a minimum chunk size is probably a good approach for all parallel algorithms (thats why static_chunk_size probably exists)
<hkaiser> I agree
<pansysk75[m]> will play around a bit and i'll get back to you
<hkaiser> sort was contributed by somebody as a 'one off' contribution to HPX, we never got around to fully integrate it with the scheduling property facilities
<hkaiser> also, sort should use projections (at least the range based version, IIRC), not sure if we actually support that ATM
<hkaiser> pansysk75[m]: actually ... I think sort does use the chunk-size calculation...
<hkaiser> after all - I forgot that we implemented that after all
<pansysk75[m]> It does do a calculation, and then takes the maximum with the magic number a few lines after that
<pansysk75[m]> The majority of the algorithms call get_bulk_iteration_shape, which also does the same calculation
<hkaiser> well, we don't need the iterator range (the shape) in this case
<pansysk75[m]> So my concern is about
<pansysk75[m]> 1. Chopping of the chunk_size at an arbitrary number
<pansysk75[m]> 2. Repeating ourselves
<pansysk75[m]> hkaiser: Agree on that
<hkaiser> pansysk75[m]: perf gets really bad if the chunks are becoming too small
<pansysk75[m]> <hkaiser> "pansysk75: perf gets really..." <- I agree
<pansysk75[m]> That's the case with other algorithms as well, but we let the user take the hit when they make the mistake of calling a par impl on very small work, correct? (Chunks = 4*n_cores and all that jazz)
<hkaiser> dkaratza[m]: I comitted a minor change to your PR, should be fine now
<hkaiser> pansysk75[m]: try it
hkaiser has quit [Quit: Bye!]
hkaiser has joined #ste||ar
Yorlik_ has quit [Ping timeout: 255 seconds]
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 250 seconds]
K-ballo1 is now known as K-ballo
tufei_ has joined #ste||ar
diehlpk_work has quit [Remote host closed the connection]