hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
nan111 has quit [Remote host closed the connection]
hkaiser has joined #ste||ar
Amy1 has joined #ste||ar
hkaiser has quit [Quit: bye]
Nikunj__ has quit [Read error: Connection reset by peer]
Yorlik has quit [Ping timeout: 264 seconds]
bita_ has quit [Read error: Connection reset by peer]
Amy2 has joined #ste||ar
Amy1 has quit [Ping timeout: 260 seconds]
<tarzeau> where could i ask something about amd mi50 usage in clusters?
mcopik has joined #ste||ar
mcopik has quit [Client Quit]
<mdiers[m]> I use vega and rx570 on workstations, could that help?
gonidelis has joined #ste||ar
<jbjnr> is hkaiser online?
gonidelis has quit [Ping timeout: 245 seconds]
<tarzeau> mdiers[m]: for pytorch and tensorflow (not using cpu but that gpu)?
<mdiers[m]> interesting, i'm also working on tensorflow right now. there is a tensorflow-rocm docker container. i got it running with singularity and made some first tests.
Yorlik has joined #ste||ar
<tarzeau> mdiers[m]: tf 1.x or 2.2.0?
<tarzeau> (built yourself using bazel, or pypi binaries)?
nikunj has quit [Ping timeout: 260 seconds]
nikunj has joined #ste||ar
<mdiers[m]> i think it was still 1.15. my problem was to get rocm per rpm running on the system without affecting other things (vnc/mesa)
<mdiers[m]> <tarzeau "(built yourself using bazel, or "> dockerhub
gonidelis has joined #ste||ar
<gonidelis> jbjnr it's probably like 05:00 in the morning in Louisiana ;p . He will be probably be logged in in about ~2 hours
<mdiers[m]> I got it running, but I haven't gotten any further at the moment, because almost only nvidia is available and the priority is a c++/python interface.
hkaiser has joined #ste||ar
<Yorlik> hkaiser: YT ?
<Yorlik> heller1: My attempts from yesterday (still wonky in many ways): https://i.imgur.com/w1XIHmw.png
<Yorlik> And Inverse (FPS) https://imgur.com/a/ptD19Y5
<Yorlik> The Data at smallish Object Counts is quite chaotic - not sure it's meaningful - I might have to improve the measurements here
<gonidelis> As I reading past PR's I can see that there is a directory called `hpx/parallel/segmented_algorithms`. What was that about? What is its present name ?
<gonidelis> am ^^
<hkaiser> Yorlik: here
<Yorlik> Hello!
<Yorlik> Did you see the image of the measurements I made yesterday?
<hkaiser> Yorlik: 5000ms/frame?
<Yorlik> See the object count
<hkaiser> 5s/frame?
<Yorlik> I need to understand better what happened here and surely there might be errors
<hkaiser> doesn't sound right
<Yorlik> 5 seconds single threaded for 200k Objects?
<gonidelis> hkaiser is there a reason to have segmented_algorithms since ranges have been introduced?
<Yorlik> Thats 200K messages sent and processed and the according calls into Lua
<hkaiser> gonidelis: segmented algorithms operate on segmented (possibly distributed) data partitions, that's different
<hkaiser> Yorlik: so this is ok?
<gonidelis> gonidelis oh ok... thakns
<hkaiser> Yorlik: so 25 us/object
<hkaiser> not too bad, true
<Yorlik> Yes
<Yorlik> Including a call into a Lua State and running a script there.
<hkaiser> nod
<Yorlik> I was more interested about what it tells about our scalability
<hkaiser> ms[m]: sorry for spamming you with review comments
<ms[m]> hkaiser: no worries, sorry and thanks for looking through
<Yorlik> And OFC the measurements have a lot of weaknesses - htis is more a rough exploration of the situation than something compliant with scientific standards.
<ms[m]> I didn't really test anything in the PR yet so that was expected...
<hkaiser> Yorlik: if you want to scaling plots, the plot something like objects/s or objects/frame
<hkaiser> that should go up linearly, ideally
<Yorlik> Did you see the inverse plot?
<Yorlik> That's the framerate
<Yorlik> The numbers at the lower end for low object counts are bonkers
<hkaiser> I'd plot objects/frame instead
<hkaiser> because, that's what you're interested in, no?
<Yorlik> Yes
<Yorlik> There's a bunch of stuff I could do.
<Yorlik> Maybe that graph, yes
<hkaiser> the fps plot doesn't tell you anything as you might idle
<Yorlik> And then fix some unhandled exceptions I envountered and improve the measurement
<Yorlik> It's the unbounded updater - it never idles
<hkaiser> then it doesn't make sense that you level off when going to higher core numbers
<Yorlik> I think I have mearuement errors when the frametime is too low
<hkaiser> fps should theoretically go up linearly with number of cores
<Yorlik> The curve for the higher object counts makes sense
<Yorlik> And FPS is log scale on Y
<hkaiser> doesn't make sense anyways
<hkaiser> why if fps getting worse when running on more cores?
<hkaiser> *is*
<Yorlik> I think it's an artifact on the low object numbers
<Yorlik> Might even be rounding errors
<hkaiser> 100 objects on 12 cores, that is ~8 objects per core
<Yorlik> == a lot of overhead
<hkaiser> that means the update should take about 200 us per cycle
<Yorlik> I used the default chunker
<hkaiser> so you should see scaling (not perfect scaling mind you)
<Yorlik> I think I'll repeat the measurement with the autochunker
<hkaiser> shrug
<hkaiser> something is off with your measurements
<Yorlik> The default chunker splits it up, even if it doesn't make sense at very low object counts
<Yorlik> So it gets inefficient in this extreme edge case
<Yorlik> OFC splitting up 8 objects into 8 core with short update times doesn't make sense, right?
<Yorlik> I think that is part of the artifact
<Yorlik> I'll think of a way to automate the measurement, so I don't have to do it all manually (every data point is a manual run and processing of log data)
<gonidelis> how can I find the target for compiling just `/algorithms` ?
<hkaiser> gonidelis: make help | grep algorithms ?
<gonidelis> hkaiser thank you!
<gonidelis> hkaiser why do you use fwiterB and fwiterE on the iterators adaptation? I mean what do these letters stand for?
<hkaiser> forward iterator begin/end
<gonidelis> oh great! I was searching for sth like A,B or 1,2 but that makes more sense ;D =D
<gonidelis> hkaiser I can see that in `for_each.hpp`, `HPX_CONCEPT_REQUIRES_` is used in the parameters of the template declaration. While in `reduce.hpp` (which is the newer + better version of iterator based algos) there is `std::enable_if` outside the template parameters. It is actually placed as the return type (??? correct me if I am wrong) of `reduce()`.
<gonidelis> I remember you saying that we use the later one on the MACROS to achieve the effect. So do we use `enable_if` vs `HPX_CONCEPT_REQUIERS` according to the case or do we just go with `enable_if` from now on as a more modern solution?
<hkaiser> gonidelis: I don't remember why it's done one way here and another way there
<hkaiser> the macro expands to enable_if anyways, so I think the reduce is older and has not been changed to use the macros
<gonidelis> hkaiser ok i totally get it. I shall prefer going with the MACRO then... (do you think that we should gradually try to turn the `enable_if`s into MACROs?)
<hkaiser> gonidelis: we can do that, the macros help especially if you have more than one condition
<gonidelis> hkaiser ok I will keep it in mind as soon as I manage to adapt `for_each`. Just one last quite important question (sorry for the spam). We know that the `begin` should be different from the `end` iterator. What should be the type on the `algorithm_result<ExPolixy, Iter>` at function's result type? I guess it's `IterB`, right?
<jbjnr> hkaiser: I have a memory somewhere that ou recently committed an executor wrapper of some kind. I'd like to see it, but I can't remember what it was called. Is it in master or a branch anywhere?
<jbjnr> * hkaiser: I have a memory somewhere that you recently committed an executor wrapper of some kind. I'd like to see it, but I can't remember what it was called. Is it in master or a branch anywhere?
<hkaiser> gonidelis: looks at the spec (standard), I think it should be the begin iterator
<hkaiser> jbjnr: examples/quickstart/executor_with_thread_hooks.cpp
<jbjnr> thanks
<hkaiser> ms[m], jbjnr, heller1: I sent a mail wrt sponsoring yesterday - care to respond?
<heller1> Yorlik: so you're happy with the performance so far?
<Yorlik> All in all yes - but I feel I need to understand more
<ms[m]> hkaiser: where to? cscs.ch address?
<hkaiser> hpx-pmc ml
<Yorlik> The machine ofc: awesome. But I'd like to automate and improve the measurements
<heller1> hkaiser: awesome, thanks!
<Yorlik> heller: interestin is, that on certain configurations of cores and workload I triggered exceptions - possibly races and a lock i removed - i needed to reinstall it - will have to try to make it more fine grained.
<ms[m]> hkaiser: thanks for pinging me, I found a bunch of pmc emails in my spam (sorry if there were some old ones you expected a reply on...)
<hkaiser> Yorlik: that's expected - races tend to show up with higher core counts
<hkaiser> mdiers[m]: any time
<Yorlik> I'll have to investigate more - but first I want to fix some things and automate measureing . 98 manual datapoints tonight was a bit crazy
<Yorlik> It also is error prone ofc.
<jbjnr> all my pmc email goes to spam too ms[m]
<hkaiser> darn, ms[m]: any time
<hkaiser> jbjnr: that's where it belongs ;-)
<jbjnr> and gsoc mostly :(
<jbjnr> hkaiser: I will replace some of my limiting executor with cut'n'paste from your executor wrapper. I like your better.
<jbjnr> ^yours
<jbjnr> Mine was not forwarding properly
<hkaiser> jbjnr: ok
<heller1> hkaiser: i really like the idea of sponsorship and the general direction
<hkaiser> heller1: great! just send a +1, then (if you don't mind)
<heller1> Didn't I?
<hkaiser> as an email?
<hkaiser> haven't seen it (yet)
<hkaiser> ahh got it now
<ms[m]> hkaiser: just replied, very good initiative!
<hkaiser> thanks!
<mdiers[m]> hkaiser: short?
<hkaiser> mdiers[m]: yah, sorry
<heller1> How do I join the open collective?
<hkaiser> register on their website and give me your nick, I'll add you to the hpx project
<hkaiser> ms[m], jbjnr: same for you ^^
<mdiers[m]> hkaiser: so go ahead
<hkaiser> mdiers[m]: I mistypes your nick and accidentially highlighted your name, sorry
<hkaiser> *mistyped*
<hkaiser> meant to talk to ms[m]
<hkaiser> added
<hkaiser> you have to approve it, though
<hkaiser> heller1: added
<mdiers[m]> hkaiser: Oh no problem, then I can go on with my project-x the bookshelf wall ;-)
<hkaiser> mdiers[m]: absolutely!
weilewei has quit [Ping timeout: 245 seconds]
weilewei has joined #ste||ar
bita_ has joined #ste||ar
nan111 has joined #ste||ar
<jbjnr> something is fishy. sync execute gives a different task thread_if than the parent task. It used to give the same thread_id
rtohid has joined #ste||ar
<bita_> hkaiser, do you have a minute?
<hkaiser> bita_: I'm in a meeting right now, could we talk a bit later?
<bita_> of course
<hkaiser> bita_: I have 10 minutes now
<bita_> In this file, https://github.com/STEllAR-GROUP/phylanx/blob/8faae0bed73cdece78f3bdacabb52f40f4e8b5a0/tests/unit/plugins/dist_matrixops/dist_slice_2_loc.cpp, test_slice_column_0 works but test_slice_column_1 doesn't. test_slice_column_1 results in an empty array in one of the localities
<bita_> I think annotation-wise it is Okay, but I am not sure how to make an empty primitive
<hkaiser> ok, what can I do?
<hkaiser> you mean ho wto return an empty partition?
<bita_> I get the error of {what}: Invalid array of elements: HPX(unhandled_exception) followed by invalid state: thread pool is not running: HPX(invalid_status
<bita_> yes
<hkaiser> what do you return now?
<hkaiser> a null-sized vector? nil?
<bita_> On locality 1 I return annotate_d([], "array_1_sliced/1",
<bita_> list("tile", list("columns", 0, 0)))
<hkaiser> well, I'd need to run the code to see what's wrong
<hkaiser> what branch should I look at?
<bita_> dist_slice
<hkaiser> ok
<hkaiser> wil try later today
<bita_> thank you
Nikunj__ has joined #ste||ar
<nan111> hkaiser, will we have a meeting today?
<hkaiser> nan111: ahh yes, sorry
<hkaiser> nan111: I'm in
nikunj97 has joined #ste||ar
Nikunj__ has quit [Ping timeout: 260 seconds]
<ms[m]> hkaiser: heller just fyi, daint is unlikely to come back up still this week...
<hkaiser> ms[m]: thanks for letting us know
<heller1> hkaiser, gonidelis: FYI, 19:00 Fridays is very bad for me
<gonidelis> heller1 we could change that then...
gonidelis63 has joined #ste||ar
<hkaiser> heller1: what time would work for you?
gonidelis has quit [Ping timeout: 245 seconds]
gonidelis63 is now known as gonidelis
gonidelis has quit [Remote host closed the connection]
gonidelis has joined #ste||ar
<gonidelis> oh i am so sory. had problem with my connection and missed your `base_iterator` messages =( =( if you please could repeat them i would appreciate it
<gonidelis> rori
<rori> sure
weilewei has quit [Remote host closed the connection]
diehlpk_work_ has quit [Remote host closed the connection]
weilewei has joined #ste||ar
diehlpk_work_ has joined #ste||ar
nan111 has quit [Remote host closed the connection]
nan111 has joined #ste||ar
karame_ has quit [Remote host closed the connection]
gonidelis has quit [Ping timeout: 245 seconds]
<heller1> hkaiser, gonidelis: around 4 would be better
<hkaiser> heller1: I can do Friday's 9am/4pm
<hkaiser> rori: how about you?
<rori> perfect for me
<hkaiser> gonidelis: what time would that be for you? 6pm?
karame_ has joined #ste||ar
rtohid has quit [Remote host closed the connection]
<hkaiser> heller1, rori: let's decide when he's back
<rori> 5pm for him I believe
<hkaiser> k
rtohid has joined #ste||ar
akheir has joined #ste||ar
<bita_> hkaiser, using primitive_argument_type(ast::nil{true}, attached_annotation) and represent the result with nil works for my problem. I will ask you if there is a better method in the personal meeting. So, debugging that is not a priority, thanks for the offer though
<hkaiser> bita_: nod, thought so
nan111 has quit [Remote host closed the connection]
weilewei has quit [Remote host closed the connection]
nan111 has joined #ste||ar
rtohid has quit [Remote host closed the connection]
rtohid has joined #ste||ar
nan111 has quit [Remote host closed the connection]
rtohid has quit [Remote host closed the connection]
karame_ has quit [Remote host closed the connection]
akheir1 has joined #ste||ar
akheir has quit [Ping timeout: 240 seconds]
nikunj97 has quit [Quit: Leaving]
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
nan11 has joined #ste||ar
nikunj has quit [Ping timeout: 265 seconds]
nikunj has joined #ste||ar
nan11 has quit [Remote host closed the connection]
weilewei has joined #ste||ar
nan11 has joined #ste||ar
<K-ballo> we are getting github sponsors now?
nikunj97 has joined #ste||ar
rtohid has joined #ste||ar
<weilewei> K-ballo who?
<K-ballo> STE||AR
<weilewei> or you mean the Acknowledgements part in hpx github?
<K-ballo> maybe I was not supposed to say anything..? https://github.com/sponsors
bita_ has quit [Quit: Leaving]
<jbjnr> K-ballo: It's not that we are getting sponsors - only that we are registering ourselves so that we can one day (if anyone wants to sponsor us)
rtohid has left #ste||ar [#ste||ar]
nan11 has quit [Remote host closed the connection]