hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
diehlpk has joined #ste||ar
diehlpk has quit [Changing host]
diehlpk has joined #ste||ar
rtohid has left #ste||ar [#ste||ar]
hkaiser has quit [Quit: bye]
weilewei has quit [Remote host closed the connection]
diehlpk has quit [Ping timeout: 246 seconds]
bita has quit [Read error: Connection reset by peer]
bita has joined #ste||ar
jaafar has quit [Ping timeout: 268 seconds]
nikunj97 has joined #ste||ar
<simbergm> jbjnr: this is an automated reminder to look at https://github.com/STEllAR-GROUP/governance/pull/2
<kordejong> Hi all. I have a simple benchmark that multiplies and divides partitioned arrays for a number of iterations (single node, 4 cores). When I look at a plot of the thread idle rates I see low values (<10%) with spikes (40, 60, even 100%) at regular intervals. A trace loaded in Vampir suggests the spikes in idle-rate correspond with calls to `hpx::agas::server::primary_namespace::decrement_credit_action`. At those moments, the
<kordejong> other three cores show much less activity, while around those calls, all cores are busy. Is this behavior to be expected or can I maybe change/tweak something to prevent these high idle rates?
<heller1> interesting
<heller1> this is due to partitioned vector is using global IDs
<heller1> however, no credit splitting should happen in a non distributed setting. nevertheless, once the id_type instances go out of scope, there's at least one decrement happening indeed
<heller1> those regular intervals then should correlate to the end of your iterations, where you let the partitioned vectors go out of scope.
<kordejong> I am not using the built-in partitioned vector BTW, but my own nD partitioned array. But possibly what you say also holds for my case.
<heller1> ah ok
<heller1> then, I have no idea what you are doing ;)
<jbjnr> Kor de Jong: can you upload your plot somewhere - or vampir data - I could have a look. Doubt I can help, but might be able to spot something unusual
<kordejong> I have put a zip of the trace here: https://kordejong.stackstorage.com/s/zdNxQlxDQaRW0VW Thanks in advance!
<kordejong> Some additional info: the number of iterations is 50, while I count about 20 spikes in idle-rate / calls to `decrement_credit_action`. Also, my array partitions are HPX components, which are referenced by partitioned array instances. In my current case the partitions are all located on a single desktop machine, but on a cluster they get distributed over all available NUMA nodes.
gonidelis has joined #ste||ar
<gonidelis> Where could I find out how do the tests work? (in the case of all_off.cpp for example) What is their purpose, when and how are they executed?
<heller1> gonidelis: their purpose is to give us some way of telling whether the implementation works or if we broke anything
<heller1> they can be run manually, but are also run at every pull request and merge to master
nikunj97 has quit [Remote host closed the connection]
nikunj97 has joined #ste||ar
<jbjnr> Kor de Jong: interesting. I see the pattern you refer to
<jbjnr> is this 4 threads on 1 node, or 4 nodes?
<kordejong> 4 threads on 1 node
<jbjnr> I do not use actions much, so I'm not sure what might be going on, but it's be interesting to see the code as well if it's on github or anything.
<jbjnr> ok. Actions are most useful for remote invocations. if you're doing things node local, maybe you could just call the functions directly and bypass the action stuff?
<jbjnr> (unless you're devloping this locally with the plan to go remote later, in which case ignore my comment)
<jbjnr> The problem with actions is that they go through the network layer, which adds some overheads. I suspect you're seeing some bottleneck where the network temporarily stops processing stuff. Hard to say without hands on. (I'm not volunteering).
<kordejong> 😉 I would like this to work well in the distributed case, but also in the non-distributed case. With a single implementation.
<jbjnr> ok
<jbjnr> you're using the mpi backend to networking I assume.
<jbjnr> do the multiple/dives that are delayed 'wait' on results from another core?
<jbjnr> * do the multiply/divides that are delayed 'wait' on results from another core?
<heller1> well, reference counting always implies some overhead
<heller1> the question however is, is the decrease in parallelism only happening in non critical sections of your code?
<kordejong> When building on my local Desktop I turn networking off
<kordejong> On the cluster I build HPX with `HPX_WITH_PARCELPORT_MPI=ON`
<jbjnr> The credit thing in the plot above takes a very long time. This is disturbing and is no doubt highly indicative of some underlying issue.
<heller1> sure it is
<heller1> nevertheless, I hope this is not a debug build?
<kordejong> There is no waiting in the implementation, except after the iteration has finished. My goal is to make this as asynchronous as possible, and the Vampir plots suggest that things are going well most of the time. Calculations of a next iteration start for some partitions while calculations of a previous iteration are still ongoing for other partitions. One aspect of the implementation is that calculations on partitioned
<kordejong> arrays always return new partitioned arrays. Non of the array partitions are updated.
<heller1> jbjnr: FWIW, I only identified the credit counting to be an issue on large scale runs, when a lot of distributed credit counting is happening
<kordejong> DebInfoWithRelease build
<heller1> ReleaseWithDebInfo
<kordejong> Ah yes ;-)
<heller1> so the partitioned arrays are temporary objects most of the time?
<jbjnr> Sorry. I didn't mean waiting in th literal sense, but more like `when(remote).then(multiply)` - implying that the multiply only happens when some remote calculation has completed.
<jbjnr> if those remote things are being delayed ...
<heller1> yeah
<kordejong> <heller1 "so the partitioned arrays are te"> indeed
<heller1> the decref actions are only supposed to be executing when an id_type goes out of scope
<heller1> Kor de Jong: so ask yourself when you need them to be distribtued, and if you can't separate the actual data from the component
<jbjnr> ^this
<jbjnr> gtg
gonidelis has quit [Remote host closed the connection]
<kordejong> <heller1 "Kor de Jong: so ask yourself whe"> I have trouble understanding your point, but it seems important. A bit more clarification from my part: The array partition component servers are located on different localities and stay there, until they go out of scope. The corresponding client instances are used in the implementation of the partitioned array class and are located on the root locality. When performing a
<kordejong> calculation `output_array = f(input_array)` tasks implementing the calculation for individual partitions are being sent to their respective localities, resulting in output partitions on these same localities. Partitioned arrays contain array partitions that may or may not be ready. Algorithms attach continuations to input partitions and result in output partitions. Does this make sense?
<kordejong> Maybe this approach is not ideal in the case of a single locality
<heller1> yes, makes sense
<heller1> think about it, if you can implement `output_array = f(input_array)` without components
hkaiser has joined #ste||ar
nikunj97 has quit [Quit: Leaving]
Abhishek09 has joined #ste||ar
Abhishek09 has quit [Remote host closed the connection]
diehlpk has joined #ste||ar
diehlpk has quit [Changing host]
diehlpk has joined #ste||ar
mdiers_ has quit [Ping timeout: 264 seconds]
Pranavug has joined #ste||ar
<hkaiser> simbergm: g'morning
<hkaiser> simbergm: quick q: how do I supress a clang-tidy error?
<simbergm> hkaiser: sec, I'll give you an example
Vir has quit [Ping timeout: 256 seconds]
<simbergm> hkaiser: // NOLINTNEXTLINE(bugprone-branch-clone)
<simbergm> replace `bugprone-branch-clone` with the actual warning
Pranavug has quit [Client Quit]
<simbergm> I think `// NOLINT(blabla)` on the same line works as well
<hkaiser> thanks! that's exactly what I need
diehlpk has quit [Ping timeout: 246 seconds]
mdiers_ has joined #ste||ar
Vir has joined #ste||ar
Vir has quit [Changing host]
Vir has joined #ste||ar
weilewei has joined #ste||ar
kale_ has joined #ste||ar
Pranavug has joined #ste||ar
Pranavug has quit [Quit: Leaving]
nikunj97 has joined #ste||ar
<weilewei> diehlpk_mobile[m incite proposal writing webinar: https://www.olcf.ornl.gov/calendar/2021-incite-call-for-proposals-webinar-2/
diehlpk_work has joined #ste||ar
<kale_> Hey, I'm Mahesh Kale. I'm a junior in computer science at IIT Roorkee. I'm currently learning the ways to make pip package of projects with binaries. I came across the project of making a pip package for phylanx. I've already built phylanx and all its dependencies on my machine. I think the project will be a great learning experience for me. Can you guide me further on how I can proceed with the project?
<zao> Hi there!
rtohid has joined #ste||ar
<nikunj97> kale_, you may want to talk further about the project with rtohid or diehlpk_work
<diehlpk_work> At the end you have to prepare a proposal by the end of this month
<diehlpk_work> And outline how you would solve the project
<diehlpk_work> The main challenge here is that you have to pack all dependencies within the pip package
<diehlpk_work> Also some one mentioned that https://docs.conda.io/en/latest/ could be an alternative
<kale_> diehlpk_work, People who are into ML field use conda more often than pip. And making a conda package would be easier and more efficient than pip package.
* zao jiggles
<kale_> diehlpk_work, I'll take a look at how conda packages differ from pip packages so I can get a better understanding of which one would be better.
<heller1> he
<diehlpk_work> Hashmi, sounds good
karame78 has joined #ste||ar
shahrzad has joined #ste||ar
<kale_> diehlpk_work, I think there can be package in both conda and pip so that end user gets a choice while installing
<diehlpk_work> kale_, A good first step would be to compare the tools and send us which tool is betetr for what we want to do
<nikunj97> kale_, btw conda is not available on clusters, at least I'm yet to see a module available
<nikunj97> so a pip package looks necessary in our case
K-ballo has quit [Remote host closed the connection]
K-ballo has joined #ste||ar
<kale_> diehlpk_work, Thanks for the lead, I'll research for it.
<nikunj97> zao, how's the coronavirus situation at your place btw?
<kale_> nikunj97, I'll research further on pros and cons of pip package and possibility of conda package on clusters.
<nikunj97> kale_, sounds good
<zao> nikunj97: Most of the HPC site are working-from-home, all higher education at the uni is moved online, buildings are closed for students. Country in general is still reasonably lax.
<nikunj97> zao, I see. Even we have our universitites closed these days. Minimal outings, trying our best to not spread it further
nan222 has joined #ste||ar
RoryH has joined #ste||ar
Abhishek09 has joined #ste||ar
<Abhishek09> rtohid: very happy to see u after long time
diehlpk has joined #ste||ar
diehlpk has quit [Changing host]
diehlpk has joined #ste||ar
<Abhishek09> diehlpk Is rtohid here?
shahrzad has quit [Read error: Connection reset by peer]
shahrzad_ has joined #ste||ar
<nikunj97> Abhishek09, they have a meeting right now
<Abhishek09> nikunj97: When they will free?
<nikunj97> usually goes for like 40min. Don't know how long it'll take today
<Abhishek09> nikunj97: Are u mentor this year?
<nikunj97> Abhishek09, yes
<Abhishek09> Which project?
<nikunj97> Blaze iterative, concurrent data structures, pip package
<nikunj97> mainly mentoring for Blaze iterative and concurrent data structures though
<nikunj97> for pip package, rtohid is handling most of the things. I'm here to help with general problems that arises with HPX and Phylanx
<Abhishek09> Which is more beneficicial in GSoC , as a student or mentor?
<Abhishek09> nikunj97
<nikunj97> This is the first time I'm being a mentor. I don't know the perks of mentors.
<Abhishek09> myself also once a gsocer
<nikunj97> I already had an internship and didn't want to work twice as hard, so I opted for mentorship instead
<Abhishek09> i didn't find ste||ar on linkedIN
<nikunj97> STE||AR is a name we gave ourselves as an organization. CCT didn't raise an eyebrow, so we're using it officially as an open source org.
<nikunj97> that's why you don't find it on linkedin
<nikunj97> most likely, no one created it there
<nikunj97> zao, would you like a linkedin profile for STE||AR?
<Abhishek09> nikunj97: that means you are in 2nd or 3 rd year
<nikunj97> simbergm, jbjnr your thoughts too on this ^^
shahrzad_ has quit [Ping timeout: 250 seconds]
<nikunj97> Abhishek09, yes, I'm in my 3rd year. I did my gsoc back in my 1st year with STE||AR when I used to have a lot of spare time
<Abhishek09> nikunj97: Is this org needs contribution for selection?
<Abhishek09> or proposal is enough
<nikunj97> Abhishek09, contributions are way to assess your understanding. So I believe, you can prove your case if you have contributions
<nikunj97> contributions can be as simple as a documentation update or as complex as fixing a difficult to handle bug
<nikunj97> but it shows your interaction with the community and your understanding of the library
<nikunj97> both of which are crucial for a great proposal
<zao> The GSoC wiki page suggests having something to show the understanding, even if it's a toy futurised matrix-matrix multiply or so.
<zao> nikunj97: I don't use linkedin actively at all, but if there's some sort of affiliation or liking one can do of ste||ar as an organization, I'd dop that.
<Abhishek09> I also thought to mentor this year but i dropped that idea
<nikunj97> zao, well yea. for example, I use STE||AR GROUP as an experience on my linkedin. But I have to manually add the name and there's nothing to search. So if we had a page, that'll be a great idea
<nikunj97> Abhishek09, your reason being?
<nikunj97> zao, that's why I use LSU on the internship I did last year. It made more sense that way.
<Abhishek09> Student participation will make profile more impressive rather than mentor
<Abhishek09> i thouhgt so
<nikunj97> I see. I wanted to give back to the community. Hence, I became a mentor.
<nikunj97> besides another reason being lack of time for another gsoc project completion
<zao> Mentorship is kind of indicative of ones capability to lead and collaborate as well.
<Abhishek09> nikunj97: You can also do gsoc along with internship
<jbjnr> Abhishek09: did you say you have done a project with stellar before? if so, which one please? Thanks
<nikunj97> Abhishek09, ik. I don't want to overburden myself with another gsoc project, when I already have a lot of things going on ;)
<Abhishek09> one senior harkirat singh has done(iit roorkee alumini)
<Abhishek09> jbjnr: not with ste||ar
<nikunj97> Abhishek09, ik people who have done it as well. Again, I don't want to overburden myself
<nikunj97> btw you're from iitr?
<Abhishek09> No
<Abhishek09> you?
<nikunj97> yea, I'm from iitr
<Abhishek09> u known harkirat?
<nikunj97> yup, he was in his final year when I was a freshman
<jbjnr> Abhishek09: which project was it?
<Abhishek09> Aboutcode organisation
<jbjnr> ok. I had a gsoc student on a differnt project a few years ago called Abhishek - but I guess it is a common name in India (?)
<nikunj97> jbjnr, haha yea!
<nikunj97> it's a pretty common name to have. Even mine is a common one.
<Abhishek09> Yes , jbjnr
<nikunj97> I know at least 5 other nikunj in my friend circle alone
<Abhishek09> jbjnr: You can see my project details there
<Abhishek09> jbjnr: Have seen all details?
<jbjnr> I'm not on linked in so I can't see anything. Do not worry, I didn't want to read it :)
mdiers_ has quit [Remote host closed the connection]
<nikunj97> jbjnr, btw what's your take on having STE||AR as an org on linkedin?
<Abhishek09> jbjnr :)
<simbergm> hkaiser: yt? just for tomorrow's meetings, can you host it? I think we'll be limited to 45 (or 30) minutes if I host it
<simbergm> and if yes, can you just send an email to hpx-devel with the details?
mdiers_ has joined #ste||ar
<hkaiser> simbergm: sure, I'll create a zoom meeting
<hkaiser> simbergm: who should be on ?
<Abhishek09> hkaiser: Is rtohid is free now?
<hkaiser> Abhishek09: he's in a meeting right now, I'll tell him to get on here
<hkaiser> ohh, he is on
<nikunj97> hkaiser, about the kokkos integration. Do we have a working hpx backend for kokkos now?
<Abhishek09> hkaiser: Thanks but not no reply from him
<hkaiser> Abhishek09: give him a minute or two
<hkaiser> nikunj97: simbergm is the local expert for this
<nikunj97> hkaiser, ohh, alright.
kale_ has quit [Quit: Leaving]
diehlpk has quit [Ping timeout: 246 seconds]
Abhishek09 has quit [Quit: Ping timeout (120 seconds)]
RoryH has quit [Ping timeout: 240 seconds]
Abhishek09 has joined #ste||ar
Abhishek09 has quit [Client Quit]
<diehlpk_work> hkaiser, Got the approval and submitted the expense report
jaafar has joined #ste||ar
<hkaiser> diehlpk_work: ok, will approve right away
nikunj97 has quit [Ping timeout: 256 seconds]
nikunj97 has joined #ste||ar
<simbergm> nikunj97: yeah
<nikunj97> simbergm, it's only on node parallelism, right?
<simbergm> Feel free to join the meeting tomorrow if you're interested
<simbergm> Yep
<nikunj97> when's it?
<nikunj97> I'd love to join in too
<simbergm> 3 pm cet
<nikunj97> cet is GMT+1 right?
<simbergm> hkaiser: I think everyone interested in the kokkos meeting is here so you can put the details here as well?
<simbergm> Yep
<nikunj97> alright, I'll join too!
<weilewei> Oh, there is a Kokkos meeting, can I join as well?
<hkaiser> simbergm: https://lsu.zoom.us/j/3340410194, tomorrow 9amCDT/15:00CET
<hkaiser> weilewei: sure, see link above
<weilewei> hkaiser thanks, will be there.
<hkaiser> nikunj97: I believe its GMT+2
<nikunj97> hkaiser, ohh ok. So 6:30PM IST
<hkaiser> no, you're right, gmt+1
<nikunj97> alright. 7:30PM it is
simbergm has left #ste||ar ["User left"]
akheir has joined #ste||ar
shahrzad_ has joined #ste||ar
shahrzad_ has quit [Ping timeout: 246 seconds]
ahkeir1 has joined #ste||ar
akheir has quit [Ping timeout: 256 seconds]
RoryH has joined #ste||ar
rtohid has quit [Remote host closed the connection]
RoryH has quit [Remote host closed the connection]
shahrzad_ has joined #ste||ar
RoryH has joined #ste||ar
nan222 has quit [Ping timeout: 240 seconds]
ahkeir1 has quit [Read error: Connection reset by peer]
ahkeir1 has joined #ste||ar
shahrzad_ has quit [Ping timeout: 246 seconds]
RoryH has quit [Remote host closed the connection]
weilewei has quit [Remote host closed the connection]
gonidelis has joined #ste||ar
rtohid has joined #ste||ar
<diehlpk_work> hkaiser, Meeting?
<hkaiser> diehlpk_work: sec
shahrzad_ has joined #ste||ar
shahrzad_ has quit [Read error: Connection reset by peer]
shahrzad has joined #ste||ar
shahrzad has quit [Ping timeout: 246 seconds]
shahrzad has joined #ste||ar
shahrzad has quit [Ping timeout: 246 seconds]
rtohid has left #ste||ar [#ste||ar]
<bita> Do we have something like: HPX_TEST_NOT_EQ
<hkaiser> bita: HPX_TEST_NEQ, I believe
<bita> Great, thanks
<hkaiser> bita: ^^
<bita> I usually have a hard time finding macros and their meaning
<hkaiser> it's... nicely underdocumented
gonidelis has quit [Ping timeout: 240 seconds]
folshost has joined #ste||ar
bita_ has joined #ste||ar
diehlpk_work_ has joined #ste||ar
bita has quit [Ping timeout: 246 seconds]
maxwellr96 has quit [Ping timeout: 246 seconds]
diehlpk_work has quit [Ping timeout: 256 seconds]
nikunj97 has quit [Ping timeout: 246 seconds]
diehlpk_work_ has quit [Remote host closed the connection]
K-ballo has quit [Quit: K-ballo]
ahkeir1 has quit [Read error: Connection reset by peer]
K-ballo has joined #ste||ar
ahkeir1 has joined #ste||ar