#ste||ar on 2020-04-14 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

00:17 Nikunj__ has joined #ste||ar

00:17 nikunj has quit [Ping timeout: 260 seconds]

00:18 nikunj has joined #ste||ar

00:18 nikunj97 has quit [Ping timeout: 260 seconds]

00:21 <hkaiser> nikunj: here

00:21 <Nikunj__> hkaiser, I was wondering if you will allow me to work on the distributed resiliency project

00:22 <Nikunj__> I want to extend that work and get a conference paper out of it

00:22 <hkaiser> Nikunj__: do we have one?

00:22 <Nikunj__> did the project with Sandia guys not extend?

00:22 <hkaiser> we currently try to integrate it with Kokkos to have resiliency of device data

00:23 <Nikunj__> ohh so the project was to test HPX capabilities before integrating it with kokkos?

00:23 <hkaiser> no plans to extend it to distributed at this point

00:24 <Nikunj__> it will be nice to have a distributed model I guess

00:24 <hkaiser> but if you're interested you can work on it anyways ;-)

00:24 <Nikunj__> yes, I believe that our paper was rejected last time coz of the Habenaro guys

00:25 <Nikunj__> if we have a distributed version that performs nicely, we can get a paper out of it

00:25 <hkaiser> nod

00:25 <Nikunj__> and I was already thinking to add even better facilities wrt resiliency so that we can work with node failures

00:26 <Nikunj__> btw can the runtime know if a node fails? or if there's a way in HPX to know that?

00:26 <hkaiser> Nikunj__: at some point heller1's collaborators have work on this

00:26 <Nikunj__> distributed resiliency?

00:27 <hkaiser> Nikunj__: well, timeouts

00:27 <hkaiser> no handling node failures

00:28 <Nikunj__> hkaiser, well yes. But relying on timeouts just make it slow to respond

00:28 <Nikunj__> how does current applications identify node failures for backup and restore?

00:28 <hkaiser> Nikunj__: you could use some watchdog mechanism

00:28 <hkaiser> no idea

00:28 <Nikunj__> hkaiser, yes, that's how I was thinking

00:29 bita has joined #ste||ar

00:29 <Nikunj__> hkaiser, any resources you'd want me to read on this? I think we can work on resiliency side of things on a greater extent

00:29 <hkaiser> Nikunj__: I don't know anything about this

00:30 bita_ has quit [Ping timeout: 260 seconds]

00:30 <Nikunj__> damn, do you know anyone who can help me with the project?

00:30 <hkaiser> Keita?

00:31 <Nikunj__> ohh yes, but will Keita help me with this sort of independent project?

00:31 <hkaiser> he can give some guidance on what to read

00:32 <Nikunj__> alright. I'll email him about this and CC you

00:32 <hkaiser> nod, pls do

00:32 <Nikunj__> is it novel enough though?

00:32 <Nikunj__> I don't want another reviewer accusing of the work being "not novel enough"

00:33 <hkaiser> hah

00:33 <hkaiser> not many production systems exist that do that, if at all

00:34 <Nikunj__> I also want to integrate our system with backup and restore

00:34 <hkaiser> checkpointing

00:34 <Nikunj__> yes checkpointing

00:34 <hkaiser> Nikunj__: Yorlik wants to hav ethat for his game

00:35 <Yorlik> What?

00:35 <Nikunj__> if we can have a mechanism of checkpointing with distributed resiliency, it will be really nice to have

00:35 <hkaiser> indeed

00:35 <hkaiser> checkpoint agas itself

00:35 <Yorlik> Persistent id_types?

00:35 <hkaiser> right

00:35 <Nikunj__> Yorlik, yes

00:35 <Yorlik> Need it for real !!!!

00:36 <Yorlik> It also would have economic implications.

00:36 <Yorlik> Imagine a datacenter rentingh out compute capacity

00:36 <Yorlik> They could have high prio and low prio jobs

00:36 <Yorlik> Just persisting a job, running another and resuming the previous later.

00:37 <Yorlik> So people with less dire computing needs could use spare capacity for cheap :)

00:38 <Yorlik> But ofc, you could do that on a VM level too :(

00:38 <Nikunj__> hkaiser, it's done then. I think this is a cool summer project that is worth exploring

00:39 <hkaiser> Nikunj__: nice!

00:39 <Nikunj__> besides I'll be done with the current JSC by mid May, so this will be nice to explore post that

00:40 <zao> Reminds me of a story of one of our researchers, running on the cluster of the Norwegian weather service many years ago. That machine had the ability to page jobs out to disk when important weather forecast jobs were scheduled.

00:40 <zao> The researcher used the machine so efficiently that it took the system too long to page his job out, consuming the whole forecast slot just to spool it out to disk, at which point it promptly spooled it back in again.

00:41 <hkaiser> zao: you issue a command: kill job:12345, and the system responds with: insert tape

00:41 <Yorlik> I have seen robotic tabe archives loooong ago ..

00:41 <Yorlik> tape

00:41 <zao> Our tape robot is quite alive and well :D

00:41 <Nikunj__> I have never seen tape sadly

00:42 <Yorlik> zao: Is tape still considered to most reliable long term storage?

00:42 <zao> Can't beat tape for bulk storage that still has some sort of retrieval need.

00:43 <zao> Tape is reasonably resilient, and tape libraries do run audits to find tapes that need to be rewritten or replaced.

00:43 <zao> You can also build some redundancy on top by duplication, even to tape libraries on other geographical sites.

00:44 <zao> We use ours for backups and LHC experiment data.

00:45 <Yorlik> LHC data amounts must be massive. You're working at CERN, zao?

00:46 hkaiser_ has joined #ste||ar

00:46 <zao> We're part of the Nordic Tier 1, so we carry and distribute part of the experiment data, as well as provide GRID compute.

00:46 <Yorlik> IC

00:46 <zao> I'm not directly involved, some of my cow-orkers work with them.

00:47 <zao> I migrated some data between tape generations a bunch of years ago, so got to poke around quite a bit at the system.

00:47 <bita> hkaiser, I have been testing that issue. Here, https://gist.github.com/taless474/0621f08e03d13e16d7bad1dd655fb155, the commented one does not pass while the other passes

00:48 <hkaiser_> bita: is that what's causing the '<unknown>' error?

00:48 hkaiser has quit [Ping timeout: 260 seconds]

00:48 <bita> yes

00:48 <hkaiser_> ok, so it's not build-system realted

00:48 <hkaiser_> related*

00:48 <bita> I think the problem is with my retile

00:48 <hkaiser_> I'll try to debug that

00:49 <hkaiser_> tomorrow

00:49 <bita> the problem is I cannot debug it

00:49 <bita> But this last example gave me some ideas about what's wrong

00:49 <zao> Hrm, it's Tuesday tomorrow, should probably hit the sack. Might get some time to HPX:ify some more code.

00:50 <hkaiser_> bita: I hope to have some time tomorrow later in the eveing

00:50 <bita> right, thank you

01:05 nikunj97 has joined #ste||ar

01:06 Nikunj__ has quit [Ping timeout: 240 seconds]

01:24 akheir1 has quit [Quit: Leaving]

01:32 nan11 has quit [Remote host closed the connection]

01:36 <Yorlik> Do we have anyone really good with intrinsics and bit manipulation?

02:01 hkaiser_ has quit [Ping timeout: 260 seconds]

02:22 Nikunj__ has joined #ste||ar

02:26 nikunj97 has quit [Ping timeout: 260 seconds]

02:33 diehlpk_work has quit [Remote host closed the connection]

02:38 bita has quit [Quit: Leaving]

03:20 Nikunj__ is now known as nikunj97

03:33 weilewei has quit [Remote host closed the connection]

04:08 nikunj has quit [Read error: Connection reset by peer]

04:08 nikunj has joined #ste||ar

08:39 mdiers[m] has quit [Quit: killed]

08:39 simbergm has quit [Quit: killed]

08:39 gdaiss[m] has quit [Quit: killed]

08:39 pfluegdk[m] has quit [Quit: killed]

08:39 Guest20891 has quit [Quit: killed]

08:39 diehlpk_mobile[m has quit [Quit: killed]

08:39 kordejong has quit [Quit: killed]

08:39 jbjnr has quit [Quit: killed]

08:39 heller1 has quit [Quit: killed]

08:39 freifrau_von_ble has quit [Quit: killed]

08:40 rori has quit [Quit: killed]

08:40 tiagofg[m] has quit [Quit: killed]

08:50 simbergm has joined #ste||ar

09:01 kordejong has joined #ste||ar

09:02 heller1 has joined #ste||ar

09:02 mdiers[m] has joined #ste||ar

09:02 gdaiss[m] has joined #ste||ar

09:02 rori has joined #ste||ar

09:02 tiagofg[m] has joined #ste||ar

09:02 diehlpk_mobile[m has joined #ste||ar

09:02 freifrau_von_ble has joined #ste||ar

09:02 pfluegdk[m] has joined #ste||ar

09:16 jbjnr has joined #ste||ar

09:16 parsa[m] has joined #ste||ar

09:17 parsa[m] is now known as Guest38238

09:29 <heller1> hkaiser_: good morning. had a look over #4512

09:29 <heller1> hkaiser_: I disagree with your conclusion about requiring shared ownership of the condition variable state

09:37 nikunj has quit [Read error: Connection reset by peer]

09:38 nikunj has joined #ste||ar

09:41 nikunj97 has quit [Ping timeout: 260 seconds]

09:57 nikunj has quit [Ping timeout: 265 seconds]

09:58 nikunj has joined #ste||ar

10:05 nikunj97 has joined #ste||ar

10:13 nikunj97 has quit [Ping timeout: 240 seconds]

12:04 hkaiser has joined #ste||ar

12:35 hkaiser_ has joined #ste||ar

12:37 hkaiser has quit [Ping timeout: 252 seconds]

12:45 hkaiser has joined #ste||ar

12:49 hkaiser_ has quit [Ping timeout: 252 seconds]

12:57 <hkaiser> simbergm: ?

12:59 <heller1> hkaiser: hey

12:59 <hkaiser> hey

12:59 <heller1> i looked at #4512

12:59 <hkaiser> nod, thanks

12:59 <heller1> I am not sure I agree with the changes to condition_variable

12:59 <hkaiser> ok

13:00 <hkaiser> care to explain?

13:00 <heller1> the comment says, the user needs to ensure that there is no wait before the destructor runs

13:01 <heller1> 1) the wait implementation didn't really change

13:01 <hkaiser> wait did change

13:02 <hkaiser> for instance: https://github.com/STEllAR-GROUP/hpx/blob/adding_jthread/libs/synchronization/include/hpx/synchronization/condition_variable.hpp#L93

13:02 <heller1> 2) by adding the shared ownership, it's not the user's responsibility anymore, but the implementation takes care of synchronizing between waits and dtor calls

13:02 <hkaiser> yes, that's the point

13:02 <heller1> yes, that changed

13:03 nikunj97 has joined #ste||ar

13:03 <heller1> the reason why I disagree here is that everything worked before, with those changes, we basically add a pessimization in there

13:05 <hkaiser> I'm not sure that everything worked before, we've seen strange occasional segfaults which could have been caused by this

13:05 <heller1> so my argument is: it's a cornercase, and the spec clearly states that it's the users reponsibility, not that of the implementation, so why let everyone pay for it?

13:06 <hkaiser> also, see libc++: https://github.com/llvm-mirror/libcxx/blob/master/include/condition_variable#L170

13:06 <hkaiser> heller1: correctness first, remember?

13:09 <hkaiser> heller1: not sure if we could detect this case and report it, alternatively

13:09 <heller1> let me think about it once more

13:09 <hkaiser> sure

13:10 <hkaiser> I thought about it for a while (well knowing you would comment on this ;-) but decided to go for it anyways

13:10 <heller1> so, the synchronization point is the unblocking of the wait

13:10 <heller1> :P

13:11 <hkaiser> heller1: also, this would affect only code that is using cv directly, internally we don't do that

13:11 <heller1> which is very interesting ;)

13:11 <hkaiser> right... so it couldn't have caused our issues - didn't think of this

13:12 <hkaiser> if use detail::cv directly everywhere (in future, etc.) and handle the mutex outside of it - we might have similar issues then

13:13 <hkaiser> s/if/we/

13:15 <hkaiser> simbergm: there are still cases where the shared_priority_scheduler uses -1 as an index into arrays

13:15 <heller1> on a side note, it looks like only condition_variable_any is using the shared ownership

13:16 <hkaiser> I have added asserts to #4311

13:16 <hkaiser> does it now?

13:17 <hkaiser> heller1: https://github.com/STEllAR-GROUP/hpx/blob/adding_jthread/libs/synchronization/include/hpx/synchronization/condition_variable.hpp#L166?

13:18 <simbergm> hkaiser: right, I haven't looked at the other cases yet...

13:18 <heller1> so, if a wait is unblocked, and then the dtor of the condition_variable is run before the wait returns, what happens

13:18 <hkaiser> simbergm: sure, no rush - just a heads up

13:18 <heller1> this is the problematic case I believe

13:18 <simbergm> is that on the apex pr? john said he would be looking at it today, but let's see

13:18 <hkaiser> heller1: yes, then the mutex gets destructed before it can be re-acquired

13:19 <hkaiser> simbergm: #4311

13:19 <heller1> hkaiser: I was talking about libcxx, sorry for the confusion

13:19 <hkaiser> ahh, ok - not sure, haven't looked

13:20 <heller1> hkaiser: the mutex is the one that needs to get unblocked, isn't it?

13:20 <hkaiser> but the note in the standard applies to both destructors

13:21 <simbergm> hkaiser: 👍️ if john doesn't look at it today I'll have a look tomorrow

13:21 <heller1> hkaiser: it's also detail::condition_variable::queue_ that's subject to the race

13:21 <hkaiser> simbergm: thanks - and as I said, no rush

13:21 <hkaiser> just keeping the ball in the air

13:22 <simbergm> yep, no worries

13:22 <simbergm> thanks for pushing it!

13:22 <hkaiser> heller1: yah, I have them both in the shared state

13:22 <hkaiser> simbergm: also, is teonik at CSCS?

13:22 <heller1> hkaiser: yes, the standard applies to both destructors, my problem is, that the standard doesn't say that the implementation needs to fix this, the note reads to me as something that a user needs to look out for.

13:23 <simbergm> yeah, he is

13:23 <hkaiser> (no idea what the name is)

13:23 <simbergm> he's working on linear algebra

13:23 <simbergm> teodor

13:23 <hkaiser> ahh, ok

13:23 <simbergm> nikolov

13:23 <hkaiser> thanks

13:23 <simbergm> he's been john's guinea pig for the mpi stuff

13:23 <hkaiser> heller1: why did Howard decide to have it in libc++, then?

13:24 <heller1> hkaiser: I haven't talked to Howard about it ;)

13:24 <nikunj97> isn't HPX supposed to figure out CPU stuff on its own? hpx::init: hpx::exception caught: Currently, HPX_HAVE_MAX_CPU_COUNT is set to 64 while your system has 256 processing units. Please reconfigure HPX with -DHPX_WITH_MAX_CPU_COUNT=256 (or higher) to increase the maximal CPU count supported by HPX.: HPX(invalid_status)

13:25 <heller1> nikunj97: optimization at compile time

13:25 <hkaiser> heller1: I came across it here: https://github.com/josuttis/jthread/blob/master/source/condition_variable_any2.hpp#L184-L202

13:25 <nikunj97> also how is HPX_HAVE_MAX_CPU_COUNT set to 64!

13:25 <hkaiser> nikunj97: it's the default

13:25 <nikunj97> hkaiser, aah! so I compile it with 256 basically then

13:26 <hkaiser> you have to specify it, yes

13:26 <nikunj97> thanks!

13:29 <heller1> hkaiser: is this a change made to C++20?

13:29 <hkaiser> heller1: not sure, let me look

13:30 <hkaiser> heller1: no, the note was in c++11

13:31 <heller1> `Requires: There shall be no thread blocked on *this.`

13:31 <hkaiser> heller1: I trust Howard more than myself ;-)

13:31 <heller1> :P

13:31 <hkaiser> K-ballo: what's your take on this?

13:32 <heller1> https://en.cppreference.com/w/cpp/thread/condition_variable/~condition_variable

13:33 <heller1> this has the note as well

13:33 <hkaiser> yah

13:34 <hkaiser> the change is easy enough to take out, it's a single commit

13:35 <heller1> haven't gotten to the jthread stuff yet...

13:37 <K-ballo> if the thread was blocked on *this then it would never ever unblock

13:37 <hkaiser> karame_: the problem is if the destructor starts running while another thread sits in wait

13:37 <hkaiser> K-ballo: ^^

13:38 <K-ballo> I don't have the rest of the context

13:38 <hkaiser> K-ballo: https://github.com/josuttis/jthread/blob/master/source/condition_variable_any2.hpp#L184-L202

13:38 <K-ballo> if the thread has been signaled already but hasn't yet left the "body" of wait then that's fine

13:38 <K-ballo> there's magic sync on cv destructor

13:39 <K-ballo> but if the thread is still locked when the cv is destroyed, then nobody will ever be able to resume it

13:40 <K-ballo> yeah, those notes are right, and it's what causes cv_any to hold the mutex by pointer iirc

13:40 <hkaiser> we assume that the waiting thread was signalled before the cv is destructed

13:40 <hkaiser> yes

13:40 <hkaiser> that's why I added that to our cv as well, it holds the mutex etc in a shared state now

13:41 <K-ballo> the non-any one?

13:41 <hkaiser> K-ballo: I added it to both

13:41 <hkaiser> is this a problem for _any only?

13:41 <K-ballo> doesn't sound good

13:41 <K-ballo> yep

13:41 <hkaiser> ok, that I can fix

13:41 <K-ballo> the _any variant has to handle 2 locks, I'm not sure what ours does

13:41 <K-ballo> the _any variant combines the user lock with the cv internal lock

13:41 <hkaiser> yah, it has two locks

13:41 <K-ballo> the user lock is kept on the side

13:42 <hkaiser> but I think our cv (non-any) has two locks as well

13:42 <heller1> we have a similar issue, since we have to lock our internal cv

13:42 <K-ballo> odd, should only have the one

13:42 <hkaiser> let me have a look

13:43 <K-ballo> I guess we don't have the mechanics to combine the lock and the wait or something?

13:43 <heller1> hkaiser: I think that change should be seperated

13:43 <heller1> hkaiser: it requires a bit more thinking, I think

13:43 <K-ballo> which change is that?

13:44 <heller1> #4512

13:45 <heller1> so our detail::condition_variable is the rough eqivalent to pthread condition variables, right?

13:46 <K-ballo> ah, we have three cvs, ok

13:47 <heller1> that's where the main confusion comes from, I think

13:51 <heller1> and our decision to almost always default to spinlock as the mutex type

13:55 <K-ballo> what's the related confusion?

13:56 <heller1> about when to use a shared state and when not

13:57 <hkaiser> heller1: ok, I can separate it into a PR

13:57 <heller1> and that our hpx::condition_variable uses two locks (one that is provided externally and one that protects the condition variables itnernal state)

13:58 <heller1> hpx::detail::cv requires a spinlock to protect its state, hpx::cv a hpx::mutex

13:59 <heller1> that has the effect that we'd need the shared ownership for cv and cv_any

14:00 <heller1> which is leading to a nice pessimization for everyone

14:00 <K-ballo> sounds like our cv is effectively a cv_any with a hardcoded mutext ype

14:00 <K-ballo> and our detail::cv is the real cv?

14:00 <heller1> indeed

14:01 <heller1> but with the wrong mutex type

14:04 <K-ballo> or is our mutex the wrong mutex type :P

14:04 <heller1> we use spinlock as the detail::cv

14:20 diehlpk_work has joined #ste||ar

14:23 hkaiser has quit [Ping timeout: 240 seconds]

14:30 hkaiser has joined #ste||ar

14:33 Rory has joined #ste||ar

14:37 weilewei has joined #ste||ar

14:57 Rory has quit [Ping timeout: 240 seconds]

15:01 <nikunj97> hkaiser, see pm pls

15:05 <hkaiser> karame_: yt?

15:06 stmatengss has joined #ste||ar

15:06 stmatengss has left #ste||ar [#ste||ar]

15:11 rtohid has joined #ste||ar

15:12 nan11 has joined #ste||ar

15:12 K-ballo has quit [Remote host closed the connection]

15:13 K-ballo has joined #ste||ar

15:17 karame_ has quit [Remote host closed the connection]

15:18 Rory has joined #ste||ar

15:18 <diehlpk_work> hkaiser, Will you join the octotiger meeting this morning?

15:20 <hkaiser> diehlpk_mobile: yes, I want to

15:20 <diehlpk_work> Good, we have to discuss new reusults

15:20 <diehlpk_work> and keep writing the paper, since we only have one week left

15:28 * rori sent a long message: < https://matrix.org/_matrix/media/r0/download/matrix.org/GqOdcNfAcxPdRXhFvllhhFgk >

15:29 * rori sent a long message: < https://matrix.org/_matrix/media/r0/download/matrix.org/TQjOBLlXeDNClFDtuhvYUHdm >

15:31 akheir has joined #ste||ar

15:32 <nikunj97> diehlpk_work, what conference are you aiming for?

15:32 <diehlpk_work> SC

15:32 <nikunj97> Is SC not canceled yet?

15:33 <nikunj97> ohh wait, that was ISC

15:33 <diehlpk_work> ISC is because this one is much earlier

15:33 <nikunj97> IC

15:36 <diehlpk_work> IMPORTANT NOTICE DUE TO Covid-19 OUTBREAK

15:36 <diehlpk_work> WCCM-ECCMAS 2020 congress is cancelled in July. Alternative options are currently being explored by the organizers along with the IACM and ECCOMAS

15:36 <diehlpk_work> Next conference is canceled

15:37 <diehlpk_work> nikunj97, IC will publish all accepted papers, but there are no talks

15:37 bita has joined #ste||ar

15:37 <diehlpk_work> I assume the same might happen to SC if things are not getting better in November

15:37 <nikunj97> makes sense

15:38 Rory has quit [Remote host closed the connection]

15:49 RostamLog has joined #ste||ar

15:49 akheir has quit [Read error: Connection reset by peer]

15:50 akheir has joined #ste||ar

15:55 Rory has joined #ste||ar

16:01 <weilewei> pwd

16:01 <weilewei> sorry, mistake...

16:15 akheir has quit [Read error: Connection reset by peer]

16:15 akheir1 has joined #ste||ar

17:07 Rory has quit [Ping timeout: 240 seconds]

17:31 wate123_Jun has joined #ste||ar

17:50 akheir1 has quit [Remote host closed the connection]

17:50 akheir1 has joined #ste||ar

18:23 <tiagofg[m]> hello,

18:23 <tiagofg[m]> I noticed that in HPX documentation, in the section "2.5.6 Writing single-node HPX applications", in "channels", the example that is there has a mistake. When I do "cout << c.get();" will fail, because get() return a future.

18:25 <tiagofg[m]> to obtain the value 42, I think the solution may be "cout << c.get().get();" or "cout << c.get(hpx::launch::sync);"

18:28 <hkaiser> tiagofg[m]: yes, correct

18:28 <hkaiser> thanks for noticing

18:28 <hkaiser> rori: yt?

18:28 <hkaiser> nvm

18:29 <rori> yep

18:30 <hkaiser> rori: sorry, nvm

18:30 <tiagofg[m]> no problem

18:36 nan11 has quit [Remote host closed the connection]

18:37 <hkaiser> tiagofg[m]: see #4519

18:42 <tiagofg[m]> ok nice

18:45 rtohid has quit [Remote host closed the connection]

18:46 Nikunj__ has joined #ste||ar

18:48 nan11 has joined #ste||ar

18:49 nikunj_ has joined #ste||ar

18:49 <nikunj_> heller1, marvell thunderx2 (https://en.wikichip.org/wiki/cavium/thunderx2) just hate working on floats

18:49 nikunj97 has quit [Ping timeout: 256 seconds]

18:49 <hkaiser> nikunj_: I poked Keita, he will send papers

18:50 <nikunj_> hkaiser, thanks a lot!

18:51 <nikunj_> heller1, double performs about 2 times better than floats on thunderx2 and simd floats are giving me numbers way above expected at 12GLUPS which is about 1.2x simd double performance

18:52 Nikunj__ has quit [Ping timeout: 264 seconds]

18:52 <nikunj_> heller1, let me plot the graphs and show you the results tomorrow

18:53 <heller1> nikunj_: sounds cool!

18:53 <heller1> hkaiser: I am gonna have a take on the condition variable issue

18:54 <diehlpk_work> nikunj_, Our contact for the access to the new arm arch is not working there anymore

18:54 <diehlpk_work> I sent him an email today and his address bounced

18:54 <nikunj_> diehlpk_work, ohh crap that's why we didn't get access

18:55 <nikunj_> we should talk to someone at Fujitsu for getting access to A64FX

18:55 <diehlpk_work> I contacted the person who signed the form with us and ask whom I should contact instead

18:56 <nikunj_> diehlpk_work, so we can expect access to the processors soon?

19:04 <hkaiser> heller1: I have removed the shared state for cv from the PR

19:04 <heller1> hkaiser: ok, let me have a look then

19:06 <heller1> hkaiser: btw, there is no reason why the polymorphic_executor can't have a templated variadic execute function and friends. You can implement it in such a way that it creates a nullary function object returning void under the hood and passes it to the underlying executor

19:11 <heller1> hkaiser: on a different note, it would have been great to review the stop_token independently from jthread ;)

19:11 <hkaiser> heller1: ok, what about the return type?

19:12 <hkaiser> heller1: those things belong together

19:12 <hkaiser> stop_token and jthread

19:12 <heller1> ok

19:12 <hkaiser> besides, jthread is trivial

19:13 <heller1> stop_token is making our cancelation_token obsolete, doesn't it?

19:15 <hkaiser> could be, yah

19:15 <hkaiser> din't think about that yet

19:15 <hkaiser> we've never had one for real, did we?

19:15 <heller1> well, we have the cancellation points

19:16 <hkaiser> yah, those are intrusive

19:16 <heller1> and friends

19:16 <hkaiser> stop_token is non-intrusive

19:16 <heller1> and the thing to cancel actions?

19:20 <heller1> hkaiser: did you come up with the reference counting for the stop_state or is there a reference implementation?

19:21 <heller1> hkaiser: and: can there only always be one source or more than one?

19:21 <hkaiser> ahh!

19:23 <heller1> the whole implementation of the reference counting and locking looks fishy

19:25 <hkaiser> heller1: be more specific, pls

19:25 <hkaiser> it's normal reference counting, additionally it counts the stop_source instances

19:26 <hkaiser> but this has no implications for the memory management

19:27 <hkaiser> heller1: there can be as many stop_sources as you like

19:27 <heller1> not quite ;)

19:27 <hkaiser> well, 2^31

19:28 <heller1> in any case: you use a bit operation to set the locked flag, but subtract the mask

19:28 <hkaiser> that's the limit for any ref counting we use

19:28 <hkaiser> this never subtracts the mask, it uses the mask for masking only (I hope)

19:29 <heller1> https://github.com/STEllAR-GROUP/hpx/pull/4512/files#diff-481863f5fbb3c1f0684eb8fd8d0be74eR169

19:29 <hkaiser> show me what you mean, pls

19:29 <heller1> unlock

19:29 <heller1> and lock here: https://github.com/STEllAR-GROUP/hpx/pull/4512/files#diff-2038504291428a21aeda6435f19140c9R69

19:30 <hkaiser> ok, what's wrong with the fetch_sub?

19:31 <hkaiser> I could add an assert to make sure it's actually locked

19:31 <heller1> is it equivalent to just unsetting the locked flag in all cases?

19:32 <hkaiser> sure, it's a single bit on an unsigned

19:45 <K-ballo> it could assert the bit was previously set/not-set, but our asserts are pretty heavy

19:47 <heller1> since it is an implementation detail, i trust it being correct. my bit manipulation foo is just a tiny bit rusty

19:59 wate123_Jun has quit [Remote host closed the connection]

20:00 <diehlpk_work> nikunj_, hkaiser I got the new person who is responsible for us

20:00 <diehlpk_work> We will see what he can do for us

20:00 <nikunj_> diehlpk_work great work!

20:03 <diehlpk_work> nikunj, Always here to support HPX

20:03 <nikunj_> \o/

20:03 rtohid has joined #ste||ar

20:03 <heller1> hkaiser: ok, left one comment

20:05 <bita> hkaiser, I have made a retile_3loc example that fails with the same error, so it would be easier to debug

20:07 <nan11> hkaiser, could you please take a look at https://github.com/STEllAR-GROUP/phylanx/pull/1146. It considers three general tiling cases with any number of tiles.

20:23 <weilewei> with lots of hard coded things, distributed G4 seems can run now on two ranks ... it requires tons of array boundary checks... I need to make it more general

20:25 weilewei has quit [Remote host closed the connection]

20:26 weilewei has joined #ste||ar

20:27 <hkaiser> bita: nice, thanks!

20:27 <hkaiser> nan11: will do

20:27 <nan11> Thanks

20:29 K-ballo has quit [Remote host closed the connection]

20:30 K-ballo has joined #ste||ar

20:38 weilewei has quit [Remote host closed the connection]

20:46 weilewei has joined #ste||ar

21:32 nikunj_ has quit [Ping timeout: 252 seconds]

23:36 rtohid has left #ste||ar [#ste||ar]