hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
Nikunj__ has joined #ste||ar
nikunj has quit [Ping timeout: 260 seconds]
nikunj has joined #ste||ar
nikunj97 has quit [Ping timeout: 260 seconds]
<hkaiser> nikunj: here
<Nikunj__> hkaiser, I was wondering if you will allow me to work on the distributed resiliency project
<Nikunj__> I want to extend that work and get a conference paper out of it
<hkaiser> Nikunj__: do we have one?
<Nikunj__> did the project with Sandia guys not extend?
<hkaiser> we currently try to integrate it with Kokkos to have resiliency of device data
<Nikunj__> ohh so the project was to test HPX capabilities before integrating it with kokkos?
<hkaiser> no plans to extend it to distributed at this point
<Nikunj__> it will be nice to have a distributed model I guess
<hkaiser> but if you're interested you can work on it anyways ;-)
<Nikunj__> yes, I believe that our paper was rejected last time coz of the Habenaro guys
<Nikunj__> if we have a distributed version that performs nicely, we can get a paper out of it
<hkaiser> nod
<Nikunj__> and I was already thinking to add even better facilities wrt resiliency so that we can work with node failures
<Nikunj__> btw can the runtime know if a node fails? or if there's a way in HPX to know that?
<hkaiser> Nikunj__: at some point heller1's collaborators have work on this
<Nikunj__> distributed resiliency?
<hkaiser> Nikunj__: well, timeouts
<hkaiser> no handling node failures
<Nikunj__> hkaiser, well yes. But relying on timeouts just make it slow to respond
<Nikunj__> how does current applications identify node failures for backup and restore?
<hkaiser> Nikunj__: you could use some watchdog mechanism
<hkaiser> no idea
<Nikunj__> hkaiser, yes, that's how I was thinking
bita has joined #ste||ar
<Nikunj__> hkaiser, any resources you'd want me to read on this? I think we can work on resiliency side of things on a greater extent
<hkaiser> Nikunj__: I don't know anything about this
bita_ has quit [Ping timeout: 260 seconds]
<Nikunj__> damn, do you know anyone who can help me with the project?
<hkaiser> Keita?
<Nikunj__> ohh yes, but will Keita help me with this sort of independent project?
<hkaiser> he can give some guidance on what to read
<Nikunj__> alright. I'll email him about this and CC you
<hkaiser> nod, pls do
<Nikunj__> is it novel enough though?
<Nikunj__> I don't want another reviewer accusing of the work being "not novel enough"
<hkaiser> hah
<hkaiser> not many production systems exist that do that, if at all
<Nikunj__> I also want to integrate our system with backup and restore
<hkaiser> checkpointing
<Nikunj__> yes checkpointing
<hkaiser> Nikunj__: Yorlik wants to hav ethat for his game
<Yorlik> What?
<Nikunj__> if we can have a mechanism of checkpointing with distributed resiliency, it will be really nice to have
<hkaiser> indeed
<hkaiser> checkpoint agas itself
<Yorlik> Persistent id_types?
<hkaiser> right
<Nikunj__> Yorlik, yes
<Yorlik> Need it for real !!!!
<Yorlik> It also would have economic implications.
<Yorlik> Imagine a datacenter rentingh out compute capacity
<Yorlik> They could have high prio and low prio jobs
<Yorlik> Just persisting a job, running another and resuming the previous later.
<Yorlik> So people with less dire computing needs could use spare capacity for cheap :)
<Yorlik> But ofc, you could do that on a VM level too :(
<Nikunj__> hkaiser, it's done then. I think this is a cool summer project that is worth exploring
<hkaiser> Nikunj__: nice!
<Nikunj__> besides I'll be done with the current JSC by mid May, so this will be nice to explore post that
<zao> Reminds me of a story of one of our researchers, running on the cluster of the Norwegian weather service many years ago. That machine had the ability to page jobs out to disk when important weather forecast jobs were scheduled.
<zao> The researcher used the machine so efficiently that it took the system too long to page his job out, consuming the whole forecast slot just to spool it out to disk, at which point it promptly spooled it back in again.
<hkaiser> zao: you issue a command: kill job:12345, and the system responds with: insert tape
<Yorlik> I have seen robotic tabe archives loooong ago ..
<Yorlik> tape
<zao> Our tape robot is quite alive and well :D
<Nikunj__> I have never seen tape sadly
<Yorlik> zao: Is tape still considered to most reliable long term storage?
<zao> Can't beat tape for bulk storage that still has some sort of retrieval need.
<zao> Tape is reasonably resilient, and tape libraries do run audits to find tapes that need to be rewritten or replaced.
<zao> You can also build some redundancy on top by duplication, even to tape libraries on other geographical sites.
<zao> We use ours for backups and LHC experiment data.
<Yorlik> LHC data amounts must be massive. You're working at CERN, zao?
hkaiser_ has joined #ste||ar
<zao> We're part of the Nordic Tier 1, so we carry and distribute part of the experiment data, as well as provide GRID compute.
<Yorlik> IC
<zao> I'm not directly involved, some of my cow-orkers work with them.
<zao> I migrated some data between tape generations a bunch of years ago, so got to poke around quite a bit at the system.
<bita> hkaiser, I have been testing that issue. Here, https://gist.github.com/taless474/0621f08e03d13e16d7bad1dd655fb155, the commented one does not pass while the other passes
<hkaiser_> bita: is that what's causing the '<unknown>' error?
hkaiser has quit [Ping timeout: 260 seconds]
<bita> yes
<hkaiser_> ok, so it's not build-system realted
<hkaiser_> related*
<bita> I think the problem is with my retile
<hkaiser_> I'll try to debug that
<hkaiser_> tomorrow
<bita> the problem is I cannot debug it
<bita> But this last example gave me some ideas about what's wrong
<zao> Hrm, it's Tuesday tomorrow, should probably hit the sack. Might get some time to HPX:ify some more code.
<hkaiser_> bita: I hope to have some time tomorrow later in the eveing
<bita> right, thank you
nikunj97 has joined #ste||ar
Nikunj__ has quit [Ping timeout: 240 seconds]
akheir1 has quit [Quit: Leaving]
nan11 has quit [Remote host closed the connection]
<Yorlik> Do we have anyone really good with intrinsics and bit manipulation?
hkaiser_ has quit [Ping timeout: 260 seconds]
Nikunj__ has joined #ste||ar
nikunj97 has quit [Ping timeout: 260 seconds]
diehlpk_work has quit [Remote host closed the connection]
bita has quit [Quit: Leaving]
Nikunj__ is now known as nikunj97
weilewei has quit [Remote host closed the connection]
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
mdiers[m] has quit [Quit: killed]
simbergm has quit [Quit: killed]
gdaiss[m] has quit [Quit: killed]
pfluegdk[m] has quit [Quit: killed]
Guest20891 has quit [Quit: killed]
diehlpk_mobile[m has quit [Quit: killed]
kordejong has quit [Quit: killed]
jbjnr has quit [Quit: killed]
heller1 has quit [Quit: killed]
freifrau_von_ble has quit [Quit: killed]
rori has quit [Quit: killed]
tiagofg[m] has quit [Quit: killed]
simbergm has joined #ste||ar
kordejong has joined #ste||ar
heller1 has joined #ste||ar
mdiers[m] has joined #ste||ar
gdaiss[m] has joined #ste||ar
rori has joined #ste||ar
tiagofg[m] has joined #ste||ar
diehlpk_mobile[m has joined #ste||ar
freifrau_von_ble has joined #ste||ar
pfluegdk[m] has joined #ste||ar
jbjnr has joined #ste||ar
parsa[m] has joined #ste||ar
parsa[m] is now known as Guest38238
<heller1> hkaiser_: good morning. had a look over #4512
<heller1> hkaiser_: I disagree with your conclusion about requiring shared ownership of the condition variable state
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
nikunj97 has quit [Ping timeout: 260 seconds]
nikunj has quit [Ping timeout: 265 seconds]
nikunj has joined #ste||ar
nikunj97 has joined #ste||ar
nikunj97 has quit [Ping timeout: 240 seconds]
hkaiser has joined #ste||ar
hkaiser_ has joined #ste||ar
hkaiser has quit [Ping timeout: 252 seconds]
hkaiser has joined #ste||ar
hkaiser_ has quit [Ping timeout: 252 seconds]
<hkaiser> simbergm: ?
<heller1> hkaiser: hey
<hkaiser> hey
<heller1> i looked at #4512
<hkaiser> nod, thanks
<heller1> I am not sure I agree with the changes to condition_variable
<hkaiser> ok
<hkaiser> care to explain?
<heller1> the comment says, the user needs to ensure that there is no wait before the destructor runs
<heller1> 1) the wait implementation didn't really change
<hkaiser> wait did change
<heller1> 2) by adding the shared ownership, it's not the user's responsibility anymore, but the implementation takes care of synchronizing between waits and dtor calls
<hkaiser> yes, that's the point
<heller1> yes, that changed
nikunj97 has joined #ste||ar
<heller1> the reason why I disagree here is that everything worked before, with those changes, we basically add a pessimization in there
<hkaiser> I'm not sure that everything worked before, we've seen strange occasional segfaults which could have been caused by this
<heller1> so my argument is: it's a cornercase, and the spec clearly states that it's the users reponsibility, not that of the implementation, so why let everyone pay for it?
<hkaiser> heller1: correctness first, remember?
<hkaiser> heller1: not sure if we could detect this case and report it, alternatively
<heller1> let me think about it once more
<hkaiser> sure
<hkaiser> I thought about it for a while (well knowing you would comment on this ;-) but decided to go for it anyways
<heller1> so, the synchronization point is the unblocking of the wait
<heller1> :P
<hkaiser> heller1: also, this would affect only code that is using cv directly, internally we don't do that
<heller1> which is very interesting ;)
<hkaiser> right... so it couldn't have caused our issues - didn't think of this
<hkaiser> if use detail::cv directly everywhere (in future, etc.) and handle the mutex outside of it - we might have similar issues then
<hkaiser> s/if/we/
<hkaiser> simbergm: there are still cases where the shared_priority_scheduler uses -1 as an index into arrays
<heller1> on a side note, it looks like only condition_variable_any is using the shared ownership
<hkaiser> I have added asserts to #4311
<hkaiser> does it now?
<simbergm> hkaiser: right, I haven't looked at the other cases yet...
<heller1> so, if a wait is unblocked, and then the dtor of the condition_variable is run before the wait returns, what happens
<hkaiser> simbergm: sure, no rush - just a heads up
<heller1> this is the problematic case I believe
<simbergm> is that on the apex pr? john said he would be looking at it today, but let's see
<hkaiser> heller1: yes, then the mutex gets destructed before it can be re-acquired
<hkaiser> simbergm: #4311
<heller1> hkaiser: I was talking about libcxx, sorry for the confusion
<hkaiser> ahh, ok - not sure, haven't looked
<heller1> hkaiser: the mutex is the one that needs to get unblocked, isn't it?
<hkaiser> but the note in the standard applies to both destructors
<simbergm> hkaiser: 👍️ if john doesn't look at it today I'll have a look tomorrow
<heller1> hkaiser: it's also detail::condition_variable::queue_ that's subject to the race
<hkaiser> simbergm: thanks - and as I said, no rush
<hkaiser> just keeping the ball in the air
<simbergm> yep, no worries
<simbergm> thanks for pushing it!
<hkaiser> heller1: yah, I have them both in the shared state
<hkaiser> simbergm: also, is teonik at CSCS?
<heller1> hkaiser: yes, the standard applies to both destructors, my problem is, that the standard doesn't say that the implementation needs to fix this, the note reads to me as something that a user needs to look out for.
<simbergm> yeah, he is
<hkaiser> (no idea what the name is)
<simbergm> he's working on linear algebra
<simbergm> teodor
<hkaiser> ahh, ok
<simbergm> nikolov
<hkaiser> thanks
<simbergm> he's been john's guinea pig for the mpi stuff
<hkaiser> heller1: why did Howard decide to have it in libc++, then?
<heller1> hkaiser: I haven't talked to Howard about it ;)
<nikunj97> isn't HPX supposed to figure out CPU stuff on its own? hpx::init: hpx::exception caught: Currently, HPX_HAVE_MAX_CPU_COUNT is set to 64 while your system has 256 processing units. Please reconfigure HPX with -DHPX_WITH_MAX_CPU_COUNT=256 (or higher) to increase the maximal CPU count supported by HPX.: HPX(invalid_status)
<heller1> nikunj97: optimization at compile time
<nikunj97> also how is HPX_HAVE_MAX_CPU_COUNT set to 64!
<hkaiser> nikunj97: it's the default
<nikunj97> hkaiser, aah! so I compile it with 256 basically then
<hkaiser> you have to specify it, yes
<nikunj97> thanks!
<heller1> hkaiser: is this a change made to C++20?
<hkaiser> heller1: not sure, let me look
<hkaiser> heller1: no, the note was in c++11
<heller1> `Requires: There shall be no thread blocked on *this.`
<hkaiser> heller1: I trust Howard more than myself ;-)
<heller1> :P
<hkaiser> K-ballo: what's your take on this?
<heller1> this has the note as well
<hkaiser> yah
<hkaiser> the change is easy enough to take out, it's a single commit
<heller1> haven't gotten to the jthread stuff yet...
<K-ballo> if the thread was blocked on *this then it would never ever unblock
<hkaiser> karame_: the problem is if the destructor starts running while another thread sits in wait
<hkaiser> K-ballo: ^^
<K-ballo> I don't have the rest of the context
<K-ballo> if the thread has been signaled already but hasn't yet left the "body" of wait then that's fine
<K-ballo> there's magic sync on cv destructor
<K-ballo> but if the thread is still locked when the cv is destroyed, then nobody will ever be able to resume it
<K-ballo> yeah, those notes are right, and it's what causes cv_any to hold the mutex by pointer iirc
<hkaiser> we assume that the waiting thread was signalled before the cv is destructed
<hkaiser> yes
<hkaiser> that's why I added that to our cv as well, it holds the mutex etc in a shared state now
<K-ballo> the non-any one?
<hkaiser> K-ballo: I added it to both
<hkaiser> is this a problem for _any only?
<K-ballo> doesn't sound good
<K-ballo> yep
<hkaiser> ok, that I can fix
<K-ballo> the _any variant has to handle 2 locks, I'm not sure what ours does
<K-ballo> the _any variant combines the user lock with the cv internal lock
<hkaiser> yah, it has two locks
<K-ballo> the user lock is kept on the side
<hkaiser> but I think our cv (non-any) has two locks as well
<heller1> we have a similar issue, since we have to lock our internal cv
<K-ballo> odd, should only have the one
<hkaiser> let me have a look
<K-ballo> I guess we don't have the mechanics to combine the lock and the wait or something?
<heller1> hkaiser: I think that change should be seperated
<heller1> hkaiser: it requires a bit more thinking, I think
<K-ballo> which change is that?
<heller1> #4512
<heller1> so our detail::condition_variable is the rough eqivalent to pthread condition variables, right?
<K-ballo> ah, we have three cvs, ok
<heller1> that's where the main confusion comes from, I think
<heller1> and our decision to almost always default to spinlock as the mutex type
<K-ballo> what's the related confusion?
<heller1> about when to use a shared state and when not
<hkaiser> heller1: ok, I can separate it into a PR
<heller1> and that our hpx::condition_variable uses two locks (one that is provided externally and one that protects the condition variables itnernal state)
<heller1> hpx::detail::cv requires a spinlock to protect its state, hpx::cv a hpx::mutex
<heller1> that has the effect that we'd need the shared ownership for cv and cv_any
<heller1> which is leading to a nice pessimization for everyone
<K-ballo> sounds like our cv is effectively a cv_any with a hardcoded mutext ype
<K-ballo> and our detail::cv is the real cv?
<heller1> indeed
<heller1> but with the wrong mutex type
<K-ballo> or is our mutex the wrong mutex type :P
<heller1> we use spinlock as the detail::cv
diehlpk_work has joined #ste||ar
hkaiser has quit [Ping timeout: 240 seconds]
hkaiser has joined #ste||ar
Rory has joined #ste||ar
weilewei has joined #ste||ar
Rory has quit [Ping timeout: 240 seconds]
<nikunj97> hkaiser, see pm pls
<hkaiser> karame_: yt?
stmatengss has joined #ste||ar
stmatengss has left #ste||ar [#ste||ar]
rtohid has joined #ste||ar
nan11 has joined #ste||ar
K-ballo has quit [Remote host closed the connection]
K-ballo has joined #ste||ar
karame_ has quit [Remote host closed the connection]
Rory has joined #ste||ar
<diehlpk_work> hkaiser, Will you join the octotiger meeting this morning?
<hkaiser> diehlpk_mobile: yes, I want to
<diehlpk_work> Good, we have to discuss new reusults
<diehlpk_work> and keep writing the paper, since we only have one week left
akheir has joined #ste||ar
<nikunj97> diehlpk_work, what conference are you aiming for?
<diehlpk_work> SC
<nikunj97> Is SC not canceled yet?
<nikunj97> ohh wait, that was ISC
<diehlpk_work> ISC is because this one is much earlier
<nikunj97> IC
<diehlpk_work> IMPORTANT NOTICE DUE TO Covid-19 OUTBREAK
<diehlpk_work> WCCM-ECCMAS 2020 congress is cancelled in July. Alternative options are currently being explored by the organizers along with the IACM and ECCOMAS
<diehlpk_work> Next conference is canceled
<diehlpk_work> nikunj97, IC will publish all accepted papers, but there are no talks
bita has joined #ste||ar
<diehlpk_work> I assume the same might happen to SC if things are not getting better in November
<nikunj97> makes sense
Rory has quit [Remote host closed the connection]
RostamLog has joined #ste||ar
akheir has quit [Read error: Connection reset by peer]
akheir has joined #ste||ar
Rory has joined #ste||ar
<weilewei> pwd
<weilewei> sorry, mistake...
akheir has quit [Read error: Connection reset by peer]
akheir1 has joined #ste||ar
Rory has quit [Ping timeout: 240 seconds]
wate123_Jun has joined #ste||ar
akheir1 has quit [Remote host closed the connection]
akheir1 has joined #ste||ar
<tiagofg[m]> hello,
<tiagofg[m]> I noticed that in HPX documentation, in the section "2.5.6 Writing single-node HPX applications", in "channels", the example that is there has a mistake. When I do "cout << c.get();" will fail, because get() return a future.
<tiagofg[m]> to obtain the value 42, I think the solution may be "cout << c.get().get();" or "cout << c.get(hpx::launch::sync);"
<hkaiser> tiagofg[m]: yes, correct
<hkaiser> thanks for noticing
<hkaiser> rori: yt?
<hkaiser> nvm
<rori> yep
<hkaiser> rori: sorry, nvm
<tiagofg[m]> no problem
nan11 has quit [Remote host closed the connection]
<hkaiser> tiagofg[m]: see #4519
<tiagofg[m]> ok nice
rtohid has quit [Remote host closed the connection]
Nikunj__ has joined #ste||ar
nan11 has joined #ste||ar
nikunj_ has joined #ste||ar
<nikunj_> heller1, marvell thunderx2 (https://en.wikichip.org/wiki/cavium/thunderx2) just hate working on floats
nikunj97 has quit [Ping timeout: 256 seconds]
<hkaiser> nikunj_: I poked Keita, he will send papers
<nikunj_> hkaiser, thanks a lot!
<nikunj_> heller1, double performs about 2 times better than floats on thunderx2 and simd floats are giving me numbers way above expected at 12GLUPS which is about 1.2x simd double performance
Nikunj__ has quit [Ping timeout: 264 seconds]
<nikunj_> heller1, let me plot the graphs and show you the results tomorrow
<heller1> nikunj_: sounds cool!
<heller1> hkaiser: I am gonna have a take on the condition variable issue
<diehlpk_work> nikunj_, Our contact for the access to the new arm arch is not working there anymore
<diehlpk_work> I sent him an email today and his address bounced
<nikunj_> diehlpk_work, ohh crap that's why we didn't get access
<nikunj_> we should talk to someone at Fujitsu for getting access to A64FX
<diehlpk_work> I contacted the person who signed the form with us and ask whom I should contact instead
<nikunj_> diehlpk_work, so we can expect access to the processors soon?
<hkaiser> heller1: I have removed the shared state for cv from the PR
<heller1> hkaiser: ok, let me have a look then
<heller1> hkaiser: btw, there is no reason why the polymorphic_executor can't have a templated variadic execute function and friends. You can implement it in such a way that it creates a nullary function object returning void under the hood and passes it to the underlying executor
<heller1> hkaiser: on a different note, it would have been great to review the stop_token independently from jthread ;)
<hkaiser> heller1: ok, what about the return type?
<hkaiser> heller1: those things belong together
<hkaiser> stop_token and jthread
<heller1> ok
<hkaiser> besides, jthread is trivial
<heller1> stop_token is making our cancelation_token obsolete, doesn't it?
<hkaiser> could be, yah
<hkaiser> din't think about that yet
<hkaiser> we've never had one for real, did we?
<heller1> well, we have the cancellation points
<hkaiser> yah, those are intrusive
<heller1> and friends
<hkaiser> stop_token is non-intrusive
<heller1> and the thing to cancel actions?
<heller1> hkaiser: did you come up with the reference counting for the stop_state or is there a reference implementation?
<heller1> hkaiser: and: can there only always be one source or more than one?
<hkaiser> ahh!
<heller1> the whole implementation of the reference counting and locking looks fishy
<hkaiser> heller1: be more specific, pls
<hkaiser> it's normal reference counting, additionally it counts the stop_source instances
<hkaiser> but this has no implications for the memory management
<hkaiser> heller1: there can be as many stop_sources as you like
<heller1> not quite ;)
<hkaiser> well, 2^31
<heller1> in any case: you use a bit operation to set the locked flag, but subtract the mask
<hkaiser> that's the limit for any ref counting we use
<hkaiser> this never subtracts the mask, it uses the mask for masking only (I hope)
<hkaiser> show me what you mean, pls
<heller1> unlock
<hkaiser> ok, what's wrong with the fetch_sub?
<hkaiser> I could add an assert to make sure it's actually locked
<heller1> is it equivalent to just unsetting the locked flag in all cases?
<hkaiser> sure, it's a single bit on an unsigned
<K-ballo> it could assert the bit was previously set/not-set, but our asserts are pretty heavy
<heller1> since it is an implementation detail, i trust it being correct. my bit manipulation foo is just a tiny bit rusty
wate123_Jun has quit [Remote host closed the connection]
<diehlpk_work> nikunj_, hkaiser I got the new person who is responsible for us
<diehlpk_work> We will see what he can do for us
<nikunj_> diehlpk_work great work!
<diehlpk_work> nikunj, Always here to support HPX
<nikunj_> \o/
rtohid has joined #ste||ar
<heller1> hkaiser: ok, left one comment
<bita> hkaiser, I have made a retile_3loc example that fails with the same error, so it would be easier to debug
<nan11> hkaiser, could you please take a look at https://github.com/STEllAR-GROUP/phylanx/pull/1146. It considers three general tiling cases with any number of tiles.
<weilewei> with lots of hard coded things, distributed G4 seems can run now on two ranks ... it requires tons of array boundary checks... I need to make it more general
weilewei has quit [Remote host closed the connection]
weilewei has joined #ste||ar
<hkaiser> bita: nice, thanks!
<hkaiser> nan11: will do
<nan11> Thanks
K-ballo has quit [Remote host closed the connection]
K-ballo has joined #ste||ar
weilewei has quit [Remote host closed the connection]
weilewei has joined #ste||ar
nikunj_ has quit [Ping timeout: 252 seconds]
rtohid has left #ste||ar [#ste||ar]