hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC2018: https://wp.me/p4pxJf-k1
<github> [hpx] K-ballo opened pull request #3382: Fix usage of HPX_CAPTURE together with default value capture [=] (master...fix-hpx-capture-default) https://git.io/fNmp8
jaafar has quit [Ping timeout: 244 seconds]
diehlpk has joined #ste||ar
hkaiser has quit [Quit: bye]
K-ballo has quit [Quit: K-ballo]
nanashi55 has quit [Ping timeout: 264 seconds]
nanashi55 has joined #ste||ar
diehlpk has quit [Ping timeout: 240 seconds]
quaz0r has quit [Ping timeout: 248 seconds]
quaz0r has joined #ste||ar
jaafar has joined #ste||ar
jaafar has quit [Ping timeout: 264 seconds]
nikunj has joined #ste||ar
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
<github> [hpx] StellarBot pushed 1 new commit to gh-pages: https://git.io/fNYk6
<github> hpx/gh-pages 9c6a9ea StellarBot: Updating docs
hkaiser has joined #ste||ar
<github> [hpx] hkaiser pushed 1 new commit to master: https://git.io/fNYOt
<github> hpx/master a4d7485 Hartmut Kaiser: Merge pull request #3379 from STEllAR-GROUP/fixing_3378...
K-ballo has joined #ste||ar
nikunj has quit [Quit: Leaving]
hkaiser has quit [Quit: bye]
bobakk3r has joined #ste||ar
bobakk3r has quit [Client Quit]
hkaiser has joined #ste||ar
jaafar has joined #ste||ar
hkaiser has quit [Quit: bye]
hkaiser has joined #ste||ar
diehlpk has joined #ste||ar
biddisco has joined #ste||ar
<biddisco> hkaiser: cancellation token
<biddisco> the destructor of this https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/parallel/algorithms/find.hpp#L67 is taking place after the memory has gone
<K-ballo> doesn't cancellation_token have shared pointer semantics?
<biddisco> it has an internal shared pointer
<biddisco> #4 0x00000000100ce030 in __libcpp_atomic_refcount_decrement<long> (__t=@0x1f: <error reading variable: Cannot access memory at address 0x1f>) at /users/biddisco/apps/llvm/include/c++/v1/memory:3472
<biddisco> #5 __release_shared (this=<optimized out>) at /users/biddisco/apps/llvm/include/c++/v1/memory:3525
<biddisco> #6 __release_shared (this=0x17) at /users/biddisco/apps/llvm/include/c++/v1/memory:3568
<biddisco> #7 ~shared_ptr (this=<optimized out>) at /users/biddisco/apps/llvm/include/c++/v1/memory:4504
<biddisco> #8 ~cancellation_token (this=<optimized out>) at /users/biddisco/src/hpx/hpx/parallel/util/cancellation_token.hpp:29
<biddisco> #9 ~ (this=<optimized out>) at /users/biddisco/src/hpx/hpx/parallel/algorithms/find.hpp:83
<biddisco> but the shared pointer destructor is hitting bad memory
<K-ballo> !
<biddisco> indeed
<K-ballo> the one in line 83 is its own copy
<biddisco> yes, the lambda captures are confusing
<K-ballo> and the lambda is mutable for some reason
<hkaiser> biddisco: nice find
<biddisco> hkaiser: I'm not sure what.how to fix
<biddisco> I don't understand what's going on
<hkaiser> biddisco, give me a bit of time, pls
<hkaiser> K-ballo: what do you mean by 'the one in line 83 is its own copy'?
<K-ballo> the backtrace points at line 83 at #9, which is introducing a capture by copy of the cancellation token at line 67 linked above
<biddisco> the one on line 72 is as well though
diehlpk has quit [Ping timeout: 240 seconds]
<hkaiser> K-ballo: yes it does
<hkaiser> both lambda have to capture the token
<hkaiser> why is that bad?
<K-ballo> it shouldn't be, since it has shared ptr semantics
<biddisco> (just fyi - you knew it already, but the code does not crash with 1 thread)
<hkaiser> could the shared_ptr on ppc be buggy?
<biddisco> why would a very specific subset of tests fail consistently, but not others?
<biddisco> if shared_pointer was buggy - would we not see more fails?
<hkaiser> nod, sure - cheap shot on my end
<biddisco> from my brief examination, it looks like most of the find algorithms are failing, they are all using this cancellation token
<biddisco> and is_heap etc
<hkaiser> right
<hkaiser> biddisco: did you change the memory_order in the cancellation token?
<K-ballo> cancel is relaxed? odd
<biddisco> is_sorted, search etc
<K-ballo> woa, everything is relaxed
<hkaiser> I mean when you tried making things sequentially consistent
<biddisco> yes, I changed it
<hkaiser> K-ballo: yah, that could be the reason
<biddisco> where is relaxed?
<hkaiser> the atomics in cancellation token
<biddisco> I removed all the relaxed
<biddisco> let me check
<biddisco> they are all .load() now
<biddisco> without the relaxed
<hkaiser> those memory_order_relaxed are wrong anyways
<hkaiser> the compare_exchange as well?
<biddisco> all the relaxed are gone
<K-ballo> presumably it has been replaced with sequential_consistency?
<hkaiser> K-ballo: isn't that the default?
<biddisco> I just removed all memory order in the assumption it defaults to sequental
<hkaiser> right, it does
<hkaiser> biddisco: well, then the logic in cancellation token is wrong
<biddisco> ok
<biddisco> it's not code I've ever looked at until now
<hkaiser> looks sane to me :/
<hkaiser> ans the relaxed, that is
<hkaiser> sans*
<biddisco> it's the destructor that is segfaulting. this means the lambdas are doing something fishy
<biddisco> I'm wondering if clang is optimizing something strangely
<biddisco> debug mode is ok
<hkaiser> yah
<hkaiser> biddisco: try adding 'tok' as an explicit capture to the second lambda as well
<biddisco> I think I tried that already earlier
<hkaiser> ok
<biddisco> let me try again
<hkaiser> K-ballo: is destruction of shared_ptr thread safe?
<K-ballo> only for that instance
<hkaiser> I believe to remember that some operations were not
<K-ballo> that's not what I meant
<hkaiser> K-ballo: for that data instance or that shared_ptr instance
<K-ballo> destroying the shared_ptr while using that some instance is bad
<K-ballo> destryoing the shared_ptr while some other instance pointing to the same shared data is fine
<K-ballo> some -> same
<hkaiser> ok, then we should be fine
<K-ballo> yes, we should
<hkaiser> each lambda holds a copy
<K-ballo> somehow the control block is getting corrupted though
<biddisco> did not fix it using tok, first, last, count instead of =
<hkaiser> biddisco: nod
<hkaiser> biddisco: another test would be to replace the shared_ptr in the token by a plain pointer and let the memory leak
<biddisco> ok
<biddisco> hkaiser: you win $10
<biddisco> using a flat pointer makes the segfault go away
<hkaiser> interesting
<hkaiser> so something messes up the lifetime of the shared_ptr
<hkaiser> biddisco: just to be on the safe side and to exclude std::shared_ptr to be a problem on that platform - could you replace it with a boost:shared_ptr, pls?
<biddisco> ok
<biddisco> hkaiser: no boost::make_shared in my 1.67 any more
<biddisco> hold on
<hkaiser> use shared_ptr(new T()) instead
<biddisco> wrong #include
<biddisco> crashes with boost make_shared and boost::shared_ptr. they probably $ifdef use std anyway
<K-ballo> no, they wouldn't
<K-ballo> they ship a number of features not in std
<hkaiser> ok, so it's something with our code
<hkaiser> surprise
<hkaiser> :/
<zao> I would not be surprised if Boost didn't give any hoots about PowerPC or other "alternative platforms".
<zao> (long-running grudge from my side, of course)
<K-ballo> we are corrupting the capture somehow
<biddisco> does cancellationtoekn need a copy ctor
<hkaiser> the default one should be fine
<hkaiser> K-ballo: the control block is not in the capture
<K-ballo> as far as I can tell we are corrupting this guy: https://github.com/llvm-mirror/libcxx/blob/master/include/memory#L3732
<K-ballo> that's calling the control block: __release_shared (this=0x17)
<hkaiser> nod
<K-ballo> uh, line 4504 is
<hkaiser> so this pointer is somehow overwritten
<hkaiser> biddisco: another idea is to replace the shared_ptr with an intrusive one, but that is more involved
<biddisco> we need to understand why this fails before making random fixes really.
<biddisco> (IMHO)
<hkaiser> I'm still not sure if this is really our problem
<hkaiser> biddisco: but I agree
<biddisco> might be a compiler problem
<biddisco> but how to be sure
<hkaiser> with a plain pointer all is well (at least it doesn't expose the same behavior), that's the reason why an intrusive pointer might be good as well, as it's one pointer compared to two pointers in shared_ptr
<hkaiser> asummptions
<K-ballo> could you try a data breakpoint, to see what writes to that location?
<biddisco> which location do you want
<K-ballo> the one corresponding to the captured shared pointer
<hkaiser> the second pointer in the shared_ptr that is captured in the second lambda
<biddisco> ok. I try
<biddisco> might not have it in my relwithdebinfo build
<hkaiser> K-ballo: those lambdas could have been moved...
<hkaiser> that would change the address for the dat abreakpoint
<biddisco> hmmm
<biddisco> value has been optimized out
<biddisco> K-ballo: when I use debug mode, the bug does not happen
<biddisco> might well be a compiler issue
<biddisco> will try again tomorrow. goodnight all
biddisco has quit [Ping timeout: 265 seconds]
diehlpk has joined #ste||ar
diehlpk has quit [Ping timeout: 244 seconds]
jakub_golinowski has joined #ste||ar
diehlpk has joined #ste||ar