hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu
<K-ballo>
Vineeth not working how? it loads
<Vineeth>
This site can’t provide a secure connectionmail.cct.lsu.edu uses an unsupported protocol.
<Vineeth>
ERR_SSL_VERSION_OR_CIPHER_MISMATCH
<K-ballo>
yes, it uses TLS1
ahmed_ has quit [Quit: Connection closed for inactivity]
aacirino has quit [Remote host closed the connection]
ahmed_ has joined #ste||ar
hkaiser has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: Bye!]
ahmed_ has quit [Quit: Connection closed for inactivity]
Vineeth has quit [Quit: Client closed]
jbalint has quit [Quit: Bye!]
jbalint has joined #ste||ar
jbalint has quit [Quit: Bye!]
jbalint has joined #ste||ar
Yorlik_ has joined #ste||ar
Guest95 has joined #ste||ar
Guest95 has quit [Client Quit]
Guest95 has joined #ste||ar
Guest95 has quit [Client Quit]
Vineeth_Pulipati has quit [Quit: Client closed]
hkaiser has joined #ste||ar
<gonidelis[m]>
hkaiser: pm plz :)
<dkaratza[m]>
gonidelis: i want you to tell me your exact email and department you weant me to add
ahmed_ has joined #ste||ar
K-ballo has joined #ste||ar
Yorlik_ has quit [Read error: Connection reset by peer]
diehlpk has joined #ste||ar
<diehlpk>
hkaiser: I fixed the cmake issue on Fedora's build system
<diehlpk>
Build is running now
<diehlpk>
gonidelis[m]:
<diehlpk>
We get a town of warning using gcc 12
<hkaiser>
diehlpk: show us the log once it's done, I'll try to fix the warnings
aacirino has joined #ste||ar
<diehlpk>
hkaiser: hpx without networking finished on x86
<jbjnr[m]>
It seems unlikely to me that we have a bug in something as fundamental as CV. Is it used much in the main hpx code? (if the answer is "no", then a deadlock problem going unfounf is possible, but if it is used a lot, then we surely would have noticed it ...)
hkaiser_ has joined #ste||ar
<hkaiser_>
jbjnr[m]: CVs are used everywhere
<hkaiser_>
jbjnr[m]: but I wouldn't put it past us to have a problem, we might just use it in ways that doesn't expose it
hkaiser has quit [Ping timeout: 260 seconds]
<hkaiser_>
that could also explain the barrier issue after all
<jbjnr[m]>
What I'm seeing is that N threads wait on the CV, but when a thread tries to notify the CV - it can't get the lock because when the CV.wait is entered - the lock is not being released as it should
<jbjnr[m]>
I mean, one of the threads that is waiting on the CV must still be holding the lock somehow, even though they should not be
<hkaiser_>
before releasing the outer lock (the one passed in) it tries to acquire another, inner lock
<hkaiser_>
is it that it's hanging while doing that?
<jbjnr[m]>
probably, though I do not see a thread spinnning there
<hkaiser_>
or, when leaving the cv::wait it has to reqcquire the outer lock
<jbjnr[m]>
What I see is that the threads that wait, appear to be suspended, but somehow the lock was not released
<hkaiser_>
uhh, that's strange indeed
<jbjnr[m]>
(the outer lock)
<jbjnr[m]>
(the user lock not released, not the inner one)
<hkaiser_>
then any thread trying to exit wait() will not be able to do so
<jbjnr[m]>
I had a quick try at debugging, but thought I'd ask first before gooiong deeper
<hkaiser_>
sure, but I don't know what could be wrong - sorry
<jbjnr[m]>
yes. I'm a bit puzzled at what's going on - I'd expect a thread to be stuck in there somewhere, but I do not see that in the thread backtraces
<jbjnr[m]>
AFAICT all threads enter the CV wait, do the right thing, but the notify cannot get the lock to wake any of them.
<diehlpk_work>
hkaiser_, gonidelis[m] I fixed the cmake hiccups on Fedora and x86 and i686 compiled
<diehlpk_work>
power, s390x, and aarch64 are still pending
<jbjnr[m]>
anyway I will look into it.
<diehlpk_work>
let me know when I should start building rc-2
<hkaiser_>
jbjnr[m]: notify needs to acquire the inner lock only, iirc
<jbjnr[m]>
I cannot acquire the user lock
<hkaiser_>
jbjnr[m]: ok, so let's recap
<hkaiser_>
cv::wait first acquires the inner lock, then releases the outer one
<hkaiser_>
on exit it first releases the inner one before reacquiring the outer lock
<hkaiser_>
so the inner and outer locks are not 'nested'
<jbjnr[m]>
indeed. The outer lock should be unlocked. But when I try to acquire the outer lock in my code prior to calling notify_one - I cannot get it because it is still unlocked.
<hkaiser_>
releasing the inner one before re-acquiring the outer lock actually fixed #3608
<jbjnr[m]>
ooh - hold one. My reply was to your first comment not the link
<hkaiser_>
you don't need to do that, do you?
<hkaiser_>
you don't need to acquire the outer lock before calling notify
<jbjnr[m]>
"need" - to modify the shared variable that the CV predicate tests, the standard yes, yes, I should have the lock
<hkaiser_>
could holding the outer lock while calling notify cause the deadlock?
<jbjnr[m]>
if that's the case, then fine.
<jbjnr[m]>
I can forget about taking the lock and move on.
<hkaiser_>
not sure what you mean with 'so we don't follow the steps in this page ...'
<jbjnr[m]>
steps 1,2,3 acquire lock, modify var, call notify
<hkaiser_>
try unlocking the outer lock before calling notify
<jbjnr[m]>
I already tried that - I cannot acquire the outer lock though
<jbjnr[m]>
it's locked already
<hkaiser_>
do you know where the lock was acquired?
<jbjnr[m]>
I just don't know how the CV can do a wait and not release it first
<jbjnr[m]>
that's what I'm trying to fix
<hkaiser_>
yah
<jbjnr[m]>
presumably when my threads wait on the cv
<hkaiser_>
ok
<hkaiser_>
what type is your outer lock of?
<hkaiser_>
spinlock? mutex?
<jbjnr[m]>
spinlock - it only happens very rarely. I will do more debugging and report back,
<hkaiser_>
spinlock should allow you to put a breakpoint on the spin to see what thread is 'hanging'
<jbjnr[m]>
there is no thread hanging
<jbjnr[m]>
that's the problem. All threads are trying to call notify, but they cannot get the lock, because a suspended thread took the lock and is waiting on the CV but someohe didn't unlock before suspending
<jbjnr[m]>
(I think)
<jbjnr[m]>
but maybe the thread never suspended?
<hkaiser_>
in order to suspend a thread has to go through the constructor of the unlock_guard
<hkaiser_>
also, if it was suspending with a held lock you'd see an exception
<jbjnr[m]>
I've looked at the code, been through the obvious logic. Come to a dead end - then I connnected here to see if you had any thoughts. I will leave now and debug some more
<hkaiser_>
ok, no ideas here
<hkaiser_>
another thought - if you believe it to be a problem with our CV, try using a std::cv instead
<hkaiser_>
not nice, but functionally correct
<jbjnr[m]>
is that safe to suspend our threads with?
<hkaiser_>
(perhaps) ;-)
<hkaiser_>
you will block all progress on that core
<jbjnr[m]>
std::condition_variable works only with std::unique_lock<std::mutex>; this restriction allows for maximal efficiency on some platforms. std::condition_variable_any provides a condition variable that works with any BasicLockable object, such as std::shared_lock.
<jbjnr[m]>
so I'd better use cv_any
<jbjnr[m]>
std::
<hkaiser_>
nod
<hkaiser_>
just to verify that it's not our CV
<jbjnr[m]>
If I block the os thread, I'lll get deadlocks for sure.
<jbjnr[m]>
not really an option
<hkaiser_>
yah, you might
<hkaiser_>
was just a thought
<jbjnr[m]>
anyway. I will dig deeper
<jbjnr[m]>
thanks anyway
<diehlpk_work>
hkaiser_, s390x finished as well
<jbjnr[m]>
hkaiser_: new information. There are 89 threads waiting on the CV in my current hung test. I will look again tomorrow and sleep on this. Not sure 89 is important, but they are all from exactly the same place in my code. (I'm dumping backtraces of suspended threads to try to understand the problem).
ahmed_ has joined #ste||ar
ahmed_ has quit [Quit: Connection closed for inactivity]
aacirino has left #ste||ar [#ste||ar]
diehlpk_work has quit [Remote host closed the connection]