#ste||ar on 2022-04-13 — irc logs at irclog.cct.lsu.edu

2021-08-06 22:55 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu

00:17 <K-ballo> Vineeth not working how? it loads

00:19 <Vineeth> This site can’t provide a secure connectionmail.cct.lsu.edu uses an unsupported protocol.

00:19 <Vineeth> ERR_SSL_VERSION_OR_CIPHER_MISMATCH

00:22 <K-ballo> yes, it uses TLS1

00:24 ahmed_ has quit [Quit: Connection closed for inactivity]

00:38 aacirino has quit [Remote host closed the connection]

00:44 ahmed_ has joined #ste||ar

00:51 hkaiser has joined #ste||ar

01:59 K-ballo has quit [Quit: K-ballo]

02:20 hkaiser has quit [Quit: Bye!]

02:54 ahmed_ has quit [Quit: Connection closed for inactivity]

04:09 Vineeth has quit [Quit: Client closed]

05:28 jbalint has quit [Quit: Bye!]

05:28 jbalint has joined #ste||ar

05:40 jbalint has quit [Quit: Bye!]

05:40 jbalint has joined #ste||ar

07:11 Yorlik_ has joined #ste||ar

09:43 Guest95 has joined #ste||ar

09:43 Guest95 has quit [Client Quit]

09:44 Guest95 has joined #ste||ar

09:45 Guest95 has quit [Client Quit]

09:59 Vineeth_Pulipati has quit [Quit: Client closed]

11:24 hkaiser has joined #ste||ar

11:43 <gonidelis[m]> hkaiser: pm plz :)

11:49 <dkaratza[m]> gonidelis: i want you to tell me your exact email and department you weant me to add

12:06 ahmed_ has joined #ste||ar

12:47 K-ballo has joined #ste||ar

14:04 Yorlik_ has quit [Read error: Connection reset by peer]

14:22 diehlpk has joined #ste||ar

14:22 <diehlpk> hkaiser: I fixed the cmake issue on Fedora's build system

14:22 <diehlpk> Build is running now

14:23 <diehlpk> gonidelis[m]:

14:25 <diehlpk> We get a town of warning using gcc 12

14:27 <hkaiser> diehlpk: show us the log once it's done, I'll try to fix the warnings

14:28 aacirino has joined #ste||ar

14:37 <diehlpk> hkaiser: hpx without networking finished on x86

14:42 <diehlpk> https://kojipkgs.fedoraproject.org//work/tasks/5782/85615782/build.log

14:42 <diehlpk> That is the log with the warnings

14:42 <diehlpk> Still need to fix the mpi build

14:43 <K-ballo> abi...

14:48 <hkaiser> let's add -Wno-interference-size to the builds

14:54 <diehlpk> Ok, I can do that

15:05 <diehlpk> mpi builds are fixed

15:07 <diehlpk> hkaiser: Once you have more fixed, I will run again with the new cmake option

15:11 <hkaiser> diehlpk: let's add all related changes here: https://github.com/STEllAR-GROUP/hpx/pull/5848

15:15 <K-ballo> s/dedora/fedora

15:28 <hkaiser> lol

15:56 diehlpk has left #ste||ar [#ste||ar]

15:56 diehlpk_work has joined #ste||ar

16:06 ahmed_ has quit [Quit: Connection closed for inactivity]

18:50 <jbjnr[m]> hkaiser: I'm seeing deadlocks when I use a condition variable that might be related to this issue and the fix that was applied https://github.com/STEllAR-GROUP/hpx/issues/3068

18:52 <jbjnr[m]> It seems unlikely to me that we have a bug in something as fundamental as CV. Is it used much in the main hpx code? (if the answer is "no", then a deadlock problem going unfounf is possible, but if it is used a lot, then we surely would have noticed it ...)

18:56 hkaiser_ has joined #ste||ar

18:56 <hkaiser_> jbjnr[m]: CVs are used everywhere

18:57 <hkaiser_> jbjnr[m]: but I wouldn't put it past us to have a problem, we might just use it in ways that doesn't expose it

18:57 hkaiser has quit [Ping timeout: 260 seconds]

18:57 <hkaiser_> that could also explain the barrier issue after all

19:01 <jbjnr[m]> What I'm seeing is that N threads wait on the CV, but when a thread tries to notify the CV - it can't get the lock because when the CV.wait is entered - the lock is not being released as it should

19:02 <jbjnr[m]> I mean, one of the threads that is waiting on the CV must still be holding the lock somehow, even though they should not be

19:02 <hkaiser_> before releasing the outer lock (the one passed in) it tries to acquire another, inner lock

19:02 <hkaiser_> is it that it's hanging while doing that?

19:03 <jbjnr[m]> probably, though I do not see a thread spinnning there

19:03 <hkaiser_> or, when leaving the cv::wait it has to reqcquire the outer lock

19:03 <jbjnr[m]> What I see is that the threads that wait, appear to be suspended, but somehow the lock was not released

19:03 <hkaiser_> uhh, that's strange indeed

19:03 <jbjnr[m]> (the outer lock)

19:04 <jbjnr[m]> (the user lock not released, not the inner one)

19:04 <hkaiser_> then any thread trying to exit wait() will not be able to do so

19:04 <jbjnr[m]> I had a quick try at debugging, but thought I'd ask first before gooiong deeper

19:05 <hkaiser_> sure, but I don't know what could be wrong - sorry

19:05 <jbjnr[m]> yes. I'm a bit puzzled at what's going on - I'd expect a thread to be stuck in there somewhere, but I do not see that in the thread backtraces

19:06 <jbjnr[m]> AFAICT all threads enter the CV wait, do the right thing, but the notify cannot get the lock to wake any of them.

19:06 <diehlpk_work> hkaiser_, gonidelis[m] I fixed the cmake hiccups on Fedora and x86 and i686 compiled

19:07 <diehlpk_work> power, s390x, and aarch64 are still pending

19:07 <jbjnr[m]> anyway I will look into it.

19:07 <diehlpk_work> let me know when I should start building rc-2

19:08 <hkaiser_> jbjnr[m]: notify needs to acquire the inner lock only, iirc

19:08 <jbjnr[m]> I cannot acquire the user lock

19:10 <hkaiser_> jbjnr[m]: ok, so let's recap

19:10 <hkaiser_> cv::wait first acquires the inner lock, then releases the outer one

19:10 <hkaiser_> on exit it first releases the inner one before reacquiring the outer lock

19:11 <hkaiser_> so the inner and outer locks are not 'nested'

19:12 <hkaiser_> see: https://github.com/STEllAR-GROUP/hpx/blob/master/libs/core/synchronization/include/hpx/synchronization/condition_variable.hpp#L93-L102

19:12 <hkaiser_> could that be a problem?

19:13 <jbjnr[m]> indeed. The outer lock should be unlocked. But when I try to acquire the outer lock in my code prior to calling notify_one - I cannot get it because it is still unlocked.

19:13 <hkaiser_> releasing the inner one before re-acquiring the outer lock actually fixed #3608

19:13 <jbjnr[m]> ooh - hold one. My reply was to your first comment not the link

19:13 <hkaiser_> you don't need to do that, do you?

19:14 <hkaiser_> you don't need to acquire the outer lock before calling notify

19:14 <jbjnr[m]> "need" - to modify the shared variable that the CV predicate tests, the standard yes, yes, I should have the lock

19:14 <jbjnr[m]> to call notify, no, not needed

19:15 <hkaiser_> right

19:15 <jbjnr[m]> so we don't follow the steps in this page ...https://en.cppreference.com/w/cpp/thread/condition_variable

19:15 <hkaiser_> could holding the outer lock while calling notify cause the deadlock?

19:15 <jbjnr[m]> if that's the case, then fine.

19:16 <jbjnr[m]> I can forget about taking the lock and move on.

19:16 <hkaiser_> not sure what you mean with 'so we don't follow the steps in this page ...'

19:17 <jbjnr[m]> steps 1,2,3 acquire lock, modify var, call notify

19:17 <hkaiser_> try unlocking the outer lock before calling notify

19:18 <jbjnr[m]> I already tried that - I cannot acquire the outer lock though

19:18 <jbjnr[m]> it's locked already

19:18 <hkaiser_> do you know where the lock was acquired?

19:18 <jbjnr[m]> I just don't know how the CV can do a wait and not release it first

19:18 <jbjnr[m]> that's what I'm trying to fix

19:19 <hkaiser_> yah

19:19 <jbjnr[m]> presumably when my threads wait on the cv

19:19 <hkaiser_> ok

19:20 <hkaiser_> what type is your outer lock of?

19:20 <hkaiser_> spinlock? mutex?

19:20 <jbjnr[m]> spinlock - it only happens very rarely. I will do more debugging and report back,

19:21 <hkaiser_> spinlock should allow you to put a breakpoint on the spin to see what thread is 'hanging'

19:21 <jbjnr[m]> there is no thread hanging

19:22 <jbjnr[m]> that's the problem. All threads are trying to call notify, but they cannot get the lock, because a suspended thread took the lock and is waiting on the CV but someohe didn't unlock before suspending

19:23 <jbjnr[m]> (I think)

19:23 <jbjnr[m]> but maybe the thread never suspended?

19:23 <hkaiser_> in order to suspend a thread has to go through the constructor of the unlock_guard

19:23 <hkaiser_> also, if it was suspending with a held lock you'd see an exception

19:24 <jbjnr[m]> I've looked at the code, been through the obvious logic. Come to a dead end - then I connnected here to see if you had any thoughts. I will leave now and debug some more

19:24 <hkaiser_> ok, no ideas here

19:25 <hkaiser_> another thought - if you believe it to be a problem with our CV, try using a std::cv instead

19:25 <hkaiser_> not nice, but functionally correct

19:26 <jbjnr[m]> is that safe to suspend our threads with?

19:26 <hkaiser_> (perhaps) ;-)

19:26 <hkaiser_> you will block all progress on that core

19:26 <jbjnr[m]> std::condition_variable works only with std::unique_lock<std::mutex>; this restriction allows for maximal efficiency on some platforms. std::condition_variable_any provides a condition variable that works with any BasicLockable object, such as std::shared_lock.

19:26 <jbjnr[m]> so I'd better use cv_any

19:26 <jbjnr[m]> std::

19:27 <hkaiser_> nod

19:27 <hkaiser_> just to verify that it's not our CV

19:27 <jbjnr[m]> If I block the os thread, I'lll get deadlocks for sure.

19:27 <jbjnr[m]> not really an option

19:27 <hkaiser_> yah, you might

19:27 <hkaiser_> was just a thought

19:27 <jbjnr[m]> anyway. I will dig deeper

19:28 <jbjnr[m]> thanks anyway

20:05 <diehlpk_work> hkaiser_, s390x finished as well

20:06 <jbjnr[m]> hkaiser_: new information. There are 89 threads waiting on the CV in my current hung test. I will look again tomorrow and sleep on this. Not sure 89 is important, but they are all from exactly the same place in my code. (I'm dumping backtraces of suspended threads to try to understand the problem).

20:06 ahmed_ has joined #ste||ar

22:16 ahmed_ has quit [Quit: Connection closed for inactivity]

22:31 aacirino has left #ste||ar [#ste||ar]

23:59 diehlpk_work has quit [Remote host closed the connection]