hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoD: https://developers.google.com/season-of-docs/
weilewei has quit [Remote host closed the connection]
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Ping timeout: 264 seconds]
Guest65867 has quit [Ping timeout: 240 seconds]
Guest65867 has joined #ste||ar
Guest65867 has quit [Ping timeout: 265 seconds]
Guest65867 has joined #ste||ar
Coldblackice_ has joined #ste||ar
Coldblackice has quit [Ping timeout: 240 seconds]
Guest65867 has quit [Ping timeout: 252 seconds]
Guest65867 has joined #ste||ar
<simbergm> hkaiser, jbjnr, heller I may not be able to join the call today but please go ahead in any case, I'll try to join as soon as I'm free
rori has joined #ste||ar
Coldblackice has joined #ste||ar
Coldblackice_ has quit [Ping timeout: 240 seconds]
K-ballo has joined #ste||ar
hkaiser has joined #ste||ar
Guest65867 has quit [Ping timeout: 276 seconds]
Guest65867 has joined #ste||ar
Guest65867 has quit [Ping timeout: 240 seconds]
Guest65867 has joined #ste||ar
hkaiser has quit [Quit: bye]
weilewei has joined #ste||ar
hkaiser has joined #ste||ar
aserio has joined #ste||ar
nikunj has quit [Remote host closed the connection]
nikunj has joined #ste||ar
aserio has quit [Ping timeout: 245 seconds]
jaafar has quit [Ping timeout: 264 seconds]
aserio has joined #ste||ar
<weilewei> hkaiser without --hpx:threads=1, I got the following error
<zao> Ah, gist has a comment with pictures. Clever.
<weilewei> zao right, I just found it today
aserio has quit [Ping timeout: 264 seconds]
aserio has joined #ste||ar
aserio has quit [Ping timeout: 250 seconds]
<hkaiser> weilewei: that's not too helpful :/
<hkaiser> apparently happens inside CUDA
<weilewei> yea, I know, I cannot find any helpful info
<hkaiser> I tend to think that this is not our problem
<weilewei> Agreed...
rori has quit [Quit: WeeChat 1.9.1]
<weilewei> But I am not sure how to reproduce the problem with a smaller example either
<hkaiser> weilewei: I'd try to get in contact with the nvidia guys at your place
<weilewei> Ok, I will talk to Ronnie, and he will start the conversation among us
<weilewei> I will meet Ronnie this afternoon
<hkaiser> k
aserio has joined #ste||ar
weilewei has quit [Remote host closed the connection]
aserio has quit [Ping timeout: 264 seconds]
aserio has joined #ste||ar
weilewei has joined #ste||ar
<heller> weilewei: what's the actual error you are getting?
<weilewei> the assertion failed, number of expected finish walkers does not match with number of actual workers
<heller> and does it happen in the exceptional code path or in the normal one?
<weilewei> exceptional code path
<heller> where is the exception being thrown, and which exception is being thrown?
<heller> no
<weilewei> This line is being called, then the following assert failed too
<heller> no
<heller> that's not what I meant
<heller> this is happening because one of the tasks threw an exception
<weilewei> what do you mean exactly? Sorry for misunderstanding
<weilewei> correct
<heller> which task did throw the exception?
<heller> Where was that exception thrown?
<weilewei> hmm, I need to run it again and check which task
weilewei has quit [Remote host closed the connection]
<heller> that's what we discussed a while back: figure out where that exception came from. This will bring you closer to a possible solution
<heller> Back then I said: "I bet this is a lock held during suspension error"
weilewei has joined #ste||ar
<weilewei> heller while I am running, each async task is associated with a thread id, so do you want to know which async id that causes the problem?
<heller> no. I want to know at which file at which line the exception was thrown
<heller> or rather, you should want to know that
<weilewei> ok, let me check it when it hits the assertion failure
<heller> no
<heller> we already know where it hits that
<heller> find the location which throws the exception in the task that makes future::get to throw the error
<weilewei> Yea, I am thinking when I hit the assertion failure, then I can know where does the failure come from
<heller> nope
<heller> that's too late
<weilewei> oh... so I should put a breakpoint on the future::get?
<heller> no
<heller> "catch throw"
<weilewei> this line?
<heller> no
<heller> make the debugger stop whenever an exception is being thrown
<heller> in gdb, you do this, by typing "catch throw"
<weilewei> oh, how about arm-forge, let me searh a bit
<heller> ddt most likely has a similar option in its gui somewhere
<weilewei> I found it! There are two: stop at catch, stop at throw
<weilewei> which one should i choose
<weilewei> I can do multiple chocies, so I just ticked them both
<weilewei> let me run the program again
<weilewei> heller sorry for misunderstanding, I was not aware of gdb has catch throw so I was not understanding correctly
<heller> weilewei: I recommended that on october 30th already ;)
<weilewei> oops, it stops at the very beginning when I starts the program
<heller> hit continue
<heller> or inspect the backtrace
<weilewei> a lot of conitnue
<weilewei> it seems that I need to run to the line before hpx async starts, otherwise it stops at every exception
<weilewei> heller do you mean by saying inspect the backtrace
<weilewei> heller what do you mean by saying inspect the backtrace
<heller> in your gist above, you only see `__cxa_throw` which is coming somwhere out of your C++ standard library. What you want to see however is the line in *your* code
<heller> each function call generates something what is usually being called a stackframe
<weilewei> Right
<heller> that is, you can see the entire call chain by inspecting the trace of function calls
<heller> aka backtrace
<heller> by inspecting i mean: Let it display and look at it.
<heller> and see if it is of any interest to you
<weilewei> ok
<weilewei> got it
<heller> sometime it is also called stacktrace
<weilewei> is this an interesting error?
<weilewei> soory, need to run to a meeting
<heller> how should I know?
<heller> please understand and apply what I said above
<weilewei> let me inspect later
<weilewei> sure
<heller> NB: lots of threads are throwing an exception at this point in time
aserio has quit [Ping timeout: 245 seconds]
aserio has joined #ste||ar
weilewei has quit [Remote host closed the connection]
jaafar has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 250 seconds]
aserio1 is now known as aserio
weilewei has joined #ste||ar
weilewei has quit [Remote host closed the connection]
Coldblackice has quit []
Coldblackice has joined #ste||ar
aserio has quit [Quit: aserio]
<simbergm> weilewei: https://user-images.githubusercontent.com/22727965/68337125-bef64280-00d7-11ea-9cec-a34956e50998.png is definitely locks held while suspending like heller said
<simbergm> it also looks an awful lot like https://github.com/STEllAR-GROUP/hpx/pull/4171
<simbergm> what commit of hpx are you running?
jaafar has quit [Ping timeout: 245 seconds]
hkaiser has quit [Quit: bye]
weilewei has joined #ste||ar
<weilewei> wei2303030426
<weilewei> oops, wrong message
<weilewei> simbergm I used this one: 414380e50e55ed1f4ebfde57f3bda7018d6d1cf0