hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC2018: https://wp.me/p4pxJf-k1
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: bye]
nanashi55 has quit [Ping timeout: 264 seconds]
nanashi55 has joined #ste||ar
anushi has quit [Ping timeout: 248 seconds]
anushi has joined #ste||ar
anushi has quit [Remote host closed the connection]
anushi has joined #ste||ar
anushi has quit [Remote host closed the connection]
anushi has joined #ste||ar
anushi has quit [Remote host closed the connection]
anushi has joined #ste||ar
anushi has quit [Remote host closed the connection]
anushi has joined #ste||ar
jaafar has quit [Ping timeout: 256 seconds]
zura has joined #ste||ar
jakub_golinowski has joined #ste||ar
david_pfander has joined #ste||ar
anushi has quit [Ping timeout: 240 seconds]
anushi has joined #ste||ar
anushi has quit [Remote host closed the connection]
anushi has joined #ste||ar
Anushi1998 has joined #ste||ar
nikunj97 has joined #ste||ar
mcopik has joined #ste||ar
anushi has quit [Ping timeout: 240 seconds]
anushi has joined #ste||ar
anushi has quit [Read error: Connection reset by peer]
anushi has joined #ste||ar
anushi has quit [Ping timeout: 264 seconds]
anushi has joined #ste||ar
anushi has quit [Ping timeout: 240 seconds]
hkaiser has joined #ste||ar
Anushi1998 has quit [Ping timeout: 245 seconds]
K-ballo has joined #ste||ar
jakub_golinowski has left #ste||ar ["Ex-Chat"]
jakub_golinowski has joined #ste||ar
<jakub_golinowski>
M-ms, yt?
hkaiser has quit [Quit: bye]
<M-ms>
jakub_golinowski: shortly
<jakub_golinowski>
M-ms, So I run the tests on both hpx and start_stop backends
<jakub_golinowski>
The hpx backends is assumed to fail every test that calls parallel_for
<jakub_golinowski>
from the log file this is 10'240 tests that failed
<jakub_golinowski>
As for the hpx start_stop it passes some tests that are failed by hpx backend but then hangs on the KMeans test
<jakub_golinowski>
this is the HPX hanging on stop() call
<jakub_golinowski>
is it possible that there is a race condition there or sth like this because the backend is started and stopped multiple times in very short intervals before it hangs
<M-ms>
That's possible, yes
<M-ms>
Did everything pass with the existing backends
hkaiser has joined #ste||ar
<jakub_golinowski>
with pthreads and tbb nearly all passed
<M-ms>
Does it hang every time on the same test?
<jakub_golinowski>
M-ms, yes it hangs always on the KMeans test
<M-ms>
There may also be problems with either nested parallelism or multiple opencv threads calling parallel_for (but that sounds unlikely)
<M-ms>
Did you try including hpx_main?
<jakub_golinowski>
and for the pthreads and tbb only 2 tests fail
<jakub_golinowski>
I am just telling because it happend now
<M-ms>
You could try to check with the debugger if any thread is obviously stuck somewhere
<jakub_golinowski>
M-ms, so this is what I am doing, I run it in the debugger
<jakub_golinowski>
(the hanging startstop backend)
<jakub_golinowski>
and it hands on t.join() inside hpx
<jakub_golinowski>
I can reproduce it
<jakub_golinowski>
but from the stack trace it seemed like a normal execution path
<M-ms>
In remove_thread (or something like that)?
<jakub_golinowski>
and no exception was thrown, just infinitely waiting in t.join()
<M-ms>
Yeah, that's most likely where it's waiting for the scheduler and thread to finish
<M-ms>
It also happens every time?
<jakub_golinowski>
I run it 2 or 3 times and it always happend
<M-ms>
I'm clueless right now, some hpx thread is not returning/is still waiting for some other thread
<jakub_golinowski>
first I run it through ctest and then through the opencv_test_core with debugger
<M-ms>
Ok
<jakub_golinowski>
In both cases it hanged on the KMeans test, but now I have a little more knowledge on it
<M-ms>
Can you check that hpx::finalize gets called?
<jakub_golinowski>
Ok
<jakub_golinowski>
btw with hpx_main the test KMeans passed
<M-ms>
Well, it should've been called if you get to the point that you're joining threads... But do check in any case
<M-ms>
Hmm, promising
<M-ms>
Can you tell if the kmeans test is structured somehow differently than the ones that pass?
<jakub_golinowski>
It seems it just calls kmeans numerous times with some different parameters
<M-ms>
I have to go now though, I will try to think about what could be going on but I suspect I'll have to look at the code to be able to say anything
<M-ms>
One thing to try with these things is to see if you can remove parts of the tests and see if it still fails (binary search style)
<M-ms>
Then check if it fails with specific parameters
<jakub_golinowski>
but you mean that previous tests might put program in such a state that it fails at KMeans test
<jakub_golinowski>
?
<jakub_golinowski>
Also I am not sure if I can get better granularity than including hpx_main.hpp for less than the whole module
<jakub_golinowski>
Because the test binaries are per module and this is the place when main() is located
anushi has quit [Ping timeout: 255 seconds]
anushi has joined #ste||ar
eschnett has joined #ste||ar
nanashi55 has quit [Ping timeout: 240 seconds]
nanashi55 has joined #ste||ar
rtohid has joined #ste||ar
diehlpk has joined #ste||ar
<M-ms>
jakub_golinowski: ah, no, that was badly worded. I was thinking of that test in particular, i.e. remove parts of that test and see if it still fails but it might be somewhat time consuming. Easier would be to first check if the k means test is stuck at the same point all the time.
diehlpk has quit [Ping timeout: 268 seconds]
anushi has quit [Remote host closed the connection]
mbremer has joined #ste||ar
<jakub_golinowski>
M-ms, so I am pretty convinced now it stops at the same point each time so I will proceed to looking into the test itself
galabc has joined #ste||ar
galabc has quit [Quit: Leaving]
jakub_golinowski has quit [Ping timeout: 240 seconds]
<diehlpk_work>
hkaiser, see pm
<nikunj97>
how can we know the reason for a failing test?
<nikunj97>
when we run: make test
mcopik has quit [Ping timeout: 240 seconds]
<hkaiser>
nikunj97: run the test that fails specificly, possibly using a debugge
<nikunj97>
hkaiser, how can I do that?
<nikunj97>
hkaiser, I tried my implementation on unit/actions and return_future is failing (rest are passing)
<hkaiser>
just directly run the failing test executable
<nikunj97>
ohk
david_pfander has quit [Quit: david_pfander]
<K-ballo>
there should be some output pointing to the check that fail, the expression, the file, the line, etc
<K-ballo>
if using ctest then try ctest --output-on-failure to see output for failed tests
<nikunj97>
K-ballo, thanks
<nikunj97>
I'm getting the following error: hpx::resource::get_partitioner() can be called only after the resource partitioner has been allowed to parse the command line options.
<nikunj97>
Seems like a config error instead of runtime error
<nikunj97>
*runtime system error
galabc has joined #ste||ar
galabc has quit [Client Quit]
diehlpk has joined #ste||ar
jakub_golinowski has joined #ste||ar
diehlpk has quit [Ping timeout: 248 seconds]
jaafar has joined #ste||ar
jbjnr has joined #ste||ar
diehlpk_work has quit [Quit: Leaving]
zura has quit [Quit: Leaving]
<nikunj97>
can anyone explain what the error means: the component is disabled for this locality (component_invalid[-1]): HPX(bad_request)
eschnett has quit [Quit: eschnett]
nanashi55 has quit [Ping timeout: 260 seconds]
nanashi55 has joined #ste||ar
hkaiser has quit [Quit: bye]
akheir has quit [Quit: Leaving]
hkaiser has joined #ste||ar
eschnett has joined #ste||ar
<nikunj97>
hkaiser, yt?
<nikunj97>
there seems to be initialization sequencing problems occurring with my implementation (which runs for global objects)
galabc has joined #ste||ar
<hkaiser>
nikunj97: ok
<nikunj97>
hkaiser, I'll try to correct them
<nikunj97>
hkaiser, things are working fine without including global object implementation though
<hkaiser>
nikunj97: nice
jakub_golinowski has quit [Ping timeout: 248 seconds]