<weilewei>
99% tests passed, 3 tests failed out of 700Total Test time (real) = 861.81 secThe following tests FAILED:551 - tests.unit.threads.thread_stacksize (Failed)554 - tests.unit.topology.numa_allocator (Failed)634 - tests.examples.quickstart.1d_wave_equation (Failed)
<weilewei>
Is it something serious? Or if needed, how can I fix it. I mostly use hpx async, thread, future those facilities on DCA++ project for now
<hkaiser>
weilewei: the wave equation example shouldn't be problematic, the others - I don't know
<hkaiser>
but I'd assume that HPX is functional for now
<weilewei>
Ok, that's nice to know
<weilewei>
I will try to build DCA++ with newly built HPX and see how it works
<hkaiser>
weilewei: I wil try to look into the failures
<hkaiser>
the numa-allocator worries me a bit, but this could be problem in the test itself
<hkaiser>
weilewei: jbjnr would be the one to know actually
<weilewei>
hkaiser thanks!
<weilewei>
if you would like to test it, you can look into my directory: /gpfs/alpine/proj-shared/cph102/weile/dev/src/hpx/build_hwloc_Debug/
<diehlpk_work>
weilewei, The 1d_wave could be related to the python issue
<weilewei>
diehlpk_work ok, good to know
<diehlpk_work>
I will look into the python issues next week
<diehlpk_work>
Distributed failed because cmake could not find mpiexec
<diehlpk_work>
I will look into this as well next week
<heller>
The numa allocator stuff should not affect functionality, maybe just performance issues
<diehlpk_work>
For the second one, I think the issues is with my scripts and how we export things
<heller>
The stack size might be because we use too much stack space upfront (aka more overhead than on x86)
<heller>
What is the python issue?
<diehlpk_work>
python can not find some system lib
<heller>
And how is this related to the failing example?
<diehlpk_work>
Starting hpx applications with the python script fail, because one import fails
<heller>
Then all tests would be marked as failed
<heller>
All tests are started through the wrapper
<diehlpk_work>
Ok, I have seen python errors for some of the tests
<heller>
Which import fails? Which python version are you using? Did you try to load a newer one or tried to do a pip install?
<diehlpk_work>
At least the distributed test failed for me for not finding mpiexec and tcp with a python error
<heller>
If an import fails, it can't be just some
<heller>
Ok, what's the error?
<diehlpk_work>
heller, python 2.7 and I need to look into it in more detail
<heller>
What's your MPI implementation? How would you start an MPI program on the machine you're on?
<diehlpk_work>
I just got access to this power9 system on Tuesday
<heller>
The ibm job scheduler/MPI implementation does not work with hpxrun.py
<heller>
That's for sure
<diehlpk_work>
It is openmpi
<heller>
Then you should have an mpiexec/mpirun *somewhere*
<heller>
What's the batch system?
<diehlpk_work>
Yes, I think that I just forgot to export the path to mpiexec
<weilewei>
Well, not sure if I understand your discussion correctly, on Summit, they use jsrun, not mpiexec/mpirun
<diehlpk_work>
This needs all more investigation, but weilewei just needed hpx without networking
<weilewei>
Yea, I do not need networking for now
<heller>
I'm just saying what to expect...
<diehlpk_work>
Foe now we just wanted to compile hpx without networking on a different power9 system to see if we get the same segfault as weilewei did get on Summit
<heller>
Didn't we conclude that the problem was with a specific blas implementation?
<hkaiser>
heller: we did not
<heller>
Also, does the segfault happen as well on John's implementation?
<hkaiser>
shrug
<heller>
But it does work when changing the blas implementation, right?
<weilewei>
heller I have not found out the solutions for both
<weilewei>
yes, hpx works well with netlib-lapack on Summit, but not essl (IBM specific blas implemenatation, and DCA++ tested it, the essl is fastest than other blas implementation)
<weilewei>
So, eventually, I still need HPX works with essl
<weilewei>
I created a ticket for OCLF (Summit help desk), and they found a similar issue that someone used hpx with essl and found problematic on Summit on this May, but not sure if it is relevant or not
<weilewei>
So, next step could be: I need to investigate std thread + essl (which DCA++'s original version) V.S. hpx+essl and see what's wrong.
<heller>
weilewei: there is no reason why HPX should not work with essl ... it is dca++ implemented on top of hpx that does not work
<heller>
does dca++ use any thread local storage?
<weilewei>
heller yea, I am trying to figure out why
<heller>
or essl?
<heller>
did you trying running your stuff with address or undefined sanitizer turned on?
<weilewei>
I am not sure, even they have a non-threading version , which works heller
<weilewei>
heller I am not sure what's that, how should I turn on and off in HPX?
<heller>
and your application just with the extra CMAKE_CXX_FLAGS
<weilewei>
Ok, I can try. With these flags, will I get extra information?
<heller>
those instrument your code with special stuff to see if you run into undefined behavior or other memory related problems (like valgrind), just faster and more accurate
<weilewei>
ok, I can try, because what I see and discussed with Dr. Kaiser before is that, all input args to essl function call are valid, but it still causes segfault
<weilewei>
for essl, a commerical library, I do not have debug version or access to its function call stacks.
<weilewei>
heller I will try your suggestions now and let you know
<heller>
weilewei: do which version of the ESSL are you linking against?
<weilewei>
I try both serial and smp version, same errors and same place heller
<jaafar>
Is there a good resource that explains how dataflow gets scheduled? As in, how they are chosen when more than one are "ready"...
* jaafar
is trying to understand the scan partitioner
<heller>
could you choose between 32 bit integer, 64 bit pointer environment and 64 bit integer, 64 bit pointer environment?
<heller>
jaafar: they are chosen by the scheduler, they don't get any special treatment
<weilewei>
heller yes I can try all versions, should I try all of them?
<heller>
you have to choose the one that fits your environment
<heller>
this should be exactly *one*
<heller>
what does the summit user guide has to say about this?
<jaafar>
heller: OK but what if there are two... order created?
<jaafar>
and where can I find that code :)
<heller>
jaafar: not necessarily, if there is another core stealing the second...
<heller>
jaafar: it's the scheduler implementation ;)
<jaafar>
great! What file should I look in for that?
<heller>
there are multiple..
<heller>
one sec
<heller>
jaafar: what are you trying to figure out?
<jaafar>
why exclusive_scan is 20-25% slower in parallel
<jaafar>
Right now I'm looking at the scheduling... seems like it could be improved to reduce cache thrashing
<heller>
I wouldn't think it has anything to do with the scheduling decisions
<heller>
hmm
<heller>
as you mentioned in your ticket
<jaafar>
there are two phases that operate on the same data
<jaafar>
these tend to get separated with other work put in between
<heller>
if the working set is correctly chosen, and the algorithm itself is cache friendly, the scheduling decision should be irrelevant
<jaafar>
which (I expect) would cause the whole set to get reloaded
<heller>
could be indeed
<jaafar>
I believe the working set is also suboptimal
<heller>
however: you probably won't figure out a way to fix that in the current scheduling implementation
<heller>
I'd start there
<jaafar>
heller: at least I will know
<weilewei>
heller well, summit user guide does not say much
<heller>
what did the persons use use who figured out that essl is the fastest option for dca++?
<weilewei>
On DCA's implementation, they use libessl.so to link
<weilewei>
probably they run some performance test?
<heller>
jaafar: that's where you probably end up in
<heller>
weilewei: I don't care much about the performance numbers ... you are saying it ain't working for you. I am trying to figure out why
<heller>
and obviously it didn't work for another person
<heller>
so I guess you use libessl.so too? I assume it is a symlink to one of the variants?
<jaafar>
OK I will read it. I'm hoping to end up with some way of tweaking the execution so that "phase 2" for a single chunk is chosen over "phase 1" for a different chunk if both are ready
<jaafar>
thus reducing the separation between use of the same data
<heller>
ok, that can be done, in principle
<heller>
if say, you want to execute the continuation passed into dataflow, once all inputs are ready
<weilewei>
heller yes, I also linked to libesslsmp.so, same error and failed at the same place
<heller>
jaafar: you should use the fork policy for that, IIRC, that should put the continuation as the next thread to schedule
<jaafar>
interesting OK
<jaafar>
scan_partition presently uses sync
<heller>
sync means, it is directly executed afterwards, without even starting a new task
<heller>
weilewei: you probably don't want the smp variants
<weilewei>
Yes, the system manager guy told me that smp, under the hood, is implemented in openmp, which might interact with hpx threads
<heller>
enxt thing to do would be, to try to figure out which call is causing trouble
<heller>
that is, comment out everything, then comment stuff back in step by step until it crashed
<heller>
eventually, you'll find the spot
<weilewei>
hmm, it starts from boost context switch then all the way down to a essl function call, that's what I have in my mind
<weilewei>
heller sure I will try that
<heller>
the one that you showed me the other day?
<hkaiser>
weilewei: from what I see just a ton of memory leaks in their code
<hkaiser>
did this run crash?
<heller>
Yeah, just leaks
<weilewei>
The test failed, but I do not see this run verboses any things from the test
<weilewei>
It should output some scientific results
<hkaiser>
so it just crashed ...
<hkaiser>
beautiful
<weilewei>
yea, beatuiful...
<weilewei>
how should I do next?
<heller>
well, fix the leaks ;)
<weilewei>
wait... what... so many leaks
<heller>
ok, here is another suggestion
<heller>
export ASAN_OPTIONS=detect_leaks=0
<heller>
weilewei: ^^
<weilewei>
then run again?
<heller>
Sure
<weilewei>
it always complains this ==51069==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
hkaiser has quit [Ping timeout: 250 seconds]
<weilewei>
But I have done export LD_PRELOAD=$OLCF_GCC_ROOT/lib64/libasan.so
<weilewei>
which is valid
<weilewei>
heller
<heller>
Sorry, no idea
weilewei has quit [Remote host closed the connection]
aserio has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 245 seconds]
aserio1 is now known as aserio
Vir has quit [Ping timeout: 265 seconds]
Vir has joined #ste||ar
aserio has quit [Quit: aserio]
hkaiser has joined #ste||ar
<jaafar>
Is there any difference between how work created by "async_execute" or "dataflow" are scheduled?
<jaafar>
Besides, I guess, the fact that async_execute doesn't have preconditions AFAICT
<jaafar>
like, imagine there was a "dataflow" whose inputs were all available, vs something I created with async_execute
<hkaiser>
jaafar: async_execute is lower-level
<jaafar>
Is there any difference in how that work gets scheduled?
<hkaiser>
dataflow uses executors to do its job
<hkaiser>
so dataflow uses async_execute anyways
<jaafar>
so I guess there is no difference in how it gets scheduled?
* jaafar
is looking at partition_scan
<jaafar>
make that scan_partitioner
<jaafar>
the first phase is all created with async_execute
<jaafar>
the second and third are all dataflow
<jaafar>
It seems that work created with async_execute is generally preferred to dataflow by the scheduler
<jaafar>
Not consistently, but enough to cause some cache thrashing as data is moved out and then back in again
<jaafar>
thus my interest in what gets scheduled :)
<jaafar>
Is there any way to influence what work gets run first?
<hkaiser>
jaafar: create additional dependencies between the futures