hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoD: https://developers.google.com/season-of-docs/
<jaafar> hkaiser: I have that MWE now. Shall I make a gist? File an issue?
<hkaiser> jaafar: does it relate to that existing ticket?
<hkaiser> if yes, pls add it there, otherwise pls create a new ticket
<jaafar> peripherally... but it's really a separate thing
<jaafar> OK I'll make a new one
<jaafar> hope I'm not just using it wrong :)
nikunj has quit [Ping timeout: 245 seconds]
<jaafar> There I go, #4118
nikunj has joined #ste||ar
<hkaiser> thanks jaafar, will look tomorrow
K-ballo has quit [Quit: K-ballo]
nikunj97 has joined #ste||ar
nikunj has quit [Read error: Connection reset by peer]
diehlpk has joined #ste||ar
nikunj97 has quit [Remote host closed the connection]
hkaiser has quit [Ping timeout: 245 seconds]
diehlpk has quit [Ping timeout: 245 seconds]
weilewei has joined #ste||ar
rori has joined #ste||ar
nikunj has joined #ste||ar
lsl88 has quit [Quit: Leaving.]
nikunj has quit [Ping timeout: 240 seconds]
K-ballo has joined #ste||ar
lsl88 has joined #ste||ar
diehlpk has joined #ste||ar
diehlpk has quit [Ping timeout: 245 seconds]
hkaiser has joined #ste||ar
<hkaiser> heller: serialization branch is back in good shape, compiles again
<hkaiser> the problem would have disappeared for you however, once you reimplemented the extra data
<heller> hkaiser: sure, that's what I did. Just wanted to have a point of comparison
<heller> And the movable_any would have been defect in any case
<heller> BTW, can we name it unique_any?
<K-ballo> mofun, moany
<hkaiser> heller: feel free to bikeshed and to rename as much as you like
<hkaiser> gtg
<K-ballo> -1 on movable_ from me too
<K-ballo> the time machine option would be to call `any` the one that only requires Destructible, and `any_copyable` the one that requires Copyable
<K-ballo> failing that, unique_ is not ideal but should be easily understandable given unique_function
hkaiser has quit [Ping timeout: 250 seconds]
aserio has joined #ste||ar
weilewei has quit [Remote host closed the connection]
<zao> I still have failures on the s11n branch on my GCC 8.3 and libstdc++.
<zao> heller ^
<heller> zao: hmm, so it ain't fixed yet?
<heller> Bummer
tianyi93 has joined #ste||ar
lsl88 has quit [Quit: Leaving.]
weilewei has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
aserio has quit [Ping timeout: 250 seconds]
quaz0r has quit [Ping timeout: 276 seconds]
rori has quit [Quit: WeeChat 1.9.1]
quaz0r has joined #ste||ar
hkaiser has joined #ste||ar
hkaiser has quit [Client Quit]
hkaiser has joined #ste||ar
weilewei has quit [Remote host closed the connection]
<diehlpk_work> hkaiser, You should get access to the Power9 machine within the next few dys
nikunj has joined #ste||ar
weilewei has joined #ste||ar
<weilewei> I run hpx tests on summit and found 3 tests failed, report here: https://gist.github.com/weilewei/33907c41b1393da3f7296586be9d1258
<weilewei> summary:
<weilewei> 99% tests passed, 3 tests failed out of 700Total Test time (real) = 861.81 secThe following tests FAILED:551 - tests.unit.threads.thread_stacksize (Failed)554 - tests.unit.topology.numa_allocator (Failed)634 - tests.examples.quickstart.1d_wave_equation (Failed)
<weilewei> Is it something serious? Or if needed, how can I fix it. I mostly use hpx async, thread, future those facilities on DCA++ project for now
<hkaiser> weilewei: the wave equation example shouldn't be problematic, the others - I don't know
<hkaiser> but I'd assume that HPX is functional for now
<weilewei> Ok, that's nice to know
<weilewei> I will try to build DCA++ with newly built HPX and see how it works
<hkaiser> weilewei: I wil try to look into the failures
<hkaiser> the numa-allocator worries me a bit, but this could be problem in the test itself
<hkaiser> weilewei: jbjnr would be the one to know actually
<weilewei> hkaiser thanks!
<weilewei> if you would like to test it, you can look into my directory: /gpfs/alpine/proj-shared/cph102/weile/dev/src/hpx/build_hwloc_Debug/
<diehlpk_work> weilewei, The 1d_wave could be related to the python issue
<weilewei> diehlpk_work ok, good to know
<diehlpk_work> I will look into the python issues next week
<diehlpk_work> Distributed failed because cmake could not find mpiexec
<diehlpk_work> I will look into this as well next week
<heller> The numa allocator stuff should not affect functionality, maybe just performance issues
<diehlpk_work> For the second one, I think the issues is with my scripts and how we export things
<heller> The stack size might be because we use too much stack space upfront (aka more overhead than on x86)
<heller> What is the python issue?
<diehlpk_work> python can not find some system lib
<heller> And how is this related to the failing example?
<diehlpk_work> Starting hpx applications with the python script fail, because one import fails
<heller> Then all tests would be marked as failed
<heller> All tests are started through the wrapper
<diehlpk_work> Ok, I have seen python errors for some of the tests
<heller> Which import fails? Which python version are you using? Did you try to load a newer one or tried to do a pip install?
<diehlpk_work> At least the distributed test failed for me for not finding mpiexec and tcp with a python error
<heller> If an import fails, it can't be just some
<heller> Ok, what's the error?
<diehlpk_work> heller, python 2.7 and I need to look into it in more detail
<heller> What's your MPI implementation? How would you start an MPI program on the machine you're on?
<diehlpk_work> I just got access to this power9 system on Tuesday
<heller> The ibm job scheduler/MPI implementation does not work with hpxrun.py
<heller> That's for sure
<diehlpk_work> It is openmpi
<heller> Then you should have an mpiexec/mpirun *somewhere*
<heller> What's the batch system?
<diehlpk_work> Yes, I think that I just forgot to export the path to mpiexec
<weilewei> Well, not sure if I understand your discussion correctly, on Summit, they use jsrun, not mpiexec/mpirun
<diehlpk_work> This needs all more investigation, but weilewei just needed hpx without networking
<weilewei> Yea, I do not need networking for now
<heller> I'm just saying what to expect...
<diehlpk_work> Foe now we just wanted to compile hpx without networking on a different power9 system to see if we get the same segfault as weilewei did get on Summit
<heller> Didn't we conclude that the problem was with a specific blas implementation?
<hkaiser> heller: we did not
<heller> Also, does the segfault happen as well on John's implementation?
<hkaiser> shrug
<heller> But it does work when changing the blas implementation, right?
<weilewei> heller I have not found out the solutions for both
<weilewei> yes, hpx works well with netlib-lapack on Summit, but not essl (IBM specific blas implemenatation, and DCA++ tested it, the essl is fastest than other blas implementation)
<weilewei> So, eventually, I still need HPX works with essl
<weilewei> I created a ticket for OCLF (Summit help desk), and they found a similar issue that someone used hpx with essl and found problematic on Summit on this May, but not sure if it is relevant or not
<weilewei> So, next step could be: I need to investigate std thread + essl (which DCA++'s original version) V.S. hpx+essl and see what's wrong.
<heller> weilewei: there is no reason why HPX should not work with essl ... it is dca++ implemented on top of hpx that does not work
<heller> does dca++ use any thread local storage?
<weilewei> heller yea, I am trying to figure out why
<heller> or essl?
<heller> did you trying running your stuff with address or undefined sanitizer turned on?
<weilewei> I am not sure, even they have a non-threading version , which works heller
<weilewei> heller I am not sure what's that, how should I turn on and off in HPX?
<heller> -DHPX_WITH_SANITIZERS=On -DCMAKE_CXX_FLAGS="-fsanitize=address -fsanitize=undefined"
<heller> use that to configure HPX
<heller> and your application just with the extra CMAKE_CXX_FLAGS
<weilewei> Ok, I can try. With these flags, will I get extra information?
<heller> those instrument your code with special stuff to see if you run into undefined behavior or other memory related problems (like valgrind), just faster and more accurate
<weilewei> ok, I can try, because what I see and discussed with Dr. Kaiser before is that, all input args to essl function call are valid, but it still causes segfault
<weilewei> for essl, a commerical library, I do not have debug version or access to its function call stacks.
<weilewei> heller I will try your suggestions now and let you know
<heller> weilewei: do which version of the ESSL are you linking against?
<weilewei> I try both serial and smp version, same errors and same place heller
<jaafar> Is there a good resource that explains how dataflow gets scheduled? As in, how they are chosen when more than one are "ready"...
* jaafar is trying to understand the scan partitioner
<heller> could you choose between 32 bit integer, 64 bit pointer environment and 64 bit integer, 64 bit pointer environment?
<heller> jaafar: they are chosen by the scheduler, they don't get any special treatment
<weilewei> heller yes I can try all versions, should I try all of them?
<heller> you have to choose the one that fits your environment
<heller> this should be exactly *one*
<heller> what does the summit user guide has to say about this?
<jaafar> heller: OK but what if there are two... order created?
<jaafar> and where can I find that code :)
<heller> jaafar: not necessarily, if there is another core stealing the second...
<heller> jaafar: it's the scheduler implementation ;)
<jaafar> great! What file should I look in for that?
<heller> there are multiple..
<heller> one sec
<heller> jaafar: what are you trying to figure out?
<jaafar> why exclusive_scan is 20-25% slower in parallel
<jaafar> Right now I'm looking at the scheduling... seems like it could be improved to reduce cache thrashing
<heller> I wouldn't think it has anything to do with the scheduling decisions
<heller> hmm
<heller> as you mentioned in your ticket
<jaafar> there are two phases that operate on the same data
<jaafar> these tend to get separated with other work put in between
<heller> if the working set is correctly chosen, and the algorithm itself is cache friendly, the scheduling decision should be irrelevant
<jaafar> which (I expect) would cause the whole set to get reloaded
<heller> could be indeed
<jaafar> I believe the working set is also suboptimal
<heller> however: you probably won't figure out a way to fix that in the current scheduling implementation
<heller> I'd start there
<jaafar> heller: at least I will know
<weilewei> heller well, summit user guide does not say much
<heller> what did the persons use use who figured out that essl is the fastest option for dca++?
<weilewei> On DCA's implementation, they use libessl.so to link
<weilewei> probably they run some performance test?
<jaafar> thanks!
<heller> jaafar: that's where the tasks get poped from the scheduler
<weilewei> @heller essl is an IBM product, so I guess the essl is highly optimized for ibm machine? I am not sure what's the exact performance number
<heller> jaafar: that's where you probably end up in
<heller> weilewei: I don't care much about the performance numbers ... you are saying it ain't working for you. I am trying to figure out why
<heller> and obviously it didn't work for another person
<heller> so I guess you use libessl.so too? I assume it is a symlink to one of the variants?
<jaafar> OK I will read it. I'm hoping to end up with some way of tweaking the execution so that "phase 2" for a single chunk is chosen over "phase 1" for a different chunk if both are ready
<jaafar> thus reducing the separation between use of the same data
<heller> ok, that can be done, in principle
<heller> if say, you want to execute the continuation passed into dataflow, once all inputs are ready
<weilewei> heller yes, I also linked to libesslsmp.so, same error and failed at the same place
<heller> jaafar: you should use the fork policy for that, IIRC, that should put the continuation as the next thread to schedule
<jaafar> interesting OK
<jaafar> scan_partition presently uses sync
<heller> sync means, it is directly executed afterwards, without even starting a new task
<heller> weilewei: you probably don't want the smp variants
<weilewei> Yes, the system manager guy told me that smp, under the hood, is implemented in openmp, which might interact with hpx threads
<heller> enxt thing to do would be, to try to figure out which call is causing trouble
<heller> that is, comment out everything, then comment stuff back in step by step until it crashed
<heller> eventually, you'll find the spot
<weilewei> hmm, it starts from boost context switch then all the way down to a essl function call, that's what I have in my mind
<weilewei> heller sure I will try that
<heller> the one that you showed me the other day?
<heller> that was in the scheduler ..
<heller> no essl whatsoever
K-ballo has joined #ste||ar
<heller> so the other problem within the scheduler is fixed now?
<weilewei> I could not recall the scheduler issue...
<weilewei> But the link I just sent you is the bug I am facing for weeks
<weilewei> The another solution is to compare all function calls and variables between std thread (their version) and hpx thread
<heller> oO
<heller> well. run with the sanitizers
<weilewei> yaah, I am building hpx now with sanitizers enabled
<heller> weilewei: https://gist.github.com/weilewei/f94b89262188f22bb25761a1d3e02851#gistcomment-3037496 <-- I am talking about that one, which we discussed 8 days ago
<weilewei> heller without essl, it passed, but with essl, it failed
<heller> aha
<heller> those are two different issues, with two different stack traces...
<heller> anyway, I am out...
<weilewei> heller yea, thanks for all the suggestions
<weilewei> the run results with sanitizer
<hkaiser> weilewei: from what I see just a ton of memory leaks in their code
<hkaiser> did this run crash?
<heller> Yeah, just leaks
<weilewei> The test failed, but I do not see this run verboses any things from the test
<weilewei> It should output some scientific results
<hkaiser> so it just crashed ...
<hkaiser> beautiful
<weilewei> yea, beatuiful...
<weilewei> how should I do next?
<heller> well, fix the leaks ;)
<weilewei> wait... what... so many leaks
<heller> ok, here is another suggestion
<heller> export ASAN_OPTIONS=detect_leaks=0
<heller> weilewei: ^^
<weilewei> then run again?
<heller> Sure
<weilewei> it always complains this ==51069==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
hkaiser has quit [Ping timeout: 250 seconds]
<weilewei> But I have done export LD_PRELOAD=$OLCF_GCC_ROOT/lib64/libasan.so
<weilewei> which is valid
<weilewei> heller
<heller> Sorry, no idea
weilewei has quit [Remote host closed the connection]
aserio has joined #ste||ar
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 245 seconds]
aserio1 is now known as aserio
Vir has quit [Ping timeout: 265 seconds]
Vir has joined #ste||ar
aserio has quit [Quit: aserio]
hkaiser has joined #ste||ar
<jaafar> Is there any difference between how work created by "async_execute" or "dataflow" are scheduled?
<jaafar> Besides, I guess, the fact that async_execute doesn't have preconditions AFAICT
<jaafar> like, imagine there was a "dataflow" whose inputs were all available, vs something I created with async_execute
<hkaiser> jaafar: async_execute is lower-level
<jaafar> Is there any difference in how that work gets scheduled?
<hkaiser> dataflow uses executors to do its job
<hkaiser> so dataflow uses async_execute anyways
<jaafar> so I guess there is no difference in how it gets scheduled?
* jaafar is looking at partition_scan
<jaafar> make that scan_partitioner
<jaafar> the first phase is all created with async_execute
<jaafar> the second and third are all dataflow
<jaafar> It seems that work created with async_execute is generally preferred to dataflow by the scheduler
<jaafar> Not consistently, but enough to cause some cache thrashing as data is moved out and then back in again
<jaafar> thus my interest in what gets scheduled :)
<jaafar> Is there any way to influence what work gets run first?
<hkaiser> jaafar: create additional dependencies between the futures
<hkaiser> jaafar: gtg now, sorry
<jaafar> OK :) see you later