#ste||ar on 2019-10-03 — irc logs at irclog.cct.lsu.edu

2019-06-17 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoD: https://developers.google.com/season-of-docs/

00:45 <jaafar> hkaiser: I have that MWE now. Shall I make a gist? File an issue?

00:45 <hkaiser> jaafar: does it relate to that existing ticket?

00:46 <hkaiser> if yes, pls add it there, otherwise pls create a new ticket

00:46 <jaafar> peripherally... but it's really a separate thing

00:46 <jaafar> OK I'll make a new one

00:46 <jaafar> hope I'm not just using it wrong :)

00:54 nikunj has quit [Ping timeout: 245 seconds]

00:57 <jaafar> There I go, #4118

01:01 nikunj has joined #ste||ar

01:10 <hkaiser> thanks jaafar, will look tomorrow

01:20 K-ballo has quit [Quit: K-ballo]

01:30 nikunj97 has joined #ste||ar

01:30 nikunj has quit [Read error: Connection reset by peer]

01:39 diehlpk has joined #ste||ar

01:41 nikunj97 has quit [Remote host closed the connection]

01:41 hkaiser has quit [Ping timeout: 245 seconds]

01:59 diehlpk has quit [Ping timeout: 245 seconds]

02:04 weilewei has joined #ste||ar

07:58 rori has joined #ste||ar

09:12 nikunj has joined #ste||ar

09:32 lsl88 has quit [Quit: Leaving.]

10:28 nikunj has quit [Ping timeout: 240 seconds]

10:45 K-ballo has joined #ste||ar

11:44 lsl88 has joined #ste||ar

12:03 diehlpk has joined #ste||ar

12:09 diehlpk has quit [Ping timeout: 245 seconds]

12:34 hkaiser has joined #ste||ar

12:35 <hkaiser> heller: serialization branch is back in good shape, compiles again

12:36 <hkaiser> the problem would have disappeared for you however, once you reimplemented the extra data

12:47 <heller> hkaiser: sure, that's what I did. Just wanted to have a point of comparison

12:47 <heller> And the movable_any would have been defect in any case

12:48 <heller> BTW, can we name it unique_any?

12:49 <K-ballo> mofun, moany

13:00 <hkaiser> heller: feel free to bikeshed and to rename as much as you like

13:01 <hkaiser> gtg

13:02 <K-ballo> -1 on movable_ from me too

13:03 <K-ballo> the time machine option would be to call `any` the one that only requires Destructible, and `any_copyable` the one that requires Copyable

13:04 <K-ballo> failing that, unique_ is not ideal but should be easily understandable given unique_function

13:05 hkaiser has quit [Ping timeout: 250 seconds]

13:27 aserio has joined #ste||ar

13:27 weilewei has quit [Remote host closed the connection]

13:50 <zao> I still have failures on the s11n branch on my GCC 8.3 and libstdc++.

13:52 <zao> https://gist.github.com/zao/7eec7c2d3bc59ad847f62c9f7966c862

13:58 <zao> heller ^

13:59 <heller> zao: hmm, so it ain't fixed yet?

13:59 <heller> Bummer

14:04 tianyi93 has joined #ste||ar

14:06 lsl88 has quit [Quit: Leaving.]

14:39 weilewei has joined #ste||ar

15:15 K-ballo has quit [Quit: K-ballo]

15:17 aserio has quit [Ping timeout: 250 seconds]

15:34 quaz0r has quit [Ping timeout: 276 seconds]

15:40 rori has quit [Quit: WeeChat 1.9.1]

15:46 quaz0r has joined #ste||ar

16:05 hkaiser has joined #ste||ar

16:05 hkaiser has quit [Client Quit]

16:05 hkaiser has joined #ste||ar

16:19 weilewei has quit [Remote host closed the connection]

16:29 <diehlpk_work> hkaiser, You should get access to the Power9 machine within the next few dys

16:32 nikunj has joined #ste||ar

17:21 weilewei has joined #ste||ar

17:26 <weilewei> I run hpx tests on summit and found 3 tests failed, report here: https://gist.github.com/weilewei/33907c41b1393da3f7296586be9d1258

17:27 <weilewei> summary:

17:27 <weilewei> 99% tests passed, 3 tests failed out of 700Total Test time (real) = 861.81 secThe following tests FAILED:551 - tests.unit.threads.thread_stacksize (Failed)554 - tests.unit.topology.numa_allocator (Failed)634 - tests.examples.quickstart.1d_wave_equation (Failed)

17:28 <weilewei> Is it something serious? Or if needed, how can I fix it. I mostly use hpx async, thread, future those facilities on DCA++ project for now

17:32 <hkaiser> weilewei: the wave equation example shouldn't be problematic, the others - I don't know

17:32 <hkaiser> but I'd assume that HPX is functional for now

17:33 <weilewei> Ok, that's nice to know

17:33 <weilewei> I will try to build DCA++ with newly built HPX and see how it works

17:36 <hkaiser> weilewei: I wil try to look into the failures

17:37 <hkaiser> the numa-allocator worries me a bit, but this could be problem in the test itself

17:37 <hkaiser> weilewei: jbjnr would be the one to know actually

17:37 <weilewei> hkaiser thanks!

17:38 <weilewei> if you would like to test it, you can look into my directory: /gpfs/alpine/proj-shared/cph102/weile/dev/src/hpx/build_hwloc_Debug/

17:44 <diehlpk_work> weilewei, The 1d_wave could be related to the python issue

17:44 <weilewei> diehlpk_work ok, good to know

17:44 <diehlpk_work> I will look into the python issues next week

17:45 <diehlpk_work> Distributed failed because cmake could not find mpiexec

17:45 <diehlpk_work> I will look into this as well next week

17:45 <heller> The numa allocator stuff should not affect functionality, maybe just performance issues

17:46 <diehlpk_work> For the second one, I think the issues is with my scripts and how we export things

17:46 <heller> The stack size might be because we use too much stack space upfront (aka more overhead than on x86)

17:47 <heller> What is the python issue?

17:48 <diehlpk_work> python can not find some system lib

17:48 <heller> And how is this related to the failing example?

17:48 <diehlpk_work> Starting hpx applications with the python script fail, because one import fails

17:49 <heller> Then all tests would be marked as failed

17:49 <heller> All tests are started through the wrapper

17:50 <diehlpk_work> Ok, I have seen python errors for some of the tests

17:50 <heller> Which import fails? Which python version are you using? Did you try to load a newer one or tried to do a pip install?

17:50 <diehlpk_work> At least the distributed test failed for me for not finding mpiexec and tcp with a python error

17:50 <heller> If an import fails, it can't be just some

17:51 <heller> Ok, what's the error?

17:51 <diehlpk_work> heller, python 2.7 and I need to look into it in more detail

17:51 <heller> What's your MPI implementation? How would you start an MPI program on the machine you're on?

17:52 <diehlpk_work> I just got access to this power9 system on Tuesday

17:52 <heller> The ibm job scheduler/MPI implementation does not work with hpxrun.py

17:52 <heller> That's for sure

17:53 <diehlpk_work> It is openmpi

17:53 <heller> Then you should have an mpiexec/mpirun *somewhere*

17:53 <heller> What's the batch system?

17:54 <diehlpk_work> Yes, I think that I just forgot to export the path to mpiexec

17:54 <weilewei> Well, not sure if I understand your discussion correctly, on Summit, they use jsrun, not mpiexec/mpirun

17:55 <diehlpk_work> This needs all more investigation, but weilewei just needed hpx without networking

17:55 <weilewei> Yea, I do not need networking for now

17:56 <heller> I'm just saying what to expect...

17:56 <diehlpk_work> Foe now we just wanted to compile hpx without networking on a different power9 system to see if we get the same segfault as weilewei did get on Summit

17:57 <heller> Didn't we conclude that the problem was with a specific blas implementation?

17:57 <hkaiser> heller: we did not

17:57 <heller> Also, does the segfault happen as well on John's implementation?

17:57 <hkaiser> shrug

17:58 <heller> But it does work when changing the blas implementation, right?

17:58 <weilewei> heller I have not found out the solutions for both

17:59 <weilewei> yes, hpx works well with netlib-lapack on Summit, but not essl (IBM specific blas implemenatation, and DCA++ tested it, the essl is fastest than other blas implementation)

17:59 <weilewei> So, eventually, I still need HPX works with essl

18:01 <weilewei> I created a ticket for OCLF (Summit help desk), and they found a similar issue that someone used hpx with essl and found problematic on Summit on this May, but not sure if it is relevant or not

18:02 <weilewei> So, next step could be: I need to investigate std thread + essl (which DCA++'s original version) V.S. hpx+essl and see what's wrong.

18:02 <heller> weilewei: there is no reason why HPX should not work with essl ... it is dca++ implemented on top of hpx that does not work

18:03 <heller> does dca++ use any thread local storage?

18:03 <weilewei> heller yea, I am trying to figure out why

18:03 <heller> or essl?

18:04 <heller> did you trying running your stuff with address or undefined sanitizer turned on?

18:04 <weilewei> I am not sure, even they have a non-threading version , which works heller

18:05 <weilewei> heller I am not sure what's that, how should I turn on and off in HPX?

18:06 <heller> -DHPX_WITH_SANITIZERS=On -DCMAKE_CXX_FLAGS="-fsanitize=address -fsanitize=undefined"

18:06 <heller> use that to configure HPX

18:06 <heller> and your application just with the extra CMAKE_CXX_FLAGS

18:06 <weilewei> Ok, I can try. With these flags, will I get extra information?

18:07 <heller> those instrument your code with special stuff to see if you run into undefined behavior or other memory related problems (like valgrind), just faster and more accurate

18:08 <weilewei> ok, I can try, because what I see and discussed with Dr. Kaiser before is that, all input args to essl function call are valid, but it still causes segfault

18:08 <weilewei> for essl, a commerical library, I do not have debug version or access to its function call stacks.

18:09 <weilewei> heller I will try your suggestions now and let you know

18:10 <heller> weilewei: do which version of the ESSL are you linking against?

18:10 <weilewei> I try both serial and smp version, same errors and same place heller

18:11 <jaafar> Is there a good resource that explains how dataflow gets scheduled? As in, how they are chosen when more than one are "ready"...

18:11 * jaafar is trying to understand the scan partitioner

18:11 <heller> could you choose between 32 bit integer, 64 bit pointer environment and 64 bit integer, 64 bit pointer environment?

18:12 <heller> jaafar: they are chosen by the scheduler, they don't get any special treatment

18:12 <weilewei> heller yes I can try all versions, should I try all of them?

18:12 <heller> you have to choose the one that fits your environment

18:12 <heller> this should be exactly *one*

18:13 <heller> what does the summit user guide has to say about this?

18:13 <jaafar> heller: OK but what if there are two... order created?

18:13 <jaafar> and where can I find that code :)

18:13 <heller> jaafar: not necessarily, if there is another core stealing the second...

18:13 <heller> jaafar: it's the scheduler implementation ;)

18:14 <jaafar> great! What file should I look in for that?

18:14 <heller> there are multiple..

18:14 <heller> one sec

18:14 <heller> jaafar: what are you trying to figure out?

18:14 <jaafar> why exclusive_scan is 20-25% slower in parallel

18:15 <jaafar> Right now I'm looking at the scheduling... seems like it could be improved to reduce cache thrashing

18:15 <heller> I wouldn't think it has anything to do with the scheduling decisions

18:15 <heller> hmm

18:15 <heller> as you mentioned in your ticket

18:15 <jaafar> there are two phases that operate on the same data

18:15 <jaafar> these tend to get separated with other work put in between

18:16 <heller> if the working set is correctly chosen, and the algorithm itself is cache friendly, the scheduling decision should be irrelevant

18:16 <jaafar> which (I expect) would cause the whole set to get reloaded

18:16 <heller> could be indeed

18:16 <jaafar> I believe the working set is also suboptimal

18:17 <heller> however: you probably won't figure out a way to fix that in the current scheduling implementation

18:17 <heller> I'd start there

18:17 <jaafar> heller: at least I will know

18:17 <weilewei> heller well, summit user guide does not say much

18:17 <heller> what did the persons use use who figured out that essl is the fastest option for dca++?

18:18 <weilewei> On DCA's implementation, they use libessl.so to link

18:18 <weilewei> probably they run some performance test?

18:19 <heller> jaafar: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/runtime/threads/detail/scheduling_loop.hpp#L591-L593

18:19 <jaafar> thanks!

18:19 <heller> jaafar: that's where the tasks get poped from the scheduler

18:19 <weilewei> @heller essl is an IBM product, so I guess the essl is highly optimized for ibm machine? I am not sure what's the exact performance number

18:19 <heller> jaafar: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/runtime/threads/policies/local_priority_queue_scheduler.hpp#L583

18:19 <heller> jaafar: that's where you probably end up in

18:20 <heller> weilewei: I don't care much about the performance numbers ... you are saying it ain't working for you. I am trying to figure out why

18:20 <heller> and obviously it didn't work for another person

18:21 <heller> so I guess you use libessl.so too? I assume it is a symlink to one of the variants?

18:21 <jaafar> OK I will read it. I'm hoping to end up with some way of tweaking the execution so that "phase 2" for a single chunk is chosen over "phase 1" for a different chunk if both are ready

18:21 <jaafar> thus reducing the separation between use of the same data

18:21 <heller> ok, that can be done, in principle

18:22 <heller> if say, you want to execute the continuation passed into dataflow, once all inputs are ready

18:22 <weilewei> heller yes, I also linked to libesslsmp.so, same error and failed at the same place

18:22 <heller> jaafar: you should use the fork policy for that, IIRC, that should put the continuation as the next thread to schedule

18:22 <jaafar> interesting OK

18:23 <jaafar> scan_partition presently uses sync

18:23 <heller> sync means, it is directly executed afterwards, without even starting a new task

18:24 <heller> weilewei: you probably don't want the smp variants

18:25 <weilewei> Yes, the system manager guy told me that smp, under the hood, is implemented in openmp, which might interact with hpx threads

18:26 <heller> enxt thing to do would be, to try to figure out which call is causing trouble

18:27 <heller> that is, comment out everything, then comment stuff back in step by step until it crashed

18:27 <heller> eventually, you'll find the spot

18:27 <weilewei> hmm, it starts from boost context switch then all the way down to a essl function call, that's what I have in my mind

18:28 <weilewei> heller sure I will try that

18:28 <heller> the one that you showed me the other day?

18:28 <heller> that was in the scheduler ..

18:28 <heller> no essl whatsoever

18:29 K-ballo has joined #ste||ar

18:29 <weilewei> heller https://gist.github.com/weilewei/6a67c110c6c0b1dd585aab6069ea27d4

18:32 <heller> so the other problem within the scheduler is fixed now?

18:34 <weilewei> I could not recall the scheduler issue...

18:36 <weilewei> But the link I just sent you is the bug I am facing for weeks

18:37 <weilewei> The another solution is to compare all function calls and variables between std thread (their version) and hpx thread

18:38 <heller> oO

18:39 <heller> well. run with the sanitizers

18:39 <weilewei> yaah, I am building hpx now with sanitizers enabled

18:42 <heller> weilewei: https://gist.github.com/weilewei/f94b89262188f22bb25761a1d3e02851#gistcomment-3037496 <-- I am talking about that one, which we discussed 8 days ago

18:46 <weilewei> heller without essl, it passed, but with essl, it failed

18:48 <heller> aha

18:48 <heller> those are two different issues, with two different stack traces...

18:48 <heller> anyway, I am out...

18:50 <weilewei> heller yea, thanks for all the suggestions

19:18 <weilewei> heller https://gist.github.com/weilewei/bcc17e8aaea3ef26cbea27134480774d

19:18 <weilewei> the run results with sanitizer

19:20 <hkaiser> weilewei: from what I see just a ton of memory leaks in their code

19:20 <hkaiser> did this run crash?

19:21 <heller> Yeah, just leaks

19:25 <weilewei> The test failed, but I do not see this run verboses any things from the test

19:25 <weilewei> It should output some scientific results

19:26 <hkaiser> so it just crashed ...

19:26 <hkaiser> beautiful

19:26 <weilewei> yea, beatuiful...

19:27 <weilewei> how should I do next?

19:32 <heller> well, fix the leaks ;)

19:33 <weilewei> wait... what... so many leaks

19:34 <heller> ok, here is another suggestion

19:36 <heller> export ASAN_OPTIONS=detect_leaks=0

19:36 <heller> weilewei: ^^

19:36 <weilewei> then run again?

19:46 <heller> Sure

19:48 <weilewei> it always complains this ==51069==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.

19:48 hkaiser has quit [Ping timeout: 250 seconds]

19:48 <weilewei> But I have done export LD_PRELOAD=$OLCF_GCC_ROOT/lib64/libasan.so

19:48 <weilewei> which is valid

19:49 <weilewei> heller

19:50 <heller> Sorry, no idea

19:56 weilewei has quit [Remote host closed the connection]

20:05 aserio has joined #ste||ar

20:07 aserio1 has joined #ste||ar

20:10 aserio has quit [Ping timeout: 245 seconds]

20:10 aserio1 is now known as aserio

20:49 Vir has quit [Ping timeout: 265 seconds]

20:49 Vir has joined #ste||ar

21:24 aserio has quit [Quit: aserio]

21:27 hkaiser has joined #ste||ar

22:31 <jaafar> Is there any difference between how work created by "async_execute" or "dataflow" are scheduled?

22:32 <jaafar> Besides, I guess, the fact that async_execute doesn't have preconditions AFAICT

22:32 <jaafar> like, imagine there was a "dataflow" whose inputs were all available, vs something I created with async_execute

22:32 <hkaiser> jaafar: async_execute is lower-level

22:33 <jaafar> Is there any difference in how that work gets scheduled?

22:33 <hkaiser> dataflow uses executors to do its job

22:33 <hkaiser> so dataflow uses async_execute anyways

22:34 <jaafar> so I guess there is no difference in how it gets scheduled?

22:34 * jaafar is looking at partition_scan

22:34 <jaafar> make that scan_partitioner

22:35 <jaafar> the first phase is all created with async_execute

22:35 <jaafar> the second and third are all dataflow

22:35 <jaafar> It seems that work created with async_execute is generally preferred to dataflow by the scheduler

22:36 <jaafar> Not consistently, but enough to cause some cache thrashing as data is moved out and then back in again

22:36 <jaafar> thus my interest in what gets scheduled :)

22:44 <jaafar> Is there any way to influence what work gets run first?

22:53 <hkaiser> jaafar: create additional dependencies between the futures

22:54 <hkaiser> jaafar: gtg now, sorry

22:56 <jaafar> OK :) see you later