#ste||ar on 2018-07-02 — irc logs at irclog.cct.lsu.edu

2018-04-23 16:40 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC2018: https://wp.me/p4pxJf-k1

00:00 diehlpk has joined #ste||ar

00:00 diehlpk has quit [Read error: Connection reset by peer]

01:16 Vir has quit [Ping timeout: 265 seconds]

01:21 Vir has joined #ste||ar

01:22 diehlpk has joined #ste||ar

01:26 Vir has quit [Ping timeout: 240 seconds]

01:26 Vir has joined #ste||ar

01:35 mcopik has quit [Ping timeout: 268 seconds]

01:46 diehlpk has quit [Remote host closed the connection]

02:08 K-ballo has quit [Quit: K-ballo]

02:59 hkaiser has quit [Quit: bye]

06:24 <jbjnr> is anyone else back at work and raring to go?

06:34 <zao> I'm on vacation and being about as unproductive as usual :D

06:37 <jbjnr> vacation - anywhere nice?

06:39 <jbjnr> I believe St. Petersburg could be nice for you this week :)

06:52 <zao> Nah, just staycation.

06:52 <zao> Went back home around midsummer, probably going again in a few weeks.

06:52 <zao> (split my vacation days up in two sections this year, so two weeks now, then three weeks later)

07:04 jbjnr has quit [Ping timeout: 245 seconds]

07:05 jbjnr has joined #ste||ar

07:10 jaafar has quit [Ping timeout: 268 seconds]

07:11 <jbjnr> grrrr. my windows machine is just terrible these days. blue screen of death and reboots all the time.

07:14 <zao> How bothersome.

07:14 <zao> Were you running pycicle infra on it?

07:14 <jbjnr> yes

07:15 <zao> I was building one of the SoC PRs the other day, my CI stuff apparently still has trouble with a test :(

07:15 <jbjnr> not sure what 'infra' means. but I have two pycicle instances spawngin work on the cray. they are just python loops polling github. Not building here.

07:15 <zao> I wonder if the container setup interferes with it.

07:15 <zao> Infrastructure.

07:17 <zao> Anyway, welcome back :D

07:37 <heller> jbjnr: welcome back!

07:47 jbjnr has quit [Ping timeout: 240 seconds]

07:49 jbjnr has joined #ste||ar

07:49 <jbjnr> rebooted again!

07:50 <jbjnr> heller: hi. Have you finished the kokkos intregration yet! :)

07:50 <heller> jbjnr: no, the kokkos and HPX model are interestingly very different and not really compatible

07:51 <heller> what i'd like instead is a thorough comparison between the two

07:51 <jbjnr> I believe we must make our stuff compatible if we are to get peak performance on a node

07:52 <heller> that's what I'm still wondering

07:52 <heller> Kokkos doesn't come for free either

07:53 <jbjnr> I've already made a lot of progress with my abaility to provide hints to the scheduler about where to put tasks, but we need to go much further.

07:53 <heller> we get nice performance for the stream benchmark, for example. Something the kokkos model is supposed to be perfect for, for example

07:53 <heller> right

07:53 <heller> I am not arguing that what we have is perfect

07:54 <jbjnr> the stream benchmark is not really a good example though as it does not use the 'standard' api that the rest of hpx uses

07:55 <jbjnr> did you reach any conclusions about N-ary tasks?

07:55 <heller> I am not even sure what that standard API is ...

07:58 <heller> N-ary tasks: that's just a by product of their model. I don't think it buys us anything

07:58 <heller> but yes, the stream benchmark needs to be streamlined again

07:59 <heller> the point is: It is able to deliver

08:00 <jbjnr> N-ary : I like the idea of creating 1 task instead of 32 (or some other number) and decrementing the ranges used.

08:01 <heller> yes, I guess that's one point where we need to optimize

08:01 <heller> instead of calucating the partitions upfront, each thread should do it on its own based on some index

08:01 <heller> or s

08:01 <heller> o

08:05 <heller> also: you are saying that we aren't able to reach peak on a single node with what we have today. On what ground are you making that statement? Do you have a comparison of your cholesky stuff using Kokkos?

08:09 <heller> and more importantly: I must get out of this overwhelming, productivity killing thesis swamp that keeps on draining my energy for too long

09:19 david_pfander1 has joined #ste||ar

09:24 <github> [hpx] StellarBot pushed 1 new commit to gh-pages: https://git.io/f1Wgg

09:24 <github> hpx/gh-pages 63eed00 StellarBot: Updating docs

10:21 david_pfander1 has quit [Ping timeout: 245 seconds]

10:51 hkaiser has joined #ste||ar

11:06 jakub_golinowski has joined #ste||ar

11:22 jakub_golinowski has quit [Quit: Ex-Chat]

11:26 jakub_golinowski has joined #ste||ar

11:45 K-ballo has joined #ste||ar

11:56 mcopik has joined #ste||ar

12:23 nikunj has joined #ste||ar

13:06 <nikunj> hkaiser: yt?

13:06 <hkaiser> here

13:07 <nikunj> So I just tried integrating my apple implementation into HPX. Things are working fine as of now (examples are running well). I'm onto running tests now

13:07 <hkaiser> nice!

13:07 <hkaiser> good job!

13:08 <nikunj> Could you please review my pr so that I can add another pr for apple integration as well (it adds onto hpx_wrap.cpp and I do not want to combine Linux and Mac OS integration into same pr)

13:09 <hkaiser> nikunj: will try to get to it today

13:09 <nikunj> thanks, I'll add another pr as soon as it is reviewed

13:18 hkaiser has quit [Quit: bye]

13:32 anushi has joined #ste||ar

13:34 Anushi1998 has joined #ste||ar

13:36 <Anushi1998> nikunj: Why don't u add a branch and make a second PR? Is there any problem in that or the second PR can only be make if first one is merged?

13:36 <nikunj> Anushi1998: the second pr cannot be worked on until the first one is not merged

13:37 <Anushi1998> okay

13:37 <nikunj> It involves additional code in the file of my first pr.

13:46 mcopik has quit [Ping timeout: 245 seconds]

13:57 aserio has joined #ste||ar

13:58 mcopik has joined #ste||ar

13:59 <jakub_golinowski> M-ms, the build in release mode has linking errors as before in a clean dir

14:00 <M-ms> jakub_golinowski: ok, thanks

14:00 <M-ms> still rebuilding here

14:07 <jbjnr> M-ms: are you in zurich or basel?

14:07 <M-ms> jbjnr: basel

14:08 <jbjnr> ok. see you tomorrow. Is the conf. centre small enough that I'll find everyone easily?

14:08 <jbjnr> I probably won't arrive until lunchtime

14:08 <M-ms> I see you're coming here as well...

14:08 <jbjnr> yup. meeting

14:10 hkaiser has joined #ste||ar

14:10 <M-ms> yep, it's reasonably small

14:10 <M-ms> coffee breaks in one hall, otherwise write on slack

14:10 <hkaiser> jbjnr: I'm ready whenever you are

14:11 <jbjnr> hkaiser: https://appear.in/stellar-group

14:17 akheir has joined #ste||ar

14:17 <M-ms> jakub_golinowski: getting the linker errors now on my work laptop, must have something different on my personal one... but now I can at least start looking into it

14:19 <nikunj> hkaiser: can we reschedule our skype meet to Wednesday or Thursday? I had to talk mainly about my implementation of Linux and Mac OS. Now that they are done (almost), I can work on Windows. I think I can get some visible leads until Wednesday to discuss it with you.

14:30 nikunj97 has joined #ste||ar

14:32 nikunj has quit [Ping timeout: 276 seconds]

14:35 <aserio> heller: yt?

14:35 <heller> aserio: hey

14:35 <aserio> heller: welcome to the team

14:35 <heller> aserio: he, thanks ;)

14:36 <heller> aserio: see pm please ;)

14:47 <hkaiser> nikunj97: sure, works for me (Thursday)

14:47 <hkaiser> let's rather do Friday

14:47 <nikunj97> hkaiser: ok

14:47 <nikunj97> I'll research ways to get things done on windows till then

15:08 nikunj1997 has joined #ste||ar

15:11 nikunj97 has quit [Ping timeout: 264 seconds]

15:34 anushi has quit [Read error: Connection reset by peer]

15:34 anushi has joined #ste||ar

15:34 anushi has quit [Remote host closed the connection]

15:54 Anushi1998 has quit [Quit: Bye]

15:59 <jakub_golinowski> M-ms, I realized that 6 CEST is now :D do you have time to look at the gdoc?

15:59 aserio1 has joined #ste||ar

16:01 aserio has quit [Ping timeout: 265 seconds]

16:01 aserio1 is now known as aserio

16:05 <M-ms> jakub_golinowski: yep, thanks

16:07 nikunj1997 has quit [Ping timeout: 240 seconds]

16:07 <github> [hpx] hkaiser created destroy_parcel (+1 new commit): https://git.io/fySM9

16:07 <github> hpx/destroy_parcel a79d051 Hartmut Kaiser: Making sure all parcels get destroyed on an HPX thread (TCP pp)

16:07 anushi has joined #ste||ar

16:09 anushi has quit [Remote host closed the connection]

16:09 <github> [hpx] hkaiser force-pushed destroy_parcel from a79d051 to 0d9a425: https://git.io/fyH4L

16:09 <github> hpx/destroy_parcel 0d9a425 Hartmut Kaiser: Making sure all parcels get destroyed on an HPX thread (TCP pp)

16:35 anushi has joined #ste||ar

16:40 <github> [hpx] hkaiser force-pushed destroy_parcel from 0d9a425 to 8e2d7c1: https://git.io/fyH4L

16:40 <github> hpx/destroy_parcel 8e2d7c1 Hartmut Kaiser: Making sure all parcels get destroyed on an HPX thread (TCP pp)...

16:42 aserio has quit [Ping timeout: 255 seconds]

17:03 Anushi1998 has joined #ste||ar

17:16 jakub_golinowski has quit [Ping timeout: 276 seconds]

17:17 <Guest87328> [hpx] hkaiser opened pull request #3361: Making sure all parcels get destroyed on an HPX thread (TCP pp) (master...destroy_parcel) https://git.io/fSs66

17:18 jaafar has joined #ste||ar

17:21 mcopik has quit [Ping timeout: 248 seconds]

17:34 mcopik has joined #ste||ar

17:39 mcopik has quit [Ping timeout: 276 seconds]

17:41 jakub_golinowski has joined #ste||ar

17:44 diehlpk_mobile has joined #ste||ar

18:08 <hkaiser> jbjnr: could you give me the link to the nvidia gpu layering workshop announcement, please

18:10 jakub_golinowski has quit [Ping timeout: 276 seconds]

18:18 aserio has joined #ste||ar

18:30 hkaiser has quit [Quit: bye]

18:46 jbjnr has quit [Remote host closed the connection]

18:54 hkaiser has joined #ste||ar

18:59 aserio1 has joined #ste||ar

19:00 <github> [hpx] khuck pushed 1 new commit to apex_fixing_null_wrapper: https://git.io/fSsQa

19:00 <github> hpx/apex_fixing_null_wrapper e63fcf6 Kevin Huck: Trying to make circleci happy

19:02 aserio has quit [Ping timeout: 240 seconds]

19:03 aserio1 has quit [Ping timeout: 240 seconds]

19:39 aserio has joined #ste||ar

19:55 jakub_golinowski has joined #ste||ar

19:58 <parsa[w]> is it possible to determine if we're on locality#0 after hpx::finalize()?

19:58 <parsa[w]> hkaiser: ^

19:58 aserio has quit [Ping timeout: 240 seconds]

19:58 hkaiser has quit [Read error: Connection reset by peer]

19:59 hkaiser has joined #ste||ar

19:59 <parsa[w]> hkaiser: is it possible to determine if we're on locality#0 after hpx::finalize()?

19:59 <hkaiser> parsa[w]: not sure what you mean

19:59 aserio has joined #ste||ar

20:03 <parsa[w]> hpx_main finishes execution, and i expect some string to be printed, which happens on locality 0. i want to check for that string when i'm on locality 0

20:03 <parsa[w]> hkaiser: ^

20:03 <hkaiser> from phylanx?

20:03 <hkaiser> I mean from physl?

20:04 <parsa[w]> any hpx application

20:04 <parsa[w]> in main()

20:05 <hkaiser> parsa[w]: is this what you need? https://github.com/STEllAR-GROUP/phylanx/blob/master/tests/regressions/execution_tree/cout_prints_twice_155.cpp

20:06 <parsa[w]> hkaiser: no. i want https://github.com/STEllAR-GROUP/phylanx/blob/master/tests/regressions/execution_tree/cout_prints_twice_155.cpp#L37 only when when it's on locality 0

20:06 <hkaiser> hpx::cout always goes to locality 0

20:06 <hkaiser> debug() as well

20:06 <parsa[w]> yes, but main is run on every locality

20:06 <K-ballo> new info on that action template on the slack channel

20:06 <parsa[w]> which means the other process fails

20:07 <hkaiser> you can't check after finalize whether you are on locality 0

20:07 <hkaiser> only way is to store the locality in a variable before finalize so you can use it afterwards

20:11 <hkaiser> parsa[w]: I just merged #3361

20:11 <github> [hpx] hkaiser pushed 1 new commit to master: https://git.io/fSsxO

20:11 <github> hpx/master 6bda03a Hartmut Kaiser: Merge pull request #3361 from STEllAR-GROUP/destroy_parcel...

20:11 <parsa[w]> thanks

20:11 <github> [hpx] hkaiser deleted destroy_parcel at 8e2d7c1: https://git.io/fSsx3

20:11 <heller> hkaiser: what's your stance on kicking the asio based PP in favor of a libfabric only solution?

20:12 <hkaiser> heller: if we can get it to work on any platform that supports sockets, sure - we won't fully get rid of asio this way though

20:12 <hkaiser> heller: what would be the rationale of doing this?

20:15 <heller> Simplifying the whole parcelhandler code by just using libfabric for the communication. This way, we can fully utilize the network. I have a prototype implementation that's on par in terms of latency and bandwidth with MPI. For window size of 1 and single thread

20:15 <heller> For the OSU test

20:16 <hkaiser> what about bootstrapping?

20:16 <heller> Solved

20:16 <hkaiser> nice

20:16 <heller> Even without PMI

20:16 <hkaiser> well, as a first step I'd say - let's add it as an additional pp

20:16 <heller> Full zero copy capable ;)

20:17 <heller> Hmm

20:18 <heller> Not sure if that's going to work out though

20:18 <hkaiser> why?

20:19 <K-ballo> PMI?

20:19 <K-ballo> MPI

20:19 <K-ballo> (I read PMI in passing and thought of Phillip Morris International)

20:19 <hkaiser> K-ballo: http://www.mcs.anl.gov/papers/P1760.pdf

20:20 <heller> I changed the serialization stuff. Mainly to have easier preprocessing and rdma reads on demand

20:20 <K-ballo> oh, it's a thing

20:20 <heller> K-ballo: process management interface

20:20 <hkaiser> heller: what about the mpi pp?

20:21 <heller> I started bottom up. As said, it's just a prototype so far and not yet fully integrated

20:21 <hkaiser> heller: will that make the mpi pp obsolete as well?

20:21 <heller> The mpi pp has no need to exist anymore :p

20:22 <heller> Yes

20:22 <heller> That's the goal

20:22 <hkaiser> ok

20:22 <hkaiser> this needs some discussion

20:23 <heller> It will certainly be a disruptive step since I expect some bugs

20:23 <heller> Sure, that's why I'm bringing it up

20:23 <hkaiser> I'm not in favor of throwing away everything we have in terms of networking and replace it with something new in one bid sweep

20:23 <hkaiser> big*

20:23 <heller> I understand. The two things could happily coexist

20:24 <heller> They in fact do at the moment

20:25 <hkaiser> so what's the problem with leaving the existing tcp pp in place for a while?

20:27 <heller> No problem at all. This new code would make the current parcel handling obsolete.

20:27 <hkaiser> I understand

20:27 <heller> Having a plan on when to remove would be good

20:27 <hkaiser> but as said, I think this change should be done in steps over at least 2 releases

20:27 <heller> Ok

20:28 <heller> No problem.

20:28 <hkaiser> one release have the new stuff in but not as the default, and the next release have it on by default, leaving the old stuff in on demand

20:29 <hkaiser> third release - remove things

20:29 <hkaiser> now, the quicker you do the releases, the quicker the stuff gets in ;-)

20:30 <heller> The risk is: bugs, changed cmake step (need to point to a libfabric install) and a potential problem when not using slurm/pbs/alps for distributed applications. Libfabric might get discontinued and we ended up with a pretty coupled code base and need to invest there

20:32 <heller> The gain: significantly faster distributed applications

20:32 <heller> And making John happy with the rdma transfers

20:33 <hkaiser> heller: sure, I'm behind this - just a bit cautious

20:34 <heller> Good

20:35 <heller> I hope that it works reasonably on Windows and osx

20:37 <heller> They claim it does...

20:38 <hkaiser> heller: sure, if not we can create some pressure through Chris

20:53 diehlpk_mobile has quit [Read error: Connection reset by peer]

21:24 jakub_golinowski has quit [Ping timeout: 256 seconds]

21:31 diehlpk_mobile has joined #ste||ar

21:32 <Anushi1998> https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/naming/name.cpp#L282-L284

21:32 <Anushi1998> Why we need to add new split credits? Since we have acquired the lock the credits will be replinshed and again whenever it is split it would be simply divided.

21:34 <Anushi1998> ahttps://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/naming/name.cpp#L288 Why is it not 1 bcoz we are replenishing only when we are exhausted and 2 can be divided further?

21:34 jakub_golinowski has joined #ste||ar

21:34 <Anushi1998> As here https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/naming/name.cpp#L98-L99

21:41 <hkaiser> Anushi1998: we replenish only once the credit has been exhausted

21:42 aserio has quit [Quit: aserio]

21:42 <Anushi1998> but it should be 1 :/ https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/naming/name.cpp#L290

21:42 <Anushi1998> hkaiser: ^^

21:42 <hkaiser> why?

21:43 <Anushi1998> becoz when we have 1 credit and we still want to split then we should replenish both

21:43 <hkaiser> creadits are always stored as log(credit)

21:43 <hkaiser> so the minimal useful credit is 2^^1

21:43 <Anushi1998> okay

21:44 <Anushi1998> hkaiser: https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/naming/name.cpp#L282-L284 Also please explain this

21:44 hkaiser has quit [Read error: Connection reset by peer]

21:45 hkaiser has joined #ste||ar

21:45 <Anushi1998> hkaiser: https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/naming/name.cpp#L282-L284 Also please explain this

21:46 <hkaiser> we do not hold a lock during credit splitting

21:47 <hkaiser> that means that concurrently two or more of those operations could happen

21:47 <hkaiser> we need to account for those

21:48 <Anushi1998> but whenever we are splitting it will reduce the credit or replinish completely (which in case of addition will give overflow)

21:48 <hkaiser> Anushi1998: yah, but it might replenish more than once

21:49 <Anushi1998> so u mean to say if I have 2 credit and it's splitted twice concurrently it will 2*HPX_GLOBALCREDIT_INITIAL ?

21:50 <hkaiser> yes

21:52 <Anushi1998> Why we have chosen such design, it can lead to large no. of credits?

21:53 <hkaiser> Anushi1998: that's to avoid holding a lock while a possibly remote operation is underway

21:53 <Anushi1998> hkaiser: Okay, thanks :)

21:55 <hkaiser> K-ballo: is there anything preventing me from using std::monostate not as the first member of a variant?

21:57 Anushi1998 has quit [Quit: Bye]

22:10 <jakub_golinowski> M-ms, I was trying to build the MartyCam app but got this errors https://pastebin.com/GsiabTz5

22:11 <jakub_golinowski> I tried rebuilding opencv with the options suggested in the install instructions of MartyCam but it still did not help. Now my guess is that I am using recent master and this might be the issue. In the mean time I am reading the source code of the app

22:14 <K-ballo> hkaiser: nope

22:15 jakub_golinowski has quit [Ping timeout: 260 seconds]

22:19 nikunj1997 has joined #ste||ar

22:27 <github> [hpx] khuck pushed 1 new commit to apex_fixing_null_wrapper: https://git.io/fSGte

22:27 <github> hpx/apex_fixing_null_wrapper a68ef88 Kevin Huck: Merge branch 'master' into apex_fixing_null_wrapper

22:28 <github> [hpx] khuck pushed 1 new commit to apex_fixing_null_wrapper: https://git.io/fSGtJ

22:28 <github> hpx/apex_fixing_null_wrapper ee55d5d Kevin Huck: Merge branch 'master' into apex_fixing_null_wrapper

22:29 <nikunj1997> hkaiser: 4 tests failed in my Mac OS test (2 of them timed out). 1 tests passed later when I reran it. So overall 99% tests passed. The reason for timed out tests could be due to RAM shortage (I'm running it on VM).

22:31 <github> [hpx] khuck opened pull request #3363: Apex fixing null wrapper (master...apex_fixing_null_wrapper) https://git.io/fSGtG

22:42 <hkaiser> nikunj1997: sounds promising

22:43 <nikunj1997> hkaiser: do you think implementing mainCRTStartup again would be a good idea?

22:43 <nikunj1997> Referencing to Billy's mail

22:44 <hkaiser> an experimental implementation doesn't sound too bad

22:44 <nikunj1997> that's what I think as well. We can always update them as msvc updates it's versions

22:45 <hkaiser> right

22:45 <nikunj1997> actually mainCRTStarup will provide all the flexibility we require for implementing it

22:45 <nikunj1997> Given init_seg cannot force itself to run after all global objects

22:46 diehlpk_mobile has quit [Read error: Connection reset by peer]

23:58 <K-ballo> hkaiser: what had you in mind?

23:58 <K-ballo> that doesn't read well...

23:59 <hkaiser> K-ballo: I justed thought I would need some meaningles empty-state for the variant

23:59 <hkaiser> just*