#ste||ar on 2019-02-11 — irc logs at irclog.cct.lsu.edu

2018-08-26 23:03 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:31 hello has joined #ste||ar

00:39 hello has quit [Ping timeout: 244 seconds]

01:38 hello has joined #ste||ar

01:44 hello has quit [Ping timeout: 246 seconds]

02:42 hello has joined #ste||ar

02:49 hello has quit [Ping timeout: 240 seconds]

03:18 hkaiser has quit [Quit: bye]

03:47 hello has joined #ste||ar

03:54 hello has quit [Ping timeout: 250 seconds]

04:53 hello has joined #ste||ar

05:00 hello has quit [Ping timeout: 272 seconds]

05:37 time_ has joined #ste||ar

05:45 time_ has quit [Ping timeout: 250 seconds]

06:41 time_ has joined #ste||ar

06:41 time_ is now known as hello

06:42 time_ has joined #ste||ar

06:43 hello has quit [Remote host closed the connection]

06:43 time_ is now known as hello

06:48 hello has quit [Ping timeout: 240 seconds]

07:14 jbjnr has joined #ste||ar

07:47 hello has joined #ste||ar

07:54 hello has quit [Ping timeout: 246 seconds]

08:00 hello has joined #ste||ar

08:00 hello has quit [Remote host closed the connection]

08:01 hello has joined #ste||ar

08:04 nikunj has quit [Quit: Leaving]

08:05 nikunj has joined #ste||ar

08:08 david_pfander has joined #ste||ar

08:21 david_pfander has quit [Quit: david_pfander]

08:21 david_pfander has joined #ste||ar

08:33 <heller_> simbergm: looks like the cherry pick went haywire

08:33 <simbergm> heller_: yeah, just saw that

08:34 <heller_> simbergm: how about we just fix everything on master and call it 1.3?

08:34 <heller_> how much time do we have for f30?

08:35 <simbergm> heller_: yes, I'd like that, I'm just worried about the timing

08:35 <heller_> what's your concern?

08:35 <simbergm> I think it was something like 2 weeks

08:35 <simbergm> precisely fedora

08:35 <simbergm> and I'm not sure if we can get 1.3.0 on the older fedoras

08:35 <simbergm> or if they only do patch releases

08:35 <simbergm> but if we can just do 1.3.0 we should do that

08:37 <heller_> ok

08:37 <heller_> or well, roll the cmake cherry picks back, flag the cmake stuff as known problem

08:39 <simbergm> 14:55 <diehlpk_work> 2019-02-19 Branch Fedora 30 from Rawhide (Rawhide becomes future F31)

08:40 <simbergm> heller_: yeah, I'll check quickly if the two PRs I couldn't apply on friday are needed (I don't remember anymore how they hang together)

08:40 <simbergm> but otherwise let's check with diehlpk_work when he gets on if just 1.3.0 would be an option

08:40 <heller_> oh, that's next week

08:40 <heller_> impossible

08:41 <heller_> fixing memory leaks right now..

08:41 <simbergm> yes, but then:

08:41 <simbergm> 14:56 <diehlpk_work> Not at all, 2019-03-26 Beta Release (Preferred Target)

08:41 <simbergm> 14:56 <diehlpk_work> This is the deadline to have hpx in fedora 30

08:41 <heller_> enabling sanitizers is a black whole...

08:42 <heller_> black hole

08:42 <heller_> ahh

08:42 <heller_> that'd be doable

08:42 <simbergm> not sure what date we actually need to stick to

08:42 <simbergm> yep

08:42 <simbergm> yeah, sanitizers

08:43 <simbergm> but even if you don't get everything fixed we should still merge the ones you do find

08:44 <heller_> yeah, I try to seperate the commits out as much as I can

08:44 <heller_> but I really want to have them run with every commit...

08:44 <heller_> so we need a clean start ;)

08:44 <heller_> https://github.com/STEllAR-GROUP/hpx/pull/3662

08:45 <heller_> that's still up

08:57 <jbjnr_> what is f30?

08:58 <heller_> fedora 30

10:29 hello has quit [Ping timeout: 268 seconds]

10:56 hello has joined #ste||ar

11:09 hello has quit [Ping timeout: 250 seconds]

12:05 <K-ballo> that unique_ptr in exception_info seems impossible

12:06 hello has joined #ste||ar

12:07 hello has quit [Remote host closed the connection]

12:08 <heller_> K-ballo: are exceptions ever copied?

12:08 hello has joined #ste||ar

12:09 <K-ballo> yes

12:09 <K-ballo> and it must be a nothrow copy

12:15 <heller_> ok, wonder why it worked then...

12:18 hkaiser has joined #ste||ar

12:20 <hkaiser> heller_: I found one issue with migration and I'm closing in on the actual problem

12:21 time_ has joined #ste||ar

12:23 hello has quit [Remote host closed the connection]

12:23 time_ is now known as hello

12:25 <heller_> hkaiser: excellent

12:26 <heller_> hkaiser: we have tons of memory leaks ;)

12:26 <hkaiser> heller_: I think it's a race between the flag in the local agas where the object lives with the one in remote agas managing the object

12:26 <heller_> ok

12:26 <hkaiser> heller_: we shouldn't have any ;-)

12:26 <heller_> indeed

12:26 <heller_> in the process of fixing them

12:26 <hkaiser> cool

12:28 <heller_> this sanitizer business is a rabbit hole deeper than alice's

12:29 <hkaiser> lol

12:30 <heller_> just found one related to this: https://github.com/STEllAR-GROUP/hpx/issues/3449

12:31 <hkaiser> nod

12:31 <hkaiser> figures

12:58 <heller_> hkaiser: this is interesting:https://stellar-group.gitlab.io/-/hpx/-/jobs/159443858/artifacts/tests.unit.html#775cc3d3-3e99-462e-91fa-e374cdeb7c28

12:58 <hkaiser> never seen this before

13:00 <hkaiser> heller_: I think that can be explained by the problem I fixed yesterday

13:00 <hkaiser> it is unrelated to the hang, I think

13:01 <heller_> it's showing up in basic_action-simplify

13:01 <hkaiser> interesting

13:01 <K-ballo> mh?

13:01 <heller_> K-ballo: the only explanation I have here is that your simplifcation somehow missed out on the pinned_ptr part

13:02 <heller_> let me check the PR

13:02 <K-ballo> context?

13:02 <heller_> we are seeing hangs in migrate_component

13:02 <heller_> hkaiser is trying to fix it

13:02 <hkaiser> heller_: see here: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/runtime/components/server/migrate_component.hpp#L234

13:02 <heller_> now, on your branch, I see that failure

13:03 <hkaiser> this attached the continuation to the outer future, instead of the inner one - that causes end_migration to be executed too early

13:03 <heller_> aha

13:03 <heller_> that's unrelated i think though

13:04 <hkaiser> well, probably

13:04 <hkaiser> actions should always pin the object

13:04 <heller_> right

13:04 <heller_> which is probably just missing somewhere

13:05 <hkaiser> most likely

13:06 <heller_> got it

13:08 <heller_> or maybe not

13:13 <heller_> ok, this actually makes tons of sense now...

13:14 <heller_> or maybe not...

13:14 * heller_ just shuts up now

13:18 bibek has joined #ste||ar

13:23 <heller_> hkaiser: I think there might be another race between decoding the parcel and actually pinning the component

13:23 <hkaiser> is there now?

13:23 <hkaiser> it's pinned right away

13:23 <heller_> hkaiser: we pin it during decoding, but it is left unpinned until the actual thread is scheduled

13:23 <heller_> unless I miss something obvious

13:27 <hkaiser> heller_: parcel::load_schedule is executed while pinned and schedules the thread which pins it again

13:34 <heller_> hkaiser: ok, then I miss something obvious...

13:35 <heller_> ahh, decorate_action...

13:41 <heller_> doesn't explain why the object isn't pinned in K-ballo's branch...

13:46 <heller_> only happens for the actions returning a future

13:54 <K-ballo> I touched some code that involved pinning and was specialized for returning futures

13:54 <K-ballo> I'll check it out later

13:54 <heller_> I went through it, but couldn't find where the missing piece is

13:57 <K-ballo> I'd say don't worry about it, if the PR breaks it we'll fix it, I'd worry if master is broken

13:59 <heller_> master is broken in different ways

14:00 <K-ballo> lol, fix master and we'll make sure the PR does not regress it

14:02 <simbergm> diehlpk_work: yt?

14:04 <heller_> K-ballo: the failure on master is 100% different than the failure on basic_action-simplify. On master we see occasional hangs, on basic_action-simplify, we see that the object isn't pinned at all when executing an action returning a future

14:09 <K-ballo> ok, if the PR fixes the issue from master it would be entirely by chance

14:10 <K-ballo> adding more bugs is much easier than removing them

14:16 <hkaiser> heller_: the pin count is wron gonly for the lazy cases, that means that the returned future doesn't keep the pined_ptr alive anymore

14:17 <heller_> yes

14:31 hkaiser has quit [Quit: bye]

14:32 aserio has joined #ste||ar

14:56 <diehlpk_work> simbergm, Yes

14:57 <diehlpk_work> Just arrived in the office

15:11 <diehlpk_work> simbergm, I have an idea for the cmake issue for the 1.2.1 release

15:12 <diehlpk_work> it is not nice, but it will work, we can hardcode the cmake module path for fedora

15:13 <diehlpk_work> So for the 1.2.1 release we would have this workaround patch and remove it fro the 1.3 release

15:27 <simbergm> diehlpk_work: right, is that definitely the cause of those problems?

15:29 <simbergm> so since 1.2.1 is starting to become a bit of an annoyance, is it possible for us to just release 1.3.0 and get that into fedora after the fact or can we only update those packages to 1.2.X?

15:31 <diehlpk_work> simbergm, We would need at least the patch for boost 1.69 to have hpx in fedora 30

15:31 <diehlpk_work> I would recommend to use this fix and disable the devel package

15:31 <diehlpk_work> and the patch for s390x

15:32 <diehlpk_work> simbergm, What about adding these two patches as minimum

15:33 <diehlpk_work> So we would have hpx 1.2.1 in fedora 30

15:34 <simbergm> well, imo the hpx-devel package is the more important of the two

15:35 <simbergm> so it wasn't quite clear to me, if is the deadline to have something in fedora 30 February 19th or March 26th?

15:37 hkaiser has joined #ste||ar

15:37 <diehlpk_work> We can always push it later to fedora 30, but if one uses our package and it is not available at the release, he can not upgrade to fedora 30

15:37 <diehlpk_work> He would need to remove hpx first

15:38 <diehlpk_work> xfce was too late for fedora 29 and I could not update my machine until their packages were available

15:39 <diehlpk_work> 2019-04-16 Final Freeze (*) - 10 days would be good to have hpx 1.2.1 in fedora 30

15:39 <K-ballo> what were the rules for using hpx_main.hpp on windows? I'm getting missing entry point

15:40 <simbergm> ok, that means we can also update the package on f28 and 29 after the fact?

15:40 <diehlpk_work> Yes

15:40 <simbergm> so what is the actual deadline for f30?

15:40 <diehlpk_work> 2019-04-16 Final Freeze (*)

15:40 <simbergm> K-ballo: just define int main?

15:41 <K-ballo> there is one (trying to run some test as standalone app)

15:41 <diehlpk_work> simbergm, I would prefer to have minimal patch for fedora 30

15:41 <diehlpk_work> let us add boost 1.69 and s390x

15:42 <diehlpk_work> and fix the devel package

15:43 <diehlpk_work> simbergm, I would suggest to do a hpx 1.2.1 release which contains the boost 1.69 only fix

15:43 <diehlpk_work> So people could use hpx with boost 1.69 which I think is important

15:43 <diehlpk_work> So we have a small release

15:44 <diehlpk_work> I will add the patch for s390x to the Fedora package

15:44 <diehlpk_work> and we will work on a patch for the devel package

15:45 <diehlpk_work> So we would down stream this minor patch to f28 and f29 and we are done

15:45 <diehlpk_work> For Fedora 31 we can work on the hpx 1.3 release

15:47 <simbergm> diehlpk_work: let me have a look at the release branch, I've applied the cmake patches that I thought would fix those problems but I've clearly applied them wrongly

15:47 <simbergm> if I can get those working we're done

15:48 <simbergm> otherwise the point is that if the deadline for f30 is in april we'll just release 1.3.0 because it should already have all those fixes and we'll save this hassle of juggling a patch release

15:48 <simbergm> that would be plenty of time to make a release

15:48 adityaRakhecha has joined #ste||ar

15:52 <diehlpk_work> simbergm, Ok, but normally there is no major version upgrade within Fedora. So updating f28 and f29 to 1.3 would be strange

15:53 <simbergm> diehlpk_work: ok, that answers my question

15:53 <diehlpk_work> simbergm, What we could do is keep 1.2.0 for f28 and f29 and just apply the patch for the devel package there

15:53 <diehlpk_work> So we could do hpx 1.3 for f30

15:54 <diehlpk_work> I think this would be the easiest way

15:55 <diehlpk_work> because on f28 and f29 we do nto need to deal with the boost patch

15:56 <diehlpk_work> *not

15:56 mbremer has joined #ste||ar

15:57 <diehlpk_work> So s390x and boost 1.69 are will be in hpx 1.3

15:57 <simbergm> diehlpk_work: give me a sec, my cherry-picking definitely went wrong

15:58 aserio1 has joined #ste||ar

15:58 <simbergm> if that works we won't need to worry about differenct patches for different fedoras etc

16:00 aserio has quit [Ping timeout: 268 seconds]

16:00 aserio1 is now known as aserio

16:02 <K-ballo> heller_: is the migrate_component test failing consistently for the PR?

16:05 <diehlpk_work> simbergm, Sure, take your time.

16:11 <simbergm> I've just updated the release branch, this time hopefully correctly

16:12 <simbergm> diehlpk_work: ^

16:12 <simbergm> let's see if rostam agrees

16:13 <simbergm> diehlpk_work: it might be a good idea to set up a circleci/whatever test which would build your fedora recipe, install it and try creating a hello world based on that

16:13 <simbergm> also #3686 should help catching these kind of problems

16:15 <heller_> K-ballo: I'll check, but I'd assume so

16:16 <heller_> K-ballo: It failed three times with the same problems, so I'd say yes:

16:22 <K-ballo> it's running on a single locality here for some reason

16:22 mbremer has quit [Quit: Leaving.]

16:22 <K-ballo> I suppose it needs some sort of test runner wrapper

16:26 <diehlpk_work> simbergm, We could download the rpm from the fedora server and install it on docker 32bit and 64 bit and compile hello world

16:26 hello has quit [Ping timeout: 250 seconds]

16:26 <diehlpk_work> or we look into their testing system

16:26 <diehlpk_work> They have a build and testing system

16:27 <diehlpk_work> I asked in the fedora-devel channel and on the build system you can only access the installation patch and run test there

16:27 <diehlpk_work> The rpm package is generated as the last step

16:30 eschnett has joined #ste||ar

16:38 jbjnr has quit [Ping timeout: 250 seconds]

16:43 <simbergm> diehlpk_work: whatever works

16:44 <diehlpk_work> Ok, I will work on a solution

17:03 <K-ballo> I notice a number of technical differences with the PR that should not be affecting behavior, one (or more) of them might be

17:04 quaz0r has quit [Ping timeout: 244 seconds]

17:06 david_pfander has quit [Ping timeout: 246 seconds]

17:11 <adityaRakhecha> HPX is participating in GSOC this year. Right?

17:12 detan has joined #ste||ar

17:13 <hkaiser> adityaRakhecha: yes

17:13 <hkaiser> if we get accepted by google

17:14 <K-ballo> how did one run multiple localities? -l2 -0 ?

17:15 <hkaiser> K-ballo: one locality --hpx:localities=2 --hpx:node=0, the other one --hpx:node=1

17:15 <hkaiser> or -l2 -0 and -1

17:15 <K-ballo> -l2 -0 and -1 is what I have... pretty good, considering

17:17 <K-ballo> -l2, -0, and -1 are being rejected.. the full form works

17:18 quaz0r has joined #ste||ar

17:18 <hkaiser> depends on whether your app uses hpx_main.hpp or not

17:19 <detan> not sure if this is the right place to get help compiling hpx?

17:19 <parsa_> detan: it is

17:20 <detan> Brilliant.. getting an error on Ubuntu 18.04.1/gcc-6/boost 1.69

17:20 <K-ballo> I'm using hpx_main, but I do not trust it

17:21 <K-ballo> detan: error_code constructor is explicit, known issue, was introduced by the boost release

17:21 <hkaiser> we should generally disable the shortcuts for the command line and use only if somebody explicitly enables them

17:21 <parsa_> detan: is your hpx is the current HPX master HEAD?

17:22 <detan> @K-ballo ok, i should use 1.68 instead?

17:22 <detan> @parsa_ 1.2 release

17:22 <K-ballo> 1.68 will work, 1.69 was released after hpx 1.2

17:23 <K-ballo> we are working on a 1.2.1 release, but it is experiencing some delays

17:23 <K-ballo> otherwise you could apply a simple patch to fix the issue, if you want to stick o 1.69

17:23 <detan> sorry, new to freenode... kinda lost here

17:24 <detan> @K-ballo what is the most recommended version for boost? The most tested...

17:25 <detan> ah.. ok I will try 1.68...

17:25 <detan> thx

17:25 <K-ballo> no idea... for hpx 1.2 I guess that would be 1.68, or maybe 1.67? we are testing with 1.58.0 and up

17:25 <detan> great.. let me try that...

17:25 <K-ballo> it's just that boost 1.69 introduced a breaking change *after* we released hpx 1.2

17:26 <diehlpk_work> adityaRakhecha, February 26 12:00 UTC List of accepted mentoring organizations published

17:26 <detan> @K-ballo totally understand...

17:26 <K-ballo> https://github.com/STEllAR-GROUP/hpx/issues/3550 << more context

17:26 <K-ballo> https://github.com/STEllAR-GROUP/hpx/pull/3558 << the fix

17:31 <detan> thanks

17:31 <detan> Should I use gcc-6 if I intend to use hpx+cuda?

17:37 <diehlpk_work> detan, 6 or 7 should be fine

17:37 <diehlpk_work> And Cuda 9.x

17:38 <K-ballo> heller_: I'm getting consistent segfaults in the lock detection logic, must be doing something wrong here

17:41 <detan> @diehlpk_work thanks!

17:42 hello has joined #ste||ar

17:43 jbjnr has joined #ste||ar

17:43 hello has quit [Remote host closed the connection]

17:45 hello has joined #ste||ar

17:46 <K-ballo> it's attempting to concurrently erase from the same map

17:46 <hkaiser> K-ballo: it should be locked

17:46 <hkaiser> next to the map is a mutex

17:48 <K-ballo> yeah, something else is off...

17:48 <K-ballo> here https://github.com/STEllAR-GROUP/hpx/blob/master/src/util/register_locks.cpp#L171

17:49 <K-ballo> those are different instances of the map

17:51 hello has quit [Ping timeout: 268 seconds]

18:05 <K-ballo> heller_: one of the failures come from pinning a component using a const-qualified type

18:09 aserio has quit [Ping timeout: 250 seconds]

18:16 hkaiser has quit [Read error: Connection reset by peer]

18:16 hkaiser has joined #ste||ar

18:16 <hkaiser> K-ballo: pinning should be mutable

18:18 <K-ballo> ok

18:29 <K-ballo> and the other failure is just the same failure as I forgot to update the exe for the other "locality"

18:38 <K-ballo> lol, now my run has deadlocked while grabbing locks for destroying iterators for the lock register map

18:39 parsa_ is now known as parsa

18:42 <K-ballo> this debug iterator stuff is crazy

18:45 <heller_> K-ballo: as long as this pin.count() != 0 failure is gone, I'm happy

18:47 <heller_> K-ballo: perfect! thanks

18:47 <heller_> did you notice any decrease in compile time and/or binary size?

18:47 hello has joined #ste||ar

18:49 <K-ballo> I've never ever in the history of the universe have I noticed a decrease in compile time :P

18:49 <K-ballo> some debug bloat is gone, a fraction of that would have affected binary size

18:51 <heller_> yeah ... the second law is killing us all

18:51 <heller_> eventually

18:54 <K-ballo> that reminds me I was working on some reports on memory usage during compilation for 1.2

18:56 hello has quit [Ping timeout: 272 seconds]

19:17 aserio has joined #ste||ar

19:18 aserio1 has joined #ste||ar

19:22 aserio has quit [Ping timeout: 268 seconds]

19:22 aserio1 is now known as aserio

19:53 <detan> not related to hpx but you guys might know: anyone struggled with __host__ and __device__ attributes for nested lambdas in cuda?

19:55 hello has joined #ste||ar

19:59 <detan> wondering if anyone here has tested it with clang -> http://lists.llvm.org/pipermail/llvm-bugs/2016-September/051197.html

20:00 <diehlpk_work> detan, You talk about constexpress

20:00 <diehlpk_work> or?

20:01 hello has quit [Ping timeout: 246 seconds]

20:01 <detan> no... is just that when you define [=] __device__ () lambda.. and you nest another lambda inside, it must be a [=] __device__ () lambda as well...

20:02 <detan> but if you the second lamda is defined inside my library I need to infer which attribute the user used... and have a call_device and call_host variants...

20:03 <detan> it all seems so clumsy

20:05 <detan> looks like the link that i sent is related to that... but for clang. So, is hpx+cuda+clang operational?

20:10 <detan> oh... there is a branch for clang+cuda...

20:21 detan has quit [Ping timeout: 256 seconds]

20:49 hkaiser has quit [Quit: bye]

20:59 hello has joined #ste||ar

21:05 hello has quit [Ping timeout: 246 seconds]

21:17 quaz0r has quit [Ping timeout: 246 seconds]

21:26 quaz0r has joined #ste||ar

21:49 eschnett has quit [Quit: eschnett]

22:01 hello has joined #ste||ar

22:02 hello has quit [Remote host closed the connection]

22:03 hello has joined #ste||ar

22:04 hello has quit [Client Quit]

22:05 hkaiser has joined #ste||ar

22:11 jbjnr has quit [Ping timeout: 264 seconds]

22:43 aserio has quit [Quit: aserio]

23:00 <K-ballo> we accidentally ended with two latch.cpp files after one of my changes, does that still confuse msvc?

23:00 <K-ballo> or rather msbuild I suppose

23:40 <hkaiser> K-ballo: doesn't confuse, just prevents concurrent compilation in VS