#ste||ar on 2018-01-24 — irc logs at irclog.cct.lsu.edu

2017-05-17 13:54 aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:11 EverYoun_ has quit [Remote host closed the connection]

00:16 EverYoun_ has joined #ste||ar

00:34 EverYoun_ has quit [Remote host closed the connection]

01:32 vamatya has quit [Ping timeout: 246 seconds]

02:15 daissgr has joined #ste||ar

02:37 K-ballo has quit [Quit: K-ballo]

02:37 hkaiser has quit [Quit: bye]

03:04 EverYoung has joined #ste||ar

03:09 EverYoung has quit [Ping timeout: 276 seconds]

03:47 EverYoung has joined #ste||ar

03:50 EverYoung has quit [Remote host closed the connection]

03:58 vamatya has joined #ste||ar

04:16 hkaiser has joined #ste||ar

04:16 hkaiser has quit [Client Quit]

05:00 nanashi55 has quit [Ping timeout: 248 seconds]

05:03 nanashi55 has joined #ste||ar

05:07 daissgr has quit [Ping timeout: 276 seconds]

05:51 EverYoung has joined #ste||ar

06:23 vamatya has quit [Ping timeout: 264 seconds]

06:42 parsa has joined #ste||ar

06:58 parsa has quit [Quit: Zzzzzzzzzzzz]

07:34 jaafar has quit [Ping timeout: 246 seconds]

08:17 <github> [hpx] sithhell pushed 1 new commit to master: https://git.io/vNo21

08:17 <github> hpx/master a220e5b Thomas Heller: Merge pull request #3122 from STEllAR-GROUP/fixing_3121...

08:18 <github> [hpx] sithhell deleted fixing_3121 at 4c7c746: https://git.io/vNo2S

08:23 david_pfander has joined #ste||ar

08:26 <zao> https://github.com/easybuilders/easybuild-easyconfigs/issues/5716

08:26 <zao> This is nifty... seems like a flag we've built our OpenMPI with for a long while has some rather massive perf shenanigans.

08:56 Guest2733 is now known as Vir

09:13 <heller_> lol

10:59 <heller_> jbjnr_: yt?

11:00 <jbjnr_> here

11:00 <jbjnr_> no

11:00 <heller_> jbjnr_: I want to add a pycicle tester, how do I verify that my setup works? As in how would I start a test run of a build?

11:01 <jbjnr_> on local laptop/machine or over ssh

11:02 <jbjnr_> once you've setup a config file like "my-laptop.cmake" then "python ./pycicle -m my-laptop"

11:02 <jbjnr_> but I usually try -p 3118 and --debug

11:03 <jbjnr_> to force just one PR to be checked and also don't trigger any builds (over ssh to test job luanching etc)

11:03 <jbjnr_> --help gives a quick summary of options

11:05 <heller_> thanks

11:05 <jbjnr_> use -p option when running locally, because if N branches need rebuilding, you only have one machine, so triggering just one branch is a good idea

11:05 <jbjnr_> -f to force a rebuild even if it doesn't need it

11:06 <jbjnr_> python pycicle -f -p 3118

11:06 <jbjnr_> --debug (-d) prints out comand without ssh sending them

11:06 <github> [hpx] sithhell pushed 1 new commit to master: https://git.io/vNoSy

11:06 <github> hpx/master 261fe3a Thomas Heller: Merge pull request #3125 from STEllAR-GROUP/after_3120...

11:06 <jbjnr_> I might have broken local build recently, only been using it over ssh

11:07 <jbjnr_> if it doesn't work, tell me and I'll fix it

11:07 <heller_> i want to test a ssh build anyway

11:07 <jbjnr_> since it "works for me" but nobody else has tried it, there are probably several things broken that I jus assume people know even though they couldn't possibly know

11:09 <heller_> sure

11:09 <heller_> I think I got a hang on it now

11:09 <heller_> thanks

11:10 <jbjnr_> my daint setup is just "python ./pycicle.py -m daint"

11:10 <heller_> where would I set the build type?

11:11 <jbjnr_> and I leave it running in a terminal and can see stuff scroll whenever you do a merge (like just now)

11:11 <jbjnr_> then it settles into a quiet state and just prints out a check - time since last 86s

11:11 <jbjnr_> watch out that in the python there is a hardocded randon clang/gcc option you might want to disable

11:12 <jbjnr_> I need to turn those into proper config settings that can be changed on a per project/setup basis

11:12 <heller_> random clang/gcc?

11:12 <heller_> that sounds odd

11:13 <heller_> it's just "if machine==daint"

11:13 <heller_> ;)

11:13 <jbjnr_> https://github.com/biddisco/pycicle/blob/master/pycicle.py#L174

11:14 <jbjnr_> oh yes. I forogt there was a dint check there too

11:14 <jbjnr_> ^daint

11:14 <jbjnr_> (what we need is an options matrix with N to choose from and then have randomly generated configs. Just gcc/clang so far)

11:14 <jbjnr_> anyway ...

11:15 <jbjnr_> boost = 'x.xx.x' :)

11:17 <heller_> why random? Don't we want to check all options in the matrix each time?

11:17 <jbjnr_> well, for one, you complained that there was too much information

11:17 <jbjnr_> so multiple all builds by all options

11:17 <jbjnr_> gtg bbiab

11:17 <heller_> well sure

11:20 K-ballo has joined #ste||ar

11:24 <heller_> jbjnr_: Failed to connect to github. Network down?

11:28 <heller_> hmm, bad credentials ... but I generated my user token?

11:47 <heller_> jbjnr_: you got your first user!

12:13 <jbjnr_> working?

12:25 <simbergm> does anyone know what's up with rostam?

12:25 <simbergm> jbjnr_: I think you got a question on github from someone who wants to use pycicle

12:27 <heller_> jbjnr_: slowly...

12:27 <jbjnr_> simbergm: ooh. looking now

12:28 <heller_> simbergm: dunno ... some system messup

12:36 hkaiser has joined #ste||ar

12:41 <heller_> hkaiser: rostam seems to be completely borked now :/

12:41 <hkaiser> heller_: I have not seen Al yesterday

12:43 <heller_> hkaiser: shall I have a look?

12:44 <hkaiser> pls do, but leave Al in the loop, pls

12:46 <heller_> ok, i undrained the ariel nodes

12:47 <heller_> this should get rid of the exception

12:48 <github> [hpx] sithhell deleted fix_hello at baacf09: https://git.io/vNoxs

12:48 <heller_> merged a new PR ... let's see

12:53 * jbjnr_ is waiting with baited breath

13:08 <heller_> so ... just using, for example papi_cost, I can't reproduce the segfault

13:09 <heller_> so it is either the generic context coroutines (who knows why those are activated on rostam) or something else

13:09 <heller_> I'll try to reproduce this locally here

13:22 <hkaiser> heller_: how do you know those are enabled?

13:25 <heller_> hkaiser: from the stacktrace

13:25 <heller_> hkaiser: red herring ... they are only enabled for the gcc 6.3 builds

13:26 <hkaiser> I now remember we enabled those deliberately to have it tested on at least one or two platforms

13:26 <heller_> *nod*

13:27 <K-ballo> good thinking

13:27 <jbjnr_> We will have to start running HPX on summit (ARM)

13:27 <hkaiser> indeed

13:27 <jbjnr_> so expect plenty of testing of differnt architectures soon

13:27 <hkaiser> perfect

13:27 <jbjnr_> ARM + 2 GPU's per node

13:28 <heller_> arm?

13:28 <heller_> I thought it was power9?

13:28 <hkaiser> isn't summit power9?

13:28 <heller_> I have a regular power tester

13:29 <jbjnr_> so sorry. power 9

13:29 <jbjnr_> I'm losing my mind

13:29 <jbjnr_> risc-ish anyway :)

13:29 <jbjnr_> the hpx+cuda situation is a total mess, and this is going to be a problem for us moving forward with our next big project on summit.

13:30 <jbjnr_> ooh - cdash has a FAU machine incoming

13:30 <heller_> hkaiser: regarding our discussion about two threads trying to resume one suspended one ... and I am having hard time to figure out what the semantics of this should be ... I am even thinking that this is UB in general, and the two "resumer" threads need to synchronize in a different way, for example what our condition variable is doing. Am I totally off there?

13:31 <heller_> jbjnr_: I agree ... there would have been an easy way out of this mess...

13:31 <jbjnr_> well. I blame you for that!

13:31 <hkaiser> heller_: stop worrying about this ;-)

13:31 <hkaiser> just leave it alone

13:31 <hkaiser> jbjnr_: having a mess is only a 'problem' if nobody does anything about it

13:31 parsa has joined #ste||ar

13:32 <jbjnr_> it'll be my next problem, so I'll have to do something about it

13:33 <hkaiser> good!

13:33 <hkaiser> should we clean up the existing half-way solved 'problems' first?

13:34 <jbjnr_> cat hpx::compute > /dev/nul

13:34 <jbjnr_> which "we" are we talking about?

13:34 <hkaiser> if you have a better solution, all the better

13:34 <hkaiser> I always use the the royal 'we' ;) I'm a Kaiser

13:35 <heller_> lol

13:36 <jbjnr_> <sigh>

13:36 Smasher has quit [Remote host closed the connection]

13:37 Smasher has joined #ste||ar

13:42 mcopik has quit [Ping timeout: 255 seconds]

13:44 <heller_> hkaiser: the suspend/resume or the other thing?

13:45 <hkaiser> what other thing?

13:45 <heller_> the easy way out of the hpx+cuda mess ;)

13:45 <hkaiser> ahh

13:45 <hkaiser> the suspend/resume

13:46 <heller_> I won't, that really bugs me ;)

13:46 <hkaiser> heh, why did I know ...

13:46 <hkaiser> it's very high risk with questionaly outcomes

13:46 <hkaiser> questionable*

13:47 <hkaiser> the changes you work on currently are noticable in real applications for very high contention already - this change would not be measuarable at all - so why bother

13:47 <hkaiser> ?

13:48 <heller_> I am getting those changes in first

13:48 <heller_> and I don't think they won't be measurable

13:49 <hkaiser> the current changes will have some effect for sure

13:50 <heller_> absolutely

13:54 <jbjnr_> still trying to fix my bugs, so I can test your stuff heller_

13:55 <heller_> jbjnr_: gradually getting it into master

13:55 <jbjnr_> :)

13:55 <heller_> jbjnr_: once I am completely done, we'll take care of your stuff

13:55 <heller_> just takes an insane amount time, since I really want to ensure everything still works

13:58 <heller_> hkaiser: FWIW, there might be a low hanging fruit that doesn't require a high risk change

13:59 <heller_> but I won't tell any details until I have more than just an educated guess, because I know what you'll going to say

14:06 <simbergm> currently hpx::start can return before the runtime is in state_running, is this a feature or bug? I'd like to suspend the runtime as soon possible after it's running

14:08 <simbergm> and second, starting the runtime without a main function does not seem to be possible right now, or am I missing some overload?

14:18 <hkaiser> heller_: ok

14:19 <hkaiser> simbergm: I was not aware of start returning before the runtime actually 'runs'

14:20 <hkaiser> but start is ment to signal to the runtime to start - so I guess it could happen, yah

14:20 <simbergm> yeah, I don't think it's necessarily wrong, just wondering if I should do the checking separately in that case

14:20 <hkaiser> simbergm: start(argc, argv) does not need a main-function?

14:21 <simbergm> hkaiser: but that will then call hpx_main?

14:21 <hkaiser> only on locality 0 (if not otherwise configured)

14:21 <hkaiser> and only if the locality is not connecting late

14:21 <simbergm> okay, I'm only dealing with one locality

14:22 <simbergm> I'm just trying to streamline starting the runtime and getting it suspended

14:22 <hkaiser> what would be the point of starting the runtime without running any functions?

14:22 <simbergm> it's not a big deal to add an empty hpx_main though

14:22 <hkaiser> nod

14:23 <simbergm> it should just be initialized, I can then later resume; hpx::async(); suspend

14:23 <simbergm> ideally faster than start/stop

14:23 <hkaiser> makes sense

14:23 <hkaiser> that was not part of the initial design ;)

14:24 <hkaiser> I think passing nullptr as hpx_main will do the trick

14:24 <hkaiser> or a default constructed function<...>{}

14:24 <simbergm> mmh, I'll try that

14:24 <simbergm> an empty function_nonser did not work when I tried it though

14:25 <hkaiser> hold on

14:25 <K-ballo> an empty one might throw, unless it is special cased to be ignored

14:25 <simbergm> I think it gets wrapped along the way

14:25 <simbergm> yeah, it throws

14:25 <hkaiser> here: https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime_impl.cpp#L295

14:26 <simbergm> hkaiser: my guess is that the bind in hpx_init.cpp wraps it so that it actually tries to call it

14:26 <simbergm> but I'm not certain

14:27 <hkaiser> I though func was the function_nonser passed in from the user verbatim - but I might be wrong

14:27 parsa has quit [Quit: Zzzzzzzzzzzz]

14:28 <hkaiser> simbergm: feel free to add an overload for init()/start() taking a nullptr_t instead of the function, causing nothing to be run

14:29 <simbergm> hkaiser: yeah, I might be wrong too

14:29 <simbergm> ok, I might just do that

14:30 <hkaiser> simbergm: you'd need 3 overloads: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/hpx_init_impl.hpp#L280-L326

14:31 <hkaiser> simbergm: and you're right, func is bound separately for certain cases: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/hpx_init_impl.hpp#L288-L289

14:32 <simbergm> hkaiser: thanks

14:32 <simbergm> that's the one I meant

14:32 <simbergm> I'll add some overloads

14:32 <hkaiser> the nullptr_t overloads should take care of that by constructing an empty function_nonser

14:34 <K-ballo> so many overloads

14:34 <simbergm> nullptr_t seems nice, never used it

14:35 <hkaiser> K-ballo: so much fun

14:39 hkaiser has quit [Quit: bye]

14:54 akheir has joined #ste||ar

15:12 aserio has joined #ste||ar

15:24 <akheir> heller_: yt? I heard you ran into trouble with rostam yesterday

15:24 <heller_> akheir: buildbot does

15:24 <heller_> akheir: look at the segfaults

15:24 <heller_> akheir: something to do with papi

15:25 <heller_> akheir: https://gist.github.com/sithhell/6187d69c5dfcfdeae2e8174f14a2b6a7

15:25 <heller_> here is the stacktrace

15:25 <heller_> I can't reproduce it anywhere else and it is not a stack overflow

15:26 <akheir> heller_: I'll take a look, I patched the server for meltdown thing last week, I've heard It may cause trouble with papi but I didn't expect this

15:29 <heller_> you don't trust your users :)?

15:29 <heller_> when I run some papi utilities, it seems to work fine

15:29 <heller_> it's just within HPX

15:29 <heller_> maybe you need to recompile PAPI against the new kernel?

15:30 <akheir> heller_: this papi comes with upstream I will compile a new one and put it in buildbot

15:31 <akheir> heller_: I don't know half of the users ;-)

15:31 <heller_> thanks

15:31 <heller_> let's hope this'll fix it

15:32 <heller_> akheir: another note, I undrained the ariel nodes this morning

15:32 <akheir> heller_: thanks, Patrick told me about it but I wasn't on my desk anymore to fix it

15:38 rtohid has joined #ste||ar

15:45 daissgr has joined #ste||ar

16:04 daissgr has quit [Ping timeout: 276 seconds]

16:16 daissgr has joined #ste||ar

16:24 <jbjnr_> heller_: FAU build didn't ever complete by the looks of it :(

16:25 <heller_> jbjnr_: no i cancelled it

16:25 <jbjnr_> ok

16:25 <heller_> because idiot me included the header tests

16:25 <jbjnr_> Still can't remember what they are/do

16:25 <heller_> they check that all headers are self consistent

16:25 <heller_> self contained, sorry

16:39 EverYoung has quit [Ping timeout: 276 seconds]

16:43 <github> [hpx] hkaiser closed pull request #3118: Adding performance_counter::reinit to allow for dynamically changing counter sets (master...reinit_counters) https://git.io/vN2Is

16:47 aserio has quit [Ping timeout: 276 seconds]

16:58 eschnett has joined #ste||ar

16:58 aserio has joined #ste||ar

16:58 <aserio> wash[m]: Will you be joining us today?

17:03 parsa has joined #ste||ar

17:13 EverYoung has joined #ste||ar

17:19 EverYoung has quit [Remote host closed the connection]

17:20 EverYoung has joined #ste||ar

17:25 EverYoung has quit [Remote host closed the connection]

17:25 EverYoung has joined #ste||ar

17:27 akheir_ has joined #ste||ar

17:30 vamatya has joined #ste||ar

17:35 akheir_ has quit [Remote host closed the connection]

17:35 parsa has quit [Quit: Zzzzzzzzzzzz]

17:38 aserio has quit [Quit: aserio]

17:39 aserio has joined #ste||ar

17:44 aserio has quit [Ping timeout: 276 seconds]

18:04 jaafar has joined #ste||ar

18:08 EverYoun_ has joined #ste||ar

18:10 EverYoung has quit [Ping timeout: 276 seconds]

18:19 daissgr has quit [Read error: Connection reset by peer]

18:39 david_pfander has quit [Ping timeout: 240 seconds]

18:50 jaafar_ has joined #ste||ar

18:54 jaafar has quit [Ping timeout: 252 seconds]

19:04 jaafar_ is now known as jaafar

19:04 jaafar has quit [Remote host closed the connection]

19:05 jaafar_ has joined #ste||ar

19:05 jaafar_ is now known as jaafar

19:28 aserio has joined #ste||ar

19:30 jaafar has quit [Remote host closed the connection]

19:30 jaafar has joined #ste||ar

20:37 aserio1 has joined #ste||ar

20:41 aserio has quit [Ping timeout: 252 seconds]

20:41 aserio1 is now known as aserio

20:55 daissgr has joined #ste||ar

20:59 aserio has quit [Ping timeout: 252 seconds]

21:00 aserio has joined #ste||ar

21:20 jaafar has quit [Remote host closed the connection]

21:21 jaafar has joined #ste||ar

21:27 jaafar has quit [Ping timeout: 252 seconds]

21:28 jaafar has joined #ste||ar

21:28 <aserio> twwright: ty?

21:32 <twwright> aserio, yes

21:33 <aserio> twwright: see pm please

21:46 aserio has quit [Quit: aserio]

21:46 aserio has joined #ste||ar

21:51 aserio1 has joined #ste||ar

21:52 jaafar_ has joined #ste||ar

21:53 jaafar has quit [Ping timeout: 276 seconds]

21:54 aserio1 has quit [Remote host closed the connection]

21:54 daissgr has quit [Quit: WeeChat 1.4]

21:54 aserio has quit [Ping timeout: 252 seconds]

21:56 daissgr has joined #ste||ar

22:02 eschnett has quit [Quit: eschnett]

22:03 aserio has joined #ste||ar

22:09 hkaiser has joined #ste||ar

22:15 <jbjnr_> heller_: good news, bad news - good news - I fixed my problem and can run tests again - bad news, no noticable speedup using your fix overheads branch (yet)

22:16 kisaacs has joined #ste||ar

22:17 <hkaiser> jbjnr_: you most likely won't be able to see a measurable improvement in overall runtime, but you should be able to reduce your thread-granularity which then might improve runtime

22:17 <aserio> hkaiser: would you forward me Jiangua's email

22:17 <hkaiser> done

22:19 <jbjnr_> that's exactly what I'm testing. no noticable speedup for the smaller block sizes

22:20 <hkaiser> jbjnr_: ok, good to know

22:38 aserio has quit [Ping timeout: 252 seconds]

22:40 Smasher has quit [Remote host closed the connection]

22:40 Smasher has joined #ste||ar

22:41 akheir has quit [Remote host closed the connection]

22:55 rtohid has left #ste||ar [#ste||ar]

23:00 mcopik has joined #ste||ar

23:07 kisaacs has quit [Ping timeout: 260 seconds]

23:07 aserio has joined #ste||ar

23:08 aserio has quit [Client Quit]

23:29 jaafar_ is now known as jaafar

23:41 kisaacs has joined #ste||ar

23:58 EverYoung has joined #ste||ar