#ste||ar on 2018-01-23 — irc logs at irclog.cct.lsu.edu

2017-05-17 13:54 aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

00:09 EverYoun_ has joined #ste||ar

00:12 EverYoung has quit [Ping timeout: 246 seconds]

00:13 EverYoun_ has quit [Ping timeout: 255 seconds]

00:17 EverYoung has joined #ste||ar

00:17 EverYoung has quit [Remote host closed the connection]

00:17 EverYoung has joined #ste||ar

00:19 EverYoung has quit [Remote host closed the connection]

00:20 EverYoung has joined #ste||ar

00:33 hkaiser has quit [Ping timeout: 268 seconds]

00:33 hkaiser has joined #ste||ar

00:38 EverYoung has quit [Ping timeout: 265 seconds]

00:41 parsa has quit [Quit: Zzzzzzzzzzzz]

00:58 daissgr has joined #ste||ar

01:27 rtohid has left #ste||ar [#ste||ar]

02:05 hkaiser has quit [Quit: bye]

02:11 K-ballo has quit [Quit: K-ballo]

02:38 jaafar_ has quit [Ping timeout: 268 seconds]

03:14 hkaiser has joined #ste||ar

03:16 <heller_> those segfaults are indeed strange

03:20 hkaiser has quit [Quit: bye]

03:41 <heller_> https://gist.github.com/sithhell/6187d69c5dfcfdeae2e8174f14a2b6a7

03:41 <heller_> that's the segfault backtrace

03:42 <heller_> why is it building with generic context corouties?

03:51 parsa has joined #ste||ar

03:54 eschnett has quit [Quit: eschnett]

03:56 eschnett has joined #ste||ar

04:09 EverYoung has joined #ste||ar

04:18 <github> [hpx] sithhell force-pushed thread_data_refcount from 9559988 to ed9396c: https://git.io/vNacO

04:18 <github> hpx/thread_data_refcount ed9396c Thomas Heller: Don't use boost::intrusive_ptr for thread_id_type...

05:01 parsa has quit [Quit: Zzzzzzzzzzzz]

05:03 nanashi55 has quit [Ping timeout: 264 seconds]

05:04 nanashi55 has joined #ste||ar

05:11 parsa has joined #ste||ar

05:11 parsa has quit [Client Quit]

05:11 parsa has joined #ste||ar

05:16 parsa has quit [Ping timeout: 264 seconds]

05:33 jaafar_ has joined #ste||ar

05:45 jaafar_ has quit [Remote host closed the connection]

05:45 jaafar_ has joined #ste||ar

05:58 vamatya has joined #ste||ar

06:07 daissgr has quit [Ping timeout: 246 seconds]

06:08 jaafar_ is now known as jaafar

06:26 jaafar_ has joined #ste||ar

06:27 jaafar_ has quit [Client Quit]

06:27 jaafar has quit [Ping timeout: 255 seconds]

06:28 jaafar has joined #ste||ar

06:53 EverYoun_ has joined #ste||ar

06:54 EverYoung has quit [Ping timeout: 246 seconds]

08:19 <github> [hpx] sithhell force-pushed thread_data_refcount from ed9396c to 3f3143e: https://git.io/vNacO

08:19 <github> hpx/thread_data_refcount 3f3143e Thomas Heller: Don't use boost::intrusive_ptr for thread_id_type...

08:39 mcopik has quit [Ping timeout: 240 seconds]

08:40 david_pfander has joined #ste||ar

09:01 vamatya has quit [Ping timeout: 256 seconds]

09:21 <jbjnr_> heller_: what's the status of your branch/branches?

09:22 <jbjnr_> I've got problems with master and I feel like I'm wasting time fixing things if I'm going to use your branch anyway, where things like wait_or_Add_new no longer exist

09:36 <jbjnr_> home/biddisco/src/hpx/src/runtime/parcelset/parcel.cpp:456:78: error: invalid user-defined conversion from ‘hpx::threads::thread_id_type’ to ‘uint64_t {aka long unsigned int}’ [-fpermissive]

09:40 gedaj has quit [Ping timeout: 260 seconds]

09:40 gedaj has joined #ste||ar

09:46 <jbjnr_> ok. Everything is broken if I use the thread_overheads branch.

10:06 <github> [hpx] StellarBot pushed 1 new commit to gh-pages: https://git.io/vNwnk

10:06 <github> hpx/gh-pages 8376734 StellarBot: Updating docs

10:25 RostamLog has joined #ste||ar

10:58 <heller_> jbjnr_: ok

10:58 <heller_> jbjnr_: figured as much ... the stuff in core should compile fine though

10:58 <heller_> there are a few conflicts with master as I am gradually trying to merge stuff

11:00 <heller_> jbjnr_: there are still a few open questions to discuss with that branch

11:01 <heller_> and I failed compiling with apex...

11:02 <heller_> that error should happen on my branch though, shouldn't it?

11:05 <heller_> jbjnr_: I am going to rebase the thread_overheads to latest master once thread_data_refcount branch was merged

11:06 K-ballo has joined #ste||ar

11:07 Vir is now known as Guest2733

11:10 <github> [hpx] sithhell force-pushed thread_data_refcount from e0c1b07 to 9d4fcca: https://git.io/vNacO

11:10 <github> hpx/thread_data_refcount 9d4fcca Thomas Heller: Don't use boost::intrusive_ptr for thread_id_type...

12:50 hkaiser has joined #ste||ar

13:03 <hkaiser> heller_: yt?

13:03 <heller_> hkaiser: hey

13:04 <hkaiser> jbjnr_: thanks, your patch to pycicle seems to have fixed the problem I was seeing

13:04 <hkaiser> heller_: see pm, pls

13:08 parsa has joined #ste||ar

13:21 <heller_> hkaiser: the segfault on rostam is not on our side btw

13:21 <heller_> https://gist.github.com/sithhell/6187d69c5dfcfdeae2e8174f14a2b6a7

13:21 <hkaiser> thought so

13:21 <hkaiser> something is off there

13:21 <heller_> why is it building with generic context coroutines, btw?

13:21 <heller_> yeah, something with the PAPI rapl counters

13:21 <hkaiser> shouldn't

13:21 <heller_> it does though ;)

13:22 <hkaiser> ok, so it tests this at least ;)

13:22 <hkaiser> I'll talk to Al today, he might have change something lately

13:28 <heller_> maybe some kernel updates related to meltdown?

13:43 parsa has quit [Quit: Zzzzzzzzzzzz]

13:45 parsa has joined #ste||ar

14:24 <heller_> jbjnr_: are you compiling with the idle backoff thingy?

14:24 <jbjnr_> I've got everything to compile now

14:24 <jbjnr_> doesn't work, but ...

14:25 <jbjnr_> goig to go back to master anyway

14:26 <heller_> so, one thing I discovered just now is that the idle backoff mechanism might be one of the reasons for the remaining 10% of performance you are looking for

14:26 <jbjnr_> ok, keep talking

14:26 <heller_> especially in your use case, where the queues probably drain quite often

14:26 <jbjnr_> no

14:26 <jbjnr_> queues are always super full

14:26 <heller_> what we do is a timed wait on a condition variable

14:26 <heller_> oh, ok

14:26 <heller_> then nevermind

14:26 <jbjnr_> (or at least they should be)

14:27 <jbjnr_> I will disable the idel backoff anyway. I think I usually do, but I'll check

14:27 <jbjnr_> my custom scheduler is broken somehow and I don't know why

14:28 <jbjnr_> I'm using this -DHPX_WITH_THREAD_MANAGER_IDLE_BACKOFF:BOOL=OFF \

14:28 <heller_> with the thread_overhead branch?

14:28 <heller_> ah ok, that's good then

14:28 <jbjnr_> no. my scheduler seems to be broken with master, so I thought rather than fixing it there, I'd try your branch cos then the thread queues are simpler, but it isn't any better

14:29 <heller_> hmm

14:29 <jbjnr_> it was faulting in wait_or_add_new, so I figured as you removed thet ...

14:29 <heller_> what symptoms do you see?

14:29 <jbjnr_> and never terminating

14:29 <heller_> tell me more...

14:30 <jbjnr_> entering destroy_thread millions of times, but never terminating

14:30 <jbjnr_> and eventually locking the macie up

14:30 <jbjnr_> machine^

14:30 <heller_> ok

14:30 <heller_> with my branch, right?

14:30 <jbjnr_> with your branch it's worse. with master it was flakey

14:31 <jbjnr_> it used to work fine, but I didn't touch ot for 2 months

14:31 <heller_> ok

14:31 <heller_> I think I know what's going on

14:31 <jbjnr_> and rebasing it onto laster master has taken me 2 days and it still doesn't work.

14:31 <heller_> can I see your updated scheduler against my branch please?

14:32 <jbjnr_> https://gist.github.com/biddisco/4aefcf758de665679baaffcd0b15e465

14:33 <heller_> jbjnr_: https://gist.github.com/biddisco/4aefcf758de665679baaffcd0b15e465#file-shared_priority_scheduler-L268

14:34 <heller_> jbjnr_: https://github.com/STEllAR-GROUP/hpx/blob/fix_thread_overheads/hpx/runtime/threads/policies/local_priority_queue_scheduler.hpp#L647-L652 <-- use this instead

14:35 <jbjnr_> ok will try

14:35 <jbjnr_> ta

14:35 <heller_> same in here: https://gist.github.com/biddisco/4aefcf758de665679baaffcd0b15e465#file-shared_priority_scheduler-L1067

14:38 <jbjnr_> the second one call the first, so I only need it in the second one

14:39 <heller_> jbjnr_: thread_queue::destroy_thread now always returns true

14:39 <jbjnr_> what if i call it with a thread that's in another queue?

14:39 <heller_> threa_id_type::get_queue returns a void pointer to the thread_queue that allocated it

14:40 <jbjnr_> aha. ok then I can simplify everything

14:40 <heller_> *nod*

14:40 <jbjnr_> the first link you showed is now obsolete then

14:40 hkaiser has quit [Quit: bye]

14:40 <jbjnr_> cos that loops over queues

14:40 <jbjnr_> (and so does the first, I have holders of queues)

14:40 <heller_> yeah, those loops should be obsolete now

14:43 <jbjnr_> hmmm. still calling destroy_thread millions of times and looping forever

14:44 <heller_> hmpf

14:45 <jbjnr_> no. worked that time.

14:45 <jbjnr_> must have run the wrong binary first time maybe....

14:49 <heller_> good

14:53 <jbjnr_> (I think I ran the binary from the terminal before the IDE had finished writing out the new compiled version)

14:53 <jbjnr_> seems to be working now. thanks

14:53 <jbjnr_> I can begin testing again (if I can fix the other unrelated bug)

14:54 <jbjnr_> might be able to make use of that queue member elsewhere to remove some other loops

15:09 <jbjnr_> heller_: hard to tell on the laptop, but things do look a bit better for smaller blocks. I will try to get binaries redy for some big tests this evening on daint

15:10 <jbjnr_> rats. other bug still there. locks up on N>1 numa domains.

15:18 hkaiser has joined #ste||ar

15:20 hkaiser has quit [Client Quit]

15:20 hkaiser has joined #ste||ar

15:21 <heller_> jbjnr_: do you know where the lockups are?

15:21 <heller_> At shutdown or somewhere else?

15:26 rtohid has joined #ste||ar

15:44 <jbjnr_> my lockups are nearer to start than shutdown

15:44 <jbjnr_> when I make my single socket lpatop appear to have 2 numa domains for testing

15:45 <heller_> Hmm

15:45 <heller_> How do you do that?

15:46 <jbjnr_> https://gist.github.com/biddisco/4aefcf758de665679baaffcd0b15e465#file-shared_priority_scheduler-L1314

15:46 <jbjnr_> cores go into a queue to domain lookup table

15:46 <jbjnr_> by fudging that at startup.

15:46 <jbjnr_> it used to work, but I broke it and now I can't find out what I did wrong

15:54 eschnett has quit [Quit: eschnett]

15:54 eschnett has joined #ste||ar

16:08 aserio has joined #ste||ar

16:10 daissgr has joined #ste||ar

16:26 EverYoun_ has quit [Ping timeout: 255 seconds]

16:27 daissgr has quit [Ping timeout: 265 seconds]

16:29 <diehlpk_work> hkaiser, jbjnr_ Did you go trough the proposals?

16:40 daissgr has joined #ste||ar

16:55 EverYoung has joined #ste||ar

16:58 Smasher has joined #ste||ar

16:59 vamatya has joined #ste||ar

17:11 jaafar_ has joined #ste||ar

17:11 jaafar has quit [Ping timeout: 276 seconds]

17:21 eschnett has quit [Quit: eschnett]

17:26 <hkaiser> yes, I did

17:30 aserio has quit [Ping timeout: 276 seconds]

17:32 EverYoung has quit [Remote host closed the connection]

17:32 EverYoung has joined #ste||ar

17:32 <diehlpk_work> Cool

17:34 jaafar_ is now known as jaafar

17:40 <diehlpk_work> Ariel nodes seems to be srun: Required node not available (down, drained or reserved)

17:41 jaafar has quit [Read error: Connection reset by peer]

17:42 jaafar has joined #ste||ar

17:46 aserio has joined #ste||ar

17:49 <zao> I assume the information out of `sinfo` and `sinfo -R` is of little help.

17:53 <diehlpk_work> ariel up 3-00:00:00 2 drain ariel[00-01]

17:53 <diehlpk_work> Duplicate jobid slurm 2018-01-23T04:47:28 ariel[00-01]

17:55 EverYoun_ has joined #ste||ar

17:55 <zao> We typically drain ours on node failures or patch-reboots, heaven knows what the local sysapes at LSU do.

17:56 <diehlpk_work> I send alirez this error

17:56 EverYoung has quit [Ping timeout: 276 seconds]

18:02 eschnett has joined #ste||ar

18:24 aserio1 has joined #ste||ar

18:27 aserio has quit [Ping timeout: 255 seconds]

18:27 aserio1 is now known as aserio

18:31 <jbjnr_> diehlpk_work: I went through some and added a new one, but mostly they are unchanged.

18:32 <diehlpk_work> ok, thanks

18:32 <diehlpk_work> I added the blaze one this morning

18:32 <diehlpk_work> So we just can wait

18:35 daissgr has quit [Ping timeout: 276 seconds]

18:42 <github> [hpx] stevenrbrandt created fix_hello (+1 new commit): https://git.io/vNrZm

18:42 <github> hpx/fix_hello 4af283e Steven R. Brandt: Add new constructor.

18:43 <github> [hpx] stevenrbrandt opened pull request #3123: Add new constructor. (master...fix_hello) https://git.io/vNrZW

18:44 jaafar has quit [Ping timeout: 276 seconds]

18:44 jaafar has joined #ste||ar

18:45 <K-ballo> that reminds me, I heard hello_world_component was failing to compile

18:45 <K-ballo> it should be added to CI with the rest of the examples

18:54 <hkaiser> create a ticket, pls

19:00 jaafar has quit [Read error: Connection reset by peer]

19:00 jaafar has joined #ste||ar

19:06 aserio has quit [Ping timeout: 276 seconds]

19:27 david_pfander has quit [Ping timeout: 260 seconds]

19:37 <github> [hpx] sithhell created after_3120 (+1 new commit): https://git.io/vNr4P

19:37 <github> hpx/after_3120 d07c0e1 Thomas Heller: Replacing nullptr with hpx::threads::invalid_thread_id...

19:38 <Guest5549> [hpx] sithhell opened pull request #3125: Replacing nullptr with hpx::threads::invalid_thread_id (master...after_3120) https://git.io/vNr4d

19:47 aserio has joined #ste||ar

19:48 mcopik has joined #ste||ar

19:50 eschnett_ has joined #ste||ar

19:53 eschnett has quit [Ping timeout: 256 seconds]

19:55 aserio1 has joined #ste||ar

19:55 <github> [hpx] sithhell pushed 1 new commit to after_3120: https://git.io/vNr04

19:55 <github> hpx/after_3120 941ac5b Thomas Heller: Replacing constexpr with HPX_CXX14_CONSTEXPR where appropriate

19:56 daissgr has joined #ste||ar

19:56 aserio has quit [Ping timeout: 265 seconds]

19:56 aserio1 is now known as aserio

20:00 <github> [hpx] sithhell force-pushed after_3120 from 941ac5b to a3bacea: https://git.io/vNr0F

20:00 <github> hpx/after_3120 2654964 Thomas Heller: Replacing nullptr with hpx::threads::invalid_thread_id...

20:00 <github> hpx/after_3120 a3bacea Thomas Heller: Replacing constexpr with HPX_CXX14_CONSTEXPR where appropriate

20:09 aserio has quit [Ping timeout: 265 seconds]

20:30 daissgr has quit [Ping timeout: 256 seconds]

20:42 hkaiser has quit [Quit: bye]

20:43 daissgr has joined #ste||ar

20:47 aserio has joined #ste||ar

20:58 <heller_> ok ... after_3120 should be good now

21:07 jaafar_ has joined #ste||ar

21:07 jaafar has quit [Ping timeout: 276 seconds]

21:23 hkaiser has joined #ste||ar

21:46 aserio has quit [Quit: aserio]

21:46 aserio has joined #ste||ar

21:47 aserio has quit [Client Quit]

22:06 jaafar_ is now known as jaafar

22:06 eschnett_ has quit [Quit: eschnett_]

22:06 <github> [hpx] stevenrbrandt pushed 1 new commit to fix_hello: https://git.io/vNrD4

22:06 <github> hpx/fix_hello baacf09 Steven R. Brandt: Merge branch 'master' into fix_hello

22:27 daissgr has quit [Ping timeout: 248 seconds]

22:59 rtohid has left #ste||ar [#ste||ar]

23:13 heller_ has quit [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]

23:15 heller_ has joined #ste||ar

23:40 parsa has quit [Quit: Zzzzzzzzzzzz]