#ste||ar on 2020-06-16 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

00:08 hkaiser has joined #ste||ar

02:09 RostamLog has joined #ste||ar

02:16 RostamLog has joined #ste||ar

02:25 nan11 has quit [Remote host closed the connection]

02:35 RostamLog has joined #ste||ar

03:10 sayefsakin has quit [Ping timeout: 260 seconds]

03:13 hkaiser has quit [Quit: bye]

03:59 sayefsakin has joined #ste||ar

05:03 nikunj97 has joined #ste||ar

05:08 Nikunj__ has joined #ste||ar

05:11 nikunj97 has quit [Ping timeout: 256 seconds]

05:30 bita_ has quit [Ping timeout: 260 seconds]

06:08 Nikunj__ is now known as nikunj97

06:09 <nikunj97> ms[m], any idea what init_modes are in serialize_buffer? https://hpx-docs.stellar-group.org/docs/sphinx/branches/master/html/libs/serialization/api.html#_CPPv4N3hpx13serialization9init_modeE

06:09 <nikunj97> essentially what does the take init mode mean

06:10 diehlpk_work has joined #ste||ar

06:11 <nikunj97> ms[m], never mind, found it

06:12 diehlpk_work_ has quit [Ping timeout: 260 seconds]

07:39 kale[m] has joined #ste||ar

08:19 sayefsakin has quit [Quit: Leaving]

08:27 kale[m] has quit [Ping timeout: 264 seconds]

08:37 kale[m] has joined #ste||ar

09:15 <gonidelis[m]> https://github.com/STEllAR-GROUP/hpx/pull/4745/checks?check_run_id=773505338 Any ideas why the tests.unit.component test is failing ?

09:33 kale[m] has quit [Ping timeout: 260 seconds]

09:34 kale[m] has joined #ste||ar

10:04 <gonidelis[m]> Can't really get what `disable_sized_sentinel_for` does...

10:04 <gonidelis[m]> `The variable template disable_sized_sentinel provides a mechanism for iterators and sentinels that can be subtracted but do not meet the semantic requirements of SizedSentinel to opt out of the concept by specializing the variable template to have the value true. `

10:59 <K-ballo> it disables the type as a sized sentinel, it makes the concept check fail

11:03 Yorlik has joined #ste||ar

11:13 <gonidelis[m]> K-ballo: why disable the type though?

11:26 <K-ballo> because it doesn't model the semantic requirements of sized sentinel

11:30 <gonidelis[m]> So if `remove_cv_t` can be applied to S and I then `!disable_sized_sentinel_for` is true, otherwise it fails?

11:30 <gonidelis[m]> according to http://eel.is/c++draft/iterator.concept.sizedsentinel

11:37 nikunj has quit [Read error: Connection reset by peer]

11:37 nikunj has joined #ste||ar

11:41 nikunj has quit [Ping timeout: 256 seconds]

11:42 nikunj has joined #ste||ar

11:47 <gonidelis[m]> Do you think that's a proper test ? https://gist.github.com/gonidelis/e0ee0756ca6abc70acfb51880b7f2c38

11:47 <gonidelis[m]> hkaiser: ^^

12:09 hkaiser has joined #ste||ar

13:03 <Yorlik> hkaiser: yt?

13:14 <hkaiser> Yorlik: here

13:14 <Yorlik> Hello!

13:14 <hkaiser> hey there

13:15 <Yorlik> I'm running into an allocation issue

13:15 <Yorlik> I see the data created along the messages I create having a size of 264 byte.

13:15 <Yorlik> It results in ~26 MB of data every frame, a lot of it obviously bloat.

13:15 <Yorlik> That is ~3x the size of the messages I'm sending.

13:15 <Yorlik> Is that normal that promises are that large?

13:15 <Yorlik> From the memory debugger (calculating a difference between 2 frames, 100k messages per frame):

13:15 <Yorlik> sim.exe!hpx::lcos::detail::promise_data_allocator<void,std::allocator<int> > count: 98.763 size: 26.073.432

13:15 <hkaiser> promises?

13:15 <Yorlik> The allocator

13:15 <hkaiser> what allocator?

13:15 <Yorlik> It's the type shown above

13:16 <hkaiser> sec

13:17 <Yorlik> K - back in a sec

13:17 <hkaiser> it's a promise created for an action,, right?

13:17 <Yorlik> Yes

13:17 <Yorlik> I sortof deduced here

13:18 <hkaiser> I have never checked how much memory it occupies

13:18 <Yorlik> The count is pretty close to the message count

13:18 <Yorlik> +And the numbers a slightly changing, because it's a delayed message (1ms)

13:18 <Yorlik> However this type must be associated with the messages sent

13:19 <hkaiser> well this is the shared_state created for the future that is returned by the async(action(), ...)

13:19 <Yorlik> The messages are not returning anything except void

13:20 <Yorlik> and exceptions

13:20 <Yorlik> I wonder if there is space reserved for exceptions

13:21 <hkaiser> yes, it is - but that's in a union with the actual data

13:21 <Yorlik> How many bytes?

13:21 <hkaiser> no idea

13:21 <hkaiser> probably less than your message

13:21 <Yorlik> The size of the type above is 264 bytes - that's a lot

13:22 <Yorlik> Maybe I'm misusing something - totally possible.

13:22 <hkaiser> sounds a lot, I don't know why that is that large

13:22 <hkaiser> we have never particularly cared about memory requirements

13:23 <Yorlik> I have to - sending many many messages

13:24 <hkaiser> look at the shared state in a debugger to see what members it has

13:24 <Yorlik> OK

13:37 <gonidelis[m]> hkaiser: you there?

13:39 <hkaiser> gonidelis[m]: here

13:39 <gonidelis[m]> I could use some advise with disable_sized_sentinel_for

13:40 <gonidelis[m]> I think I need to create it myself, right?

13:41 <hkaiser> c++20 will have std::disable_sized_sentinel_for

13:41 <gonidelis[m]> But since we don't use c++20 yet, we have to create it from scratch, right?

13:41 <hkaiser> we should implement our own hpx::traits::disable_sized_sentinel_for which defaults to 'false' for C++ < 20 and defaults to the std version otherwise

13:42 <hkaiser> correction: it should default to std::disable_sized_sentinel_for if available

13:43 <hkaiser> otherwise it should default to false

13:45 <gonidelis[m]> ok so this `hpx::traits::disable_sized_sentinel_for` . Should I write the code inside the `is_sentinel_for.hpp` file?

13:47 <hkaiser> yes

13:49 <hkaiser> gonidelis[m]: something along the lines of https://gist.github.com/hkaiser/babf1a39325a339b15420267408d727d

13:49 <Yorlik> hkaiser: sizeof(fut.shared_state_.px->on_completed_) = 144 (digging deeper)

13:50 <hkaiser> Yorlik: ok, that's your continuation you attached to the future

13:51 <Yorlik> The action?

13:51 <hkaiser> no

13:51 <hkaiser> do you attach continuations (using .then())?

13:51 <Yorlik> The future is created like this: auto fut = hpx::async<gameobject::send_message_action<M>>( recipient, std::move( msg ) );

13:51 <Yorlik> recipient is an id_type

13:51 <hkaiser> what do you do with the future 'fut' afterwards?

13:52 <Yorlik> I store it and check for exceptions after some time. if it is ready it's discarded

13:52 <Yorlik> Like this: sender.echo_list.push_back( std::move( fut ) );

13:52 <Yorlik> the echo:list is checked

13:53 <hkaiser> ahh, I see now

13:53 <Yorlik> Bad?

13:53 <hkaiser> this is a boost::container::small_vector<util::unique_function<>, 3>

13:53 <Yorlik> Yep

13:54 <hkaiser> unique_function has at least 3 * sizeof(void*), so this small_vector has at least 9 * sizeof(void*)

13:54 <Yorlik> BloatyMcBloatface?

13:54 <hkaiser> no, perf-optimizations for upto 3 continuations

13:55 <hkaiser> probably a bit over the top ;-)

13:55 <Yorlik> And if I don't use these continuations I'm hosed?

13:55 <Yorlik> Time for another overload?

13:55 <hkaiser> not hosed, just wasting memory

13:55 <Yorlik> Which is a problem in a messagin system.

13:55 <hkaiser> that shared state is used everywhere

13:56 <Yorlik> If I ever reach my goal of 100ms/frame that would be 260 MB / second on 100k objects/messages

13:57 <Yorlik> I need to rethink the messaging system or get some help from you here, I think.

13:58 <Yorlik> I have already reduced many dynamic allocations in my code, like messages, mailboxes, small vectors for parameters etc

13:58 <Yorlik> And I'm redirecting them to mimalloc

13:58 <hkaiser> Yorlik: I think we can safely reduce this

13:58 <Yorlik> What would you suggest?

13:59 <hkaiser> small_vector<> is not particularly memory friendly

13:59 <Yorlik> It could be a user option

13:59 <Yorlik> Like what properties the futures should have

13:59 <Yorlik> More templating, I guess

13:59 <hkaiser> futures have seldom more than one continuation attached, so we were trying to optimize the one-continuation case while still having the option of having more than one

14:00 <hkaiser> one continuation should not require an additional allocation, but more than one could

14:00 <Yorlik> With low frequency actions and long running remote actions that's a non issue. I guess I'm introducing a different use case here.

14:01 <Yorlik> High Frequency small actions

14:01 <gonidelis[m]> hkaiser: So for my cxx20_disable_sized_sentinel_for.cpp dummy test, do you know what header should I include?

14:01 <Yorlik> Need to get some food - BRB in ~20 minutes

14:01 <hkaiser> we might need a custom container that a) does not allocate for one element, b) allows for adding more elements, and c) never shrinks

14:02 <hkaiser> gonidelis[m]: <type_traits>

14:02 <hkaiser> gonidelis[m]: no, <iterator>

14:03 <hkaiser> gonidelis[m]: http://eel.is/c++draft/iterator.synopsis

14:04 <nikunj97> anyone faced this while compiling hpx? https://gist.github.com/NK-Nikunj/7c32e549cff6ded14af34295a7017f93

14:04 <nikunj97> this is on hpx master

14:05 <hkaiser> uhhh

14:06 <hkaiser> look at the generated preprocessed file, something is off right before that line

14:06 <nikunj97> alright

14:06 <hkaiser> I think the #if is not recognized as a preprocessor directive, but I could be wrong

14:07 <hkaiser> nikunj97: what's your cmake options?

14:07 <nikunj97> just a CMAKE_INSTALL_PREFIX

14:07 <hkaiser> what compiler?

14:08 <nikunj97> gcc 9.3

14:08 <weilewei> If I would like to add a new thread_data in hpx, are these two files are starting places? https://github.com/STEllAR-GROUP/hpx/blob/master/libs/threading_base/include/hpx/threading_base/thread_data.hpp and https://github.com/STEllAR-GROUP/hpx/blob/master/libs/threading/include/hpx/threading/thread.hpp

14:08 <nikunj97> can confirm it works for gcc 10.1 on another cluster that I have

14:09 <hkaiser> weilewei: just the first of those - look for get_thread_data()/set_thread_data() for guidance

14:09 <hkaiser> nikunj97: this a a strange one

14:10 <nikunj97> could it be something relating to cmake?

14:10 <hkaiser> shrug

14:11 <nikunj97> hkaiser, let me explore it a bit. will update you on my findings

14:11 <hkaiser> thanks

14:11 <weilewei> hkaiser in the first file, get_thread_data()/set_thread_data() are virtual functions though

14:11 <hkaiser> right

14:12 jaafar_ has joined #ste||ar

14:14 <hkaiser> hey jaafar_

14:14 <weilewei> hmm then there should be some inheritance of get_thread_data()/set_thread_data() defined somewhere else

14:14 <hkaiser> jaafar_: it's completely beyond me how expanding a function-style macro could be different from expanding a non-function-style macro :/

14:14 <hkaiser> weilewei: did you do a grep fro get_thread_data?

14:15 <hkaiser> you will find it an several files, all of which (almost) need to be looked at

14:15 <weilewei> right, there is a lot of get_thread_data in different places

14:16 <weilewei> sure, I will add new features everywhere appropriate

14:17 jaafar has quit [Ping timeout: 265 seconds]

14:24 kale[m] has quit [Ping timeout: 258 seconds]

14:25 kale[m] has joined #ste||ar

14:26 <Yorlik> hkaiser: Back. So - is there anything I can do?

14:32 <hkaiser> Yorlik: create such a custom data structure ;-)

14:33 <Yorlik> How wpould i Use it?

14:35 <hkaiser> that could replace the small_vector in the shared state

14:36 <hkaiser> Yorlik: for now we could replace it with std::vector<> but this would require an allocation even fr the first continuation

14:37 <Yorlik> I think we need none of this but a reduced future

14:37 <Yorlik> I don't want to pay for what I don't use

14:41 <hkaiser> Yorlik: not sure how we could accomodate this request

14:42 <Yorlik> C++? ;)

14:42 <hkaiser> go ahead

14:43 <Yorlik> ping-pong ...

14:50 kale[m] has quit [Ping timeout: 258 seconds]

14:50 karame_ has joined #ste||ar

14:51 <nikunj97> ok my distributed 1d stencil is both seg faulting and not scaling :/

14:51 kale[m] has joined #ste||ar

15:19 nan11 has joined #ste||ar

15:26 <hkaiser> ms[m]: yt?

15:40 <K-ballo> ms[m]: moar conflicts?

15:43 <ms[m]> K-ballo and hkaiser yes

15:43 <hkaiser> ms[m]: I created #4758

15:43 <ms[m]> sorry K-ballo, that should be the last module renaming pr for this release

15:43 <hkaiser> interesting insights

15:43 <ms[m]> hkaiser: the cmake profiling results look very interesting

15:44 <hkaiser> yes

15:44 <ms[m]> yeah, I bet looping through all our cache variables isn't the most efficient

15:44 <ms[m]> plus generating files is probably not for free either

15:44 <hkaiser> yah, that's the issue - even more as we do it for each and every module

15:45 <ms[m]> let's look at this properly after 1.5.0

15:45 <hkaiser> yes, agreed

15:45 <ms[m]> we should be able to figure out what went wrong with using object libraries with that as well

15:45 <ms[m]> the output looks very useful

15:45 <K-ballo> looping through all cache variables :|

15:49 <K-ballo> hkaiser: how did you make that trace?

15:52 <ms[m]> K-ballo: it seems cmake 3.18 has learned to profile cmake code

15:52 <K-ballo> --profiling-output and --profiling-format, found them

16:07 rtohid has joined #ste||ar

16:31 nan1110 has joined #ste||ar

16:31 nan1110 has quit [Remote host closed the connection]

16:32 nan222 has joined #ste||ar

16:34 nan11 has quit [Ping timeout: 245 seconds]

16:34 nanm has joined #ste||ar

16:37 nan222 has quit [Ping timeout: 245 seconds]

16:54 <Yorlik> hkaiser: YT?

16:56 mdiers[m]1 has left #ste||ar ["Kicked by @appservice-irc:matrix.org : Idle for 30+ days"]

16:58 <hkaiser> Yorlik: hey

16:58 <Yorlik> Hello!

16:58 zao[m]1 has left #ste||ar ["Kicked by @appservice-irc:matrix.org : Idle for 30+ days"]

16:59 <Yorlik> I'd like to discuss what could be done with this messaging problem, possible in voice if you can afford the time (could be later or another day ofc)

17:00 <hkaiser> Yorlik: you identified the main purpetrator causing the large memory requirements for the shared state

17:00 <hkaiser> thanks fo rthat

17:00 <hkaiser> in order to fix this, we need to reduce the memory required by this particular data item

17:01 <hkaiser> this data item stores continuations attached to a future

17:01 <Yorlik> I see a problem in the baked in possibility for continuations in every single future

17:01 <Yorlik> At least at this size

17:02 <hkaiser> as I said, the requirements from HPX side are: a) attaching one contiuation should not require additional allocations, additional continuations are very seldom and could require allocations

17:02 <Yorlik> But I guess making this generic would be a crapload of work.

17:02 <hkaiser> also we can assume that this container does not shrink (which might simplify its implementation

17:02 <Yorlik> realloc ftw

17:02 <hkaiser> Yorlik: I don't think that we want to remove the ability to attach continuations

17:03 <Yorlik> Me neither.

17:03 <Yorlik> But could they be made an opt-in thing?

17:03 <hkaiser> so we're back to a) and b) listed above

17:03 <Yorlik> Or opt-out

17:03 <hkaiser> no way

17:03 <Yorlik> So - what exactly would be the interface requirements for this vector?

17:04 <Yorlik> Like - can use std::allocator interface?

17:04 <Yorlik> Etc ...

17:05 <Yorlik> Could it be a pointer type like std::unique_ptr<some_vector<T>> ?

17:05 <Yorlik> Would that kill locality too much?

17:06 <Yorlik> And - what could I realistically do? Not sure if I'd tackle such a vector that would have a good outcome.

17:12 <hkaiser> Yorlik: using a unique_ptr would require an allocation even for one element

17:12 <Yorlik> even an empty one?

17:12 <hkaiser> it could be something like a variant<callback_type, std::vector<callback_type>>

17:13 <Yorlik> Im thinking along a system which would use indirection with pointers, but custom allocators putting it all together, preferably in the same cache line.

17:14 <hkaiser> no, an empty continuation wouldn't require allocation

17:14 gdaiss[m] has left #ste||ar ["Kicked by @appservice-irc:matrix.org : Idle for 30+ days"]

17:14 <hkaiser> Yorlik: let's not optimize before we even know how to do things

17:14 <Yorlik> :D

17:14 <Yorlik> However - if you do something I'll be glad to test

17:15 <hkaiser> we've been using a different scheme before, we could try to get back to that

17:15 <Yorlik> The current system creates a lot of bloat for me.

17:15 <hkaiser> sure, as said, nobody has cared for memory consumption so far

17:15 * Yorlik joins the Guineapig Union

17:16 <Yorlik> BTW: Is there a way In C++ to evict a piece of data from the cache, to prevent others from not being evicted?

17:16 <hkaiser> no

17:16 <Yorlik> Damned

17:17 <Yorlik> After every Object update I could explicitely evict the entity being updated

17:18 <Yorlik> I know there are machine code / intrinsics which can bypass the cache, but thats for very special cases only, I think

17:18 <hkaiser> Yorlik: you have no idea what data you want to cache or not - not without measurements

17:19 <hkaiser> so don't even think about it

17:19 <Yorlik> However - message memory consumption must go down a lot.

17:19 <Yorlik> 264 bytes is crazy

17:20 <hkaiser> how large is your message?

17:20 <Yorlik> The default minimal messages are 32 bytes

17:20 <Yorlik> The variant of them 40

17:20 <hkaiser> what's the return type stored in the future?

17:20 <Yorlik> void

17:20 <hkaiser> ok

17:20 <Yorlik> The shared state should have nothing from me

17:21 <hkaiser> well, it needs things for itself

17:21 <Yorlik> 264 bytes ?

17:21 <hkaiser> as said, half of that is from the continuation storage

17:21 * Yorlik shudders thinking about what these structures might be hiding in their basement.

17:21 <hkaiser> shrug - feel free to help

17:22 <Yorlik> Thats what I wanted to discuss.

17:22 <Yorlik> Not sure hwat I'd actually be capable of.

17:22 <hkaiser> and I told you what a possible first step could be

17:22 <Yorlik> Writing that vector?

17:23 <Yorlik> I'd need exact specs.

17:23 <hkaiser> experimenting on how large a variant<unique_function<>, vector<unique_function>> would be as a start

17:24 <hkaiser> we can hand-roll a slimmer but equivalent version of that without pulling in variant

17:24 <Yorlik> What is a unique_function? hpx or std:: type?

17:24 <hkaiser> (variant itself is not most compile-time friendly)

17:24 <hkaiser> hpx::util::unique_function<...>

17:24 <hkaiser> probably hpx::util::unique_function<void(void)>

17:25 <hkaiser> need to look

17:25 <hkaiser> or even unique_function_nonser<void()>

17:26 <Yorlik> So - are things things like std::function ?

17:26 <Yorlik> are these

17:27 <hkaiser> compare that to the same when using std::function<void()>

17:28 <K-ballo> hpx::unique_function is analogous to std::any_invocable

17:29 <Yorlik> Seems like right now you are not the Cheshire Cat but the White Rabbit himself ;)

17:29 <hkaiser> Yorlik: hpx function and unique_function have a default size of 5 * sizeof(void*), 3 of those are reserved for the internal small-object optimization storage

17:29 <Yorlik> Oh man ...

17:29 <Yorlik> So many bytes

17:30 <Yorlik> That system was really optimized for something else.

17:30 <K-ballo> do we hold callbacks by, how do we? vector?

17:30 <hkaiser> K-ballo: currently boost::small_vector<..., 3>

17:30 <hkaiser> a bit of overkill

17:30 <K-ballo> indeed

17:31 <hkaiser> small_vector<..., 1> would be sufficient, I think

17:31 <K-ballo> I remember the old approach, composed callables, not very nice either

17:31 <hkaiser> indeed

17:31 <K-ballo> was there an intrusive linked list at some point?

17:31 <hkaiser> small_vector itself has quite some footprint (as it turns out)

17:31 <K-ballo> intrusive *singly* linked list

17:32 <hkaiser> K-ballo: don't remember, frankly - all I know is that adding a single continuation should not trigger allocation

17:32 <Yorlik> std::unique_ptr<T>

17:32 <K-ballo> unique_ptr requires allocation

17:32 <hkaiser> Yorlik: requires allocation

17:32 <K-ballo> I was thinking a single linked list with an embedded head node

17:32 <Yorlik> An empty one?

17:32 <K-ballo> when you add a continuation it is no longer empty

17:32 <hkaiser> Yorlik: it should require allocation for at least one element

17:33 <Yorlik> I mean a default constructed unique is just a pointer isn#t it?

17:33 <hkaiser> yes

17:33 <K-ballo> " adding a single continuation should not trigger allocation"

17:33 <Yorlik> So why not allow that?

17:33 <K-ballo> quoted from a few lines above

17:33 <hkaiser> K-ballo: embedded head node sounds doable, but again requires an additional pointer - Yorlikwill not like that

17:33 <K-ballo> I doubt he'll get any better

17:33 <hkaiser> indeed

17:34 <hkaiser> Yorlik: I can offer shaving off 80 bytes right away without too much effort

17:34 <Yorlik> If a simple future<void> requires an allocation of 264 bytes somewhere .. seriously?

17:34 <hkaiser> Yorlik: stop complaining

17:34 <hkaiser> we've heard your message

17:34 <Yorlik> I mean - this is supposed to be a function call/return, nothing more.

17:34 <Yorlik> OK

17:35 <Yorlik> 80bytes would be a third already

17:35 <hkaiser> right

17:35 <Yorlik> That's ~(MB per frame

17:35 <K-ballo> how big is small_vector<T, 1>? sizeof(T) +

17:35 <Yorlik> 8MB less

17:36 <hkaiser> Yorlik: try replacing the '3' for a '1' here: https://github.com/STEllAR-GROUP/hpx/blob/master/libs/futures/include/hpx/futures/detail/future_data.hpp#L75

17:36 <Yorlik> I'll give it a shot - brb

17:38 <K-ballo> 24, that's expected for a vector-like thing

17:39 <Yorlik> Yup

17:39 <Yorlik> The vector in my messages is why they are 32 byte in the end

17:39 <Yorlik> 24+8

17:53 bita_ has joined #ste||ar

17:57 <hkaiser> K-ballo: if I understand this correctly, the object member in basic_function is always pointing to the internal storage

17:58 <hkaiser> it's used just for the purpose of knowing whether the object is empty or not

17:58 <K-ballo> doesn't sound right.. not all callables fit in the internal storage

17:58 <hkaiser> but then the internal storage is used to point to the allocated memory

17:59 <Yorlik> Dou you have that too, that the VS memory debugger likes to crash a lot?

17:59 <K-ballo> that sounds wrong, it should point directly to the object

18:01 <hkaiser> K-ballo: just trying to figure out whether we could remove that

18:03 <K-ballo> remove the member pointer? no

18:03 <K-ballo> s/member/object

18:13 <bita_> hkaiser, I think #1178 is ready for review

18:14 <hkaiser> bita_: ok, I'll have a look

18:14 <bita_> thanks

18:15 <hkaiser> K-ballo: object is used a) for pointing to the internal storage or to point to the allocated data

18:16 <hkaiser> couldn't we store the pointer to the allocated data in the internal storage instead?

18:16 <Yorlik> hkaiser: Size is down exactly by 80 bytes from 264 to 184

18:17 <hkaiser> as promised ;-)

18:17 <Yorlik> Yup

18:17 <Yorlik> Step by step cutting down memory usage and cache thrashing :)

18:18 <Yorlik> 8 MB less per frame, 80MB/sec (if we meet our goal of 100ms/frame)

18:19 <Yorlik> currently ~650-750 ms

18:19 <Yorlik> 100k objects

18:19 <Yorlik> hkaiser: Would setting it to 0 crash anything or just sub-optimize stuff?

18:38 <hkaiser> Yorlik: that would something not acceptable, I think

18:38 <hkaiser> would be*

18:38 <Yorlik> Would it cause crashes or just slow down the system?

18:39 <Yorlik> I just wonder if that could be a build setting

18:39 <Yorlik> Like a Macro

18:39 <hkaiser> should just slow down things as even adding the first continuation would require an allocation

18:39 <Yorlik> Is this used by HPX internally a lot?

18:39 <K-ballo> hkaiser: I suppose we could

18:39 <K-ballo> that's how it was initially, before we expanded the embedded storage to 3 pointers

18:40 <hkaiser> K-ballo: yes, we changed it also to have a flag for the empty state

18:40 <hkaiser> before we depended on the empty_vtable which didn't work well across shared libraries

18:42 <hkaiser> Yorlik: the thing is that we certainly could create a special (minimal) shared state but then we need a mechanism to instruct async (and all the others) to use that for the returned future

18:42 <hkaiser> Yorlik: I have no idea how that could be done without duplicating everything

18:43 <Yorlik> I see it's a hard problem.

18:44 <K-ballo> Yorlik: on the other hand, you could just not use async and the others, use your own

18:50 RostamLog has joined #ste||ar

18:50 <Yorlik> And with custom new/delete a lot can be optimized

19:16 <Yorlik> In the moment, when the system has reached its steady state after creating and initializing all objects and is just busy with its update loop I'm spending 58.87% of the time inside the executors operator(), which means there's still a lot of overhead (~40%) if I read the numbers correctly..

19:18 rtohid has left #ste||ar [#ste||ar]

19:36 karame_ has quit [Ping timeout: 245 seconds]

19:36 kale[m] has quit [Ping timeout: 246 seconds]

19:37 kale[m] has joined #ste||ar

19:44 rtohid has joined #ste||ar

21:04 <Yorlik> hkaiser: Would --hpx:numa-sensitive just affect strictly NUMA domains? What about level3 cache domains?

21:04 <Yorlik> Is there any way to tweak that ?

21:22 <Yorlik> Is there a way to completely disable work stealing? My frametimes get better the longer the tasks become and it's the fastest when dividing my parallel loop exactly in core count chunks. Work stealing doesn't make sense in this scenario and the scheduling loop suddenly eats ~14% of the CPU time exactly here: https://github.com/STEllAR-GROUP/hpx/blob/master/libs/thread_pools/include/hpx/thread_pools/scheduling_loop.hpp

21:22 <Yorlik> #L634

21:22 <Yorlik> https://github.com/STEllAR-GROUP/hpx/blob/master/libs/thread_pools/include/hpx/thread_pools/scheduling_loop.hpp#L634

21:24 <Yorlik> That's what the profiler says: hpx::threads::detail::scheduling_loop<hpx::threads::policies::local_priority_queue_scheduler<std::mutex,hpx::threads::policies::lockfree_fifo,hpx::threads::policies::lockfree_fifo,hpx::threads::policies::lockfree_lifo> > 418590 (28,79 %) 241095 (16,58 %)

21:24 <Yorlik> When cutting my chunks in smaller pieces the percentage of that line of code gets much lower, but my frametimes still go up

21:25 <Yorlik> hkaiser ^^ ??

21:26 <hkaiser> Yorlik: use your own chunker that does exacatly that - create as many tasks as you have cores

21:26 <hkaiser> s/tasks/chunks

21:26 <Yorlik> I'm doing that right now

21:27 <Yorlik> But for some weird reason it appears as if the scheduling loop doesn't like that

21:27 <Yorlik> Currently I'm doing this: ....with( static_chunk_size( m_e_type::endindex.load( ) / ( hpx::get_num_worker_threads( ) * 1 ) ) ), // default

21:27 <Yorlik> //.with( auto_chunk_size( autochunker_target_us * 1us ) ), // > 200

21:27 <Yorlik> Woops -- copied more than I wanted

21:28 <Yorlik> Long story short: I'm giving the core count to the static chunker

21:29 <Yorlik> hkaiser: The line of code I reported above eats like 14% of the CPU time in that scenario if the profiler reports correctly

21:30 <Yorlik> 28% of the time is spent inside that function, but 16% it takes all by its own.

21:30 <Yorlik> 14 % just that single line

21:31 <Yorlik> But only under the mentioned scenario

21:57 kale[m] has quit [Ping timeout: 260 seconds]

21:57 kale[m] has joined #ste||ar

22:00 parsa| has joined #ste||ar

22:01 joe[m]1 has quit [*.net *.split]

22:01 jbjnr has quit [*.net *.split]

22:03 parsa has quit [Read error: Connection reset by peer]

22:03 parsa| is now known as parsa

22:04 tiagofg[m] has quit [*.net *.split]

22:04 gonidelis[m] has quit [*.net *.split]

22:12 tiagofg[m] has joined #ste||ar

22:12 gonidelis[m] has joined #ste||ar

22:20 joe[m]1 has joined #ste||ar

22:20 jbjnr has joined #ste||ar

22:21 nikunj has quit [Ping timeout: 256 seconds]

22:22 nikunj has joined #ste||ar

22:27 joe[m]1 has quit [*.net *.split]

22:27 jbjnr has quit [*.net *.split]

22:30 tiagofg[m] has quit [*.net *.split]

22:30 gonidelis[m] has quit [*.net *.split]

22:32 rtohid has left #ste||ar [#ste||ar]

22:33 neill[m] has quit [*.net *.split]

22:33 diehlpk_mobile[m has quit [*.net *.split]

22:33 nikunj has quit [*.net *.split]

22:33 carola[m]1 has quit [*.net *.split]

22:33 richard[m]1 has quit [*.net *.split]

22:33 rori has quit [*.net *.split]

22:33 tarzeau has quit [*.net *.split]

22:33 ralph[m] has quit [*.net *.split]

22:33 nanm has quit [*.net *.split]

22:33 weilewei has quit [*.net *.split]

22:33 gretax[m] has quit [*.net *.split]

22:33 noise[m] has quit [*.net *.split]

22:33 smith[m] has quit [*.net *.split]

22:33 jaafar_ has quit [*.net *.split]

22:33 hkaiser has quit [*.net *.split]

22:33 Yorlik has quit [*.net *.split]

22:33 oleg[m]2 has quit [*.net *.split]

22:33 kordejong has quit [*.net *.split]

22:33 heller1 has quit [*.net *.split]

22:33 ms[m] has quit [*.net *.split]

22:33 diehlpk_work has quit [*.net *.split]

22:33 K-ballo has quit [*.net *.split]

22:33 wash[m] has quit [*.net *.split]

22:33 Vir has quit [*.net *.split]

22:33 parsa has quit [*.net *.split]

22:33 bita_ has quit [*.net *.split]

22:33 Guest21318 has quit [*.net *.split]

22:33 kale[m] has quit [*.net *.split]

22:33 Amy1 has quit [*.net *.split]

22:33 bobakk3r has quit [*.net *.split]

22:37 oleg[m]2 has joined #ste||ar

22:37 kordejong has joined #ste||ar

22:37 ms[m] has joined #ste||ar

22:37 heller1 has joined #ste||ar

22:37 diehlpk_work has joined #ste||ar

22:37 K-ballo has joined #ste||ar

22:37 Yorlik has joined #ste||ar

22:37 gretax[m] has joined #ste||ar

22:37 noise[m] has joined #ste||ar

22:37 smith[m] has joined #ste||ar

22:37 ralph[m] has joined #ste||ar

22:37 diehlpk_mobile[m has joined #ste||ar

22:38 Amy1 has joined #ste||ar

22:38 kale[m] has joined #ste||ar

22:38 carola[m]1 has joined #ste||ar

22:38 rori has joined #ste||ar

22:38 nikunj has joined #ste||ar

22:38 tarzeau has joined #ste||ar

22:38 parsa has joined #ste||ar

22:38 Guest21318 has joined #ste||ar

22:38 bita_ has joined #ste||ar

22:38 wash[m] has joined #ste||ar

22:38 Vir has joined #ste||ar

22:38 nanm has joined #ste||ar

22:39 jaafar_ has joined #ste||ar

22:39 hkaiser has joined #ste||ar

22:40 carola[m]1 has quit [Ping timeout: 240 seconds]

22:40 diehlpk_mobile[m has quit [Ping timeout: 256 seconds]

22:40 oleg[m]2 has quit [Ping timeout: 244 seconds]

22:42 bobakk3r has joined #ste||ar

22:43 richard[m]1 has joined #ste||ar

22:44 <Yorlik> hkaiser: How can I fix this: https://i.imgur.com/gkfQpYh.png ?

22:45 <hkaiser> you can't

22:45 <Yorlik> Is it an artifact oe for real?

22:45 <Yorlik> I mean - it looks crazy

22:45 <hkaiser> what's your idle-rate?

22:45 <Yorlik> Usually below 20 or even 10 %

22:45 <Yorlik> Lemme run with the current settings

22:45 <hkaiser> there you go

22:46 <hkaiser> in that figure it's 17.98%

22:47 <Yorlik> So that look reflects the idle rate ?

22:47 <Yorlik> loop

22:47 <hkaiser> I think so, yes

22:48 <hkaiser> what function is that?

22:48 <hkaiser> (can't see it)

22:48 <hkaiser> is it wait_and_add_new?

22:48 <Yorlik> I posted a github link a bit above

22:49 <Yorlik> https://github.com/STEllAR-GROUP/hpx/blob/master/libs/thread_pools/include/hpx/thread_pools/scheduling_loop.hpp#L634

22:49 <hkaiser> so it's get_next_thread?

22:49 <Yorlik> This line specifically

22:49 neill[m] has joined #ste||ar

22:49 <Yorlik> Took like 14 %

22:49 <Yorlik> But only in this scenario with huge tasks

22:49 <hkaiser> well, there is a lot happening in that function - essentially all of the task stealing

22:50 <hkaiser> Yorlik: what you're seeing is that Amdahl is wacking you over your head

22:50 <Yorlik> I think the overhead coming with me chopping the tasks in smaller parts is a problem. I doubt it's hpx, but something in my overall architecture.

22:50 tiagofg[m] has joined #ste||ar

22:51 <Yorlik> The crazy thing is my frametimes go down with me making the tsaks really large

22:51 <Yorlik> tasks

22:51 <Yorlik> I must have some sort of overhead i don't currently see or am aware oif

22:51 gonidelis[m] has joined #ste||ar

22:53 joe[m]1 has joined #ste||ar

22:55 jbjnr has joined #ste||ar

22:56 diehlpk_mobile[m has joined #ste||ar

23:03 carola[m]1 has joined #ste||ar

23:05 oleg[m]2 has joined #ste||ar

23:19 <Yorlik> hkaiser: When using the autochunker, setting it to 400 µs the percentage of that function goes down a lot.

23:19 <Yorlik> But my frametimes go up from ~700ms to 1700 ms

23:22 <Yorlik> Using one task per core I'm getting an idle rate of ~17% and a frametime of ~650 ms

23:22 <Yorlik> That doesn't löook right

23:22 <Yorlik> Even when using healthy task lengths the overall performance goes down horribly.

23:25 <Yorlik> Using 8 tasks per core i get an idle rate < 10%, but the framerate goes up to 720 ms

23:27 <Yorlik> After all each message is an async that runs on the system and heaving more chunks allows them to get mixed in between in the gaps between the large chunks.

23:27 <Yorlik> So utilization is better - but I don't understand why the framtime become so much worse

23:28 <Yorlik> Probably my overhead per chunk created is just way too bad.

23:31 <hkaiser> measure, measure, measure

23:31 <Yorlik> Doing that

23:32 <Yorlik> In the next Milestone I will spam perfcounters

23:32 <Yorlik> Instrumentation is a large part in the next milestone

23:34 weilewei has joined #ste||ar

23:49 <Yorlik> hkaiser: Which threads are running the actions? Also the workers?

23:50 <hkaiser> depends

23:50 <Yorlik> Like my send message action

23:50 <hkaiser> direct actions are run on the parcelport threads, non-direct actions are run by the workers

23:50 <Yorlik> what is direct or indirect in this context?

23:51 <hkaiser> all actions are non-direct by default

23:51 <Yorlik> OK

23:51 <Yorlik> That would explain why my task times are still very low even when ionly using one per core

23:52 <Yorlik> the messages make it short

23:52 <hkaiser> well, I thought your messages are handled in chunks as well

23:53 <Yorlik> The acxtion is called in the chunk, but its an async

23:53 <hkaiser> nod

23:53 <Yorlik> So it could run anywhere HPX puts it

23:55 <Yorlik> So right now the scheduling loop is down to ~2&

23:55 <Yorlik> 2%

23:55 <Yorlik> I removed the limit on task creation

23:56 <Yorlik> The only limit in place is the amount of Lua States in use

23:56 <Yorlik> I think it allows better scheduling for the small messages

23:56 <hkaiser> ok