<gonidelis[m]>
Can't really get what `disable_sized_sentinel_for` does...
<gonidelis[m]>
`The variable template disable_sized_sentinel provides a mechanism for iterators and sentinels that can be subtracted but do not meet the semantic requirements of SizedSentinel to opt out of the concept by specializing the variable template to have the value true. `
<K-ballo>
it disables the type as a sized sentinel, it makes the concept check fail
Yorlik has joined #ste||ar
<gonidelis[m]>
K-ballo: why disable the type though?
<K-ballo>
because it doesn't model the semantic requirements of sized sentinel
<gonidelis[m]>
So if `remove_cv_t` can be applied to S and I then `!disable_sized_sentinel_for` is true, otherwise it fails?
<hkaiser>
sounds a lot, I don't know why that is that large
<hkaiser>
we have never particularly cared about memory requirements
<Yorlik>
I have to - sending many many messages
<hkaiser>
look at the shared state in a debugger to see what members it has
<Yorlik>
OK
<gonidelis[m]>
hkaiser: you there?
<hkaiser>
gonidelis[m]: here
<gonidelis[m]>
I could use some advise with disable_sized_sentinel_for
<gonidelis[m]>
I think I need to create it myself, right?
<hkaiser>
c++20 will have std::disable_sized_sentinel_for
<gonidelis[m]>
But since we don't use c++20 yet, we have to create it from scratch, right?
<hkaiser>
we should implement our own hpx::traits::disable_sized_sentinel_for which defaults to 'false' for C++ < 20 and defaults to the std version otherwise
<hkaiser>
correction: it should default to std::disable_sized_sentinel_for if available
<hkaiser>
otherwise it should default to false
<gonidelis[m]>
ok so this `hpx::traits::disable_sized_sentinel_for` . Should I write the code inside the `is_sentinel_for.hpp` file?
<hkaiser>
Yorlik: ok, that's your continuation you attached to the future
<Yorlik>
The action?
<hkaiser>
no
<hkaiser>
do you attach continuations (using .then())?
<Yorlik>
The future is created like this: auto fut = hpx::async<gameobject::send_message_action<M>>( recipient, std::move( msg ) );
<Yorlik>
recipient is an id_type
<hkaiser>
what do you do with the future 'fut' afterwards?
<Yorlik>
I store it and check for exceptions after some time. if it is ready it's discarded
<Yorlik>
Like this: sender.echo_list.push_back( std::move( fut ) );
<Yorlik>
the echo:list is checked
<hkaiser>
ahh, I see now
<Yorlik>
Bad?
<hkaiser>
this is a boost::container::small_vector<util::unique_function<>, 3>
<Yorlik>
Yep
<hkaiser>
unique_function has at least 3 * sizeof(void*), so this small_vector has at least 9 * sizeof(void*)
<Yorlik>
BloatyMcBloatface?
<hkaiser>
no, perf-optimizations for upto 3 continuations
<hkaiser>
probably a bit over the top ;-)
<Yorlik>
And if I don't use these continuations I'm hosed?
<Yorlik>
Time for another overload?
<hkaiser>
not hosed, just wasting memory
<Yorlik>
Which is a problem in a messagin system.
<hkaiser>
that shared state is used everywhere
<Yorlik>
If I ever reach my goal of 100ms/frame that would be 260 MB / second on 100k objects/messages
<Yorlik>
I need to rethink the messaging system or get some help from you here, I think.
<Yorlik>
I have already reduced many dynamic allocations in my code, like messages, mailboxes, small vectors for parameters etc
<Yorlik>
And I'm redirecting them to mimalloc
<hkaiser>
Yorlik: I think we can safely reduce this
<Yorlik>
What would you suggest?
<hkaiser>
small_vector<> is not particularly memory friendly
<Yorlik>
It could be a user option
<Yorlik>
Like what properties the futures should have
<Yorlik>
More templating, I guess
<hkaiser>
futures have seldom more than one continuation attached, so we were trying to optimize the one-continuation case while still having the option of having more than one
<hkaiser>
one continuation should not require an additional allocation, but more than one could
<Yorlik>
With low frequency actions and long running remote actions that's a non issue. I guess I'm introducing a different use case here.
<Yorlik>
High Frequency small actions
<gonidelis[m]>
hkaiser: So for my cxx20_disable_sized_sentinel_for.cpp dummy test, do you know what header should I include?
<Yorlik>
Need to get some food - BRB in ~20 minutes
<hkaiser>
we might need a custom container that a) does not allocate for one element, b) allows for adding more elements, and c) never shrinks
<nikunj97>
can confirm it works for gcc 10.1 on another cluster that I have
<hkaiser>
weilewei: just the first of those - look for get_thread_data()/set_thread_data() for guidance
<hkaiser>
nikunj97: this a a strange one
<nikunj97>
could it be something relating to cmake?
<hkaiser>
shrug
<nikunj97>
hkaiser, let me explore it a bit. will update you on my findings
<hkaiser>
thanks
<weilewei>
hkaiser in the first file, get_thread_data()/set_thread_data() are virtual functions though
<hkaiser>
right
jaafar_ has joined #ste||ar
<hkaiser>
hey jaafar_
<weilewei>
hmm then there should be some inheritance of get_thread_data()/set_thread_data() defined somewhere else
<hkaiser>
jaafar_: it's completely beyond me how expanding a function-style macro could be different from expanding a non-function-style macro :/
<hkaiser>
weilewei: did you do a grep fro get_thread_data?
<hkaiser>
you will find it an several files, all of which (almost) need to be looked at
<weilewei>
right, there is a lot of get_thread_data in different places
<weilewei>
sure, I will add new features everywhere appropriate
jaafar has quit [Ping timeout: 265 seconds]
kale[m] has quit [Ping timeout: 258 seconds]
kale[m] has joined #ste||ar
<Yorlik>
hkaiser: Back. So - is there anything I can do?
<hkaiser>
Yorlik: create such a custom data structure ;-)
<Yorlik>
How wpould i Use it?
<hkaiser>
that could replace the small_vector in the shared state
<hkaiser>
Yorlik: for now we could replace it with std::vector<> but this would require an allocation even fr the first continuation
<Yorlik>
I think we need none of this but a reduced future
<Yorlik>
I don't want to pay for what I don't use
<hkaiser>
Yorlik: not sure how we could accomodate this request
<Yorlik>
C++? ;)
<hkaiser>
go ahead
<Yorlik>
ping-pong ...
kale[m] has quit [Ping timeout: 258 seconds]
karame_ has joined #ste||ar
<nikunj97>
ok my distributed 1d stencil is both seg faulting and not scaling :/
kale[m] has joined #ste||ar
nan11 has joined #ste||ar
<hkaiser>
ms[m]: yt?
<K-ballo>
ms[m]: moar conflicts?
<ms[m]>
K-ballo and hkaiser yes
<hkaiser>
ms[m]: I created #4758
<ms[m]>
sorry K-ballo, that should be the last module renaming pr for this release
<hkaiser>
interesting insights
<ms[m]>
hkaiser: the cmake profiling results look very interesting
<hkaiser>
yes
<ms[m]>
yeah, I bet looping through all our cache variables isn't the most efficient
<ms[m]>
plus generating files is probably not for free either
<hkaiser>
yah, that's the issue - even more as we do it for each and every module
<ms[m]>
let's look at this properly after 1.5.0
<hkaiser>
yes, agreed
<ms[m]>
we should be able to figure out what went wrong with using object libraries with that as well
<ms[m]>
the output looks very useful
<K-ballo>
looping through all cache variables :|
<K-ballo>
hkaiser: how did you make that trace?
<ms[m]>
K-ballo: it seems cmake 3.18 has learned to profile cmake code
<K-ballo>
--profiling-output and --profiling-format, found them
rtohid has joined #ste||ar
nan1110 has joined #ste||ar
nan1110 has quit [Remote host closed the connection]
nan222 has joined #ste||ar
nan11 has quit [Ping timeout: 245 seconds]
nanm has joined #ste||ar
nan222 has quit [Ping timeout: 245 seconds]
<Yorlik>
hkaiser: YT?
mdiers[m]1 has left #ste||ar ["Kicked by @appservice-irc:matrix.org : Idle for 30+ days"]
<hkaiser>
Yorlik: hey
<Yorlik>
Hello!
zao[m]1 has left #ste||ar ["Kicked by @appservice-irc:matrix.org : Idle for 30+ days"]
<Yorlik>
I'd like to discuss what could be done with this messaging problem, possible in voice if you can afford the time (could be later or another day ofc)
<hkaiser>
Yorlik: you identified the main purpetrator causing the large memory requirements for the shared state
<hkaiser>
thanks fo rthat
<hkaiser>
in order to fix this, we need to reduce the memory required by this particular data item
<hkaiser>
this data item stores continuations attached to a future
<Yorlik>
I see a problem in the baked in possibility for continuations in every single future
<Yorlik>
At least at this size
<hkaiser>
as I said, the requirements from HPX side are: a) attaching one contiuation should not require additional allocations, additional continuations are very seldom and could require allocations
<Yorlik>
But I guess making this generic would be a crapload of work.
<hkaiser>
also we can assume that this container does not shrink (which might simplify its implementation
<Yorlik>
realloc ftw
<hkaiser>
Yorlik: I don't think that we want to remove the ability to attach continuations
<Yorlik>
Me neither.
<Yorlik>
But could they be made an opt-in thing?
<hkaiser>
so we're back to a) and b) listed above
<Yorlik>
Or opt-out
<hkaiser>
no way
<Yorlik>
So - what exactly would be the interface requirements for this vector?
<Yorlik>
Like - can use std::allocator interface?
<Yorlik>
Etc ...
<Yorlik>
Could it be a pointer type like std::unique_ptr<some_vector<T>> ?
<Yorlik>
Would that kill locality too much?
<Yorlik>
And - what could I realistically do? Not sure if I'd tackle such a vector that would have a good outcome.
<hkaiser>
Yorlik: using a unique_ptr would require an allocation even for one element
<Yorlik>
even an empty one?
<hkaiser>
it could be something like a variant<callback_type, std::vector<callback_type>>
<Yorlik>
Im thinking along a system which would use indirection with pointers, but custom allocators putting it all together, preferably in the same cache line.
<hkaiser>
no, an empty continuation wouldn't require allocation
gdaiss[m] has left #ste||ar ["Kicked by @appservice-irc:matrix.org : Idle for 30+ days"]
<hkaiser>
Yorlik: let's not optimize before we even know how to do things
<Yorlik>
:D
<Yorlik>
However - if you do something I'll be glad to test
<hkaiser>
we've been using a different scheme before, we could try to get back to that
<Yorlik>
The current system creates a lot of bloat for me.
<hkaiser>
sure, as said, nobody has cared for memory consumption so far
* Yorlik
joins the Guineapig Union
<Yorlik>
BTW: Is there a way In C++ to evict a piece of data from the cache, to prevent others from not being evicted?
<hkaiser>
no
<Yorlik>
Damned
<Yorlik>
After every Object update I could explicitely evict the entity being updated
<Yorlik>
I know there are machine code / intrinsics which can bypass the cache, but thats for very special cases only, I think
<hkaiser>
Yorlik: you have no idea what data you want to cache or not - not without measurements
<hkaiser>
so don't even think about it
<Yorlik>
However - message memory consumption must go down a lot.
<Yorlik>
264 bytes is crazy
<hkaiser>
how large is your message?
<Yorlik>
The default minimal messages are 32 bytes
<Yorlik>
The variant of them 40
<hkaiser>
what's the return type stored in the future?
<Yorlik>
void
<hkaiser>
ok
<Yorlik>
The shared state should have nothing from me
<hkaiser>
well, it needs things for itself
<Yorlik>
264 bytes ?
<hkaiser>
as said, half of that is from the continuation storage
* Yorlik
shudders thinking about what these structures might be hiding in their basement.
<hkaiser>
shrug - feel free to help
<Yorlik>
Thats what I wanted to discuss.
<Yorlik>
Not sure hwat I'd actually be capable of.
<hkaiser>
and I told you what a possible first step could be
<Yorlik>
Writing that vector?
<Yorlik>
I'd need exact specs.
<hkaiser>
experimenting on how large a variant<unique_function<>, vector<unique_function>> would be as a start
<hkaiser>
we can hand-roll a slimmer but equivalent version of that without pulling in variant
<Yorlik>
What is a unique_function? hpx or std:: type?
<hkaiser>
(variant itself is not most compile-time friendly)
<Yorlik>
So - are things things like std::function ?
<Yorlik>
are these
<hkaiser>
compare that to the same when using std::function<void()>
<K-ballo>
hpx::unique_function is analogous to std::any_invocable
<Yorlik>
Seems like right now you are not the Cheshire Cat but the White Rabbit himself ;)
<hkaiser>
Yorlik: hpx function and unique_function have a default size of 5 * sizeof(void*), 3 of those are reserved for the internal small-object optimization storage
<Yorlik>
Oh man ...
<Yorlik>
So many bytes
<Yorlik>
That system was really optimized for something else.
<K-ballo>
do we hold callbacks by, how do we? vector?
<hkaiser>
K-ballo: currently boost::small_vector<..., 3>
<hkaiser>
a bit of overkill
<K-ballo>
indeed
<hkaiser>
small_vector<..., 1> would be sufficient, I think
<K-ballo>
I remember the old approach, composed callables, not very nice either
<hkaiser>
indeed
<K-ballo>
was there an intrusive linked list at some point?
<hkaiser>
small_vector itself has quite some footprint (as it turns out)
<K-ballo>
intrusive *singly* linked list
<hkaiser>
K-ballo: don't remember, frankly - all I know is that adding a single continuation should not trigger allocation
<Yorlik>
std::unique_ptr<T>
<K-ballo>
unique_ptr requires allocation
<hkaiser>
Yorlik: requires allocation
<K-ballo>
I was thinking a single linked list with an embedded head node
<Yorlik>
An empty one?
<K-ballo>
when you add a continuation it is no longer empty
<hkaiser>
Yorlik: it should require allocation for at least one element
<Yorlik>
I mean a default constructed unique is just a pointer isn#t it?
<hkaiser>
yes
<K-ballo>
" adding a single continuation should not trigger allocation"
<Yorlik>
So why not allow that?
<K-ballo>
quoted from a few lines above
<hkaiser>
K-ballo: embedded head node sounds doable, but again requires an additional pointer - Yorlikwill not like that
<K-ballo>
I doubt he'll get any better
<hkaiser>
indeed
<hkaiser>
Yorlik: I can offer shaving off 80 bytes right away without too much effort
<Yorlik>
If a simple future<void> requires an allocation of 264 bytes somewhere .. seriously?
<hkaiser>
Yorlik: stop complaining
<hkaiser>
we've heard your message
<Yorlik>
I mean - this is supposed to be a function call/return, nothing more.
<Yorlik>
OK
<Yorlik>
80bytes would be a third already
<hkaiser>
right
<Yorlik>
That's ~(MB per frame
<K-ballo>
how big is small_vector<T, 1>? sizeof(T) +
<K-ballo>
24, that's expected for a vector-like thing
<Yorlik>
Yup
<Yorlik>
The vector in my messages is why they are 32 byte in the end
<Yorlik>
24+8
bita_ has joined #ste||ar
<hkaiser>
K-ballo: if I understand this correctly, the object member in basic_function is always pointing to the internal storage
<hkaiser>
it's used just for the purpose of knowing whether the object is empty or not
<K-ballo>
doesn't sound right.. not all callables fit in the internal storage
<hkaiser>
but then the internal storage is used to point to the allocated memory
<Yorlik>
Dou you have that too, that the VS memory debugger likes to crash a lot?
<K-ballo>
that sounds wrong, it should point directly to the object
<hkaiser>
K-ballo: just trying to figure out whether we could remove that
<K-ballo>
remove the member pointer? no
<K-ballo>
s/member/object
<bita_>
hkaiser, I think #1178 is ready for review
<hkaiser>
bita_: ok, I'll have a look
<bita_>
thanks
<hkaiser>
K-ballo: object is used a) for pointing to the internal storage or to point to the allocated data
<hkaiser>
couldn't we store the pointer to the allocated data in the internal storage instead?
<Yorlik>
hkaiser: Size is down exactly by 80 bytes from 264 to 184
<hkaiser>
as promised ;-)
<Yorlik>
Yup
<Yorlik>
Step by step cutting down memory usage and cache thrashing :)
<Yorlik>
8 MB less per frame, 80MB/sec (if we meet our goal of 100ms/frame)
<Yorlik>
currently ~650-750 ms
<Yorlik>
100k objects
<Yorlik>
hkaiser: Would setting it to 0 crash anything or just sub-optimize stuff?
<hkaiser>
Yorlik: that would something not acceptable, I think
<hkaiser>
would be*
<Yorlik>
Would it cause crashes or just slow down the system?
<Yorlik>
I just wonder if that could be a build setting
<Yorlik>
Like a Macro
<hkaiser>
should just slow down things as even adding the first continuation would require an allocation
<Yorlik>
Is this used by HPX internally a lot?
<K-ballo>
hkaiser: I suppose we could
<K-ballo>
that's how it was initially, before we expanded the embedded storage to 3 pointers
<hkaiser>
K-ballo: yes, we changed it also to have a flag for the empty state
<hkaiser>
before we depended on the empty_vtable which didn't work well across shared libraries
<hkaiser>
Yorlik: the thing is that we certainly could create a special (minimal) shared state but then we need a mechanism to instruct async (and all the others) to use that for the returned future
<hkaiser>
Yorlik: I have no idea how that could be done without duplicating everything
<Yorlik>
I see it's a hard problem.
<K-ballo>
Yorlik: on the other hand, you could just not use async and the others, use your own
RostamLog has joined #ste||ar
<Yorlik>
And with custom new/delete a lot can be optimized
<Yorlik>
In the moment, when the system has reached its steady state after creating and initializing all objects and is just busy with its update loop I'm spending 58.87% of the time inside the executors operator(), which means there's still a lot of overhead (~40%) if I read the numbers correctly..
rtohid has left #ste||ar [#ste||ar]
karame_ has quit [Ping timeout: 245 seconds]
kale[m] has quit [Ping timeout: 246 seconds]
kale[m] has joined #ste||ar
rtohid has joined #ste||ar
<Yorlik>
hkaiser: Would --hpx:numa-sensitive just affect strictly NUMA domains? What about level3 cache domains?
<Yorlik>
But only in this scenario with huge tasks
<hkaiser>
well, there is a lot happening in that function - essentially all of the task stealing
<hkaiser>
Yorlik: what you're seeing is that Amdahl is wacking you over your head
<Yorlik>
I think the overhead coming with me chopping the tasks in smaller parts is a problem. I doubt it's hpx, but something in my overall architecture.
tiagofg[m] has joined #ste||ar
<Yorlik>
The crazy thing is my frametimes go down with me making the tsaks really large
<Yorlik>
tasks
<Yorlik>
I must have some sort of overhead i don't currently see or am aware oif
gonidelis[m] has joined #ste||ar
joe[m]1 has joined #ste||ar
jbjnr has joined #ste||ar
diehlpk_mobile[m has joined #ste||ar
carola[m]1 has joined #ste||ar
oleg[m]2 has joined #ste||ar
<Yorlik>
hkaiser: When using the autochunker, setting it to 400 µs the percentage of that function goes down a lot.
<Yorlik>
But my frametimes go up from ~700ms to 1700 ms
<Yorlik>
Using one task per core I'm getting an idle rate of ~17% and a frametime of ~650 ms
<Yorlik>
That doesn't löook right
<Yorlik>
Even when using healthy task lengths the overall performance goes down horribly.
<Yorlik>
Using 8 tasks per core i get an idle rate < 10%, but the framerate goes up to 720 ms
<Yorlik>
After all each message is an async that runs on the system and heaving more chunks allows them to get mixed in between in the gaps between the large chunks.
<Yorlik>
So utilization is better - but I don't understand why the framtime become so much worse
<Yorlik>
Probably my overhead per chunk created is just way too bad.
<hkaiser>
measure, measure, measure
<Yorlik>
Doing that
<Yorlik>
In the next Milestone I will spam perfcounters
<Yorlik>
Instrumentation is a large part in the next milestone
weilewei has joined #ste||ar
<Yorlik>
hkaiser: Which threads are running the actions? Also the workers?
<hkaiser>
depends
<Yorlik>
Like my send message action
<hkaiser>
direct actions are run on the parcelport threads, non-direct actions are run by the workers
<Yorlik>
what is direct or indirect in this context?
<hkaiser>
all actions are non-direct by default
<Yorlik>
OK
<Yorlik>
That would explain why my task times are still very low even when ionly using one per core
<Yorlik>
the messages make it short
<hkaiser>
well, I thought your messages are handled in chunks as well
<Yorlik>
The acxtion is called in the chunk, but its an async
<hkaiser>
nod
<Yorlik>
So it could run anywhere HPX puts it
<Yorlik>
So right now the scheduling loop is down to ~2&
<Yorlik>
2%
<Yorlik>
I removed the limit on task creation
<Yorlik>
The only limit in place is the amount of Lua States in use
<Yorlik>
I think it allows better scheduling for the small messages