hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
hkaiser has joined #ste||ar
RostamLog has joined #ste||ar
RostamLog has joined #ste||ar
nan11 has quit [Remote host closed the connection]
RostamLog has joined #ste||ar
sayefsakin has quit [Ping timeout: 260 seconds]
hkaiser has quit [Quit: bye]
sayefsakin has joined #ste||ar
nikunj97 has joined #ste||ar
Nikunj__ has joined #ste||ar
nikunj97 has quit [Ping timeout: 256 seconds]
bita_ has quit [Ping timeout: 260 seconds]
Nikunj__ is now known as nikunj97
<nikunj97> essentially what does the take init mode mean
diehlpk_work has joined #ste||ar
<nikunj97> ms[m], never mind, found it
diehlpk_work_ has quit [Ping timeout: 260 seconds]
kale[m] has joined #ste||ar
sayefsakin has quit [Quit: Leaving]
kale[m] has quit [Ping timeout: 264 seconds]
kale[m] has joined #ste||ar
<gonidelis[m]> https://github.com/STEllAR-GROUP/hpx/pull/4745/checks?check_run_id=773505338 Any ideas why the tests.unit.component test is failing ?
kale[m] has quit [Ping timeout: 260 seconds]
kale[m] has joined #ste||ar
<gonidelis[m]> Can't really get what `disable_sized_sentinel_for` does...
<gonidelis[m]> `The variable template disable_sized_sentinel provides a mechanism for iterators and sentinels that can be subtracted but do not meet the semantic requirements of SizedSentinel to opt out of the concept by specializing the variable template to have the value true. `
<K-ballo> it disables the type as a sized sentinel, it makes the concept check fail
Yorlik has joined #ste||ar
<gonidelis[m]> K-ballo: why disable the type though?
<K-ballo> because it doesn't model the semantic requirements of sized sentinel
<gonidelis[m]> So if `remove_cv_t` can be applied to S and I then `!disable_sized_sentinel_for` is true, otherwise it fails?
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
nikunj has quit [Ping timeout: 256 seconds]
nikunj has joined #ste||ar
<gonidelis[m]> Do you think that's a proper test ? https://gist.github.com/gonidelis/e0ee0756ca6abc70acfb51880b7f2c38
<gonidelis[m]> hkaiser: ^^
hkaiser has joined #ste||ar
<Yorlik> hkaiser: yt?
<hkaiser> Yorlik: here
<Yorlik> Hello!
<hkaiser> hey there
<Yorlik> I'm running into an allocation issue
<Yorlik> I see the data created along the messages I create having a size of 264 byte.
<Yorlik> It results in ~26 MB of data every frame, a lot of it obviously bloat.
<Yorlik> That is ~3x the size of the messages I'm sending.
<Yorlik> Is that normal that promises are that large?
<Yorlik> From the memory debugger (calculating a difference between 2 frames, 100k messages per frame):
<Yorlik> sim.exe!hpx::lcos::detail::promise_data_allocator<void,std::allocator<int> > count: 98.763 size: 26.073.432
<hkaiser> promises?
<Yorlik> The allocator
<hkaiser> what allocator?
<Yorlik> It's the type shown above
<hkaiser> sec
<Yorlik> K - back in a sec
<hkaiser> it's a promise created for an action,, right?
<Yorlik> Yes
<Yorlik> I sortof deduced here
<hkaiser> I have never checked how much memory it occupies
<Yorlik> The count is pretty close to the message count
<Yorlik> +And the numbers a slightly changing, because it's a delayed message (1ms)
<Yorlik> However this type must be associated with the messages sent
<hkaiser> well this is the shared_state created for the future that is returned by the async(action(), ...)
<Yorlik> The messages are not returning anything except void
<Yorlik> and exceptions
<Yorlik> I wonder if there is space reserved for exceptions
<hkaiser> yes, it is - but that's in a union with the actual data
<Yorlik> How many bytes?
<hkaiser> no idea
<hkaiser> probably less than your message
<Yorlik> The size of the type above is 264 bytes - that's a lot
<Yorlik> Maybe I'm misusing something - totally possible.
<hkaiser> sounds a lot, I don't know why that is that large
<hkaiser> we have never particularly cared about memory requirements
<Yorlik> I have to - sending many many messages
<hkaiser> look at the shared state in a debugger to see what members it has
<Yorlik> OK
<gonidelis[m]> hkaiser: you there?
<hkaiser> gonidelis[m]: here
<gonidelis[m]> I could use some advise with disable_sized_sentinel_for
<gonidelis[m]> I think I need to create it myself, right?
<hkaiser> c++20 will have std::disable_sized_sentinel_for
<gonidelis[m]> But since we don't use c++20 yet, we have to create it from scratch, right?
<hkaiser> we should implement our own hpx::traits::disable_sized_sentinel_for which defaults to 'false' for C++ < 20 and defaults to the std version otherwise
<hkaiser> correction: it should default to std::disable_sized_sentinel_for if available
<hkaiser> otherwise it should default to false
<gonidelis[m]> ok so this `hpx::traits::disable_sized_sentinel_for` . Should I write the code inside the `is_sentinel_for.hpp` file?
<hkaiser> yes
<hkaiser> gonidelis[m]: something along the lines of https://gist.github.com/hkaiser/babf1a39325a339b15420267408d727d
<Yorlik> hkaiser: sizeof(fut.shared_state_.px->on_completed_) = 144 (digging deeper)
<hkaiser> Yorlik: ok, that's your continuation you attached to the future
<Yorlik> The action?
<hkaiser> no
<hkaiser> do you attach continuations (using .then())?
<Yorlik> The future is created like this: auto fut = hpx::async<gameobject::send_message_action<M>>( recipient, std::move( msg ) );
<Yorlik> recipient is an id_type
<hkaiser> what do you do with the future 'fut' afterwards?
<Yorlik> I store it and check for exceptions after some time. if it is ready it's discarded
<Yorlik> Like this: sender.echo_list.push_back( std::move( fut ) );
<Yorlik> the echo:list is checked
<hkaiser> ahh, I see now
<Yorlik> Bad?
<hkaiser> this is a boost::container::small_vector<util::unique_function<>, 3>
<Yorlik> Yep
<hkaiser> unique_function has at least 3 * sizeof(void*), so this small_vector has at least 9 * sizeof(void*)
<Yorlik> BloatyMcBloatface?
<hkaiser> no, perf-optimizations for upto 3 continuations
<hkaiser> probably a bit over the top ;-)
<Yorlik> And if I don't use these continuations I'm hosed?
<Yorlik> Time for another overload?
<hkaiser> not hosed, just wasting memory
<Yorlik> Which is a problem in a messagin system.
<hkaiser> that shared state is used everywhere
<Yorlik> If I ever reach my goal of 100ms/frame that would be 260 MB / second on 100k objects/messages
<Yorlik> I need to rethink the messaging system or get some help from you here, I think.
<Yorlik> I have already reduced many dynamic allocations in my code, like messages, mailboxes, small vectors for parameters etc
<Yorlik> And I'm redirecting them to mimalloc
<hkaiser> Yorlik: I think we can safely reduce this
<Yorlik> What would you suggest?
<hkaiser> small_vector<> is not particularly memory friendly
<Yorlik> It could be a user option
<Yorlik> Like what properties the futures should have
<Yorlik> More templating, I guess
<hkaiser> futures have seldom more than one continuation attached, so we were trying to optimize the one-continuation case while still having the option of having more than one
<hkaiser> one continuation should not require an additional allocation, but more than one could
<Yorlik> With low frequency actions and long running remote actions that's a non issue. I guess I'm introducing a different use case here.
<Yorlik> High Frequency small actions
<gonidelis[m]> hkaiser: So for my cxx20_disable_sized_sentinel_for.cpp dummy test, do you know what header should I include?
<Yorlik> Need to get some food - BRB in ~20 minutes
<hkaiser> we might need a custom container that a) does not allocate for one element, b) allows for adding more elements, and c) never shrinks
<hkaiser> gonidelis[m]: <type_traits>
<hkaiser> gonidelis[m]: no, <iterator>
<nikunj97> anyone faced this while compiling hpx? https://gist.github.com/NK-Nikunj/7c32e549cff6ded14af34295a7017f93
<nikunj97> this is on hpx master
<hkaiser> uhhh
<hkaiser> look at the generated preprocessed file, something is off right before that line
<nikunj97> alright
<hkaiser> I think the #if is not recognized as a preprocessor directive, but I could be wrong
<hkaiser> nikunj97: what's your cmake options?
<nikunj97> just a CMAKE_INSTALL_PREFIX
<hkaiser> what compiler?
<nikunj97> gcc 9.3
<nikunj97> can confirm it works for gcc 10.1 on another cluster that I have
<hkaiser> weilewei: just the first of those - look for get_thread_data()/set_thread_data() for guidance
<hkaiser> nikunj97: this a a strange one
<nikunj97> could it be something relating to cmake?
<hkaiser> shrug
<nikunj97> hkaiser, let me explore it a bit. will update you on my findings
<hkaiser> thanks
<weilewei> hkaiser in the first file, get_thread_data()/set_thread_data() are virtual functions though
<hkaiser> right
jaafar_ has joined #ste||ar
<hkaiser> hey jaafar_
<weilewei> hmm then there should be some inheritance of get_thread_data()/set_thread_data() defined somewhere else
<hkaiser> jaafar_: it's completely beyond me how expanding a function-style macro could be different from expanding a non-function-style macro :/
<hkaiser> weilewei: did you do a grep fro get_thread_data?
<hkaiser> you will find it an several files, all of which (almost) need to be looked at
<weilewei> right, there is a lot of get_thread_data in different places
<weilewei> sure, I will add new features everywhere appropriate
jaafar has quit [Ping timeout: 265 seconds]
kale[m] has quit [Ping timeout: 258 seconds]
kale[m] has joined #ste||ar
<Yorlik> hkaiser: Back. So - is there anything I can do?
<hkaiser> Yorlik: create such a custom data structure ;-)
<Yorlik> How wpould i Use it?
<hkaiser> that could replace the small_vector in the shared state
<hkaiser> Yorlik: for now we could replace it with std::vector<> but this would require an allocation even fr the first continuation
<Yorlik> I think we need none of this but a reduced future
<Yorlik> I don't want to pay for what I don't use
<hkaiser> Yorlik: not sure how we could accomodate this request
<Yorlik> C++? ;)
<hkaiser> go ahead
<Yorlik> ping-pong ...
kale[m] has quit [Ping timeout: 258 seconds]
karame_ has joined #ste||ar
<nikunj97> ok my distributed 1d stencil is both seg faulting and not scaling :/
kale[m] has joined #ste||ar
nan11 has joined #ste||ar
<hkaiser> ms[m]: yt?
<K-ballo> ms[m]: moar conflicts?
<ms[m]> K-ballo and hkaiser yes
<hkaiser> ms[m]: I created #4758
<ms[m]> sorry K-ballo, that should be the last module renaming pr for this release
<hkaiser> interesting insights
<ms[m]> hkaiser: the cmake profiling results look very interesting
<hkaiser> yes
<ms[m]> yeah, I bet looping through all our cache variables isn't the most efficient
<ms[m]> plus generating files is probably not for free either
<hkaiser> yah, that's the issue - even more as we do it for each and every module
<ms[m]> let's look at this properly after 1.5.0
<hkaiser> yes, agreed
<ms[m]> we should be able to figure out what went wrong with using object libraries with that as well
<ms[m]> the output looks very useful
<K-ballo> looping through all cache variables :|
<K-ballo> hkaiser: how did you make that trace?
<ms[m]> K-ballo: it seems cmake 3.18 has learned to profile cmake code
<K-ballo> --profiling-output and --profiling-format, found them
rtohid has joined #ste||ar
nan1110 has joined #ste||ar
nan1110 has quit [Remote host closed the connection]
nan222 has joined #ste||ar
nan11 has quit [Ping timeout: 245 seconds]
nanm has joined #ste||ar
nan222 has quit [Ping timeout: 245 seconds]
<Yorlik> hkaiser: YT?
mdiers[m]1 has left #ste||ar ["Kicked by @appservice-irc:matrix.org : Idle for 30+ days"]
<hkaiser> Yorlik: hey
<Yorlik> Hello!
zao[m]1 has left #ste||ar ["Kicked by @appservice-irc:matrix.org : Idle for 30+ days"]
<Yorlik> I'd like to discuss what could be done with this messaging problem, possible in voice if you can afford the time (could be later or another day ofc)
<hkaiser> Yorlik: you identified the main purpetrator causing the large memory requirements for the shared state
<hkaiser> thanks fo rthat
<hkaiser> in order to fix this, we need to reduce the memory required by this particular data item
<hkaiser> this data item stores continuations attached to a future
<Yorlik> I see a problem in the baked in possibility for continuations in every single future
<Yorlik> At least at this size
<hkaiser> as I said, the requirements from HPX side are: a) attaching one contiuation should not require additional allocations, additional continuations are very seldom and could require allocations
<Yorlik> But I guess making this generic would be a crapload of work.
<hkaiser> also we can assume that this container does not shrink (which might simplify its implementation
<Yorlik> realloc ftw
<hkaiser> Yorlik: I don't think that we want to remove the ability to attach continuations
<Yorlik> Me neither.
<Yorlik> But could they be made an opt-in thing?
<hkaiser> so we're back to a) and b) listed above
<Yorlik> Or opt-out
<hkaiser> no way
<Yorlik> So - what exactly would be the interface requirements for this vector?
<Yorlik> Like - can use std::allocator interface?
<Yorlik> Etc ...
<Yorlik> Could it be a pointer type like std::unique_ptr<some_vector<T>> ?
<Yorlik> Would that kill locality too much?
<Yorlik> And - what could I realistically do? Not sure if I'd tackle such a vector that would have a good outcome.
<hkaiser> Yorlik: using a unique_ptr would require an allocation even for one element
<Yorlik> even an empty one?
<hkaiser> it could be something like a variant<callback_type, std::vector<callback_type>>
<Yorlik> Im thinking along a system which would use indirection with pointers, but custom allocators putting it all together, preferably in the same cache line.
<hkaiser> no, an empty continuation wouldn't require allocation
gdaiss[m] has left #ste||ar ["Kicked by @appservice-irc:matrix.org : Idle for 30+ days"]
<hkaiser> Yorlik: let's not optimize before we even know how to do things
<Yorlik> :D
<Yorlik> However - if you do something I'll be glad to test
<hkaiser> we've been using a different scheme before, we could try to get back to that
<Yorlik> The current system creates a lot of bloat for me.
<hkaiser> sure, as said, nobody has cared for memory consumption so far
* Yorlik joins the Guineapig Union
<Yorlik> BTW: Is there a way In C++ to evict a piece of data from the cache, to prevent others from not being evicted?
<hkaiser> no
<Yorlik> Damned
<Yorlik> After every Object update I could explicitely evict the entity being updated
<Yorlik> I know there are machine code / intrinsics which can bypass the cache, but thats for very special cases only, I think
<hkaiser> Yorlik: you have no idea what data you want to cache or not - not without measurements
<hkaiser> so don't even think about it
<Yorlik> However - message memory consumption must go down a lot.
<Yorlik> 264 bytes is crazy
<hkaiser> how large is your message?
<Yorlik> The default minimal messages are 32 bytes
<Yorlik> The variant of them 40
<hkaiser> what's the return type stored in the future?
<Yorlik> void
<hkaiser> ok
<Yorlik> The shared state should have nothing from me
<hkaiser> well, it needs things for itself
<Yorlik> 264 bytes ?
<hkaiser> as said, half of that is from the continuation storage
* Yorlik shudders thinking about what these structures might be hiding in their basement.
<hkaiser> shrug - feel free to help
<Yorlik> Thats what I wanted to discuss.
<Yorlik> Not sure hwat I'd actually be capable of.
<hkaiser> and I told you what a possible first step could be
<Yorlik> Writing that vector?
<Yorlik> I'd need exact specs.
<hkaiser> experimenting on how large a variant<unique_function<>, vector<unique_function>> would be as a start
<hkaiser> we can hand-roll a slimmer but equivalent version of that without pulling in variant
<Yorlik> What is a unique_function? hpx or std:: type?
<hkaiser> (variant itself is not most compile-time friendly)
<hkaiser> hpx::util::unique_function<...>
<hkaiser> probably hpx::util::unique_function<void(void)>
<hkaiser> need to look
<hkaiser> or even unique_function_nonser<void()>
<Yorlik> So - are things things like std::function ?
<Yorlik> are these
<hkaiser> compare that to the same when using std::function<void()>
<K-ballo> hpx::unique_function is analogous to std::any_invocable
<Yorlik> Seems like right now you are not the Cheshire Cat but the White Rabbit himself ;)
<hkaiser> Yorlik: hpx function and unique_function have a default size of 5 * sizeof(void*), 3 of those are reserved for the internal small-object optimization storage
<Yorlik> Oh man ...
<Yorlik> So many bytes
<Yorlik> That system was really optimized for something else.
<K-ballo> do we hold callbacks by, how do we? vector?
<hkaiser> K-ballo: currently boost::small_vector<..., 3>
<hkaiser> a bit of overkill
<K-ballo> indeed
<hkaiser> small_vector<..., 1> would be sufficient, I think
<K-ballo> I remember the old approach, composed callables, not very nice either
<hkaiser> indeed
<K-ballo> was there an intrusive linked list at some point?
<hkaiser> small_vector itself has quite some footprint (as it turns out)
<K-ballo> intrusive *singly* linked list
<hkaiser> K-ballo: don't remember, frankly - all I know is that adding a single continuation should not trigger allocation
<Yorlik> std::unique_ptr<T>
<K-ballo> unique_ptr requires allocation
<hkaiser> Yorlik: requires allocation
<K-ballo> I was thinking a single linked list with an embedded head node
<Yorlik> An empty one?
<K-ballo> when you add a continuation it is no longer empty
<hkaiser> Yorlik: it should require allocation for at least one element
<Yorlik> I mean a default constructed unique is just a pointer isn#t it?
<hkaiser> yes
<K-ballo> " adding a single continuation should not trigger allocation"
<Yorlik> So why not allow that?
<K-ballo> quoted from a few lines above
<hkaiser> K-ballo: embedded head node sounds doable, but again requires an additional pointer - Yorlikwill not like that
<K-ballo> I doubt he'll get any better
<hkaiser> indeed
<hkaiser> Yorlik: I can offer shaving off 80 bytes right away without too much effort
<Yorlik> If a simple future<void> requires an allocation of 264 bytes somewhere .. seriously?
<hkaiser> Yorlik: stop complaining
<hkaiser> we've heard your message
<Yorlik> I mean - this is supposed to be a function call/return, nothing more.
<Yorlik> OK
<Yorlik> 80bytes would be a third already
<hkaiser> right
<Yorlik> That's ~(MB per frame
<K-ballo> how big is small_vector<T, 1>? sizeof(T) +
<Yorlik> 8MB less
<Yorlik> I'll give it a shot - brb
<K-ballo> 24, that's expected for a vector-like thing
<Yorlik> Yup
<Yorlik> The vector in my messages is why they are 32 byte in the end
<Yorlik> 24+8
bita_ has joined #ste||ar
<hkaiser> K-ballo: if I understand this correctly, the object member in basic_function is always pointing to the internal storage
<hkaiser> it's used just for the purpose of knowing whether the object is empty or not
<K-ballo> doesn't sound right.. not all callables fit in the internal storage
<hkaiser> but then the internal storage is used to point to the allocated memory
<Yorlik> Dou you have that too, that the VS memory debugger likes to crash a lot?
<K-ballo> that sounds wrong, it should point directly to the object
<hkaiser> K-ballo: just trying to figure out whether we could remove that
<K-ballo> remove the member pointer? no
<K-ballo> s/member/object
<bita_> hkaiser, I think #1178 is ready for review
<hkaiser> bita_: ok, I'll have a look
<bita_> thanks
<hkaiser> K-ballo: object is used a) for pointing to the internal storage or to point to the allocated data
<hkaiser> couldn't we store the pointer to the allocated data in the internal storage instead?
<Yorlik> hkaiser: Size is down exactly by 80 bytes from 264 to 184
<hkaiser> as promised ;-)
<Yorlik> Yup
<Yorlik> Step by step cutting down memory usage and cache thrashing :)
<Yorlik> 8 MB less per frame, 80MB/sec (if we meet our goal of 100ms/frame)
<Yorlik> currently ~650-750 ms
<Yorlik> 100k objects
<Yorlik> hkaiser: Would setting it to 0 crash anything or just sub-optimize stuff?
<hkaiser> Yorlik: that would something not acceptable, I think
<hkaiser> would be*
<Yorlik> Would it cause crashes or just slow down the system?
<Yorlik> I just wonder if that could be a build setting
<Yorlik> Like a Macro
<hkaiser> should just slow down things as even adding the first continuation would require an allocation
<Yorlik> Is this used by HPX internally a lot?
<K-ballo> hkaiser: I suppose we could
<K-ballo> that's how it was initially, before we expanded the embedded storage to 3 pointers
<hkaiser> K-ballo: yes, we changed it also to have a flag for the empty state
<hkaiser> before we depended on the empty_vtable which didn't work well across shared libraries
<hkaiser> Yorlik: the thing is that we certainly could create a special (minimal) shared state but then we need a mechanism to instruct async (and all the others) to use that for the returned future
<hkaiser> Yorlik: I have no idea how that could be done without duplicating everything
<Yorlik> I see it's a hard problem.
<K-ballo> Yorlik: on the other hand, you could just not use async and the others, use your own
RostamLog has joined #ste||ar
<Yorlik> And with custom new/delete a lot can be optimized
<Yorlik> In the moment, when the system has reached its steady state after creating and initializing all objects and is just busy with its update loop I'm spending 58.87% of the time inside the executors operator(), which means there's still a lot of overhead (~40%) if I read the numbers correctly..
rtohid has left #ste||ar [#ste||ar]
karame_ has quit [Ping timeout: 245 seconds]
kale[m] has quit [Ping timeout: 246 seconds]
kale[m] has joined #ste||ar
rtohid has joined #ste||ar
<Yorlik> hkaiser: Would --hpx:numa-sensitive just affect strictly NUMA domains? What about level3 cache domains?
<Yorlik> Is there any way to tweak that ?
<Yorlik> Is there a way to completely disable work stealing? My frametimes get better the longer the tasks become and it's the fastest when dividing my parallel loop exactly in core count chunks. Work stealing doesn't make sense in this scenario and the scheduling loop suddenly eats ~14% of the CPU time exactly here: https://github.com/STEllAR-GROUP/hpx/blob/master/libs/thread_pools/include/hpx/thread_pools/scheduling_loop.hpp
<Yorlik> #L634
<Yorlik> That's what the profiler says: hpx::threads::detail::scheduling_loop<hpx::threads::policies::local_priority_queue_scheduler<std::mutex,hpx::threads::policies::lockfree_fifo,hpx::threads::policies::lockfree_fifo,hpx::threads::policies::lockfree_lifo> > 418590 (28,79 %) 241095 (16,58 %)
<Yorlik> When cutting my chunks in smaller pieces the percentage of that line of code gets much lower, but my frametimes still go up
<Yorlik> hkaiser ^^ ??
<hkaiser> Yorlik: use your own chunker that does exacatly that - create as many tasks as you have cores
<hkaiser> s/tasks/chunks
<Yorlik> I'm doing that right now
<Yorlik> But for some weird reason it appears as if the scheduling loop doesn't like that
<Yorlik> Currently I'm doing this: ....with( static_chunk_size( m_e_type::endindex.load( ) / ( hpx::get_num_worker_threads( ) * 1 ) ) ), // default
<Yorlik> //.with( auto_chunk_size( autochunker_target_us * 1us ) ), // > 200
<Yorlik> Woops -- copied more than I wanted
<Yorlik> Long story short: I'm giving the core count to the static chunker
<Yorlik> hkaiser: The line of code I reported above eats like 14% of the CPU time in that scenario if the profiler reports correctly
<Yorlik> 28% of the time is spent inside that function, but 16% it takes all by its own.
<Yorlik> 14 % just that single line
<Yorlik> But only under the mentioned scenario
kale[m] has quit [Ping timeout: 260 seconds]
kale[m] has joined #ste||ar
parsa| has joined #ste||ar
joe[m]1 has quit [*.net *.split]
jbjnr has quit [*.net *.split]
parsa has quit [Read error: Connection reset by peer]
parsa| is now known as parsa
tiagofg[m] has quit [*.net *.split]
gonidelis[m] has quit [*.net *.split]
tiagofg[m] has joined #ste||ar
gonidelis[m] has joined #ste||ar
joe[m]1 has joined #ste||ar
jbjnr has joined #ste||ar
nikunj has quit [Ping timeout: 256 seconds]
nikunj has joined #ste||ar
joe[m]1 has quit [*.net *.split]
jbjnr has quit [*.net *.split]
tiagofg[m] has quit [*.net *.split]
gonidelis[m] has quit [*.net *.split]
rtohid has left #ste||ar [#ste||ar]
neill[m] has quit [*.net *.split]
diehlpk_mobile[m has quit [*.net *.split]
nikunj has quit [*.net *.split]
carola[m]1 has quit [*.net *.split]
richard[m]1 has quit [*.net *.split]
rori has quit [*.net *.split]
tarzeau has quit [*.net *.split]
ralph[m] has quit [*.net *.split]
nanm has quit [*.net *.split]
weilewei has quit [*.net *.split]
gretax[m] has quit [*.net *.split]
noise[m] has quit [*.net *.split]
smith[m] has quit [*.net *.split]
jaafar_ has quit [*.net *.split]
hkaiser has quit [*.net *.split]
Yorlik has quit [*.net *.split]
oleg[m]2 has quit [*.net *.split]
kordejong has quit [*.net *.split]
heller1 has quit [*.net *.split]
ms[m] has quit [*.net *.split]
diehlpk_work has quit [*.net *.split]
K-ballo has quit [*.net *.split]
wash[m] has quit [*.net *.split]
Vir has quit [*.net *.split]
parsa has quit [*.net *.split]
bita_ has quit [*.net *.split]
Guest21318 has quit [*.net *.split]
kale[m] has quit [*.net *.split]
Amy1 has quit [*.net *.split]
bobakk3r has quit [*.net *.split]
oleg[m]2 has joined #ste||ar
kordejong has joined #ste||ar
ms[m] has joined #ste||ar
heller1 has joined #ste||ar
diehlpk_work has joined #ste||ar
K-ballo has joined #ste||ar
Yorlik has joined #ste||ar
gretax[m] has joined #ste||ar
noise[m] has joined #ste||ar
smith[m] has joined #ste||ar
ralph[m] has joined #ste||ar
diehlpk_mobile[m has joined #ste||ar
Amy1 has joined #ste||ar
kale[m] has joined #ste||ar
carola[m]1 has joined #ste||ar
rori has joined #ste||ar
nikunj has joined #ste||ar
tarzeau has joined #ste||ar
parsa has joined #ste||ar
Guest21318 has joined #ste||ar
bita_ has joined #ste||ar
wash[m] has joined #ste||ar
Vir has joined #ste||ar
nanm has joined #ste||ar
jaafar_ has joined #ste||ar
hkaiser has joined #ste||ar
carola[m]1 has quit [Ping timeout: 240 seconds]
diehlpk_mobile[m has quit [Ping timeout: 256 seconds]
oleg[m]2 has quit [Ping timeout: 244 seconds]
bobakk3r has joined #ste||ar
richard[m]1 has joined #ste||ar
<Yorlik> hkaiser: How can I fix this: https://i.imgur.com/gkfQpYh.png ?
<hkaiser> you can't
<Yorlik> Is it an artifact oe for real?
<Yorlik> I mean - it looks crazy
<hkaiser> what's your idle-rate?
<Yorlik> Usually below 20 or even 10 %
<Yorlik> Lemme run with the current settings
<hkaiser> there you go
<hkaiser> in that figure it's 17.98%
<Yorlik> So that look reflects the idle rate ?
<Yorlik> loop
<hkaiser> I think so, yes
<hkaiser> what function is that?
<hkaiser> (can't see it)
<hkaiser> is it wait_and_add_new?
<Yorlik> I posted a github link a bit above
<hkaiser> so it's get_next_thread?
<Yorlik> This line specifically
neill[m] has joined #ste||ar
<Yorlik> Took like 14 %
<Yorlik> But only in this scenario with huge tasks
<hkaiser> well, there is a lot happening in that function - essentially all of the task stealing
<hkaiser> Yorlik: what you're seeing is that Amdahl is wacking you over your head
<Yorlik> I think the overhead coming with me chopping the tasks in smaller parts is a problem. I doubt it's hpx, but something in my overall architecture.
tiagofg[m] has joined #ste||ar
<Yorlik> The crazy thing is my frametimes go down with me making the tsaks really large
<Yorlik> tasks
<Yorlik> I must have some sort of overhead i don't currently see or am aware oif
gonidelis[m] has joined #ste||ar
joe[m]1 has joined #ste||ar
jbjnr has joined #ste||ar
diehlpk_mobile[m has joined #ste||ar
carola[m]1 has joined #ste||ar
oleg[m]2 has joined #ste||ar
<Yorlik> hkaiser: When using the autochunker, setting it to 400 µs the percentage of that function goes down a lot.
<Yorlik> But my frametimes go up from ~700ms to 1700 ms
<Yorlik> Using one task per core I'm getting an idle rate of ~17% and a frametime of ~650 ms
<Yorlik> That doesn't löook right
<Yorlik> Even when using healthy task lengths the overall performance goes down horribly.
<Yorlik> Using 8 tasks per core i get an idle rate < 10%, but the framerate goes up to 720 ms
<Yorlik> After all each message is an async that runs on the system and heaving more chunks allows them to get mixed in between in the gaps between the large chunks.
<Yorlik> So utilization is better - but I don't understand why the framtime become so much worse
<Yorlik> Probably my overhead per chunk created is just way too bad.
<hkaiser> measure, measure, measure
<Yorlik> Doing that
<Yorlik> In the next Milestone I will spam perfcounters
<Yorlik> Instrumentation is a large part in the next milestone
weilewei has joined #ste||ar
<Yorlik> hkaiser: Which threads are running the actions? Also the workers?
<hkaiser> depends
<Yorlik> Like my send message action
<hkaiser> direct actions are run on the parcelport threads, non-direct actions are run by the workers
<Yorlik> what is direct or indirect in this context?
<hkaiser> all actions are non-direct by default
<Yorlik> OK
<Yorlik> That would explain why my task times are still very low even when ionly using one per core
<Yorlik> the messages make it short
<hkaiser> well, I thought your messages are handled in chunks as well
<Yorlik> The acxtion is called in the chunk, but its an async
<hkaiser> nod
<Yorlik> So it could run anywhere HPX puts it
<Yorlik> So right now the scheduling loop is down to ~2&
<Yorlik> 2%
<Yorlik> I removed the limit on task creation
<Yorlik> The only limit in place is the amount of Lua States in use
<Yorlik> I think it allows better scheduling for the small messages
<hkaiser> ok