<Yorlik>
Most time consuming functions are now: luaS_new (string creation, especially strcmp inside it) with 9.3%, _free_base with 5.4%, _malloc_base with 4.8%, spinlocks with 3%, SwitchToFiber 2.8 % VirtualQuery 2.6% and then the loop with 2.48 % then realloc, sweeplist (lua GC), boost::lockfree::deque::alloc_node, and more Lua stuff
<Yorlik>
So a lot is in the allocation/deallocation game
<Yorlik>
luaV_execute and its calls have 57% - that's really business logic running
<Yorlik>
But inside are these string functions ofc
kale[m] has quit [Ping timeout: 240 seconds]
<Yorlik>
mimalloc is really underused - lots goes still through ucrtbase.fll
<Yorlik>
hkaiser: Do you see a way to use mimalloc globally for new and delete?
<Yorlik>
That header which busted before?
<hkaiser>
well, it should get used, no?
<Yorlik>
No
<Yorlik>
It caused an exception
<hkaiser>
I thought they patched the binary to intercept all allocation calls
<Yorlik>
So - I can use mi_malloc and mi_free manually
<Yorlik>
The profiler tells another story
<hkaiser>
I need to investigate this, no idea what went wrong
<Yorlik>
OK
<Yorlik>
mimalloc has two headers
<Yorlik>
mimalloc.h - that one is uncomplicated
<Yorlik>
And then mimalloc-new-delete.h
<Yorlik>
That one explodes
<Yorlik>
It does the new and delete overrides
<Yorlik>
Not sure about global malloc
<Yorlik>
I still see the ucrtbase malloc in use
<Yorlik>
So - it looks like at least 25% CPU are wasted with allocations and deallcations
<Yorlik>
However - it's getting very late here - time to go sleep.
<Yorlik>
G'Night.
* Yorlik
waves and fades.
kale[m] has joined #ste||ar
Yorlik has quit [Ping timeout: 246 seconds]
kale[m] has quit [Ping timeout: 256 seconds]
kale[m] has joined #ste||ar
sayefsakin has joined #ste||ar
weilewei has quit [Remote host closed the connection]
hkaiser has quit [Quit: bye]
sayef_ has joined #ste||ar
sayefsakin has quit [Ping timeout: 240 seconds]
sayefsakin has joined #ste||ar
sayef_ has quit [Read error: Connection reset by peer]
nanm has quit [Remote host closed the connection]
Yorlik has joined #ste||ar
nikunj97 has joined #ste||ar
jaafar_ is now known as jaafar
Nikunj__ has joined #ste||ar
nikunj97 has quit [Ping timeout: 260 seconds]
kale[m] has quit [Ping timeout: 265 seconds]
Nikunj__ has quit [Read error: Connection reset by peer]
Nikunj__ has joined #ste||ar
nikunj97 has joined #ste||ar
Nikunj__ has quit [Ping timeout: 260 seconds]
bita_ has quit [Ping timeout: 260 seconds]
Yorlik has quit [Read error: Connection reset by peer]
<nikunj97>
I'm getting serialization errors from the code
<zao>
Compiler/linker/runtime?
<nikunj97>
zao, compiler error
<zao>
Oh, there was a log above.
<nikunj97>
I followed the boost serialization tutorial to add the serialization function.
<zao>
nikunj97: Your serialize function is const.
<zao>
Going to be hard to deserialize into a const object.
<nikunj97>
ohh crap, yes! silly me
<zao>
(squinting at the errors and seeing `/usr/local/include/hpx/serialization/access.hpp:36:22: error: no matching function for call to ‘serialize(hpx::serialization::input_archive&, const std::vector<double>&, int)’`
<nikunj97>
zao, thanks a lot
<nikunj97>
it compiles now
<zao>
\o/
sayefsakin has quit [Ping timeout: 260 seconds]
kale[m] has joined #ste||ar
sayefsakin has joined #ste||ar
jaafar has quit [Ping timeout: 260 seconds]
kale[m] has quit [Ping timeout: 246 seconds]
kale[m] has joined #ste||ar
jaafar has joined #ste||ar
jaafar has quit [Ping timeout: 260 seconds]
jaafar has joined #ste||ar
sayefsakin has quit [Read error: Connection reset by peer]
sayefsakin has joined #ste||ar
sayefsakin has quit [Ping timeout: 260 seconds]
<nikunj97>
ms[m], good news. I was able to write an optimized version for the available 1d stencil 8. I'm on benchmarking now, but for usual case you can expect a minimum of 4x speed up (from my initial results)
<nikunj97>
also the benchmark is now scaling on distributed as well
<gonidelis[m]>
ahhh.... build-and-test hates me ;p
<gonidelis[m]>
but that does not concern the PR files. What should I do?
<zao>
Maybe make a small PR that addresses the things and get that through, continuing your work after that's in?
<zao>
Maybe look at where the change came from and how it has slipped through, either by existing before tooling was in place, or hack-commited to master?
<zao>
The most important thing whenever something's wrong is to know who to blame :P
<nikunj97>
zao, :D
<gonidelis[m]>
haha
<zao>
gonidelis[m]: If you rebase on current master, it's fixed already.
<ms[m]>
gonidelis: yeah, I know... but feel free to keep asking here, usually if you're confused about an error it's most likely unrelated to your changes
<gonidelis[m]>
ok great thanks a lot :D
sayefsakin has joined #ste||ar
sayefsakin has quit [Ping timeout: 260 seconds]
hkaiser has joined #ste||ar
<nikunj97>
ms[m], ok, it won't scale any further than 4 nodes :/
Yorlik has joined #ste||ar
<Yorlik>
hkaiser: YT
<Yorlik>
I made another test yesterday: Using 1 million objects/messages used to take ~30 seconds/frame and I'm down to ~5-6 seconds. Concerning scaling that is even better than with 100k objects, where I end up at ~6-7 ms pre frame. I guass the numerous little small object optimizations I did together with mimalloc indeed helped a lot.
<Yorlik>
However - the frametimes are still too long and I need to optimize further.
<hkaiser>
nice
<Yorlik>
hkaiser: I made the lambdas and the executor function static - but that didn't change anything
<hkaiser>
sure, why should it
<Yorlik>
no recreation of objects?
<Yorlik>
So their construction time is insignificant
K-ballo has quit [Quit: K-ballo]
<hkaiser>
Yorlik: if your lambda has no captures then it's equivalent to a function pointer anyways
<Yorlik>
Only references, no copies.
<Yorlik>
so - probably just like a mini namespace
<hkaiser>
then it has captures, but still trivially constructable
<Yorlik>
Make sense
<Yorlik>
I'm trying everything
<hkaiser>
Yorlik: measure, measure, measure
<hkaiser>
everything else is just conjecture
<Yorlik>
At some point I gave the autochunker a ridiculously high target time of ~20000µsec and saw an unexpected huge gain in time savings
<Yorlik>
So - trying mad things is good, as long as the system isn't really fully understood
<Yorlik>
In the moment I'm measuring the efect of the chunk count / core on the times
<Yorlik>
like: How much overhead do I get from more tasks
<Yorlik>
And if I see something interesting I'll go profile it
<Yorlik>
But that's all prelude to what's going to happen in Milestone 2
<Yorlik>
I wan scripted automated experiments and solid instrumentation / graphing
K-ballo has joined #ste||ar
kale[m] has quit [Ping timeout: 256 seconds]
kale[m] has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
K-ballo has quit [Client Quit]
K-ballo has joined #ste||ar
kale[m] has quit [Ping timeout: 240 seconds]
<Yorlik>
hakaiser: {what}: Couldn't add new thread to the map of threads: HPX(out_of_memory)
<Yorlik>
22/64 GB RAM left ....
<Yorlik>
hkaiser ^^
<Yorlik>
err used
<Yorlik>
42 left
<hkaiser>
no idea
<hkaiser>
if it says OOM, then I'd believe it
<Yorlik>
Maybe some internal structure with limited data?
<Yorlik>
I had this on occasion 2-3 times when running with 1000000 objects
<Yorlik>
or could this be a propagation of my memory management?
<Yorlik>
because I throw above 1000000
Amy1 has quit [Ping timeout: 256 seconds]
Amy1 has joined #ste||ar
rtohid has joined #ste||ar
karame_ has joined #ste||ar
weilewei has joined #ste||ar
<hkaiser>
ms[m]: yt?
<ms[m]>
hkaiser: yeah
<hkaiser>
ms[m]: nvm, I just saw #4761 (and merged it) - thanks!
<ms[m]>
thank you
<ms[m]>
and also thanks for coordinating the kokkos discussions!
<ms[m]>
it's nice that there are so many people interested
<ms[m]>
I actually made decent progress on the kokkos executor today, it's kind of working even with a few algorithms
<ms[m]>
the real work is going to be avoiding kokkos fences in their views...
<hkaiser>
yes
<hkaiser>
also the integration of continuations
<hkaiser>
but that's for another day, I think
<ms[m]>
regarding gpu continuations, I don't know of of a way to avoid synchronizing via the cpu with our current futures
<ms[m]>
they're kind of set up to go via the cpu with the future being set to ready on the cpu side
<hkaiser>
in any case, thanks for working on this, it will be a game changer in terms of project visibility (at least amongst the labs)
<ms[m]>
yes, definitely for an other day...
<ms[m]>
with big enough kernels it's again not a problem
<hkaiser>
ms[m]: yes, I know - but all we need to keep is the hpx::future interface class, nobody said we wouldn't be able to create special shared states that do what's needed
<ms[m]>
eventually someone will want tiny kernels and then it's the same problem we have on the cpu
<ms[m]>
right, that might be possible, I hadn't thought about that at all
<hkaiser>
the shared state would just carry the context information (stream) allowing to submit continuations to the same stream
<ms[m]>
yeah, that could work
<ms[m]>
it may be overconstraining, but could work as a start
<ms[m]>
hkaiser: on an unrelated topic, and sorry for bringing this up again, but what's the status of ci on rostam? we would really need that for a release and I'd be happy to help out there
<hkaiser>
ms[m]: nothing has happened there (yet) mostly because we ran out of disk space and acquired new storage (120TB) that is currently being configured and made available
<hkaiser>
once that's in place we should be able to proceed with the ci
<ms[m]>
:P
<ms[m]>
ok, well, if I can be of help I'm available
<hkaiser>
ms[m]: thanks, I'll keep you in the loop - will ask today what's the state
<ms[m]>
we might get fancier ci running on daint soonish (that harmen guy is from cscs), but we probably still won't be able to cover enough configurations with that
<hkaiser>
sure, I'm all in favor of using rostam for this
<hkaiser>
sorry for the delay
<ms[m]>
hkaiser: no worries, I'm just being impatient
<ms[m]>
I really appreciate you guys setting that up
bita_ has joined #ste||ar
nan77 has joined #ste||ar
Yorlik has quit [Ping timeout: 265 seconds]
kale[m] has joined #ste||ar
karame_ has quit [Remote host closed the connection]
<hkaiser>
nikunj97: yt?
<nikunj97>
hkaiser, here
<hkaiser>
see pm, pls
mcopik has joined #ste||ar
mcopik has quit [Client Quit]
<hkaiser>
nikunj97: yt
<nikunj97>
hkaiser, here
<hkaiser>
pm pls (again)
<nikunj97>
:D
<nikunj97>
hkaiser, ms[m]: early scaling upto 8 nodes (512 cores) showing promising results for my optimized 1d stencil 8. Time of execution decreases from 750s on 1 node to 180s for 8 nodes. For shorter problem sizes, scaling of upto 2 nodes (128 cores) can be expected.
<nikunj97>
this is with 2400 partitions, 100000 points per partition iterating for a total of 4096 iterations
jaafar has quit [Quit: Konversation terminated!]
<nikunj97>
this is wrt total time which includes the time in initialization as well. I think kernel performance will be even better
<gonidelis[m]>
hkaiser: Please tell me if there is anything else to fix #4745 as I don't know from the tests that failed which are important and which are not
<K-ballo>
they are all important
<gonidelis[m]>
K-ballo: Earlier in the day mikel advised me that there are some fails that I shouldn't be worried about...
<K-ballo>
that's a different question
<K-ballo>
they may not all be your fault/responsibility, but they are nevertheless important
<gonidelis[m]>
ok you got me there
<gonidelis[m]>
So I am asking if there is sth that needs to be fixed in order for the particular PR to be ready
karame_ has joined #ste||ar
<hkaiser>
gonidelis[m]: we'll merge it as it is right now
<hkaiser>
might take another day or two
<hkaiser>
ms[m]: Alireza told me that the storage is in place now
weilewei has quit [Remote host closed the connection]
<hkaiser>
on that rostam, that is
weilewei has joined #ste||ar
<gonidelis[m]>
hkaiser: ok great :D
jaafar has joined #ste||ar
kale[m] has quit [Ping timeout: 264 seconds]
kale[m] has joined #ste||ar
<gonidelis[m]>
hkaiser: I have been working on that feature check that we were talking about the other day and I have changes some .cmake files. Will the changes be checked if I just `make -j` at my build dir? Or do I need to `cmake` first all over again ?
<K-ballo>
make will do cmake first when needed
<gonidelis[m]>
K-ballo: ok great. Thanks a lot!
<ms[m]>
hkaiser: thanks for the update, sounds good
<gonidelis[m]>
hkaiser: K-ballo What's your opinion on my disable_sized_sentinel_for implementation?
<hkaiser>
gonidelis[m]: If you created a PR I would comment on it
<hkaiser>
some minor issues, but overall looks good
<gonidelis[m]>
+ sized_sentinel_for
<K-ballo>
gonidelis[m]: feature tests for library stuff have 'std' in the name, your feature test always fails (no inline for block scope variables)
<K-ballo>
maybe silence the warning about unused `b` too? not sure how we usually handle those
<hkaiser>
K-ballo: we usually don't care, but a (void) b; wouldn't hurt
<hkaiser>
tests are missing
<hkaiser>
pls create a PR, I have a couple of comments
<gonidelis[m]>
Yeah. Was asking mainly for the test part. ;) Ok. I will make the changes on `ditance.hpp` right now and I will make the PR right away
<gonidelis[m]>
K-ballo: why does the test always fail?
<K-ballo>
(no inline for block scope variables)
<gonidelis[m]>
hkaiser: I will add (void) b; and tests asap
<gonidelis[m]>
K-ballo: ahhh why is that? Ι don't really know why is `inline` needed there (maybe it is not needed after all) but I put it because that is the return type of `std::disable_sized_sentinel_for`
<gonidelis[m]>
`inline constexpr bool`
<K-ballo>
it's not the return type, it's a variable not a function
<hkaiser>
the definition of disable_size_sentinel_for has to be inline
<K-ballo>
inline variables only exist at namespace scope, your's is at block scope.. so the test always fails to compile
<K-ballo>
*yours
<gonidelis[m]>
ok got it. I will remove the `inline` keyword
Yorlik has joined #ste||ar
<K-ballo>
once the test is ready copy it to the wandbox and check that it compiles as expected and fails as expected
<K-ballo>
latest gcc should have sized_sentinel_for already
<K-ballo>
what does knowing that a type is a sized sentinel give you?
<gonidelis[m]>
That I can calculate S - I in constant time (?)
<gonidelis[m]>
like a random access iterator
<gonidelis[m]>
ahh... Is it that we make a generalization?
<gonidelis[m]>
That is to say, that we make that despatching not only in the case of random access iterator but in any case where S - I could be calculated in constant time!
sayef_ has joined #ste||ar
wash[m]_ has joined #ste||ar
V|r has joined #ste||ar
<gonidelis[m]>
K-ballo: ^^
sayefsakin has quit [*.net *.split]
wash[m] has quit [*.net *.split]
Vir has quit [*.net *.split]
wash[m]_ is now known as wash[m]
<gonidelis[m]>
Any ideas on how to check my distance.hpp implementation? Is there any test for the previous implementation?
<gonidelis[m]>
hkaiser: ^^
weilewei has quit [Remote host closed the connection]
<hkaiser>
gonidelis[m]: we don't have a test for this
nan77 has quit [Remote host closed the connection]
<hkaiser>
gonidelis[m]: "Is it that we make a generalization?" that's excatly what we want to do
karame_ has quit [Remote host closed the connection]
<gonidelis[m]>
yeah great
<gonidelis[m]>
So in order to check my distance.hpp do I have to create a test of my own? Also, where do you use your custom distance? I can only find std::distance used in the algos
nan9 has joined #ste||ar
<hkaiser>
well, that's the point, you'll need to change the algos you touch to use your new distance implementation
rtohid has left #ste||ar [#ste||ar]
weilewei has joined #ste||ar
<hkaiser>
gonidelis[m]: ^^
<K-ballo>
gonidelis[m]: yes, you were right
kale[m] has quit [Ping timeout: 256 seconds]
kale[m] has joined #ste||ar
<gonidelis[m]>
hkaiser: Yeah right. I do get that. But what was the reason for having the previous implementation in the first place if you didn't use it ?
<gonidelis[m]>
Also, am I allowed to create a test of my own (at `libs/algorithms/tests/unit`) for `distance.hpp` just to check if it works properely?
<gonidelis[m]>
properly*
<K-ballo>
where's the previous implementation?
<K-ballo>
gonidelis[m]:
<hkaiser>
gonidelis[m]: we did use it in the reduce algorithm
<hkaiser>
gonidelis[m]: sure, pls add tests if you think it would be useful to have them
<gonidelis[m]>
hkaiser: ahh great! Any way I could have find that on my own? (through some github tool maybe)
<K-ballo>
InIterB, InIterE mmmmh
<hkaiser>
grep for util::distance?
<hkaiser>
K-ballo: yah, naming is always hard
<hkaiser>
gonidelis[m]: or detail::distance for that matter
<gonidelis[m]>
what's the problem with the naming??
<hkaiser>
gonidelis[m]: it's hard ;-)
<gonidelis[m]>
the naming or the problem ? ;p
<hkaiser>
'InIterE' might be better off being 'Sentinel' or something similar
<gonidelis[m]>
Ok... we could discuss that as soon as I post the PR
<gonidelis[m]>
hkaiser: Apologies for not having posted it yet, but I encounter mini problems with the dispatching. I am sure I 'll figure it out.
<hkaiser>
sure, no worries
<gonidelis[m]>
As for that grep thing. Should I just `cd hpx_sized_sentiel_for/libs/algorithms/include/hpx/parallel/algorithms` and then ` grep "util::distance"
<gonidelis[m]>
`? Or is there another way around? (it seems to take some time)
<gonidelis[m]>
hkaiser: ^^
<gonidelis[m]>
`detail::distance` ^^
<hkaiser>
well grep -R does it recursively, I believe