hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
<Yorlik> Most time consuming functions are now: luaS_new (string creation, especially strcmp inside it) with 9.3%, _free_base with 5.4%, _malloc_base with 4.8%, spinlocks with 3%, SwitchToFiber 2.8 % VirtualQuery 2.6% and then the loop with 2.48 % then realloc, sweeplist (lua GC), boost::lockfree::deque::alloc_node, and more Lua stuff
<Yorlik> So a lot is in the allocation/deallocation game
<Yorlik> luaV_execute and its calls have 57% - that's really business logic running
<Yorlik> But inside are these string functions ofc
kale[m] has quit [Ping timeout: 240 seconds]
<Yorlik> mimalloc is really underused - lots goes still through ucrtbase.fll
<Yorlik> hkaiser: Do you see a way to use mimalloc globally for new and delete?
<Yorlik> That header which busted before?
<hkaiser> well, it should get used, no?
<Yorlik> No
<Yorlik> It caused an exception
<hkaiser> I thought they patched the binary to intercept all allocation calls
<Yorlik> So - I can use mi_malloc and mi_free manually
<Yorlik> The profiler tells another story
<hkaiser> I need to investigate this, no idea what went wrong
<Yorlik> OK
<Yorlik> mimalloc has two headers
<Yorlik> mimalloc.h - that one is uncomplicated
<Yorlik> And then mimalloc-new-delete.h
<Yorlik> That one explodes
<Yorlik> It does the new and delete overrides
<Yorlik> Not sure about global malloc
<Yorlik> I still see the ucrtbase malloc in use
<Yorlik> So - it looks like at least 25% CPU are wasted with allocations and deallcations
<Yorlik> However - it's getting very late here - time to go sleep.
<Yorlik> G'Night.
* Yorlik waves and fades.
kale[m] has joined #ste||ar
Yorlik has quit [Ping timeout: 246 seconds]
kale[m] has quit [Ping timeout: 256 seconds]
kale[m] has joined #ste||ar
sayefsakin has joined #ste||ar
weilewei has quit [Remote host closed the connection]
hkaiser has quit [Quit: bye]
sayef_ has joined #ste||ar
sayefsakin has quit [Ping timeout: 240 seconds]
sayefsakin has joined #ste||ar
sayef_ has quit [Read error: Connection reset by peer]
nanm has quit [Remote host closed the connection]
Yorlik has joined #ste||ar
nikunj97 has joined #ste||ar
jaafar_ is now known as jaafar
Nikunj__ has joined #ste||ar
nikunj97 has quit [Ping timeout: 260 seconds]
kale[m] has quit [Ping timeout: 265 seconds]
Nikunj__ has quit [Read error: Connection reset by peer]
Nikunj__ has joined #ste||ar
nikunj97 has joined #ste||ar
Nikunj__ has quit [Ping timeout: 260 seconds]
bita_ has quit [Ping timeout: 260 seconds]
Yorlik has quit [Read error: Connection reset by peer]
<nikunj97> I'm getting serialization errors from the code
<zao> Compiler/linker/runtime?
<nikunj97> zao, compiler error
<zao> Oh, there was a log above.
<nikunj97> I followed the boost serialization tutorial to add the serialization function.
<zao> nikunj97: Your serialize function is const.
<zao> Going to be hard to deserialize into a const object.
<nikunj97> ohh crap, yes! silly me
<zao> (squinting at the errors and seeing `/usr/local/include/hpx/serialization/access.hpp:36:22: error: no matching function for call to ‘serialize(hpx::serialization::input_archive&, const std::vector<double>&, int)’`
<nikunj97> zao, thanks a lot
<nikunj97> it compiles now
<zao> \o/
sayefsakin has quit [Ping timeout: 260 seconds]
kale[m] has joined #ste||ar
sayefsakin has joined #ste||ar
jaafar has quit [Ping timeout: 260 seconds]
kale[m] has quit [Ping timeout: 246 seconds]
kale[m] has joined #ste||ar
jaafar has joined #ste||ar
jaafar has quit [Ping timeout: 260 seconds]
jaafar has joined #ste||ar
sayefsakin has quit [Read error: Connection reset by peer]
sayefsakin has joined #ste||ar
sayefsakin has quit [Ping timeout: 260 seconds]
<nikunj97> ms[m], good news. I was able to write an optimized version for the available 1d stencil 8. I'm on benchmarking now, but for usual case you can expect a minimum of 4x speed up (from my initial results)
<nikunj97> also the benchmark is now scaling on distributed as well
<gonidelis[m]> ahhh.... build-and-test hates me ;p
<gonidelis[m]> but that does not concern the PR files. What should I do?
<zao> Maybe make a small PR that addresses the things and get that through, continuing your work after that's in?
<zao> Maybe look at where the change came from and how it has slipped through, either by existing before tooling was in place, or hack-commited to master?
<zao> The most important thing whenever something's wrong is to know who to blame :P
<nikunj97> zao, :D
<gonidelis[m]> haha
<zao> gonidelis[m]: If you rebase on current master, it's fixed already.
<zao> (the typo has been there since years)
<ms[m]> gonidelis: please ignore that
<ms[m]> nikunj97: nice! PR? ;) thanks for working on that!
<nikunj97> ms[m], working on it. will add once I can confirm performance benefits
<ms[m]> sure, no worries
<gonidelis[m]> ms[m]: the thing is I don't know which tests to ignore and which not
<gonidelis[m]> zao: thnx!
<ms[m]> gonidelis: yeah, I know... but feel free to keep asking here, usually if you're confused about an error it's most likely unrelated to your changes
<gonidelis[m]> ok great thanks a lot :D
sayefsakin has joined #ste||ar
sayefsakin has quit [Ping timeout: 260 seconds]
hkaiser has joined #ste||ar
<nikunj97> ms[m], ok, it won't scale any further than 4 nodes :/
Yorlik has joined #ste||ar
<Yorlik> hkaiser: YT
<Yorlik> I made another test yesterday: Using 1 million objects/messages used to take ~30 seconds/frame and I'm down to ~5-6 seconds. Concerning scaling that is even better than with 100k objects, where I end up at ~6-7 ms pre frame. I guass the numerous little small object optimizations I did together with mimalloc indeed helped a lot.
<Yorlik> However - the frametimes are still too long and I need to optimize further.
<hkaiser> nice
<Yorlik> hkaiser: I made the lambdas and the executor function static - but that didn't change anything
<hkaiser> sure, why should it
<Yorlik> no recreation of objects?
<Yorlik> So their construction time is insignificant
K-ballo has quit [Quit: K-ballo]
<hkaiser> Yorlik: if your lambda has no captures then it's equivalent to a function pointer anyways
<Yorlik> Only references, no copies.
<Yorlik> so - probably just like a mini namespace
<hkaiser> then it has captures, but still trivially constructable
<Yorlik> Make sense
<Yorlik> I'm trying everything
<hkaiser> Yorlik: measure, measure, measure
<hkaiser> everything else is just conjecture
<Yorlik> At some point I gave the autochunker a ridiculously high target time of ~20000µsec and saw an unexpected huge gain in time savings
<Yorlik> So - trying mad things is good, as long as the system isn't really fully understood
<Yorlik> In the moment I'm measuring the efect of the chunk count / core on the times
<Yorlik> like: How much overhead do I get from more tasks
<Yorlik> And if I see something interesting I'll go profile it
<Yorlik> But that's all prelude to what's going to happen in Milestone 2
<Yorlik> I wan scripted automated experiments and solid instrumentation / graphing
K-ballo has joined #ste||ar
kale[m] has quit [Ping timeout: 256 seconds]
kale[m] has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
K-ballo has joined #ste||ar
K-ballo has quit [Client Quit]
K-ballo has joined #ste||ar
kale[m] has quit [Ping timeout: 240 seconds]
<Yorlik> hakaiser: {what}: Couldn't add new thread to the map of threads: HPX(out_of_memory)
<Yorlik> 22/64 GB RAM left ....
<Yorlik> hkaiser ^^
<Yorlik> err used
<Yorlik> 42 left
<hkaiser> no idea
<hkaiser> if it says OOM, then I'd believe it
<Yorlik> Maybe some internal structure with limited data?
<Yorlik> I had this on occasion 2-3 times when running with 1000000 objects
<Yorlik> or could this be a propagation of my memory management?
<Yorlik> because I throw above 1000000
Amy1 has quit [Ping timeout: 256 seconds]
Amy1 has joined #ste||ar
rtohid has joined #ste||ar
karame_ has joined #ste||ar
weilewei has joined #ste||ar
<hkaiser> ms[m]: yt?
<ms[m]> hkaiser: yeah
<hkaiser> ms[m]: nvm, I just saw #4761 (and merged it) - thanks!
<ms[m]> thank you
<ms[m]> and also thanks for coordinating the kokkos discussions!
<ms[m]> it's nice that there are so many people interested
<ms[m]> I actually made decent progress on the kokkos executor today, it's kind of working even with a few algorithms
<ms[m]> the real work is going to be avoiding kokkos fences in their views...
<hkaiser> yes
<hkaiser> also the integration of continuations
<hkaiser> but that's for another day, I think
<ms[m]> regarding gpu continuations, I don't know of of a way to avoid synchronizing via the cpu with our current futures
<ms[m]> they're kind of set up to go via the cpu with the future being set to ready on the cpu side
<hkaiser> in any case, thanks for working on this, it will be a game changer in terms of project visibility (at least amongst the labs)
<ms[m]> yes, definitely for an other day...
<ms[m]> with big enough kernels it's again not a problem
<hkaiser> ms[m]: yes, I know - but all we need to keep is the hpx::future interface class, nobody said we wouldn't be able to create special shared states that do what's needed
<ms[m]> eventually someone will want tiny kernels and then it's the same problem we have on the cpu
<ms[m]> right, that might be possible, I hadn't thought about that at all
<hkaiser> the shared state would just carry the context information (stream) allowing to submit continuations to the same stream
<ms[m]> yeah, that could work
<ms[m]> it may be overconstraining, but could work as a start
<ms[m]> hkaiser: on an unrelated topic, and sorry for bringing this up again, but what's the status of ci on rostam? we would really need that for a release and I'd be happy to help out there
<hkaiser> ms[m]: nothing has happened there (yet) mostly because we ran out of disk space and acquired new storage (120TB) that is currently being configured and made available
<hkaiser> once that's in place we should be able to proceed with the ci
<ms[m]> :P
<ms[m]> ok, well, if I can be of help I'm available
<hkaiser> ms[m]: thanks, I'll keep you in the loop - will ask today what's the state
<ms[m]> we might get fancier ci running on daint soonish (that harmen guy is from cscs), but we probably still won't be able to cover enough configurations with that
<hkaiser> sure, I'm all in favor of using rostam for this
<hkaiser> sorry for the delay
<ms[m]> hkaiser: no worries, I'm just being impatient
<ms[m]> I really appreciate you guys setting that up
bita_ has joined #ste||ar
nan77 has joined #ste||ar
Yorlik has quit [Ping timeout: 265 seconds]
kale[m] has joined #ste||ar
karame_ has quit [Remote host closed the connection]
<hkaiser> nikunj97: yt?
<nikunj97> hkaiser, here
<hkaiser> see pm, pls
mcopik has joined #ste||ar
mcopik has quit [Client Quit]
<hkaiser> nikunj97: yt
<nikunj97> hkaiser, here
<hkaiser> pm pls (again)
<nikunj97> :D
<nikunj97> hkaiser, ms[m]: early scaling upto 8 nodes (512 cores) showing promising results for my optimized 1d stencil 8. Time of execution decreases from 750s on 1 node to 180s for 8 nodes. For shorter problem sizes, scaling of upto 2 nodes (128 cores) can be expected.
<nikunj97> this is with 2400 partitions, 100000 points per partition iterating for a total of 4096 iterations
jaafar has quit [Quit: Konversation terminated!]
<nikunj97> this is wrt total time which includes the time in initialization as well. I think kernel performance will be even better
<gonidelis[m]> hkaiser: Please tell me if there is anything else to fix #4745 as I don't know from the tests that failed which are important and which are not
<K-ballo> they are all important
<gonidelis[m]> K-ballo: Earlier in the day mikel advised me that there are some fails that I shouldn't be worried about...
<K-ballo> that's a different question
<K-ballo> they may not all be your fault/responsibility, but they are nevertheless important
<gonidelis[m]> ok you got me there
<gonidelis[m]> So I am asking if there is sth that needs to be fixed in order for the particular PR to be ready
karame_ has joined #ste||ar
<hkaiser> gonidelis[m]: we'll merge it as it is right now
<hkaiser> might take another day or two
<hkaiser> ms[m]: Alireza told me that the storage is in place now
weilewei has quit [Remote host closed the connection]
<hkaiser> on that rostam, that is
weilewei has joined #ste||ar
<gonidelis[m]> hkaiser: ok great :D
jaafar has joined #ste||ar
kale[m] has quit [Ping timeout: 264 seconds]
kale[m] has joined #ste||ar
<gonidelis[m]> hkaiser: I have been working on that feature check that we were talking about the other day and I have changes some .cmake files. Will the changes be checked if I just `make -j` at my build dir? Or do I need to `cmake` first all over again ?
<K-ballo> make will do cmake first when needed
<gonidelis[m]> K-ballo: ok great. Thanks a lot!
<ms[m]> hkaiser: thanks for the update, sounds good
<gonidelis[m]> hkaiser: K-ballo What's your opinion on my disable_sized_sentinel_for implementation?
<hkaiser> gonidelis[m]: If you created a PR I would comment on it
<hkaiser> some minor issues, but overall looks good
<gonidelis[m]> + sized_sentinel_for
<K-ballo> gonidelis[m]: feature tests for library stuff have 'std' in the name, your feature test always fails (no inline for block scope variables)
<K-ballo> maybe silence the warning about unused `b` too? not sure how we usually handle those
<hkaiser> K-ballo: we usually don't care, but a (void) b; wouldn't hurt
<hkaiser> tests are missing
<hkaiser> pls create a PR, I have a couple of comments
<gonidelis[m]> Yeah. Was asking mainly for the test part. ;) Ok. I will make the changes on `ditance.hpp` right now and I will make the PR right away
<gonidelis[m]> K-ballo: why does the test always fail?
<K-ballo> (no inline for block scope variables)
<gonidelis[m]> hkaiser: I will add (void) b; and tests asap
<gonidelis[m]> K-ballo: ahhh why is that? Ι don't really know why is `inline` needed there (maybe it is not needed after all) but I put it because that is the return type of `std::disable_sized_sentinel_for`
<gonidelis[m]> `inline constexpr bool`
<K-ballo> it's not the return type, it's a variable not a function
<hkaiser> the definition of disable_size_sentinel_for has to be inline
<K-ballo> inline variables only exist at namespace scope, your's is at block scope.. so the test always fails to compile
<K-ballo> *yours
<gonidelis[m]> ok got it. I will remove the `inline` keyword
Yorlik has joined #ste||ar
<K-ballo> once the test is ready copy it to the wandbox and check that it compiles as expected and fails as expected
<K-ballo> latest gcc should have sized_sentinel_for already
<gonidelis[m]> wandbox?
<hkaiser> gonidelis[m]: also, inline constexpr variables are C++17, use HPX_INLINE_CONSTEXPR_VARIABLE instead of 'inline constexpr'
<K-ballo> better open a PR next time, to capture all the comments
<hkaiser> that's what I asked as well
<hkaiser> there are more things that need adaptation
<gonidelis[m]> ok I am sory. I thought that opening a PR for unfinished work would be stupid
<hkaiser> gonidelis[m]: why? electrons are cheap
<gonidelis[m]> hkaiser: I 'll make sure you won't have to repeat that quote ever again ;D
<K-ballo> when you open a PR you can choose to create a Draft PR
<K-ballo> to signal that your work is unfinished
<gonidelis[m]> ahhh ok... great! I 'll do. BTW that wandbox thing is just what I needed. Thanks!
nikunj97 has quit [Read error: Connection reset by peer]
sayefsakin has joined #ste||ar
<diehlpk_work> Please find attached the link to the JOSS review
<diehlpk_work> Please look into the questions and the help to answer them
<gonidelis[m]> hkaiser: Could you please remind me what is the change that I have to make on `distance` ?? https://github.com/gonidelis/hpx/blob/master/libs/algorithms/include/hpx/parallel/algorithms/detail/distance.hpp . I reckon it is that we should use `sized_sentinel_for` instead of the `iterator_tag` feature, in order to make the type despatching, but why?
<K-ballo> what does knowing that a type is a sized sentinel give you?
<gonidelis[m]> That I can calculate S - I in constant time (?)
<gonidelis[m]> like a random access iterator
<gonidelis[m]> ahh... Is it that we make a generalization?
<gonidelis[m]> That is to say, that we make that despatching not only in the case of random access iterator but in any case where S - I could be calculated in constant time!
sayef_ has joined #ste||ar
wash[m]_ has joined #ste||ar
V|r has joined #ste||ar
<gonidelis[m]> K-ballo: ^^
sayefsakin has quit [*.net *.split]
wash[m] has quit [*.net *.split]
Vir has quit [*.net *.split]
wash[m]_ is now known as wash[m]
<gonidelis[m]> Any ideas on how to check my distance.hpp implementation? Is there any test for the previous implementation?
<gonidelis[m]> hkaiser: ^^
weilewei has quit [Remote host closed the connection]
<hkaiser> gonidelis[m]: we don't have a test for this
nan77 has quit [Remote host closed the connection]
<hkaiser> gonidelis[m]: "Is it that we make a generalization?" that's excatly what we want to do
karame_ has quit [Remote host closed the connection]
<gonidelis[m]> yeah great
<gonidelis[m]> So in order to check my distance.hpp do I have to create a test of my own? Also, where do you use your custom distance? I can only find std::distance used in the algos
nan9 has joined #ste||ar
<hkaiser> well, that's the point, you'll need to change the algos you touch to use your new distance implementation
rtohid has left #ste||ar [#ste||ar]
weilewei has joined #ste||ar
<hkaiser> gonidelis[m]: ^^
<K-ballo> gonidelis[m]: yes, you were right
kale[m] has quit [Ping timeout: 256 seconds]
kale[m] has joined #ste||ar
<gonidelis[m]> hkaiser: Yeah right. I do get that. But what was the reason for having the previous implementation in the first place if you didn't use it ?
<gonidelis[m]> Also, am I allowed to create a test of my own (at `libs/algorithms/tests/unit`) for `distance.hpp` just to check if it works properely?
<gonidelis[m]> properly*
<K-ballo> where's the previous implementation?
<K-ballo> gonidelis[m]:
<hkaiser> gonidelis[m]: we did use it in the reduce algorithm
<hkaiser> gonidelis[m]: sure, pls add tests if you think it would be useful to have them
<gonidelis[m]> hkaiser: ahh great! Any way I could have find that on my own? (through some github tool maybe)
<K-ballo> InIterB, InIterE mmmmh
<hkaiser> grep for util::distance?
<hkaiser> K-ballo: yah, naming is always hard
<hkaiser> gonidelis[m]: or detail::distance for that matter
<gonidelis[m]> what's the problem with the naming??
<hkaiser> gonidelis[m]: it's hard ;-)
<gonidelis[m]> the naming or the problem ? ;p
<hkaiser> 'InIterE' might be better off being 'Sentinel' or something similar
<gonidelis[m]> Ok... we could discuss that as soon as I post the PR
<gonidelis[m]> hkaiser: Apologies for not having posted it yet, but I encounter mini problems with the dispatching. I am sure I 'll figure it out.
<hkaiser> sure, no worries
<gonidelis[m]> As for that grep thing. Should I just `cd hpx_sized_sentiel_for/libs/algorithms/include/hpx/parallel/algorithms` and then ` grep "util::distance"
<gonidelis[m]> `? Or is there another way around? (it seems to take some time)
<gonidelis[m]> hkaiser: ^^
<gonidelis[m]> `detail::distance` ^^
<hkaiser> well grep -R does it recursively, I believe
<gonidelis[m]> ahh okkk.... That was the trick
<gonidelis[m]> Thank you very much!!!