weilewei18 has quit [Remote host closed the connection]
<weilewei>
hkaiser jbjnr ms[m] so I found out how to break cyclic dependency with libcds. When libcds is built inside hpx, we can pass a flag, LIBCDS_INSIDE_HPX, to say hi I am inside hpx build. and then inside libcds, link libcds core to hpx_core (instead HPX::hpx when build outside of hpx) and link libcd tests to hpx_wrap (instead of HPX::hpx).
<weilewei>
next step might be adding libcds thread data structure inside hpx thread data.... we will see how it goes, otherwise, we might need to let hpx set_thread_data to set a pair of 64 bits data or something similar
<weilewei>
well we will discuss more on Monday meeting
weilewei has quit [Remote host closed the connection]
hkaiser has quit [Quit: bye]
bita_ has quit [Read error: Connection reset by peer]
nikunj97 has joined #ste||ar
nikunj97 has quit [Remote host closed the connection]
<gonidelis[m]>
what about the 'pycicle daint-clang-oldest' tests and the 'pycicle daint-gcc-cuda' one
<gonidelis[m]>
?
Nikunj__ has quit [Read error: Connection reset by peer]
diehlpk_work_ has quit [Remote host closed the connection]
diehlpk_work_ has joined #ste||ar
weilewei has joined #ste||ar
nikunj97 has joined #ste||ar
<nikunj97>
is 1d stencil 8 not optimized for distributed? I was running with 1M points per partition and 100 partitions over 100 iterations and the code isn't scaling. It starts at 85s execution time for single node and increases to 110s by the time I test for 10 nodes.
<ms[m]>
gonidelis: ignore the clang-oldest failure, that's unrelated
<ms[m]>
I'll try to fix it in the next few days
<nikunj97>
I'm using: srun -N <node amount> -p marvin ./1d_stencil_8 --nx=1000000 --np=1000 --nt=100 to test
<ms[m]>
nikunj97: haven't looked at it closely, but it's most likely not very optimized (or it's regressed)
<ms[m]>
you might have better luck with a bigger problem, but it sounds like you already have quite a big problem
<nikunj97>
is there any distributed benchmark available?
<nikunj97>
I wanted to test scalability of a distributed network
<nikunj97>
ms[m], if I increase the problem size execution times will increase well beyond a few minutes per run. I believe current problem is large enough
<ms[m]>
afaik that's supposed to be our best distributed benchmark... heller jbjnr hkaiser might have other ideas
<hkaiser>
ms[m]: there is the 2d stencil from the tutorials
<Yorlik>
hkaiser: Do you have plans to fix the mimalloc integration? I like its general concept of sharded freelists and the C++ integration it has. Also it's very fast. (Faster than jemalloc and tcmalloc - at least they claim that and I didn't see that denied anywhere.)
<hkaiser>
Yorlik: sure, we plan to fix it
<Yorlik>
Nice. Just saw the additional posts on github after I posted here.
kale[m] has joined #ste||ar
nan11 has joined #ste||ar
<ms[m]>
hkaiserjbjnr any news on the rostam CI? I think having that up is a must before we can do another release
<ms[m]>
I'd like to help if I can
<kale[m]>
I'm getting this warning while building phylanx : #warning "The header hpx/runtime/get_os_thread_count.hpp is deprecated, please include hpx/runtime_local/get_os_thread_count.hpp instead" [-Wcpp], How can I resolve this issue
<Yorlik>
kale[m] Just replace the deprecated header with the new one where it's used. I used Sublime full text search to find all locations in my project.
<Yorlik>
ms[m] IÄll give the fix a shot - on it right now.
<ms[m]>
Yorlik: thanks
<ms[m]>
kale[m]: if you're on latest master I recommend you replace it with `hpx/runtime.hpp` instead, if you want to have it work with older HPXs include `hpx/include/runtime.hpp` instead
<ms[m]>
i.e. ignore the header that it's recommending you, but don't ignore the actual message
<ms[m]>
we're fixing those up before the next release
<ms[m]>
ah, but you said you're building phylanx... in that case it might already be fixed on latest master
<ms[m]>
if not it'll be fixed sooner or later
<kale[m]>
Yorlik: Ah, ohk. I though the problem was my build options. Thanks
<ms[m]>
the warning is mostly harmless, it's just a nudge to change the include path
rtohid has joined #ste||ar
<Yorlik>
ms[m] Everything compiles nicely now. I took the library out of my TARGET_LINK_LIBRARIES, the headers were found, all peachy. But at start I get an exception "write access violation", in mimalloc.dll.
<Yorlik>
Beginning of the call stack is sim.exe!`dynamic initializer for 'hpx::serialization::detail::register_class_name<hpx::actions::manage_object_action_base,void>::instance''() Line 74C++
<Yorlik>
Right after sim.exe!__scrt_common_main_seh() Line 258C++
<Yorlik>
And then it goes all the way down into mimalloc
<ms[m]>
Yorlik: ok, I'm afraid I can't help you further than this
<ms[m]>
at least you're linking with mimalloc
<Yorlik>
I posted the complete stack - I'll dig and see if I can find anything. Thanks a lot so far.
<ms[m]>
hkaiser's eyes might be good on the pr, but afaict it's supposed to do exactly the same as before except restrict who gets the `/INCLUDE:mimalloc_version` flag
<Yorlik>
Maybe I was calling the mimalloc version in the wronmg place (outside hpx)
<ms[m]>
scrt_common_main_seh sounds like some c++ runtime initialization function
<ms[m]>
this is not my area though, I'm not on windows...
<Yorlik>
Yes. I'm afraid it's not my fault this time.
<ms[m]>
it could be another static initialization order issues
<ms[m]>
something hpx may be getting initialized before mimallocs internal datastructures, or something like that
<Yorlik>
Removing #include <mimalloc-new-delete.h> made it run
<Yorlik>
So - global new and delete overriding is tricky here.
<K-ballo>
structured exception handler
<nikunj97>
hkaiser, thanks for the call. I'll start with what you said.
<hkaiser>
nikunj97: any time
<hkaiser>
yes, that looks like an initialization sequencing issue
<hkaiser>
mimalloc needs to be initialized first, before any of our global objects do any allocation
<hkaiser>
no idea how to ensure that, however (without major refactorings)
<K-ballo>
do we have those kind of objects, outside of function registration?
<hkaiser>
serialization registration
<hkaiser>
and action registration
<Yorlik>
hkaiser: Removing #include <mimalloc-new-delete.h> made it run
<hkaiser>
nod
<hkaiser>
but then, is mimalloc still used?
<K-ballo>
don't we use well-known ids for core actions?
<hkaiser>
do they rewire the binary to use it anyways?
<Yorlik>
Not the global new and delete overrides, malloc yes, I think
<hkaiser>
K-ballo: could be it's not an issue - I'd need to look closely
<hkaiser>
K-ballo: Yorlik's problem comes out of serialization
<Yorlik>
82.2 µsec per object update right now. That includes loading the object and its mailbox, a call into Lua, a small script running there and exiting. To me it still looks too slow.
<Yorlik>
the timing acounts for the threadcount already and assumes perfect parallel eficciency
<Yorlik>
Single threaded I'm at 47.7 µsec
<Yorlik>
So - my 12 cores give me not even x2 :(
weilewei has quit [Remote host closed the connection]
<hkaiser>
Yorlik: are you sure mimalloc gets loaded at all?
<Yorlik>
At least it prints the version of mimalloc
<Yorlik>
So - there must be some code loaded, yes
<hkaiser>
what loads mimalloc? hpx or your app?
<Yorlik>
HPX
<Yorlik>
I don't even have it in my target_link_libraries
<Yorlik>
Still - header got found and all.
<hkaiser>
mimalloc should get initialized at load time and if hpx depends on it it should happen before any hpx global objects are being created
<Yorlik>
So the re-export from HPX worked prety nice, except that little problem above.
<hkaiser>
frankly, I don't understand what the problem is caused by
<Yorlik>
Something in the mimalloc header which overrides new and delete triggered the exception
<hkaiser>
it's a global constructor in the executable, which shouldn't get invoked before all global constructors from all dependent libraries are being called
<Yorlik>
The standard head is no problem
<hkaiser>
that means that at the point of the constructor call mimalloc was already initialized...
<Yorlik>
I am not sure what I could do.
<hkaiser>
me neither
<Yorlik>
hkaiser: Is it normal or common, that my time/(objcount*worker_count) drops from 47 to 82 µs when going from single threaded to 12 workers? Overall it is faster ofc. But the relative speed goes down the gully.
<Yorlik>
like per core
<hkaiser>
contention, false sharing, cross numa-domain traffc could cause this, yes
<Yorlik>
idle rate is ~< 1%
<hkaiser>
how does it behave if you stay on the same numa domain?
<Yorlik>
You mean using only one level 3 cache?
<hkaiser>
yes
<Yorlik>
I could do that with using only the forst 3 cores
<hkaiser>
and no cross-numa domain memory traffic
<hkaiser>
6 cores, I presume
<Yorlik>
I have only one real numa domain, but one lvl3 cache per 3 cores
<Yorlik>
12 cores
<Yorlik>
3 share a lvl3
<hkaiser>
one numa-domain?
<Yorlik>
Thats what hwloc tells me
<hkaiser>
interesting
<Yorlik>
But 4 level3 caches
<Yorlik>
AMD 3900x
<Yorlik>
12 cores
<Yorlik>
4 lvl3 caches
nan11 has quit [Remote host closed the connection]
<hkaiser>
apparently cross CCX traffic is much slower, so the caches are important
<gonidelis[m]>
hkaiser: just fixed the spelling and include issues that `build-and-test` check indicated
<hkaiser>
thanks
<gonidelis[m]>
The PR should be ok after that
<hkaiser>
Yorlik: could be that you reach the memory bandwidth and that starts limiting overall perf
<hkaiser>
all cores goes through the same IO bus after all
<Yorlik>
hkaiser: The memory bandwidth is abysmal in the moment. I calculated object size*count/time
<hkaiser>
there is a lot more than just your objects being moved from/to memory
<Yorlik>
Sure
<Yorlik>
messages, lua states, futures ...
<Yorlik>
Probably the best approach is to shrink stuff, indeed.
<Yorlik>
process usage is ~6 GB
<Yorlik>
the majority of that gets moved around
nan11 has joined #ste||ar
<hkaiser>
Yorlik: what about if you measure first?
<hkaiser>
to understand what's causing the effects you see?
<Yorlik>
How would you measure a total memory usage?
<hkaiser>
ask the system
<Yorlik>
I see the total memory used by the app
<hkaiser>
there are also tools like Intel Amplifier (VTune) that can help assessing things
<Yorlik>
And i see how it jumps up when the messages are being processed
<hkaiser>
AMD has their own tools, I'm sure
<Yorlik>
AMD tools are broken for me in the moment
<hkaiser>
it's easy enough to create one for std::disable_sized_sentinel_for, I can do that
<K-ballo>
gonidelis[m]: is this your first time with C++?
<Yorlik>
“But I don’t want to go among mad people," Alice remarked. "Oh, you can’t help that," said the Cat: "we’re all mad here. I’m mad. You’re mad." "How do you know I’m mad?" said Alice. "You must be," said the Cat, "or you wouldn’t have come here.” ― Lewis Carroll, Alice in Wonderland
<jbjnr>
and your point is?
<Yorlik>
Oh - it just was like gonidelis in wonderland and hkaiser and K-ballo playing the Cheshire cat :)
<hkaiser>
lol
nikunj97 has quit [Read error: Connection reset by peer]
<gonidelis[m]>
K-ballo: yes. why?
<K-ballo>
gonidelis[m]: much like the alice reference, I was suggesting everything is a rabbit hole with C++
<gonidelis[m]>
Yorlik: 😅😅😅😅😅
<Yorlik>
:D
<gonidelis[m]>
hahahahhaha
<Yorlik>
Welcome to the rabbithole :)
<gonidelis[m]>
ok I get it.
<gonidelis[m]>
It's fun tough.
<gonidelis[m]>
though.... *
<K-ballo>
tough too
<gonidelis[m]>
To be honest the "rabbithole" reference is stolen from a quote of Mr.Kaiser at one of our meetings...
<gonidelis[m]>
hkaiser: How could I run that cmake configure tests???
<hkaiser>
the HPX_WITH_... from b) will be defined a configure time (inside cmake), while the HPX_HAVE_... from c) will be defined as a preprocessor constant at compile time allowing to react to whether things are available or not
<bita_>
hkaiser, thanks for working on blaze_tensor #59. I worked around that issue in #1192, should I change that back (the changes in dist_transpose 3d)?
<gonidelis[m]>
hkaiser: could you resent c). It seems your link was cut