#ste||ar on 2020-06-15 — irc logs at irclog.cct.lsu.edu

2020-02-24 20:46 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020

00:31 weilewei18 has quit [Remote host closed the connection]

01:18 <weilewei> hkaiser jbjnr ms[m] so I found out how to break cyclic dependency with libcds. When libcds is built inside hpx, we can pass a flag, LIBCDS_INSIDE_HPX, to say hi I am inside hpx build. and then inside libcds, link libcds core to hpx_core (instead HPX::hpx when build outside of hpx) and link libcd tests to hpx_wrap (instead of HPX::hpx).

01:18 <weilewei> https://github.com/STEllAR-GROUP/hpx/blob/libcds/cmake/HPX_SetupLibCDS.cmake

01:19 <hkaiser> sounds reasonable

01:19 <weilewei> https://github.com/weilewei/libcds/blob/hpx-thread/CMakeLists.txt#L195-L206

01:21 <weilewei> next step might be adding libcds thread data structure inside hpx thread data.... we will see how it goes, otherwise, we might need to let hpx set_thread_data to set a pair of 64 bits data or something similar

01:21 <weilewei> well we will discuss more on Monday meeting

02:24 weilewei has quit [Remote host closed the connection]

02:25 hkaiser has quit [Quit: bye]

02:55 bita_ has quit [Read error: Connection reset by peer]

04:49 nikunj97 has joined #ste||ar

06:05 nikunj97 has quit [Remote host closed the connection]

06:06 nikunj97 has joined #ste||ar

06:13 nikunj97 has quit [Ping timeout: 256 seconds]

06:33 nikunj97 has joined #ste||ar

06:36 Nikunj__ has joined #ste||ar

06:39 nikunj97 has quit [Ping timeout: 256 seconds]

12:14 <gonidelis[m]> any idea on where/why the spellcheck tests fails? https://github.com/STEllAR-GROUP/hpx/pull/4745/checks?check_run_id=770239396

12:18 <K-ballo> click on it, then go to artifacts

12:18 <gonidelis[m]> ahhhh.....

12:19 <gonidelis[m]> thanks a lot!

12:19 <gonidelis[m]> wow!

12:26 hkaiser has joined #ste||ar

12:32 <gonidelis[m]> what about the 'pycicle daint-clang-oldest' tests and the 'pycicle daint-gcc-cuda' one

12:32 <gonidelis[m]> ?

12:41 Nikunj__ has quit [Read error: Connection reset by peer]

12:56 diehlpk_work_ has quit [Remote host closed the connection]

12:56 diehlpk_work_ has joined #ste||ar

13:08 weilewei has joined #ste||ar

13:49 nikunj97 has joined #ste||ar

14:21 <nikunj97> is 1d stencil 8 not optimized for distributed? I was running with 1M points per partition and 100 partitions over 100 iterations and the code isn't scaling. It starts at 85s execution time for single node and increases to 110s by the time I test for 10 nodes.

14:22 <ms[m]> gonidelis: ignore the clang-oldest failure, that's unrelated

14:22 <ms[m]> I'll try to fix it in the next few days

14:22 <nikunj97> I'm using: srun -N <node amount> -p marvin ./1d_stencil_8 --nx=1000000 --np=1000 --nt=100 to test

14:24 <ms[m]> nikunj97: haven't looked at it closely, but it's most likely not very optimized (or it's regressed)

14:25 <ms[m]> you might have better luck with a bigger problem, but it sounds like you already have quite a big problem

14:25 <nikunj97> is there any distributed benchmark available?

14:25 <nikunj97> I wanted to test scalability of a distributed network

14:26 <nikunj97> ms[m], if I increase the problem size execution times will increase well beyond a few minutes per run. I believe current problem is large enough

14:28 <ms[m]> afaik that's supposed to be our best distributed benchmark... heller jbjnr hkaiser might have other ideas

14:29 <hkaiser> ms[m]: there is the 2d stencil from the tutorials

14:29 <ms[m]> oh right, we have both stencils

14:30 <ms[m]> thanks!

14:30 <ms[m]> nikunj97: ^

14:30 <nikunj97> is there any other available?

14:30 <ms[m]> https://github.com/STEllAR-GROUP/tutorials/tree/master/examples/03_stencil

14:58 <hkaiser> nikunj97: meeting now?

14:58 <nikunj97> hkaiser, meeting time

14:58 <nikunj97> hkaiser, I'm ready

14:58 <nikunj97> let me find the link

15:08 <Yorlik> hkaiser: Do you have plans to fix the mimalloc integration? I like its general concept of sharded freelists and the C++ integration it has. Also it's very fast. (Faster than jemalloc and tcmalloc - at least they claim that and I didn't see that denied anywhere.)

15:09 <hkaiser> Yorlik: sure, we plan to fix it

15:09 <Yorlik> Nice. Just saw the additional posts on github after I posted here.

15:16 kale[m] has joined #ste||ar

15:16 nan11 has joined #ste||ar

15:20 <ms[m]> hkaiserjbjnr any news on the rostam CI? I think having that up is a must before we can do another release

15:20 <ms[m]> I'd like to help if I can

15:20 <kale[m]> I'm getting this warning while building phylanx : #warning "The header hpx/runtime/get_os_thread_count.hpp is deprecated, please include hpx/runtime_local/get_os_thread_count.hpp instead" [-Wcpp], How can I resolve this issue

15:27 <Yorlik> kale[m] Just replace the deprecated header with the new one where it's used. I used Sublime full text search to find all locations in my project.

15:27 <Yorlik> ms[m] IÄll give the fix a shot - on it right now.

15:28 <ms[m]> Yorlik: thanks

15:28 <ms[m]> kale[m]: if you're on latest master I recommend you replace it with `hpx/runtime.hpp` instead, if you want to have it work with older HPXs include `hpx/include/runtime.hpp` instead

15:29 <ms[m]> i.e. ignore the header that it's recommending you, but don't ignore the actual message

15:29 <ms[m]> we're fixing those up before the next release

15:29 <ms[m]> ah, but you said you're building phylanx... in that case it might already be fixed on latest master

15:29 <ms[m]> if not it'll be fixed sooner or later

15:30 <kale[m]> Yorlik: Ah, ohk. I though the problem was my build options. Thanks

15:30 <ms[m]> the warning is mostly harmless, it's just a nudge to change the include path

15:51 rtohid has joined #ste||ar

15:53 <Yorlik> ms[m] Everything compiles nicely now. I took the library out of my TARGET_LINK_LIBRARIES, the headers were found, all peachy. But at start I get an exception "write access violation", in mimalloc.dll.

15:55 <Yorlik> Beginning of the call stack is sim.exe!`dynamic initializer for 'hpx::serialization::detail::register_class_name<hpx::actions::manage_object_action_base,void>::instance''() Line 74C++

15:56 <Yorlik> Right after sim.exe!__scrt_common_main_seh() Line 258C++

15:56 <Yorlik> And then it goes all the way down into mimalloc

15:58 <ms[m]> Yorlik: ok, I'm afraid I can't help you further than this

15:58 <ms[m]> at least you're linking with mimalloc

15:58 <Yorlik> I posted the complete stack - I'll dig and see if I can find anything. Thanks a lot so far.

15:58 <ms[m]> hkaiser's eyes might be good on the pr, but afaict it's supposed to do exactly the same as before except restrict who gets the `/INCLUDE:mimalloc_version` flag

15:59 <Yorlik> Maybe I was calling the mimalloc version in the wronmg place (outside hpx)

16:00 <ms[m]> scrt_common_main_seh sounds like some c++ runtime initialization function

16:00 <ms[m]> this is not my area though, I'm not on windows...

16:00 <Yorlik> Yes. I'm afraid it's not my fault this time.

16:01 <ms[m]> it could be another static initialization order issues

16:02 <ms[m]> something hpx may be getting initialized before mimallocs internal datastructures, or something like that

16:02 <Yorlik> Removing #include <mimalloc-new-delete.h> made it run

16:03 <Yorlik> So - global new and delete overriding is tricky here.

16:09 <K-ballo> structured exception handler

16:11 <nikunj97> hkaiser, thanks for the call. I'll start with what you said.

16:11 <hkaiser> nikunj97: any time

16:15 <hkaiser> yes, that looks like an initialization sequencing issue

16:15 <hkaiser> mimalloc needs to be initialized first, before any of our global objects do any allocation

16:15 <hkaiser> no idea how to ensure that, however (without major refactorings)

16:17 <K-ballo> do we have those kind of objects, outside of function registration?

16:21 <hkaiser> serialization registration

16:21 <hkaiser> and action registration

16:22 <Yorlik> hkaiser: Removing #include <mimalloc-new-delete.h> made it run

16:22 <hkaiser> nod

16:22 <hkaiser> but then, is mimalloc still used?

16:22 <K-ballo> don't we use well-known ids for core actions?

16:22 <hkaiser> do they rewire the binary to use it anyways?

16:22 <Yorlik> Not the global new and delete overrides, malloc yes, I think

16:23 <hkaiser> K-ballo: could be it's not an issue - I'd need to look closely

16:23 <hkaiser> K-ballo: Yorlik's problem comes out of serialization

16:29 <Yorlik> 82.2 µsec per object update right now. That includes loading the object and its mailbox, a call into Lua, a small script running there and exiting. To me it still looks too slow.

16:30 <Yorlik> the timing acounts for the threadcount already and assumes perfect parallel eficciency

16:32 <Yorlik> Single threaded I'm at 47.7 µsec

16:33 <Yorlik> So - my 12 cores give me not even x2 :(

16:33 weilewei has quit [Remote host closed the connection]

16:36 <hkaiser> Yorlik: are you sure mimalloc gets loaded at all?

16:36 <Yorlik> At least it prints the version of mimalloc

16:36 <Yorlik> So - there must be some code loaded, yes

16:36 <hkaiser> what loads mimalloc? hpx or your app?

16:36 <Yorlik> HPX

16:37 <Yorlik> I don't even have it in my target_link_libraries

16:37 <Yorlik> Still - header got found and all.

16:37 <hkaiser> mimalloc should get initialized at load time and if hpx depends on it it should happen before any hpx global objects are being created

16:37 <Yorlik> So the re-export from HPX worked prety nice, except that little problem above.

16:38 <hkaiser> frankly, I don't understand what the problem is caused by

16:39 <Yorlik> Something in the mimalloc header which overrides new and delete triggered the exception

16:39 <hkaiser> it's a global constructor in the executable, which shouldn't get invoked before all global constructors from all dependent libraries are being called

16:39 <Yorlik> The standard head is no problem

16:39 <hkaiser> that means that at the point of the constructor call mimalloc was already initialized...

16:40 <Yorlik> I am not sure what I could do.

16:40 <hkaiser> me neither

16:46 <Yorlik> hkaiser: Is it normal or common, that my time/(objcount*worker_count) drops from 47 to 82 µs when going from single threaded to 12 workers? Overall it is faster ofc. But the relative speed goes down the gully.

16:46 <Yorlik> like per core

16:46 <hkaiser> contention, false sharing, cross numa-domain traffc could cause this, yes

16:47 <Yorlik> idle rate is ~< 1%

16:47 <hkaiser> how does it behave if you stay on the same numa domain?

16:47 <Yorlik> You mean using only one level 3 cache?

16:47 <hkaiser> yes

16:48 <Yorlik> I could do that with using only the forst 3 cores

16:48 <hkaiser> and no cross-numa domain memory traffic

16:48 <hkaiser> 6 cores, I presume

16:48 <Yorlik> I have only one real numa domain, but one lvl3 cache per 3 cores

16:48 <Yorlik> 12 cores

16:48 <Yorlik> 3 share a lvl3

16:48 <hkaiser> one numa-domain?

16:48 <Yorlik> Thats what hwloc tells me

16:49 <hkaiser> interesting

16:49 <Yorlik> But 4 level3 caches

16:49 <Yorlik> AMD 3900x

16:49 <Yorlik> 12 cores

16:49 <Yorlik> 4 lvl3 caches

16:50 nan11 has quit [Remote host closed the connection]

16:53 <hkaiser> apparently cross CCX traffic is much slower, so the caches are important

16:54 <gonidelis[m]> hkaiser: just fixed the spelling and include issues that `build-and-test` check indicated

16:54 <hkaiser> thanks

16:54 <gonidelis[m]> The PR should be ok after that

16:55 <hkaiser> Yorlik: could be that you reach the memory bandwidth and that starts limiting overall perf

16:56 <hkaiser> all cores goes through the same IO bus after all

16:57 <Yorlik> hkaiser: The memory bandwidth is abysmal in the moment. I calculated object size*count/time

16:57 <hkaiser> there is a lot more than just your objects being moved from/to memory

16:57 <Yorlik> Sure

16:57 <Yorlik> messages, lua states, futures ...

16:58 <Yorlik> Probably the best approach is to shrink stuff, indeed.

16:58 <Yorlik> process usage is ~6 GB

16:58 <Yorlik> the majority of that gets moved around

16:58 nan11 has joined #ste||ar

16:59 <hkaiser> Yorlik: what about if you measure first?

16:59 <hkaiser> to understand what's causing the effects you see?

16:59 <Yorlik> How would you measure a total memory usage?

16:59 <hkaiser> ask the system

16:59 <Yorlik> I see the total memory used by the app

16:59 <hkaiser> there are also tools like Intel Amplifier (VTune) that can help assessing things

17:00 <Yorlik> And i see how it jumps up when the messages are being processed

17:00 <hkaiser> AMD has their own tools, I'm sure

17:00 <Yorlik> AMD tools are broken for me in the moment

17:00 <Yorlik> I'm in contact with them.

17:00 <Yorlik> Especially uProf

17:00 <Yorlik> Not sure it can actually do that

17:00 <hkaiser> Yorlik: https://developer.amd.com/amd-uprof/

17:01 <Yorlik> I'm in their list already

17:01 <Yorlik> filed an issue

17:01 <hkaiser> vtune can looks at AMD cpus as well

17:01 <Yorlik> https://community.amd.com/thread/253753

17:01 <hkaiser> https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html

17:12 <Yorlik> How large is a future<void>? 16?

17:14 <K-ballo> probably 8...?

17:14 <K-ballo> sizeof() not working?

17:15 <Yorlik> Just hoped someone might know by heart.

17:16 <Yorlik> Just did a rough estimate of objects I know I create

17:16 <Yorlik> And I'm enfing up at far below 1GB/sec

17:17 <Yorlik> The theoretical max is ~25 GB/sec

17:17 <Yorlik> the objects are ~110-120 MB/sec

17:28 <nikunj97> hkaiser, why does hpx::serialization::serialize_buffer<double> not come with a resize function?

17:35 bita_ has joined #ste||ar

17:35 <hkaiser> nikunj97: because nobody needed it ;-)

17:35 <nikunj97> hkaiser, I need one :D

17:37 <hkaiser> nikunj97: well, why don't you add it ;-)

17:38 <nikunj97> I'll just use a hack for now. Will add one later ;)

17:42 <K-ballo> mmmh... why do you need it, nikunj97?

17:43 <nikunj97> K-ballo, I wanted to resize the buffer contents

17:43 <K-ballo> that's given... but why?

17:44 <nikunj97> so essentially I start with the buffer. I operate on it and my final result is first 1/3 part of the buffer.

17:44 <gonidelis[m]> hkaiser: What do you mean "sans necessary clang-format changes."?

17:44 <nikunj97> K-ballo, and that's why I want to resize the buffer to 1/3 size and keep only the final result

17:47 <hkaiser> gonidelis[m]: there is at least one problem that will be reported by clang-format

17:48 <gonidelis[m]> ok... I will clang-format the whole thing

17:48 <hkaiser> it's the new test

17:49 <hkaiser> there might be cmake-format complaints as well, not sure however ;-)

17:50 <gonidelis[m]> I have not heard of cmake-format

17:51 <gonidelis[m]> Do I have to apply it along with clang-format everytime I make a change?

17:51 <nikunj97> gonidelis[m], I'm certain hkaiser is kidding ;)

17:52 <nikunj97> hkaiser, any documentation on serializers?

17:52 <gonidelis[m]> oh ok

17:53 <gonidelis[m]> hkaiser: Is this the requirement that I need to use for sized_sentinel_for ? https://github.com/gonidelis/hpx/blob/38516832601cc2c0178ad045cd96fdddefbbfd44/libs/iterator_support/include/hpx/iterator_support/traits/is_iterator.hpp#L190-L201

17:53 <gonidelis[m]> according to http://eel.is/c++draft/iterator.concept.sizedsentinel

17:54 <K-ballo> nikunj97: https://github.com/cheshirekow/cmake_format

17:54 <nikunj97> K-ballo, woah!!

17:55 <nikunj97> didn't think it every existed

17:59 <hkaiser> nikunj97: boost serialization

17:59 <nikunj97> thanks!

18:00 <hkaiser> nikunj97: just use hpx:: instead of boost::

18:04 <hkaiser> gonidelis[m]: yes, that's it

18:06 <hkaiser> gonidelis[m]: there is also std::disable_sized_sentinel_for

18:06 <hkaiser> you might want to take that into account as well

18:10 <hkaiser> we will need a feature chacke for this and use our own, if not available

18:10 <hkaiser> feature check

18:11 <gonidelis[m]> feature check?

18:11 <gonidelis[m]> you mean that ` std::disable_sized_sentinel_for` is c++20 and thus we do not have it ?

18:11 <hkaiser> right

18:11 <hkaiser> I can help with that

18:22 <gonidelis[m]> hkaiser: please don't tell me that this is another rabbithole ;p

18:23 <hkaiser> lol

18:24 <hkaiser> we have small C++ tests that run at cmake configure time to decide what's available and what not

18:24 <hkaiser> gonidelis[m]: here: https://github.com/STEllAR-GROUP/hpx/tree/master/cmake/tests

18:25 <hkaiser> it's easy enough to create one for std::disable_sized_sentinel_for, I can do that

18:29 <K-ballo> gonidelis[m]: is this your first time with C++?

18:36 <Yorlik> “But I don’t want to go among mad people," Alice remarked. "Oh, you can’t help that," said the Cat: "we’re all mad here. I’m mad. You’re mad." "How do you know I’m mad?" said Alice. "You must be," said the Cat, "or you wouldn’t have come here.” ― Lewis Carroll, Alice in Wonderland

18:47 <jbjnr> and your point is?

18:48 <Yorlik> Oh - it just was like gonidelis in wonderland and hkaiser and K-ballo playing the Cheshire cat :)

18:49 <hkaiser> lol

19:00 nikunj97 has quit [Read error: Connection reset by peer]

19:01 <gonidelis[m]> K-ballo: yes. why?

19:01 <K-ballo> gonidelis[m]: much like the alice reference, I was suggesting everything is a rabbit hole with C++

19:01 <gonidelis[m]> Yorlik: 😅😅😅😅😅

19:02 <Yorlik> :D

19:02 <gonidelis[m]> hahahahhaha

19:02 <Yorlik> Welcome to the rabbithole :)

19:02 <gonidelis[m]> ok I get it.

19:02 <gonidelis[m]> It's fun tough.

19:02 <gonidelis[m]> though.... *

19:03 <K-ballo> tough too

19:04 <gonidelis[m]> To be honest the "rabbithole" reference is stolen from a quote of Mr.Kaiser at one of our meetings...

19:13 <gonidelis[m]> hkaiser: How could I run that cmake configure tests???

19:16 <hkaiser> gonidelis[m]: a) add a small c++ program here: https://github.com/STEllAR-GROUP/hpx/tree/master/cmake/tests that fails compiling if the feature is not available, and succeeds if the feature is available, b) add a small cmake script here that invokes the test: https://github.com/STEllAR-GROUP/hpx/blob/master/cmake/HPX_AddConfigTest.cmake#L447, and c) invoke the test script here: https://github.com/STEllAR-GROUP/hpx/blob/maste

19:16 <hkaiser> ake/HPX_PerformCxxFeatureTests.cmake#L76

19:18 <hkaiser> the HPX_WITH_... from b) will be defined a configure time (inside cmake), while the HPX_HAVE_... from c) will be defined as a preprocessor constant at compile time allowing to react to whether things are available or not

19:20 <bita_> hkaiser, thanks for working on blaze_tensor #59. I worked around that issue in #1192, should I change that back (the changes in dist_transpose 3d)?

19:23 <gonidelis[m]> hkaiser: could you resent c). It seems your link was cut

19:23 <gonidelis[m]> ?

19:24 <hkaiser> c) invoke the test script here: https://github.com/STEllAR-GROUP/hpx/blob/master/cmake/HPX_PerformCxxFeatureTests.cmake#L76

19:24 <gonidelis[m]> Thank you ;)

19:34 nan11 has quit [Remote host closed the connection]

19:55 nan11 has joined #ste||ar

20:27 <hkaiser> bita_: as you like it

20:28 weilewei has joined #ste||ar

20:34 K-ballo has quit [Ping timeout: 246 seconds]

20:35 K-ballo has joined #ste||ar

20:37 <gonidelis[m]> question: in order for S to be a valid `sentinel_for` I it is not mandatory for the minus (-) operation to be defined?

20:37 <K-ballo> no

20:37 <K-ballo> sized sentinel is a refinement of sentinel

20:38 <gonidelis[m]> what is a case where S is a valid `sentinel_for` for I and the minus operation is not defined?

20:38 <K-ballo> your sentinel

20:38 <K-ballo> any non-random access iterator

20:39 <K-ballo> a predicate based sentinel

20:39 <gonidelis[m]> my sentinel?

20:39 <K-ballo> the one you worked with, Sentinel<X> was it?

20:39 <gonidelis[m]> oh ok

20:40 <gonidelis[m]> yeah

20:40 <gonidelis[m]> what is a predicate based sentinel?

20:40 <K-ballo> all a sentinel answers is "are we there yet?", a sized sentinel in addition to that also answers "how far are we from there?"

20:41 <K-ballo> a sentinel that tells you when you've reached an element for which a predicate is true

20:42 <gonidelis[m]> aahhhh

20:42 <gonidelis[m]> wow!!!

20:43 <gonidelis[m]> my mind just blown. I mean your explanation was great.

20:43 <gonidelis[m]> Thank you so much

20:57 diehlpk_work_ has quit [Remote host closed the connection]

21:00 diehlpk_work_ has joined #ste||ar

22:05 Yorlik has quit [Ping timeout: 264 seconds]

22:30 rtohid has left #ste||ar [#ste||ar]

22:42 kale[m] has quit [Ping timeout: 265 seconds]

22:42 kale[m] has joined #ste||ar

22:55 hkaiser has quit [Quit: bye]

23:03 kale[m] has quit [Ping timeout: 260 seconds]

23:30 sayefsakin has joined #ste||ar