also I feel that my yesterday's result may be skewed
so first, draw a graph with the roofline
start with the basics first
also, draw the roofline with different peaks
ok, let me try it
different max bandwidth (main memory, cache levels), different max compute (no vectorization, vectorization, FMA, single threaded, all threads, etc)
since the stencil's metric is MLUP/S, you should convert GFLOP/S to MLUP/S
so for a 5 point stencil, mlups = glops/5 ?
you need to calculate the arithmetic intensity
as a first step
well, keep the gflops at the first step
i'll explain the conversion later
setup the roofline for the ARM64FX2 first
nikunj97: which of the stencil examples are you improving. It sounds like you are ding something very useful and worthwhile.
jbjnr, I'm working with heller1's 2d stencil benchmark from one of his lectures
is it part of the tutorials?
(in the tutorials repo)
ok great.
we have a plan to redo the tutorial material for the next course and it would be lovely to have a simd version of the stncil code to add to the material.
jbjnr, I'm trying my best :)
nikunj97: yeah, would be nice if you could write a few pages about the performance modelling ;)
lessons learnt, optimization etc
well this is all for a lab based project at my university. As I told you, it's a collaboration between iitr and jsc, so they won't let me off without a good 10-15 page report ;)
we have a gsoc project to add simd stuff, couldn't you do that as well and get paid for it too?
jsc = julich?
I have an inernship this summer. Also, I'm a mentor this gsoc so I won't be able to apply as a student.
and yes, jsc is julich supercomputing center
but I can look into the project when I'm free and add stuff there
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
how do I get interactive access to a node in rostam. I don't see screen anymore :/
karame78 has quit [Remote host closed the connection]
kale has joined #ste||ar
kale_ has joined #ste||ar
kale has quit [Ping timeout: 250 seconds]
hkaiser has joined #ste||ar
Hashmi has joined #ste||ar
Abhishek09 has joined #ste||ar
Abhishek09 has quit [Remote host closed the connection]
kale_ has quit [Quit: Leaving]
kale_ has joined #ste||ar
heller1, how do I do roofline analysis with PAPI?
you don't
I went through the link you sent. It looks like I need to first find theoretical maximum and then add some macros to your code to find gflops in your application. Post this you compare
nikunj97: step #1: determine the peak bandwidth of your system. step #2: determine the peak flops performance of your system
well, for a stencil that's simple enough
anyways, let's get the roofline first
once you've done step #1 and step #2, you can plot the roofline as the function `min(AI * peak_bw, peak_flops)` where AI stands for arithmetic intensity
which is the unit on you're x-axis
arithmetic intensity is a metric that tells you how many flops per byte your system can achieve
sorry, misformulated it
the arithmetic intensity is the unit that characterizes your workload
why is that on the y-axis then?
ohh that's a min there
if you have a bandwitdh bound problem, you have a low arithmetic intensity, you need to transfer more memory to your ALUS
How do I transfer more memory to ALU?
anyway, let me try step #1 and #2 before I ask any more doubts
for example, the following calculation (assuming all floats): `a = b + c;` requires two loads and one store, an equivalent of 12 bytes, for 1 floating point operation, that means that your arithmetic intensity is 1/12
you don't transfer more memory to your ALU
this is a fixed unit (determined by the peak bandwidth)
what you need to do is to increase the number of floating point operations per memory load/store
this can be done by exploiting your cache organization, intelligent prefetching, or a different algorithm
so, first complete step #1 and #2
for step #2, you can have different peak lines, as mentioned yesterday. One with scalar only operations, one with vectorization, one with vectorized FMA
and one additional one using all cores and vectorization
which module do I need to load to test this?
so I run an equivalent of nersc$ srun -n 4 -c 6 sde -knl -d -iform 1 -omix my_mix.out -i -global_region -start_ssc_mark 111:repeat -stop_ssc_mark 222:repeat -- foo.exe?
I have no idea what this command means
lol, I think I'm confusing myself and you
<diehlpk_mobile[m "> <@diehlpk:matrix.org> In your "> Google sent an email with updated time line for GSoC
kale_ has quit [Quit: Leaving]
kale_ has joined #ste||ar
Student proposal review period is now March 31-April 20.
And we got one additional week to review the proposals
nikunj97: for the peak flop performance, it is enough to consider the max frequency and the number of operations you can get through per cycle, no need to measure it
heller1, I was currently reading on stream benchmarks. You want me to run `gcc -fopenmp -D_OPENMP stream.c -o stream` on the source code, right?
and `-O3`
then take the value you got from the stream TRIAD and multiply it by 1.5, then you have a realistic number
diehlpk_mobile[m, I am currently working on my second draft. I have found better ways to implement the pip package and I am actively looking for me. Since you are busy with your work. You can directly read my second draft tomorrow instead of going through both.
hisilicon seems to be about twice as powerful wrt e5 in triad
akheir has quit [Read error: Connection reset by peer]
nikunj97: they make sense
any more information on this system?
the gist is all I got from running stream
is there anything I'm missing?
"The Hi1616 supports up to 512 GiB of quad-channel DDR4-2400 memory. This chip supports up to 2-way SMP with two ports supporting 96 Gb/s each."
nikunj97: you might need to watch out for CPU frequency throttling. Recall that modern CPUs/etc can slow themselves down when they get hot. This can mess up benchmarks!
jbjnr, aah! that makes sense
make sure nobody else is using the node you're on, and make sure you're not running your tests on a login node
I'm allocating myself a separate node
and running a script that runs the stream executable 10 times
and stores the triad result to a file
nikunj97: ahh, the arm64fx is the arm node that I gave you?
heller1, yes!
it's not related to the project I'm doing with, but I decided to write my benchmark such that we can reuse it on a64fx as well for our own project
our -> ste||ar
nikunj97: it is _not_ a arm64fx, the arm64fx is the one with SVE, the one that is going to get put into post-k (aka Fugaku)
what is it then?
entirely different machines
yes, but it is NOT a ARM64FX
why's the telegram group ARMFX64 then? shrug
I always forget ... you should have the information in the email thread where I gave you access...
the telegram group is about getting access to riken's machines, which use the armfx64
"give me a shout if you need access to a larger aarch64 machine"
should've read that right
diehlpk_mobile[m told me that our proposal was accepted by fujitsu
and that we should get access to it
aarch64 is the generic term for arm 64 bit architectures
the Hi1616 is one as well
yes. I was mislead by the telegram group name
I should've seen /proc/cpuinfo
ct-clmsn has joined #ste||ar
let me remove those stream benchmarks claiming to be arm64fx then
you don't have to remove them, just give them their true name ;)
so it's a qualcomm falkor
something like that, yes
so now, I have the peak traid bandwidth
I multiply it by 1.5 to get a realistic number
what is the next step?
and now plot the roofline
or was it 1.5?
let me check the code again ;)
diehlpk has joined #ste||ar
heller1, they don't multiply by 1.5 anywhere
they report: avgtime[j] = avgtime[j]/(double)(NTIMES-1);
diehlpk, yt?
yeah, that;s fine, leave out the multiplication
nikunj97, yes
so I report the average triad as bandwidth
diehlpk, did we not get our proposal accepted with fujitsu?
I thought they accepted our proposal and we were getting access to the a64fx machines
I assumed the same but they never came back to us
so no a64fx :/
I do not know, just sent him one more time a reminder
so you don't have a SVE capable CPU?
nan has joined #ste||ar
nan is now known as Guest9057
Guest9057 has quit [Remote host closed the connection]
nan1 has joined #ste||ar
bita has joined #ste||ar
kale has joined #ste||ar
kale_ has quit [Read error: Connection reset by peer]
gonidelis has joined #ste||ar
shahrzad has joined #ste||ar
shahrzad has quit [Remote host closed the connection]
shahrzad has joined #ste||ar
ahkeir1 has quit [Quit: Leaving]
ahkeir1 has joined #ste||ar
ahkeir1 has quit [Client Quit]
akheir has joined #ste||ar
gonidelis48 has joined #ste||ar
gonidelis has quit [Remote host closed the connection]
(T00000000/----------------.----/----------------) P--------/----------------.---- 16:17.53.391 [0000000000000003] <fatal> [ERR] thread_func: default thread_num:0 : caught boost::system::system_error: The paging file is too small for this operation to complete, aborted thread execution
run with --hpx:attach-debugger=exception
gonidelis, pr's related to documentation is very much appreciated. hkaiser would concur ;)
everybody: today is the 12'th birthday of HPX, btw
Happy Birthday HPX
hkaiser: woohoo! Awesome job! Congrats to you!
Let's start a new one from scratch :P
zao: way to go!
wow! 12 years of development...
zao: let's use javascript so it runs everywhere
WASM, you say?
tells me that we should publish the survey results asap
"12'th Birthday - What do People think?"
akheir1 has joined #ste||ar
what's still missing?
stmatengss has joined #ste||ar
akheir has quit [Ping timeout: 265 seconds]
woop, happy birthday HPX!
hkaiser: sounds like a good idea
simbergm: I still need to fix the images, didn't have tim eyet :/
hkaiser: you mean the labels?
doesn't have to be exactly on the birthday ;)
hkaiser: you need help with the images? I can probably paste something together quite quickly
it's just that one image, no?
I think so, yes
simbergm: I'm just running out of time, so if you could look into the labels I'd appreciate it very much
hkaiser: yep, I can take care of it
simbergm: thanks!
Hashmi has joined #ste||ar
Abhishek09 has joined #ste||ar
rtohid : we will install install hpx by dnf or cmake in manylinux docker?
simbergm If you would like you could check if the changes are proper. I have completely removed :lines: and replaced them with :start-after: :end-before:
are these little patches merged with master directly? or are they gathered and merged in a large newer-verison-like pull request?
gonidelis: nothing goes directly to master, ever
everything goes through PRs
hkaiser: With the need to keep a lua state around for a task in flight I now see how many tasks actually can be "in flight": I'm around 1000 Lua states in the moment which reflects exactly that. At that level in the moment it stabilizes and doesn't grow the pool of Lua States. I more and more have a feeling that we'd need a new kind of scripting language to handle this programming environment. it's neither Lua or
everyones fault - to me it looks rather as a new situation which would require that.
anyone - not everyones.
heh, running out of ideas, do you?
Not really
As long as it stabilizes it's not really a big problem. You just need the memoy to keep around these Lua States.
how much memory does a lua state consume?
After creation in the moment ~500kb - 1MB
It depends how large your script base it
nod, understand
I'm fantasizing about a lua vesion tailored for HPX tbh.
akheir1 has quit [Read error: Connection reset by peer]
E.G. with our programming paradigm there is no reason why the static parts couldn't be shared.
akheir1 has joined #ste||ar
It's just crazy to have all the scripts around in 1000ish copies
Yorlik: yah
I don't know if it is possible for Lua STates to have more shared data
it's the variables that need separation
We just need one per OS thread
that should be possible somehow, talk to the lua guys
Otherwise we'd have data sharing
Abhishek09 has quit [Remote host closed the connection]
I'll dig into that
one per os thread would assum ethe hpx threads don't move around
It's more to avoid false sharing
You don't want two threads trying to read the same ram
Even if it's const
nah, reading is not an issue
Then we 'd need only one copy :)
I'm thinking about sharing memory pages virtually
gonidelis has quit [Remote host closed the connection]
But that would probably explode because of addresses stored
Yorlik: premature optimization again
NP having 3-4 GB worth of Lua states arouznd.
the lua scripting will overshadow everything else anyways
In the moment I'm process ~150,000 messages on 10,000 object per second in Lua. Not good enough still, imo.
gonidelis has joined #ste||ar
smaller memory footprint might help there
I might be able to optimize stuff later, when we have our systems prototyped.
In the moment every message has a vector of variants as argument pack.
Thats not exactly efficient
Ove time I can make specialized, smaller messages.
depends on how large the variant is
40 bytes
A message variant over message types is 32 bytes
So - adding a stupid int adds 40 bytes ..
We have 40 byte bools ! lol
The id_types I put into the variant blew it up together with the strings
id_types are intrusive_ptr's essentially, so not more than a single pointer
I could preompile all strings and use a string hash instead.
These messages also go over the wire.
So I can't really make the id_types smaller
that shouldn't affect things
What is the minimum part of an id_type I really need to uniquely identifdy an ovbject?
I'm pretty sure there's a ton of optimization possible with having this many lua states aropund. I just didn't have time for that yet. Too much to do in other areas and it's not a big problem yet. But it might become one.
The good news is, that most likely the current amount I use is an upper limit
does it scale linearly with the number of states?
The states do not cause issues except memory usage
When an object gets updated I grab a state from the pool and call the update function in Lua giving it the object and the mailbox.
The state sticks with the object until the updater exits. Then the meailbox is dumped and the state returned to the pool.
Since a task is a batch of objects There is a limit to the amount of tasks in flight. We most likely can live with this memory consumption for quite a while.
are those merely living on the server?
The states? They are just local, sure.
akheir1 has quit [Read error: Connection reset by peer]
Since they are essentially const and the same all aroundthe cluster
akheir1 has joined #ste||ar
Object migration is planned for the stage after this milestone, when a local, single node scripted simulation is stable enough.
nan1 has joined #ste||ar
Just measured: killing 989 Lua States gave me 2.0 MB per state memory back.( measured in Debugger )
hkaiser the non-mpi version of HPX module on Summit is installed and well tested. So Summit now has distributed and serial HPX 1.4.1.
weilewei, nice
:) Yea!
nan1 has quit [Ping timeout: 240 seconds]
Abhishek09 here!
rtohid: we will prefer to install hpx by cmake or dnf package ? dnf package will require same gcc version to compile
what's the best technique for running the clang lint tool provided with phylanx?
i've ancient code that's blocked on my poor code formatting
cmake , boost by binary tar &,pybind,blaze , blaze tensor by cmake, & git, libjemelloc by sudo,
Abhishek09 CMake
we just need to follow build instructions on Phylanx's Wiki
diehlpk has quit [Remote host closed the connection]
diehlpk has joined #ste||ar
diehlpk has joined #ste||ar
diehlpk has quit [Changing host]
That means dnf is not involved in our project rtohid diehlpk
diehlpk has quit [Remote host closed the connection]
diehlpk has joined #ste||ar
rtohid nikunj97 manylinux is supported on travis/Appveyor ci but not circle ci but cibuildwheel does
stmatengss has left #ste||ar [#ste||ar]
Abhishek09 not to build Phylanx itself.
Abhishek09, The final solution can not involve the dnf package
Abhishek09, if you use manylinux, you will have dnf. But it'll be an older version with old libraries. So you can't rely on dnf
rtohid `not to build Phylanx itself` means?
if you're going the manylinux route, you will have to build everything from scratch
including a gnu gcc compiler
nikunj97 yes centos 8 support dnf but 5 does nt
once you have the gnu compiler toolchain, you will have to build dependencies followed by phylanx
so in short, dnf is not allowed. The final solution must not have dnf
dnf is not allowed nikunj97 Why?
old gcc?
older version of dnf implies older libraries
akheir1 has quit [Remote host closed the connection]
Abhishek09, As I told you before, a would start to use the dnf package of hpx and try to build phylanx and its dependencies within a whl file
akheir1 has joined #ste||ar
avah has quit [Remote host closed the connection]
Once you have done this a step within the main goal, you can remove the dnf package and take care to compile hpx
Fist step could be: use hpx's dnf package, build all dependencies, and phylanx itself
use the build package on a fresh docker file to test things
nikunj97 manylinux doesnt support circle ci
In the second step, once we have a working solution for this. You will add hpx to the build chain of the pip package
cibuildwheel does nikunj97
I believe once you figured out to build and ship the dependencies and phylanx, it will be easy to do the same for hpx
it's worth investigating cibuildwheel as well then
To summarize, the final package can not use dnf or apt-get at all
A first proof of concept could use dnf to install hpx
akheir1 has quit [Read error: Connection reset by peer]
cibuilds is a later step in the project imo
akheir1 has joined #ste||ar
Yes , i know it uses the same docker as manylinux
yes, I agree. I would not over-engineer the solution
Abhishek09, you should focus on the first few steps in order to move to the later ones
it's always easier to formulate later steps when you have some initial proof of concept
Abhishek09, yes, I agree. Just having a pip package compiled in one docker image and copy it to a fresh one and everything works, would be a huge success
that's why everyone is suggesting you to first figure out the part of building a pip package. Once you succeed in that, we can explore the ideas of ci integration
but that's when you've successfully made a pip package which compiles everything on manylinux and finally creates a wheel using auditwheel (or the likes)
nikunj97: but rtohid said to follow build instructions on Phylanx's Wiki
I don't see any confusions
hkaiser: It's kinda interesting how they creation of Lua States depends a lot on the properties of the workload and the corresponding number of tasks in flight. I added a small busy loop to objuect creation to slow it down a bit and thus allow intermediary created Lua States to finish, so they could be reused.That drastically reduced the number of created LuaStates. I think there's a lot to learn here.
Also it seems my artificial tests do not really reflect a realistic workload. I'll have to do a lot of experimenting.
It seems there are phases of bursts where new states get created and then it stabilizes again before it comes to a stable situation overall.
When all objects were created and message cration was in a steady state, after purging all engines it run with just 4 - one per thread.
diehlpk: Why we use dnf package rather than using make install?
by cmake
he's not forcing you to use dnf
he's just telling you, if you want to use dnf package initially, you may do so
but the final product should not make use of dnf
that means i can apply any option dnf or cmake
Abhishek09, I was thinking that getting HPX to work will take a long time.
So you could use dnf to install hpx and just compile all dependencies of Phylanx and itself
So you could deliver a first package where you use HPX from dnf
and compile only phylanx and its dependencies. So we have a working package sooner
I think if you figured out how to compile and ship for example pybind11, it will be easier to compile HPX using pip
After we have this package, you can remove dnf insta;;
and build hpx
I believe that doing baby steps is the way to go
It is just my opnion and you can propose whatever you want
I just want to say doing incremental steps is better
However, the final solution should not use dnf in your proposal
Yorlik: you might request a lua state only once the thread actually starts running, not at task creation
How could I do that ? I request the states immediately before using them.
diehlpk: using dnf for binaries is not cross platform and makes the pip almost useless
nan1 has joined #ste||ar
hkaiser: One problem are Callbacks which use a Lua state form a c++ function which is called from Lua. Unfortunately I cannot just pass around the LuaState, since it's in actions which might run remotely.
hkaiser, yes, I know and that is exactly what I want as a first step
Yorlik:if you associate the lua state with the hpx thread this shouldn't be a problem
I think we should start with the most easiest pip packahe which is might useless at all, but show that tings work
diehlpk: ok
hkaiser: I had that and it exploded, when the task migrated to another thread. The states are bound to the tasks or I get access errors
I think if we can use dnf install, build phylanx and its dependencies in one Fedora docker container and generate a pip package
Yorlik: hpx threads have a 64 bit value you can use for your purposes
Copy this pip package to a fresh Fedora docker container and install it and run any phylanx example
hpx::threads::set_state(std::size_t) or something similar (and size_t get_state())
This would be a major step
Like using a lua state with two tasks at the same time? That might be possible.
so you attach you lua state to the hpx thread which will carry it with it
next step would be remove dnf install and compile hpx
Oh - I see.
no, just one task
Once we have done this, we could think on how to use tools to make it more usefull
htis way the c++ code called from lua will use the same lua state as the surrounding hpx thread
In the moment I am using several states per task, one per object update - that could be fixed.
hkaiser, I just said it is not a good way to use the build system for offical pip packages from the beginning on
diehlpk: ok
We should do baby steps in this direction
So I'd use the same state for every update run in the batch of the parllel loop
hkaiser: Is there something like task local storage?
sure, set_state/get_state
OK - I'll look that up
diehlpk: Instructions unclear, I removed dnf :P
hkaiser: That function only accepts the id_type and the size_t - What I'd need to do is store a unique ptr that get automagically destroyed, when the task finishes - I can't see how I would do that here. I'd have no way to figure out how long the task lives.
Yorlik: it's thread_id_type, not id_type
ahh, but good point
I need to create an example for this
Yes, but however - What would that size_t help me?
The data needs to be destroyed at the end of the task.
So the deleter of the unique_ptr to the lua state can kick in
and give it back to the pool
that's what I need to create an example for
OK. Thanks a ton !
bita: after the retiling you'd need to create a new annotation which needs to have all of the new tiles in it
gonidelis has quit [Remote host closed the connection]
hkaiser, annotate_d does not need to show all of the meta data. I don't get your point :?
gonidelis has joined #ste||ar
annotate_d produces a annotation that has information for all the tiles,
hkaiser I was asking about merging these patces in some milestone version or sth or if they just getting merged directly to master...
diehlpk: do you plan to join Maxwells defense dryrun now?
gonidelis: do you think this is needed?
diehlpk_mobile[m: ^^
no i dont. just asking how things work in your team ;)
heller1, about roofline analysis. Peak performance of a single core will be (2.6)*(16) GFLOPs/s where 2.6GHz is cpu frequency and 16 instructions are executed per cycle
and from stream benchmarks we know that 4GB/s is the memory bandwidth
the triad operation is a[j] = b[j]+scalar*c[j];
so it will require 2 loads i.e. b[j] and c[j] and load and store for a[j]
More like 40, no?
was it 40, let me check
ohh yeah 39.xxGB/s
my bad
16 is vectorized fma?
CPUs have 16 instructions per cycle (as E5-2600v3 series CPUs have AVX2.0 and FMA instruction sets that at their theoretical maximum are two times lager than that of E5-2600v1 and E5-2600v2)
so peak core performance is about 41.6GLOPs/s
and processor's peak performance is 1664GFLOPs/s
so simply AVX2, no FMA will be 20.8GFLOPs/s
Ok, as mentioned earlier, it would be good to have all those separate "roofs"
yes, I wanted to confirm this with you
so these will be horizontal lines on plotted graph
1. Scalar instructions, single core 2. Vector instructions, no fma, single core 3. Vector FMA instructions, single core 4. The number of 3. multiplied with the number of cores
ok, got it. How do I get the intersecting line (memory bandwidth one)?
how do I get to know where it'll intersect (y or x axis)?
So draw three graphs with those different rooflines. One graph per machine to test
but what about the memory bandwidth line?
Hashmi has quit [Quit: Connection closed for inactivity]
Yes, sounds reasonable
Yes, try to plot a proper roofline
Each of the horizontal lines will hint you at the benefit of each architectural improvement
What do you think?
f(x) = min(x*peak_bw, peak_flops), where x is the arithmetic intensity
yes, I think I've also figured out the memory bandwidth line. I take 2 operations from stream benchmarks with their arithmetic intensity
and then I have their memory bandwidth as well
so I multiply it with the corresponding memory bandwidth and compare with processor's peak
sorry, riot seems to be completely borked right now
is it not the same as your function?
I think it is. If I know 2 corresponding points of a line, I can plot the line itself. I have both arithmetic intensity and it's corresponding peak bw
Use pen and paper and try to figure it out yourself
ok, let me try
x is the unknown
You have two functions that intersect, f(x) = a*x and g(x)= b
gonidelis: we do a release every couple of month that contains all changes that have accumulated since the last release
this would mean that the memory bandwidth line passes through the origin
Basic math rules apply
It absolutely does
nikunj97: zero cores use zero memory bandwidth ;-)
Why shouldn't it?
hold on, x axis has arithmetic intensity, right?
Abhishek09 has quit [Remote host closed the connection]
Why shouldn't it?
nothing, I was confused by your analogy. I got what you're saying now
always remember, we use math as a way to express our models ;)
HPX is not magic, it's C++, roofline is not witchcraft, it's math ;)
anyways, zero could either mean that you have no flops or that your bytes got to infinity, both lead to the number of flops/s to be zero
weilewei has quit [Ping timeout: 240 seconds]
so there's no reason why it should not start at zero
heller1, we use stream triad benchmark
we use it's 2FLOP/iter and does 24B/iter
that gives us arithmetic intensity of 1/12 FLOP/B
we used that to determine the maximum bandwidth
but sure, you can use those values to validate your graph
bonus point: where will the triad benchmark result sit at?