<nikunj97>
also I feel that my yesterday's result may be skewed
<heller1>
so first, draw a graph with the roofline
<heller1>
shrug
<heller1>
start with the basics first
<nikunj97>
ok
<heller1>
also, draw the roofline with different peaks
<nikunj97>
ok, let me try it
<heller1>
different max bandwidth (main memory, cache levels), different max compute (no vectorization, vectorization, FMA, single threaded, all threads, etc)
<heller1>
since the stencil's metric is MLUP/S, you should convert GFLOP/S to MLUP/S
<nikunj97>
so for a 5 point stencil, mlups = glops/5 ?
<heller1>
no
<heller1>
you need to calculate the arithmetic intensity
<heller1>
as a first step
<heller1>
well, keep the gflops at the first step
<heller1>
i'll explain the conversion later
<heller1>
setup the roofline for the ARM64FX2 first
<nikunj97>
ok
<jbjnr>
nikunj97: which of the stencil examples are you improving. It sounds like you are ding something very useful and worthwhile.
<nikunj97>
jbjnr, I'm working with heller1's 2d stencil benchmark from one of his lectures
<jbjnr>
is it part of the tutorials?
<jbjnr>
(in the tutorials repo)
<nikunj97>
yes
<jbjnr>
ok great.
<jbjnr>
we have a plan to redo the tutorial material for the next course and it would be lovely to have a simd version of the stncil code to add to the material.
<nikunj97>
jbjnr, I'm trying my best :)
<heller1>
nikunj97: yeah, would be nice if you could write a few pages about the performance modelling ;)
<heller1>
lessons learnt, optimization etc
<nikunj97>
well this is all for a lab based project at my university. As I told you, it's a collaboration between iitr and jsc, so they won't let me off without a good 10-15 page report ;)
<jbjnr>
we have a gsoc project to add simd stuff, couldn't you do that as well and get paid for it too?
<jbjnr>
jsc = julich?
<nikunj97>
I have an inernship this summer. Also, I'm a mentor this gsoc so I won't be able to apply as a student.
<nikunj97>
and yes, jsc is julich supercomputing center
<jbjnr>
k
<nikunj97>
but I can look into the project when I'm free and add stuff there
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
<nikunj97>
how do I get interactive access to a node in rostam. I don't see screen anymore :/
karame78 has quit [Remote host closed the connection]
kale has joined #ste||ar
kale_ has joined #ste||ar
kale has quit [Ping timeout: 250 seconds]
hkaiser has joined #ste||ar
Hashmi has joined #ste||ar
Abhishek09 has joined #ste||ar
Abhishek09 has quit [Remote host closed the connection]
kale_ has quit [Quit: Leaving]
kale_ has joined #ste||ar
<nikunj97>
heller1, how do I do roofline analysis with PAPI?
<heller1>
you don't
<nikunj97>
I went through the link you sent. It looks like I need to first find theoretical maximum and then add some macros to your code to find gflops in your application. Post this you compare
<heller1>
nikunj97: step #1: determine the peak bandwidth of your system. step #2: determine the peak flops performance of your system
<heller1>
well, for a stencil that's simple enough
<heller1>
anyways, let's get the roofline first
<nikunj97>
yes
<heller1>
once you've done step #1 and step #2, you can plot the roofline as the function `min(AI * peak_bw, peak_flops)` where AI stands for arithmetic intensity
<heller1>
which is the unit on you're x-axis
<heller1>
arithmetic intensity is a metric that tells you how many flops per byte your system can achieve
<heller1>
sorry, misformulated it
<heller1>
the arithmetic intensity is the unit that characterizes your workload
<nikunj97>
why is that on the y-axis then?
<heller1>
FLOPS
<nikunj97>
ohh that's a min there
<heller1>
if you have a bandwitdh bound problem, you have a low arithmetic intensity, you need to transfer more memory to your ALUS
<nikunj97>
How do I transfer more memory to ALU?
<nikunj97>
anyway, let me try step #1 and #2 before I ask any more doubts
<heller1>
for example, the following calculation (assuming all floats): `a = b + c;` requires two loads and one store, an equivalent of 12 bytes, for 1 floating point operation, that means that your arithmetic intensity is 1/12
<heller1>
you don't transfer more memory to your ALU
<heller1>
this is a fixed unit (determined by the peak bandwidth)
<nikunj97>
gotcha
<heller1>
what you need to do is to increase the number of floating point operations per memory load/store
<heller1>
this can be done by exploiting your cache organization, intelligent prefetching, or a different algorithm
<heller1>
so, first complete step #1 and #2
<heller1>
for step #2, you can have different peak lines, as mentioned yesterday. One with scalar only operations, one with vectorization, one with vectorized FMA
<heller1>
and one additional one using all cores and vectorization
<nikunj97>
which module do I need to load to test this?
<heller1>
none?
<nikunj97>
so I run an equivalent of nersc$ srun -n 4 -c 6 sde -knl -d -iform 1 -omix my_mix.out -i -global_region -start_ssc_mark 111:repeat -stop_ssc_mark 222:repeat -- foo.exe?
<heller1>
I have no idea what this command means
<nikunj97>
lol, I think I'm confusing myself and you
<diehlpk_mobile[m>
<diehlpk_mobile[m "> <@diehlpk:matrix.org> In your "> Google sent an email with updated time line for GSoC
kale_ has quit [Quit: Leaving]
kale_ has joined #ste||ar
<diehlpk_mobile[m>
Student proposal review period is now March 31-April 20.
<diehlpk_mobile[m>
And we got one additional week to review the proposals
<heller1>
yay
<heller1>
nikunj97: for the peak flop performance, it is enough to consider the max frequency and the number of operations you can get through per cycle, no need to measure it
<nikunj97>
heller1, I was currently reading on stream benchmarks. You want me to run `gcc -fopenmp -D_OPENMP stream.c -o stream` on the source code, right?
<heller1>
and `-O3`
<nikunj97>
right!
<heller1>
then take the value you got from the stream TRIAD and multiply it by 1.5, then you have a realistic number
<kale_>
diehlpk_mobile[m, I am currently working on my second draft. I have found better ways to implement the pip package and I am actively looking for me. Since you are busy with your work. You can directly read my second draft tomorrow instead of going through both.
<nikunj97>
hisilicon seems to be about twice as powerful wrt e5 in triad
akheir has quit [Read error: Connection reset by peer]
<heller1>
nikunj97: they make sense
<heller1>
any more information on this system?
<nikunj97>
the gist is all I got from running stream
<nikunj97>
is there anything I'm missing?
<heller1>
"The Hi1616 supports up to 512 GiB of quad-channel DDR4-2400 memory. This chip supports up to 2-way SMP with two ports supporting 96 Gb/s each."
<jbjnr>
nikunj97: you might need to watch out for CPU frequency throttling. Recall that modern CPUs/etc can slow themselves down when they get hot. This can mess up benchmarks!
<nikunj97>
jbjnr, aah! that makes sense
<jbjnr>
make sure nobody else is using the node you're on, and make sure you're not running your tests on a login node
<nikunj97>
I'm allocating myself a separate node
<nikunj97>
and running a script that runs the stream executable 10 times
<nikunj97>
and stores the triad result to a file
<heller1>
nikunj97: ahh, the arm64fx is the arm node that I gave you?
<nikunj97>
heller1, yes!
<nikunj97>
it's not related to the project I'm doing with, but I decided to write my benchmark such that we can reuse it on a64fx as well for our own project
<nikunj97>
our -> ste||ar
<heller1>
nikunj97: it is _not_ a arm64fx, the arm64fx is the one with SVE, the one that is going to get put into post-k (aka Fugaku)
<nikunj97>
what is it then?
<heller1>
entirely different machines
<heller1>
yes, but it is NOT a ARM64FX
<nikunj97>
why's the telegram group ARMFX64 then? shrug
<heller1>
I always forget ... you should have the information in the email thread where I gave you access...
<heller1>
the telegram group is about getting access to riken's machines, which use the armfx64
<nikunj97>
"give me a shout if you need access to a larger aarch64 machine"
<nikunj97>
should've read that right
<nikunj97>
diehlpk_mobile[m told me that our proposal was accepted by fujitsu
<nikunj97>
and that we should get access to it
<heller1>
aarch64 is the generic term for arm 64 bit architectures
<heller1>
the Hi1616 is one as well
<nikunj97>
yes. I was mislead by the telegram group name
<nikunj97>
I should've seen /proc/cpuinfo
ct-clmsn has joined #ste||ar
<nikunj97>
let me remove those stream benchmarks claiming to be arm64fx then
<heller1>
you don't have to remove them, just give them their true name ;)
<nikunj97>
so it's a qualcomm falkor
<heller1>
something like that, yes
<nikunj97>
so now, I have the peak traid bandwidth
<nikunj97>
I multiply it by 1.5 to get a realistic number
<nikunj97>
what is the next step?
<heller1>
and now plot the roofline
<heller1>
or was it 1.5?
<heller1>
let me check the code again ;)
diehlpk has joined #ste||ar
<nikunj97>
heller1, they don't multiply by 1.5 anywhere
<nikunj97>
they report: avgtime[j] = avgtime[j]/(double)(NTIMES-1);
<nikunj97>
diehlpk, yt?
<heller1>
yeah, that;s fine, leave out the multiplication
<diehlpk>
nikunj97, yes
<nikunj97>
so I report the average triad as bandwidth
<nikunj97>
diehlpk, did we not get our proposal accepted with fujitsu?
<nikunj97>
I thought they accepted our proposal and we were getting access to the a64fx machines
<diehlpk>
I assumed the same but they never came back to us
<nikunj97>
so no a64fx :/
<diehlpk>
I do not know, just sent him one more time a reminder
<heller1>
so you don't have a SVE capable CPU?
nan has joined #ste||ar
nan is now known as Guest9057
Guest9057 has quit [Remote host closed the connection]
nan1 has joined #ste||ar
bita has joined #ste||ar
kale has joined #ste||ar
kale_ has quit [Read error: Connection reset by peer]
gonidelis has joined #ste||ar
shahrzad has joined #ste||ar
shahrzad has quit [Remote host closed the connection]
shahrzad has joined #ste||ar
ahkeir1 has quit [Quit: Leaving]
ahkeir1 has joined #ste||ar
ahkeir1 has quit [Client Quit]
akheir has joined #ste||ar
gonidelis48 has joined #ste||ar
gonidelis has quit [Remote host closed the connection]
<Yorlik>
(T00000000/----------------.----/----------------) P--------/----------------.---- 16:17.53.391 [0000000000000003] <fatal> [ERR] thread_func: default thread_num:0 : caught boost::system::system_error: The paging file is too small for this operation to complete, aborted thread execution
<Yorlik>
<unknown>
<heller1>
he
<heller1>
run with --hpx:attach-debugger=exception
<nikunj97>
gonidelis, pr's related to documentation is very much appreciated. hkaiser would concur ;)
<hkaiser>
absolutey!
<hkaiser>
everybody: today is the 12'th birthday of HPX, btw
<bita>
Happy Birthday HPX
<heller1>
hkaiser: woohoo! Awesome job! Congrats to you!
<zao>
Yay!
<zao>
Let's start a new one from scratch :P
<hkaiser>
zao: way to go!
<gonidelis>
wow! 12 years of development...
<K-ballo>
zao: let's use javascript so it runs everywhere
<zao>
WASM, you say?
<hkaiser>
tells me that we should publish the survey results asap
<hkaiser>
"12'th Birthday - What do People think?"
akheir1 has joined #ste||ar
<heller1>
indeed
<heller1>
what's still missing?
stmatengss has joined #ste||ar
akheir has quit [Ping timeout: 265 seconds]
<simbergm>
woop, happy birthday HPX!
<simbergm>
hkaiser: sounds like a good idea
<hkaiser>
simbergm: I still need to fix the images, didn't have tim eyet :/
<simbergm>
hkaiser: you mean the labels?
<hkaiser>
yah
<simbergm>
doesn't have to be exactly on the birthday ;)
<simbergm>
hkaiser: you need help with the images? I can probably paste something together quite quickly
<simbergm>
it's just that one image, no?
<heller1>
I think so, yes
<hkaiser>
simbergm: I'm just running out of time, so if you could look into the labels I'd appreciate it very much
<simbergm>
hkaiser: yep, I can take care of it
<hkaiser>
simbergm: thanks!
Hashmi has joined #ste||ar
Abhishek09 has joined #ste||ar
<Abhishek09>
rtohid : we will install install hpx by dnf or cmake in manylinux docker?
<gonidelis>
simbergm If you would like you could check if the changes are proper. I have completely removed :lines: and replaced them with :start-after: :end-before:
<gonidelis>
are these little patches merged with master directly? or are they gathered and merged in a large newer-verison-like pull request?
<hkaiser>
gonidelis: nothing goes directly to master, ever
<hkaiser>
everything goes through PRs
<Yorlik>
hkaiser: With the need to keep a lua state around for a task in flight I now see how many tasks actually can be "in flight": I'm around 1000 Lua states in the moment which reflects exactly that. At that level in the moment it stabilizes and doesn't grow the pool of Lua States. I more and more have a feeling that we'd need a new kind of scripting language to handle this programming environment. it's neither Lua or
<Yorlik>
everyones fault - to me it looks rather as a new situation which would require that.
<Yorlik>
anyone - not everyones.
<hkaiser>
heh, running out of ideas, do you?
<Yorlik>
Not really
<Yorlik>
As long as it stabilizes it's not really a big problem. You just need the memoy to keep around these Lua States.
<hkaiser>
how much memory does a lua state consume?
<Yorlik>
After creation in the moment ~500kb - 1MB
<Yorlik>
It depends how large your script base it
<hkaiser>
nod, understand
<Yorlik>
I'm fantasizing about a lua vesion tailored for HPX tbh.
akheir1 has quit [Read error: Connection reset by peer]
<Yorlik>
E.G. with our programming paradigm there is no reason why the static parts couldn't be shared.
akheir1 has joined #ste||ar
<Yorlik>
It's just crazy to have all the scripts around in 1000ish copies
<hkaiser>
Yorlik: yah
<Yorlik>
I don't know if it is possible for Lua STates to have more shared data
<hkaiser>
it's the variables that need separation
<Yorlik>
Yes
<Yorlik>
We just need one per OS thread
<hkaiser>
that should be possible somehow, talk to the lua guys
<Yorlik>
Otherwise we'd have data sharing
Abhishek09 has quit [Remote host closed the connection]
<Yorlik>
I'll dig into that
<hkaiser>
one per os thread would assum ethe hpx threads don't move around
<Yorlik>
Yes
<Yorlik>
It's more to avoid false sharing
<Yorlik>
You don't want two threads trying to read the same ram
<Yorlik>
Even if it's const
<hkaiser>
nah, reading is not an issue
<Yorlik>
Then we 'd need only one copy :)
<Yorlik>
I'm thinking about sharing memory pages virtually
gonidelis has quit [Remote host closed the connection]
<Yorlik>
But that would probably explode because of addresses stored
<hkaiser>
Yorlik: premature optimization again
<Yorlik>
lol
<Yorlik>
NP having 3-4 GB worth of Lua states arouznd.
<hkaiser>
the lua scripting will overshadow everything else anyways
<Yorlik>
Yup
<Yorlik>
In the moment I'm process ~150,000 messages on 10,000 object per second in Lua. Not good enough still, imo.
gonidelis has joined #ste||ar
<hkaiser>
smaller memory footprint might help there
<Yorlik>
I might be able to optimize stuff later, when we have our systems prototyped.
<Yorlik>
In the moment every message has a vector of variants as argument pack.
<Yorlik>
Thats not exactly efficient
<Yorlik>
Ove time I can make specialized, smaller messages.
<hkaiser>
depends on how large the variant is
<Yorlik>
40 bytes
<Yorlik>
A message variant over message types is 32 bytes
<Yorlik>
So - adding a stupid int adds 40 bytes ..
<Yorlik>
We have 40 byte bools ! lol
<Yorlik>
The id_types I put into the variant blew it up together with the strings
<hkaiser>
id_types are intrusive_ptr's essentially, so not more than a single pointer
<Yorlik>
I could preompile all strings and use a string hash instead.
<Yorlik>
These messages also go over the wire.
<Yorlik>
So I can't really make the id_types smaller
<hkaiser>
that shouldn't affect things
<Yorlik>
What is the minimum part of an id_type I really need to uniquely identifdy an ovbject?
<Yorlik>
I'm pretty sure there's a ton of optimization possible with having this many lua states aropund. I just didn't have time for that yet. Too much to do in other areas and it's not a big problem yet. But it might become one.
<Yorlik>
The good news is, that most likely the current amount I use is an upper limit
<heller1>
does it scale linearly with the number of states?
<Yorlik>
The states do not cause issues except memory usage
<Yorlik>
When an object gets updated I grab a state from the pool and call the update function in Lua giving it the object and the mailbox.
<Yorlik>
The state sticks with the object until the updater exits. Then the meailbox is dumped and the state returned to the pool.
<Yorlik>
Since a task is a batch of objects There is a limit to the amount of tasks in flight. We most likely can live with this memory consumption for quite a while.
<heller1>
are those merely living on the server?
<Yorlik>
The states? They are just local, sure.
akheir1 has quit [Read error: Connection reset by peer]
<Yorlik>
Since they are essentially const and the same all aroundthe cluster
akheir1 has joined #ste||ar
<Yorlik>
Object migration is planned for the stage after this milestone, when a local, single node scripted simulation is stable enough.
nan1 has joined #ste||ar
<Yorlik>
Just measured: killing 989 Lua States gave me 2.0 MB per state memory back.( measured in Debugger )
<weilewei>
hkaiser the non-mpi version of HPX module on Summit is installed and well tested. So Summit now has distributed and serial HPX 1.4.1.
<ct-clmsn>
weilewei, nice
<hkaiser>
perfect!
<weilewei>
:) Yea!
nan1 has quit [Ping timeout: 240 seconds]
<rtohid>
Abhishek09 here!
<Abhishek09>
rtohid: we will prefer to install hpx by cmake or dnf package ? dnf package will require same gcc version to compile
<ct-clmsn>
what's the best technique for running the clang lint tool provided with phylanx?
<ct-clmsn>
i've ancient code that's blocked on my poor code formatting
<Abhishek09>
cmake , boost by binary tar &,pybind,blaze , blaze tensor by cmake, & git, libjemelloc by sudo,
<Abhishek09>
rtohid
<rtohid>
Abhishek09 CMake
<rtohid>
we just need to follow build instructions on Phylanx's Wiki
diehlpk has quit [Remote host closed the connection]
diehlpk has joined #ste||ar
diehlpk has joined #ste||ar
diehlpk has quit [Changing host]
<Abhishek09>
That means dnf is not involved in our project rtohid diehlpk
diehlpk has quit [Remote host closed the connection]
diehlpk has joined #ste||ar
<Abhishek09>
rtohid nikunj97 manylinux is supported on travis/Appveyor ci but not circle ci but cibuildwheel does
stmatengss has left #ste||ar [#ste||ar]
<rtohid>
Abhishek09 not to build Phylanx itself.
<diehlpk>
Abhishek09, The final solution can not involve the dnf package
<nikunj97>
Abhishek09, if you use manylinux, you will have dnf. But it'll be an older version with old libraries. So you can't rely on dnf
<Abhishek09>
rtohid `not to build Phylanx itself` means?
<nikunj97>
if you're going the manylinux route, you will have to build everything from scratch
<nikunj97>
including a gnu gcc compiler
<Abhishek09>
nikunj97 yes centos 8 support dnf but 5 does nt
<nikunj97>
once you have the gnu compiler toolchain, you will have to build dependencies followed by phylanx
<nikunj97>
so in short, dnf is not allowed. The final solution must not have dnf
<Abhishek09>
dnf is not allowed nikunj97 Why?
<Abhishek09>
old gcc?
<nikunj97>
older version of dnf implies older libraries
akheir1 has quit [Remote host closed the connection]
<diehlpk>
Abhishek09, As I told you before, a would start to use the dnf package of hpx and try to build phylanx and its dependencies within a whl file
akheir1 has joined #ste||ar
avah has quit [Remote host closed the connection]
<diehlpk>
Once you have done this a step within the main goal, you can remove the dnf package and take care to compile hpx
<diehlpk>
Fist step could be: use hpx's dnf package, build all dependencies, and phylanx itself
<diehlpk>
use the build package on a fresh docker file to test things
<Abhishek09>
nikunj97 manylinux doesnt support circle ci
<diehlpk>
In the second step, once we have a working solution for this. You will add hpx to the build chain of the pip package
<Abhishek09>
cibuildwheel does nikunj97
<diehlpk>
I believe once you figured out to build and ship the dependencies and phylanx, it will be easy to do the same for hpx
<nikunj97>
it's worth investigating cibuildwheel as well then
<diehlpk>
To summarize, the final package can not use dnf or apt-get at all
<diehlpk>
A first proof of concept could use dnf to install hpx
akheir1 has quit [Read error: Connection reset by peer]
<nikunj97>
cibuilds is a later step in the project imo
akheir1 has joined #ste||ar
<Abhishek09>
Yes , i know it uses the same docker as manylinux
<Abhishek09>
nikunj97
<diehlpk>
yes, I agree. I would not over-engineer the solution
<nikunj97>
Abhishek09, you should focus on the first few steps in order to move to the later ones
<nikunj97>
it's always easier to formulate later steps when you have some initial proof of concept
<diehlpk>
Abhishek09, yes, I agree. Just having a pip package compiled in one docker image and copy it to a fresh one and everything works, would be a huge success
<nikunj97>
that's why everyone is suggesting you to first figure out the part of building a pip package. Once you succeed in that, we can explore the ideas of ci integration
<nikunj97>
but that's when you've successfully made a pip package which compiles everything on manylinux and finally creates a wheel using auditwheel (or the likes)
<Abhishek09>
nikunj97: but rtohid said to follow build instructions on Phylanx's Wiki
<nikunj97>
sure
<nikunj97>
I don't see any confusions
<Yorlik>
hkaiser: It's kinda interesting how they creation of Lua States depends a lot on the properties of the workload and the corresponding number of tasks in flight. I added a small busy loop to objuect creation to slow it down a bit and thus allow intermediary created Lua States to finish, so they could be reused.That drastically reduced the number of created LuaStates. I think there's a lot to learn here.
<Yorlik>
Also it seems my artificial tests do not really reflect a realistic workload. I'll have to do a lot of experimenting.
<Yorlik>
It seems there are phases of bursts where new states get created and then it stabilizes again before it comes to a stable situation overall.
<Yorlik>
When all objects were created and message cration was in a steady state, after purging all engines it run with just 4 - one per thread.
<Abhishek09>
diehlpk: Why we use dnf package rather than using make install?
<Abhishek09>
by cmake
<nikunj97>
he's not forcing you to use dnf
<nikunj97>
he's just telling you, if you want to use dnf package initially, you may do so
<nikunj97>
but the final product should not make use of dnf
<Abhishek09>
that means i can apply any option dnf or cmake
<diehlpk>
Abhishek09, I was thinking that getting HPX to work will take a long time.
<diehlpk>
So you could use dnf to install hpx and just compile all dependencies of Phylanx and itself
<diehlpk>
So you could deliver a first package where you use HPX from dnf
<diehlpk>
and compile only phylanx and its dependencies. So we have a working package sooner
<diehlpk>
I think if you figured out how to compile and ship for example pybind11, it will be easier to compile HPX using pip
<diehlpk>
After we have this package, you can remove dnf insta;;
<diehlpk>
and build hpx
<diehlpk>
I believe that doing baby steps is the way to go
<diehlpk>
It is just my opnion and you can propose whatever you want
<diehlpk>
I just want to say doing incremental steps is better
<diehlpk>
However, the final solution should not use dnf in your proposal
<hkaiser>
Yorlik: you might request a lua state only once the thread actually starts running, not at task creation
<Yorlik>
How could I do that ? I request the states immediately before using them.
<hkaiser>
diehlpk: using dnf for binaries is not cross platform and makes the pip almost useless
nan1 has joined #ste||ar
<Yorlik>
hkaiser: One problem are Callbacks which use a Lua state form a c++ function which is called from Lua. Unfortunately I cannot just pass around the LuaState, since it's in actions which might run remotely.
<diehlpk>
hkaiser, yes, I know and that is exactly what I want as a first step
<hkaiser>
Yorlik:if you associate the lua state with the hpx thread this shouldn't be a problem
<diehlpk>
I think we should start with the most easiest pip packahe which is might useless at all, but show that tings work
<hkaiser>
diehlpk: ok
<Yorlik>
hkaiser: I had that and it exploded, when the task migrated to another thread. The states are bound to the tasks or I get access errors
<diehlpk>
I think if we can use dnf install, build phylanx and its dependencies in one Fedora docker container and generate a pip package
<hkaiser>
Yorlik: hpx threads have a 64 bit value you can use for your purposes
<diehlpk>
Copy this pip package to a fresh Fedora docker container and install it and run any phylanx example
<hkaiser>
hpx::threads::set_state(std::size_t) or something similar (and size_t get_state())
<diehlpk>
This would be a major step
<Yorlik>
Like using a lua state with two tasks at the same time? That might be possible.
<hkaiser>
so you attach you lua state to the hpx thread which will carry it with it
<diehlpk>
next step would be remove dnf install and compile hpx
<Yorlik>
Oh - I see.
<hkaiser>
no, just one task
<diehlpk>
Once we have done this, we could think on how to use tools to make it more usefull
<hkaiser>
htis way the c++ code called from lua will use the same lua state as the surrounding hpx thread
<Yorlik>
In the moment I am using several states per task, one per object update - that could be fixed.
<diehlpk>
hkaiser, I just said it is not a good way to use the build system for offical pip packages from the beginning on
<hkaiser>
diehlpk: ok
<diehlpk>
We should do baby steps in this direction
<Yorlik>
So I'd use the same state for every update run in the batch of the parllel loop
<hkaiser>
agreed
<Yorlik>
hkaiser: Is there something like task local storage?
<hkaiser>
sure, set_state/get_state
<Yorlik>
OK - I'll look that up
<zao>
diehlpk: Instructions unclear, I removed dnf :P
<Yorlik>
hkaiser: That function only accepts the id_type and the size_t - What I'd need to do is store a unique ptr that get automagically destroyed, when the task finishes - I can't see how I would do that here. I'd have no way to figure out how long the task lives.
<hkaiser>
Yorlik: it's thread_id_type, not id_type
<hkaiser>
ahh, but good point
<hkaiser>
I need to create an example for this
<Yorlik>
Yes, but however - What would that size_t help me?
<Yorlik>
The data needs to be destroyed at the end of the task.
<Yorlik>
So the deleter of the unique_ptr to the lua state can kick in
<Yorlik>
and give it back to the pool
<hkaiser>
that's what I need to create an example for
<Yorlik>
OK. Thanks a ton !
<hkaiser>
bita: after the retiling you'd need to create a new annotation which needs to have all of the new tiles in it
gonidelis has quit [Remote host closed the connection]
<bita>
hkaiser, annotate_d does not need to show all of the meta data. I don't get your point :?
gonidelis has joined #ste||ar
<hkaiser>
annotate_d produces a annotation that has information for all the tiles,
<gonidelis>
hkaiser I was asking about merging these patces in some milestone version or sth or if they just getting merged directly to master...
<hkaiser>
diehlpk: do you plan to join Maxwells defense dryrun now?
<hkaiser>
gonidelis: do you think this is needed?
<hkaiser>
diehlpk_mobile[m: ^^
<gonidelis>
no i dont. just asking how things work in your team ;)
<nikunj97>
heller1, about roofline analysis. Peak performance of a single core will be (2.6)*(16) GFLOPs/s where 2.6GHz is cpu frequency and 16 instructions are executed per cycle
<nikunj97>
and from stream benchmarks we know that 4GB/s is the memory bandwidth
<nikunj97>
the triad operation is a[j] = b[j]+scalar*c[j];
<nikunj97>
so it will require 2 loads i.e. b[j] and c[j] and load and store for a[j]
<heller1>
More like 40, no?
<nikunj97>
was it 40, let me check
<nikunj97>
ohh yeah 39.xxGB/s
<nikunj97>
my bad
<heller1>
16 is vectorized fma?
<nikunj97>
yes
<nikunj97>
CPUs have 16 instructions per cycle (as E5-2600v3 series CPUs have AVX2.0 and FMA instruction sets that at their theoretical maximum are two times lager than that of E5-2600v1 and E5-2600v2)
<nikunj97>
so peak core performance is about 41.6GLOPs/s
<nikunj97>
and processor's peak performance is 1664GFLOPs/s
<nikunj97>
so simply AVX2, no FMA will be 20.8GFLOPs/s
<heller1>
Ok, as mentioned earlier, it would be good to have all those separate "roofs"
<nikunj97>
yes, I wanted to confirm this with you
<nikunj97>
so these will be horizontal lines on plotted graph
<heller1>
1. Scalar instructions, single core 2. Vector instructions, no fma, single core 3. Vector FMA instructions, single core 4. The number of 3. multiplied with the number of cores
<nikunj97>
ok, got it. How do I get the intersecting line (memory bandwidth one)?
<nikunj97>
how do I get to know where it'll intersect (y or x axis)?
<heller1>
So draw three graphs with those different rooflines. One graph per machine to test
<nikunj97>
but what about the memory bandwidth line?
Hashmi has quit [Quit: Connection closed for inactivity]
<heller1>
Yes, sounds reasonable
<heller1>
Yes, try to plot a proper roofline
<heller1>
Each of the horizontal lines will hint you at the benefit of each architectural improvement
<heller1>
What do you think?
<heller1>
f(x) = min(x*peak_bw, peak_flops), where x is the arithmetic intensity
<nikunj97>
yes, I think I've also figured out the memory bandwidth line. I take 2 operations from stream benchmarks with their arithmetic intensity
<nikunj97>
and then I have their memory bandwidth as well
<nikunj97>
so I multiply it with the corresponding memory bandwidth and compare with processor's peak
<heller1>
No
<heller1>
sorry, riot seems to be completely borked right now
<nikunj97>
is it not the same as your function?
<nikunj97>
I think it is. If I know 2 corresponding points of a line, I can plot the line itself. I have both arithmetic intensity and it's corresponding peak bw
<heller1>
Use pen and paper and try to figure it out yourself
<nikunj97>
ok, let me try
<heller1>
;)
<heller1>
x is the unknown
<heller1>
You have two functions that intersect, f(x) = a*x and g(x)= b
<hkaiser>
gonidelis: we do a release every couple of month that contains all changes that have accumulated since the last release
<nikunj97>
this would mean that the memory bandwidth line passes through the origin
<heller1>
Basic math rules apply
<heller1>
It absolutely does
<hkaiser>
nikunj97: zero cores use zero memory bandwidth ;-)
<heller1>
Why shouldn't it?
<nikunj97>
hold on, x axis has arithmetic intensity, right?
<heller1>
Yes
Abhishek09 has quit [Remote host closed the connection]
<heller1>
Why shouldn't it?
<nikunj97>
nothing, I was confused by your analogy. I got what you're saying now
<heller1>
always remember, we use math as a way to express our models ;)
<heller1>
HPX is not magic, it's C++, roofline is not witchcraft, it's math ;)
<heller1>
anyways, zero could either mean that you have no flops or that your bytes got to infinity, both lead to the number of flops/s to be zero
weilewei has quit [Ping timeout: 240 seconds]
<nikunj97>
yes
<heller1>
so there's no reason why it should not start at zero
<nikunj97>
heller1, we use stream triad benchmark
<nikunj97>
we use it's 2FLOP/iter and does 24B/iter
<nikunj97>
that gives us arithmetic intensity of 1/12 FLOP/B
<heller1>
we used that to determine the maximum bandwidth
<heller1>
but sure, you can use those values to validate your graph
<heller1>
bonus point: where will the triad benchmark result sit at?