hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
RostamLog has joined #ste||ar
RostamLog has joined #ste||ar
ahkeir1 has quit [Quit: Leaving]
nikunj97 has quit [Ping timeout: 260 seconds]
bita has joined #ste||ar
hkaiser has quit [Quit: bye]
shahrzad has quit [Ping timeout: 246 seconds]
SSolei1 has joined #ste||ar
rtohid has left #ste||ar [#ste||ar]
SSolei1 has quit [Remote host closed the connection]
shahrzad has joined #ste||ar
weilewei has quit [Ping timeout: 240 seconds]
bita has quit [Quit: Leaving]
shahrzad has quit [Ping timeout: 246 seconds]
shahrzad has joined #ste||ar
kale has joined #ste||ar
kale has quit [Client Quit]
shahrzad has quit [Ping timeout: 260 seconds]
mdiers_ has quit [Quit: mdiers_]
mdiers_ has joined #ste||ar
nikunj97 has joined #ste||ar
<heller1> nikunj97: is PAPI available on those? Or other mean to read the hardware performance counters?
<nikunj97> heller1, I do not know. Let me check
<nikunj97> heller1, yes PAPI is available
<heller1> ok, good
<heller1> do you have a roofline model already?
<nikunj97> heller1, I don't have one
<nikunj97> how do I create a roofline model?
<heller1> then create one ;)
<heller1> one sec
<nikunj97> aah! I'll try that!
<nikunj97> also I feel that my yesterday's result may be skewed
<heller1> so first, draw a graph with the roofline
<heller1> shrug
<heller1> start with the basics first
<nikunj97> ok
<heller1> also, draw the roofline with different peaks
<nikunj97> ok, let me try it
<heller1> different max bandwidth (main memory, cache levels), different max compute (no vectorization, vectorization, FMA, single threaded, all threads, etc)
<heller1> since the stencil's metric is MLUP/S, you should convert GFLOP/S to MLUP/S
<nikunj97> so for a 5 point stencil, mlups = glops/5 ?
<heller1> no
<heller1> you need to calculate the arithmetic intensity
<heller1> as a first step
<heller1> well, keep the gflops at the first step
<heller1> i'll explain the conversion later
<heller1> setup the roofline for the ARM64FX2 first
<nikunj97> ok
<jbjnr> nikunj97: which of the stencil examples are you improving. It sounds like you are ding something very useful and worthwhile.
<nikunj97> jbjnr, I'm working with heller1's 2d stencil benchmark from one of his lectures
<jbjnr> is it part of the tutorials?
<jbjnr> (in the tutorials repo)
<nikunj97> yes
<jbjnr> ok great.
<jbjnr> we have a plan to redo the tutorial material for the next course and it would be lovely to have a simd version of the stncil code to add to the material.
<nikunj97> jbjnr, I'm trying my best :)
<heller1> nikunj97: yeah, would be nice if you could write a few pages about the performance modelling ;)
<heller1> lessons learnt, optimization etc
<nikunj97> well this is all for a lab based project at my university. As I told you, it's a collaboration between iitr and jsc, so they won't let me off without a good 10-15 page report ;)
<jbjnr> we have a gsoc project to add simd stuff, couldn't you do that as well and get paid for it too?
<jbjnr> jsc = julich?
<nikunj97> I have an inernship this summer. Also, I'm a mentor this gsoc so I won't be able to apply as a student.
<nikunj97> and yes, jsc is julich supercomputing center
<jbjnr> k
<nikunj97> but I can look into the project when I'm free and add stuff there
nikunj has quit [Read error: Connection reset by peer]
nikunj has joined #ste||ar
<nikunj97> how do I get interactive access to a node in rostam. I don't see screen anymore :/
karame78 has quit [Remote host closed the connection]
kale has joined #ste||ar
kale_ has joined #ste||ar
kale has quit [Ping timeout: 250 seconds]
hkaiser has joined #ste||ar
Hashmi has joined #ste||ar
Abhishek09 has joined #ste||ar
Abhishek09 has quit [Remote host closed the connection]
kale_ has quit [Quit: Leaving]
kale_ has joined #ste||ar
<nikunj97> heller1, how do I do roofline analysis with PAPI?
<heller1> you don't
<nikunj97> I went through the link you sent. It looks like I need to first find theoretical maximum and then add some macros to your code to find gflops in your application. Post this you compare
<heller1> nikunj97: step #1: determine the peak bandwidth of your system. step #2: determine the peak flops performance of your system
<heller1> well, for a stencil that's simple enough
<heller1> anyways, let's get the roofline first
<nikunj97> yes
<heller1> once you've done step #1 and step #2, you can plot the roofline as the function `min(AI * peak_bw, peak_flops)` where AI stands for arithmetic intensity
<heller1> which is the unit on you're x-axis
<heller1> arithmetic intensity is a metric that tells you how many flops per byte your system can achieve
<heller1> sorry, misformulated it
<heller1> the arithmetic intensity is the unit that characterizes your workload
<nikunj97> why is that on the y-axis then?
<heller1> FLOPS
<nikunj97> ohh that's a min there
<heller1> if you have a bandwitdh bound problem, you have a low arithmetic intensity, you need to transfer more memory to your ALUS
<nikunj97> How do I transfer more memory to ALU?
<nikunj97> anyway, let me try step #1 and #2 before I ask any more doubts
<heller1> for example, the following calculation (assuming all floats): `a = b + c;` requires two loads and one store, an equivalent of 12 bytes, for 1 floating point operation, that means that your arithmetic intensity is 1/12
<heller1> you don't transfer more memory to your ALU
<heller1> this is a fixed unit (determined by the peak bandwidth)
<nikunj97> gotcha
<heller1> what you need to do is to increase the number of floating point operations per memory load/store
<heller1> this can be done by exploiting your cache organization, intelligent prefetching, or a different algorithm
<heller1> so, first complete step #1 and #2
<heller1> for step #2, you can have different peak lines, as mentioned yesterday. One with scalar only operations, one with vectorization, one with vectorized FMA
<heller1> and one additional one using all cores and vectorization
<nikunj97> which module do I need to load to test this?
<heller1> none?
<nikunj97> so I run an equivalent of nersc$ srun -n 4 -c 6 sde -knl -d -iform 1 -omix my_mix.out -i -global_region -start_ssc_mark 111:repeat -stop_ssc_mark 222:repeat -- foo.exe?
<heller1> I have no idea what this command means
<nikunj97> lol, I think I'm confusing myself and you
<heller1> to get the peak memory bandwidth, use this: https://www.cs.virginia.edu/stream/FTP/Code/
<nikunj97> ya that's what I wanted to know
<heller1> make sure you compile it with openmp enabled
kale_ is now known as kale
<diehlpk_mobile[m> In your class Extended Community Bonding period to 4 weeks (from 3 weeks)
<diehlpk_mobile[m> would
kale is now known as kale__
kale__ is now known as kale_
<diehlpk_mobile[m> <diehlpk_mobile[m "> <@diehlpk:matrix.org> In your "> Google sent an email with updated time line for GSoC
kale_ has quit [Quit: Leaving]
kale_ has joined #ste||ar
<diehlpk_mobile[m> Student proposal review period is now March 31-April 20.
<diehlpk_mobile[m> And we got one additional week to review the proposals
<heller1> yay
<heller1> nikunj97: for the peak flop performance, it is enough to consider the max frequency and the number of operations you can get through per cycle, no need to measure it
<nikunj97> heller1, I was currently reading on stream benchmarks. You want me to run `gcc -fopenmp -D_OPENMP stream.c -o stream` on the source code, right?
<heller1> and `-O3`
<nikunj97> right!
<heller1> then take the value you got from the stream TRIAD and multiply it by 1.5, then you have a realistic number
<kale_> diehlpk_mobile[m, I am currently working on my second draft. I have found better ways to implement the pip package and I am actively looking for me. Since you are busy with your work. You can directly read my second draft tomorrow instead of going through both.
akheir has joined #ste||ar
Abhishek09 has joined #ste||ar
ahkeir1 has joined #ste||ar
<nikunj97> heller1, here are the results of running stream on hisilicon1616 and haswell x86 e5: https://gist.github.com/NK-Nikunj/fb61448647bc4bcc8c7ce0d0833c1036
<nikunj97> hisilicon seems to be about twice as powerful wrt e5 in triad
akheir has quit [Read error: Connection reset by peer]
<heller1> nikunj97: they make sense
<heller1> any more information on this system?
<nikunj97> the gist is all I got from running stream
<nikunj97> is there anything I'm missing?
<heller1> "The Hi1616 supports up to 512 GiB of quad-channel DDR4-2400 memory. This chip supports up to 2-way SMP with two ports supporting 96 Gb/s each."
<heller1> so there you go
<nikunj97> aah! it definitely is powerful then
<heller1> no, looks good
<nikunj97> should I also run in on a64fx that we have?
<heller1> sure thing
<nikunj97> alright!
<heller1> make a case study ;)
<nikunj97> Wish I knew enough about processors to make that happen
<heller1> trust me, what you have so far is plenty
<nikunj97> heller1, btw I don't see any slurm setup on a64fx
<nikunj97> would it be fine to simply run the benchmark?
<heller1> i guess
<heller1> check if noone else is on the system
<heller1> and make sure to repeat the measurement a few times
<nikunj97> alright, so run them a good 10 times
<nikunj97> and take the average
<heller1> but in any case, it would be could to be aware about the general procedure of doing benchmarks for such a shared system
<heller1> or the max
<nikunj97> this is the first time, I'm doing this 😅. I will definitely remember this in the future!
weilewei has joined #ste||ar
karame78 has joined #ste||ar
Hashmi has quit [Quit: Connection closed for inactivity]
<nikunj97> heller1, set of 10 runs sure has a lot of variety in values: https://gist.github.com/NK-Nikunj/fb61448647bc4bcc8c7ce0d0833c1036#file-x86-set-of-10-runs
<nikunj97> is that expected?
<jbjnr> nikunj97: you might need to watch out for CPU frequency throttling. Recall that modern CPUs/etc can slow themselves down when they get hot. This can mess up benchmarks!
<nikunj97> jbjnr, aah! that makes sense
<jbjnr> make sure nobody else is using the node you're on, and make sure you're not running your tests on a login node
<nikunj97> I'm allocating myself a separate node
<nikunj97> and running a script that runs the stream executable 10 times
<nikunj97> and stores the triad result to a file
<heller1> nikunj97: ahh, the arm64fx is the arm node that I gave you?
<nikunj97> heller1, yes!
<nikunj97> it's not related to the project I'm doing with, but I decided to write my benchmark such that we can reuse it on a64fx as well for our own project
<nikunj97> our -> ste||ar
<heller1> nikunj97: it is _not_ a arm64fx, the arm64fx is the one with SVE, the one that is going to get put into post-k (aka Fugaku)
<nikunj97> what is it then?
<heller1> entirely different machines
<heller1> yes, but it is NOT a ARM64FX
<nikunj97> why's the telegram group ARMFX64 then? shrug
<heller1> I always forget ... you should have the information in the email thread where I gave you access...
<heller1> the telegram group is about getting access to riken's machines, which use the armfx64
<nikunj97> "give me a shout if you need access to a larger aarch64 machine"
<nikunj97> should've read that right
<nikunj97> diehlpk_mobile[m told me that our proposal was accepted by fujitsu
<nikunj97> and that we should get access to it
<heller1> aarch64 is the generic term for arm 64 bit architectures
<heller1> the Hi1616 is one as well
<nikunj97> yes. I was mislead by the telegram group name
<nikunj97> I should've seen /proc/cpuinfo
ct-clmsn has joined #ste||ar
<nikunj97> let me remove those stream benchmarks claiming to be arm64fx then
<heller1> you don't have to remove them, just give them their true name ;)
<nikunj97> so it's a qualcomm falkor
<heller1> something like that, yes
<nikunj97> so now, I have the peak traid bandwidth
<nikunj97> I multiply it by 1.5 to get a realistic number
<nikunj97> what is the next step?
<heller1> and now plot the roofline
<heller1> or was it 1.5?
<heller1> let me check the code again ;)
diehlpk has joined #ste||ar
<nikunj97> heller1, they don't multiply by 1.5 anywhere
<nikunj97> they report: avgtime[j] = avgtime[j]/(double)(NTIMES-1);
<nikunj97> diehlpk, yt?
<heller1> yeah, that;s fine, leave out the multiplication
<diehlpk> nikunj97, yes
<nikunj97> so I report the average triad as bandwidth
<nikunj97> diehlpk, did we not get our proposal accepted with fujitsu?
<nikunj97> I thought they accepted our proposal and we were getting access to the a64fx machines
<diehlpk> I assumed the same but they never came back to us
<nikunj97> so no a64fx :/
<diehlpk> I do not know, just sent him one more time a reminder
<heller1> so you don't have a SVE capable CPU?
nan has joined #ste||ar
nan is now known as Guest9057
Guest9057 has quit [Remote host closed the connection]
nan1 has joined #ste||ar
bita has joined #ste||ar
kale has joined #ste||ar
kale_ has quit [Read error: Connection reset by peer]
gonidelis has joined #ste||ar
shahrzad has joined #ste||ar
shahrzad has quit [Remote host closed the connection]
shahrzad has joined #ste||ar
ahkeir1 has quit [Quit: Leaving]
ahkeir1 has joined #ste||ar
ahkeir1 has quit [Client Quit]
akheir has joined #ste||ar
gonidelis48 has joined #ste||ar
gonidelis has quit [Remote host closed the connection]
gonidelis48 is now known as gonidelis
<Yorlik> WTF???(T00000000/----------------.----/----------------) P--------/----------------.---- 16:17.53.385 [0000000000000001] <fatal> [ERR] thread_func: default thread_num:2 : caught boost::system::system_error: Unknown error (1455), aborted thread execution
<Yorlik> <unknown>
<Yorlik> (T00000000/----------------.----/----------------) P--------/----------------.---- 16:17.53.391 [0000000000000003] <fatal> [ERR] thread_func: default thread_num:0 : caught boost::system::system_error: The paging file is too small for this operation to complete, aborted thread execution
<Yorlik> <unknown>
<heller1> he
<heller1> run with --hpx:attach-debugger=exception
<heller1> then you'll get to it
<Yorlik> Thanks! And Hello heller1 ! :)
parsa has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
<diehlpk> simbergm, jbjnr Anyone interested to prepare the GSoD application?
parsa has joined #ste||ar
<heller1> Yorlik: and hi ;)
<hkaiser> Yorlik: looks like a OOM error
<Yorlik> I had triggered a purge of the lua engines, which is a loop over a vector and deletetion of members
<Yorlik> I was doing some memory debugging to check allocations and stuff.
rtohid has joined #ste||ar
kale has quit [Quit: Leaving]
Abhishek09 has quit [Remote host closed the connection]
<heller1> does HPX run on folding@home?
<heller1> that would be quite a feat ;)
<gonidelis> hey just made a very small PR #4454 on documentation. Hope it's ok as it's just a little detail but I think it's a crucial one as newcomers who try to read manual https://stellar-group.github.io/hpx/docs/sphinx/latest/html/manual/creating_hpx_projects.html would not be able to catch the example :)
<gonidelis> any comments accepted
<nikunj97> gonidelis, pr's related to documentation is very much appreciated. hkaiser would concur ;)
<hkaiser> absolutey!
<hkaiser> everybody: today is the 12'th birthday of HPX, btw
<bita> Happy Birthday HPX
<heller1> hkaiser: woohoo! Awesome job! Congrats to you!
<zao> Yay!
<zao> Let's start a new one from scratch :P
<hkaiser> zao: way to go!
<gonidelis> wow! 12 years of development...
<K-ballo> zao: let's use javascript so it runs everywhere
<zao> WASM, you say?
<hkaiser> tells me that we should publish the survey results asap
<hkaiser> "12'th Birthday - What do People think?"
akheir1 has joined #ste||ar
<heller1> indeed
<heller1> what's still missing?
stmatengss has joined #ste||ar
akheir has quit [Ping timeout: 265 seconds]
<simbergm> woop, happy birthday HPX!
<simbergm> hkaiser: sounds like a good idea
<hkaiser> simbergm: I still need to fix the images, didn't have tim eyet :/
<simbergm> hkaiser: you mean the labels?
<hkaiser> yah
<simbergm> doesn't have to be exactly on the birthday ;)
<simbergm> hkaiser: you need help with the images? I can probably paste something together quite quickly
<simbergm> it's just that one image, no?
<heller1> I think so, yes
<hkaiser> simbergm: I'm just running out of time, so if you could look into the labels I'd appreciate it very much
<simbergm> hkaiser: yep, I can take care of it
<hkaiser> simbergm: thanks!
Hashmi has joined #ste||ar
Abhishek09 has joined #ste||ar
<Abhishek09> rtohid : we will install install hpx by dnf or cmake in manylinux docker?
<gonidelis> simbergm If you would like you could check if the changes are proper. I have completely removed :lines: and replaced them with :start-after: :end-before:
<gonidelis> are these little patches merged with master directly? or are they gathered and merged in a large newer-verison-like pull request?
<hkaiser> gonidelis: nothing goes directly to master, ever
<hkaiser> everything goes through PRs
<Yorlik> hkaiser: With the need to keep a lua state around for a task in flight I now see how many tasks actually can be "in flight": I'm around 1000 Lua states in the moment which reflects exactly that. At that level in the moment it stabilizes and doesn't grow the pool of Lua States. I more and more have a feeling that we'd need a new kind of scripting language to handle this programming environment. it's neither Lua or
<Yorlik> everyones fault - to me it looks rather as a new situation which would require that.
<Yorlik> anyone - not everyones.
<hkaiser> heh, running out of ideas, do you?
<Yorlik> Not really
<Yorlik> As long as it stabilizes it's not really a big problem. You just need the memoy to keep around these Lua States.
<hkaiser> how much memory does a lua state consume?
<Yorlik> After creation in the moment ~500kb - 1MB
<Yorlik> It depends how large your script base it
<hkaiser> nod, understand
<Yorlik> I'm fantasizing about a lua vesion tailored for HPX tbh.
akheir1 has quit [Read error: Connection reset by peer]
<Yorlik> E.G. with our programming paradigm there is no reason why the static parts couldn't be shared.
akheir1 has joined #ste||ar
<Yorlik> It's just crazy to have all the scripts around in 1000ish copies
<hkaiser> Yorlik: yah
<Yorlik> I don't know if it is possible for Lua STates to have more shared data
<hkaiser> it's the variables that need separation
<Yorlik> Yes
<Yorlik> We just need one per OS thread
<hkaiser> that should be possible somehow, talk to the lua guys
<Yorlik> Otherwise we'd have data sharing
Abhishek09 has quit [Remote host closed the connection]
<Yorlik> I'll dig into that
<hkaiser> one per os thread would assum ethe hpx threads don't move around
<Yorlik> Yes
<Yorlik> It's more to avoid false sharing
<Yorlik> You don't want two threads trying to read the same ram
<Yorlik> Even if it's const
<hkaiser> nah, reading is not an issue
<Yorlik> Then we 'd need only one copy :)
<Yorlik> I'm thinking about sharing memory pages virtually
gonidelis has quit [Remote host closed the connection]
<Yorlik> But that would probably explode because of addresses stored
<hkaiser> Yorlik: premature optimization again
<Yorlik> lol
<Yorlik> NP having 3-4 GB worth of Lua states arouznd.
<hkaiser> the lua scripting will overshadow everything else anyways
<Yorlik> Yup
<Yorlik> In the moment I'm process ~150,000 messages on 10,000 object per second in Lua. Not good enough still, imo.
gonidelis has joined #ste||ar
<hkaiser> smaller memory footprint might help there
<Yorlik> I might be able to optimize stuff later, when we have our systems prototyped.
<Yorlik> In the moment every message has a vector of variants as argument pack.
<Yorlik> Thats not exactly efficient
<Yorlik> Ove time I can make specialized, smaller messages.
<hkaiser> depends on how large the variant is
<Yorlik> 40 bytes
<Yorlik> A message variant over message types is 32 bytes
<Yorlik> So - adding a stupid int adds 40 bytes ..
<Yorlik> We have 40 byte bools ! lol
<Yorlik> The id_types I put into the variant blew it up together with the strings
<hkaiser> id_types are intrusive_ptr's essentially, so not more than a single pointer
<Yorlik> I could preompile all strings and use a string hash instead.
<Yorlik> These messages also go over the wire.
<Yorlik> So I can't really make the id_types smaller
<hkaiser> that shouldn't affect things
<Yorlik> What is the minimum part of an id_type I really need to uniquely identifdy an ovbject?
<Yorlik> It's cluster wide messaging
<hkaiser> id_type is 8 bytes, is that too much?
<Yorlik> In the moment I use the raw hpx::id_type
<Yorlik> Then it's the strings
<Yorlik> I think I'll optimize there
<Yorlik> 32 byte + variant index = 40
* Yorlik blames the strings
<hkaiser> std::string could be larger because of small object optimizations
<Yorlik> I think I'l come up with a global string hashing system onside Lua and use that.
<hkaiser> they usually store 16-64 bytes directly in the object before starting to allocate
<hkaiser> or use your own string type
<Yorlik> I have to talk with our scripter. We might just require to use sh("my string thing") with sh being a lua side memoized hasher
<Yorlik> Lua side memoized hashing is crazily fast
akheir1 has quit [Read error: Connection reset by peer]
akheir1 has joined #ste||ar
<Yorlik> Only a chat message, like between players would be larger, but that couzld be solved differently anyways.
nan1 has quit [Ping timeout: 240 seconds]
shahrzad has quit [Ping timeout: 260 seconds]
<heller1> is there only one lua implementation?
<Yorlik> There are several dialects and variants arount. We stick with vanilla for various reasons.
<hkaiser> sensible choice
<Yorlik> It's a pity we had to outrule LuaJit
<heller1> yeah
<heller1> is luajit maybe an option?
<Yorlik> Nope
<Yorlik> No support, bit rot, general dead end.
<Yorlik> If there were an established group supporting and understanding LuaJit it would be different, but they're not yet there.
<heller1> ?
<heller1> I see
<Yorlik> Reading this link ^^
<Yorlik> I'm afraid any other LuaVersion would bust the Lua C++ bindings we use
<heller1> nod
<Yorlik> I'm pretty sure there's a ton of optimization possible with having this many lua states aropund. I just didn't have time for that yet. Too much to do in other areas and it's not a big problem yet. But it might become one.
<Yorlik> The good news is, that most likely the current amount I use is an upper limit
<heller1> does it scale linearly with the number of states?
<Yorlik> The states do not cause issues except memory usage
<Yorlik> When an object gets updated I grab a state from the pool and call the update function in Lua giving it the object and the mailbox.
<Yorlik> The state sticks with the object until the updater exits. Then the meailbox is dumped and the state returned to the pool.
<Yorlik> Since a task is a batch of objects There is a limit to the amount of tasks in flight. We most likely can live with this memory consumption for quite a while.
<heller1> are those merely living on the server?
<Yorlik> The states? They are just local, sure.
akheir1 has quit [Read error: Connection reset by peer]
<Yorlik> Since they are essentially const and the same all aroundthe cluster
akheir1 has joined #ste||ar
<Yorlik> Object migration is planned for the stage after this milestone, when a local, single node scripted simulation is stable enough.
nan1 has joined #ste||ar
<Yorlik> Just measured: killing 989 Lua States gave me 2.0 MB per state memory back.( measured in Debugger )
<Yorlik> So thats including Garbage and all.
<bita> I think we need to update https://github.com/STEllAR-GROUP/phylanx/wiki/Build-Instructions . As far as I know blaze_tensor is a dependency which is not mentioned. Also both links to build hpx do not work
<ct-clmsn> bita, does steve have notes on how his script builds the containers?
Abhishek09 has joined #ste||ar
<bita> I don't know about notes. Nan was using https://github.com/STEllAR-GROUP/phylanx/wiki/Phylanx-in-Containers (steve's singularity on Rostam) and I think its phyalnx is not updated (she got iso_component error)
<bita> nan1^^
avah has joined #ste||ar
<Abhishek09> rtohid?
<ct-clmsn> bita, gotcha
<weilewei> hkaiser the non-mpi version of HPX module on Summit is installed and well tested. So Summit now has distributed and serial HPX 1.4.1.
<ct-clmsn> weilewei, nice
<hkaiser> perfect!
<weilewei> :) Yea!
nan1 has quit [Ping timeout: 240 seconds]
<rtohid> Abhishek09 here!
<Abhishek09> rtohid: we will prefer to install hpx by cmake or dnf package ? dnf package will require same gcc version to compile
<ct-clmsn> what's the best technique for running the clang lint tool provided with phylanx?
<ct-clmsn> i've ancient code that's blocked on my poor code formatting
<Abhishek09> cmake , boost by binary tar &,pybind,blaze , blaze tensor by cmake, & git, libjemelloc by sudo,
<Abhishek09> rtohid
<rtohid> Abhishek09 CMake
<rtohid> we just need to follow build instructions on Phylanx's Wiki
diehlpk has quit [Remote host closed the connection]
diehlpk has joined #ste||ar
diehlpk has joined #ste||ar
diehlpk has quit [Changing host]
<Abhishek09> That means dnf is not involved in our project rtohid diehlpk
diehlpk has quit [Remote host closed the connection]
diehlpk has joined #ste||ar
<Abhishek09> rtohid nikunj97 manylinux is supported on travis/Appveyor ci but not circle ci but cibuildwheel does
stmatengss has left #ste||ar [#ste||ar]
<rtohid> Abhishek09 not to build Phylanx itself.
<diehlpk> Abhishek09, The final solution can not involve the dnf package
<nikunj97> Abhishek09, if you use manylinux, you will have dnf. But it'll be an older version with old libraries. So you can't rely on dnf
<Abhishek09> rtohid `not to build Phylanx itself` means?
<nikunj97> if you're going the manylinux route, you will have to build everything from scratch
<nikunj97> including a gnu gcc compiler
<Abhishek09> nikunj97 yes centos 8 support dnf but 5 does nt
<nikunj97> once you have the gnu compiler toolchain, you will have to build dependencies followed by phylanx
<nikunj97> so in short, dnf is not allowed. The final solution must not have dnf
<Abhishek09> dnf is not allowed nikunj97 Why?
<Abhishek09> old gcc?
<nikunj97> older version of dnf implies older libraries
akheir1 has quit [Remote host closed the connection]
<diehlpk> Abhishek09, As I told you before, a would start to use the dnf package of hpx and try to build phylanx and its dependencies within a whl file
akheir1 has joined #ste||ar
avah has quit [Remote host closed the connection]
<diehlpk> Once you have done this a step within the main goal, you can remove the dnf package and take care to compile hpx
<diehlpk> Fist step could be: use hpx's dnf package, build all dependencies, and phylanx itself
<diehlpk> use the build package on a fresh docker file to test things
<Abhishek09> nikunj97 manylinux doesnt support circle ci
<diehlpk> In the second step, once we have a working solution for this. You will add hpx to the build chain of the pip package
<Abhishek09> cibuildwheel does nikunj97
<diehlpk> I believe once you figured out to build and ship the dependencies and phylanx, it will be easy to do the same for hpx
<nikunj97> it's worth investigating cibuildwheel as well then
<diehlpk> To summarize, the final package can not use dnf or apt-get at all
<diehlpk> A first proof of concept could use dnf to install hpx
<nikunj97> Abhishek09, https://github.com/joerick/cibuildwheel cibuildwheels can build manylinux as well
akheir1 has quit [Read error: Connection reset by peer]
<nikunj97> cibuilds is a later step in the project imo
akheir1 has joined #ste||ar
<Abhishek09> Yes , i know it uses the same docker as manylinux
<Abhishek09> nikunj97
<diehlpk> yes, I agree. I would not over-engineer the solution
<nikunj97> Abhishek09, you should focus on the first few steps in order to move to the later ones
<nikunj97> it's always easier to formulate later steps when you have some initial proof of concept
<diehlpk> Abhishek09, yes, I agree. Just having a pip package compiled in one docker image and copy it to a fresh one and everything works, would be a huge success
<nikunj97> that's why everyone is suggesting you to first figure out the part of building a pip package. Once you succeed in that, we can explore the ideas of ci integration
<nikunj97> but that's when you've successfully made a pip package which compiles everything on manylinux and finally creates a wheel using auditwheel (or the likes)
<Abhishek09> nikunj97: but rtohid said to follow build instructions on Phylanx's Wiki
<nikunj97> sure
<nikunj97> I don't see any confusions
<Yorlik> hkaiser: It's kinda interesting how they creation of Lua States depends a lot on the properties of the workload and the corresponding number of tasks in flight. I added a small busy loop to objuect creation to slow it down a bit and thus allow intermediary created Lua States to finish, so they could be reused.That drastically reduced the number of created LuaStates. I think there's a lot to learn here.
<Yorlik> Also it seems my artificial tests do not really reflect a realistic workload. I'll have to do a lot of experimenting.
<Yorlik> It seems there are phases of bursts where new states get created and then it stabilizes again before it comes to a stable situation overall.
<Yorlik> When all objects were created and message cration was in a steady state, after purging all engines it run with just 4 - one per thread.
<Abhishek09> diehlpk: Why we use dnf package rather than using make install?
<Abhishek09> by cmake
<nikunj97> he's not forcing you to use dnf
<nikunj97> he's just telling you, if you want to use dnf package initially, you may do so
<nikunj97> but the final product should not make use of dnf
<Abhishek09> that means i can apply any option dnf or cmake
<diehlpk> Abhishek09, I was thinking that getting HPX to work will take a long time.
<diehlpk> So you could use dnf to install hpx and just compile all dependencies of Phylanx and itself
<diehlpk> So you could deliver a first package where you use HPX from dnf
<diehlpk> and compile only phylanx and its dependencies. So we have a working package sooner
<diehlpk> I think if you figured out how to compile and ship for example pybind11, it will be easier to compile HPX using pip
<diehlpk> After we have this package, you can remove dnf insta;;
<diehlpk> and build hpx
<diehlpk> I believe that doing baby steps is the way to go
<diehlpk> It is just my opnion and you can propose whatever you want
<diehlpk> I just want to say doing incremental steps is better
<diehlpk> However, the final solution should not use dnf in your proposal
<hkaiser> Yorlik: you might request a lua state only once the thread actually starts running, not at task creation
<Yorlik> How could I do that ? I request the states immediately before using them.
<hkaiser> diehlpk: using dnf for binaries is not cross platform and makes the pip almost useless
nan1 has joined #ste||ar
<Yorlik> hkaiser: One problem are Callbacks which use a Lua state form a c++ function which is called from Lua. Unfortunately I cannot just pass around the LuaState, since it's in actions which might run remotely.
<diehlpk> hkaiser, yes, I know and that is exactly what I want as a first step
<hkaiser> Yorlik:if you associate the lua state with the hpx thread this shouldn't be a problem
<diehlpk> I think we should start with the most easiest pip packahe which is might useless at all, but show that tings work
<hkaiser> diehlpk: ok
<Yorlik> hkaiser: I had that and it exploded, when the task migrated to another thread. The states are bound to the tasks or I get access errors
<diehlpk> I think if we can use dnf install, build phylanx and its dependencies in one Fedora docker container and generate a pip package
<hkaiser> Yorlik: hpx threads have a 64 bit value you can use for your purposes
<diehlpk> Copy this pip package to a fresh Fedora docker container and install it and run any phylanx example
<hkaiser> hpx::threads::set_state(std::size_t) or something similar (and size_t get_state())
<diehlpk> This would be a major step
<Yorlik> Like using a lua state with two tasks at the same time? That might be possible.
<hkaiser> so you attach you lua state to the hpx thread which will carry it with it
<diehlpk> next step would be remove dnf install and compile hpx
<Yorlik> Oh - I see.
<hkaiser> no, just one task
<diehlpk> Once we have done this, we could think on how to use tools to make it more usefull
<hkaiser> htis way the c++ code called from lua will use the same lua state as the surrounding hpx thread
<Yorlik> In the moment I am using several states per task, one per object update - that could be fixed.
<diehlpk> hkaiser, I just said it is not a good way to use the build system for offical pip packages from the beginning on
<hkaiser> diehlpk: ok
<diehlpk> We should do baby steps in this direction
<Yorlik> So I'd use the same state for every update run in the batch of the parllel loop
<hkaiser> agreed
<Yorlik> hkaiser: Is there something like task local storage?
<hkaiser> sure, set_state/get_state
<Yorlik> OK - I'll look that up
<zao> diehlpk: Instructions unclear, I removed dnf :P
<Yorlik> hkaiser: Thanks!
<diehlpk> zao, This is just how I would do it
<hkaiser> Yorlik: to get the id of the running thread you can use get_self_id()
<Yorlik> ok.
<Yorlik> Thanks !
<Yorlik> hkaiser: That function only accepts the id_type and the size_t - What I'd need to do is store a unique ptr that get automagically destroyed, when the task finishes - I can't see how I would do that here. I'd have no way to figure out how long the task lives.
<bita> hkaiser, a quick question: does [this](https://github.com/STEllAR-GROUP/phylanx/blob/add_retiling/tests/unit/plugins/dist_matrixops/retile_2_loc.cpp#L50) make sense to you? should we fetch the data from other localities when we retile?
<hkaiser> bita: sec
<hkaiser> Yorlik: it's thread_id_type, not id_type
<hkaiser> ahh, but good point
<hkaiser> I need to create an example for this
<Yorlik> Yes, but however - What would that size_t help me?
<Yorlik> The data needs to be destroyed at the end of the task.
<Yorlik> So the deleter of the unique_ptr to the lua state can kick in
<Yorlik> and give it back to the pool
<hkaiser> that's what I need to create an example for
<Yorlik> OK. Thanks a ton !
<hkaiser> bita: after the retiling you'd need to create a new annotation which needs to have all of the new tiles in it
gonidelis has quit [Remote host closed the connection]
<bita> hkaiser, annotate_d does not need to show all of the meta data. I don't get your point :?
gonidelis has joined #ste||ar
<hkaiser> annotate_d produces a annotation that has information for all the tiles,
<gonidelis> hkaiser I was asking about merging these patces in some milestone version or sth or if they just getting merged directly to master...
<hkaiser> diehlpk: do you plan to join Maxwells defense dryrun now?
<hkaiser> gonidelis: do you think this is needed?
<hkaiser> diehlpk_mobile[m: ^^
<gonidelis> no i dont. just asking how things work in your team ;)
<nikunj97> heller1, about roofline analysis. Peak performance of a single core will be (2.6)*(16) GFLOPs/s where 2.6GHz is cpu frequency and 16 instructions are executed per cycle
<nikunj97> and from stream benchmarks we know that 4GB/s is the memory bandwidth
<nikunj97> the triad operation is a[j] = b[j]+scalar*c[j];
<nikunj97> so it will require 2 loads i.e. b[j] and c[j] and load and store for a[j]
<heller1> More like 40, no?
<nikunj97> was it 40, let me check
<nikunj97> ohh yeah 39.xxGB/s
<nikunj97> my bad
<heller1> 16 is vectorized fma?
<nikunj97> yes
<nikunj97> CPUs have 16 instructions per cycle (as E5-2600v3 series CPUs have AVX2.0 and FMA instruction sets that at their theoretical maximum are two times lager than that of E5-2600v1 and E5-2600v2)
<nikunj97> so peak core performance is about 41.6GLOPs/s
<nikunj97> and processor's peak performance is 1664GFLOPs/s
<nikunj97> so simply AVX2, no FMA will be 20.8GFLOPs/s
<heller1> And where do you get the 16 from?
<heller1> Ok, as mentioned earlier, it would be good to have all those separate "roofs"
<nikunj97> yes, I wanted to confirm this with you
<nikunj97> so these will be horizontal lines on plotted graph
<heller1> 1. Scalar instructions, single core 2. Vector instructions, no fma, single core 3. Vector FMA instructions, single core 4. The number of 3. multiplied with the number of cores
<nikunj97> ok, got it. How do I get the intersecting line (memory bandwidth one)?
<nikunj97> how do I get to know where it'll intersect (y or x axis)?
<heller1> So draw three graphs with those different rooflines. One graph per machine to test
<nikunj97> but what about the memory bandwidth line?
Hashmi has quit [Quit: Connection closed for inactivity]
<heller1> Yes, sounds reasonable
<heller1> Yes, try to plot a proper roofline
<heller1> Each of the horizontal lines will hint you at the benefit of each architectural improvement
<heller1> What do you think?
<heller1> f(x) = min(x*peak_bw, peak_flops), where x is the arithmetic intensity
<nikunj97> yes, I think I've also figured out the memory bandwidth line. I take 2 operations from stream benchmarks with their arithmetic intensity
<nikunj97> and then I have their memory bandwidth as well
<nikunj97> so I multiply it with the corresponding memory bandwidth and compare with processor's peak
<heller1> No
<heller1> sorry, riot seems to be completely borked right now
<nikunj97> is it not the same as your function?
<nikunj97> I think it is. If I know 2 corresponding points of a line, I can plot the line itself. I have both arithmetic intensity and it's corresponding peak bw
<heller1> Use pen and paper and try to figure it out yourself
<nikunj97> ok, let me try
<heller1> ;)
<heller1> x is the unknown
<heller1> You have two functions that intersect, f(x) = a*x and g(x)= b
<hkaiser> gonidelis: we do a release every couple of month that contains all changes that have accumulated since the last release
<nikunj97> this would mean that the memory bandwidth line passes through the origin
<heller1> Basic math rules apply
<heller1> It absolutely does
<hkaiser> nikunj97: zero cores use zero memory bandwidth ;-)
<heller1> Why shouldn't it?
<nikunj97> hold on, x axis has arithmetic intensity, right?
<heller1> Yes
Abhishek09 has quit [Remote host closed the connection]
<heller1> Why shouldn't it?
<nikunj97> nothing, I was confused by your analogy. I got what you're saying now
<heller1> always remember, we use math as a way to express our models ;)
<heller1> HPX is not magic, it's C++, roofline is not witchcraft, it's math ;)
<heller1> anyways, zero could either mean that you have no flops or that your bytes got to infinity, both lead to the number of flops/s to be zero
weilewei has quit [Ping timeout: 240 seconds]
<nikunj97> yes
<heller1> so there's no reason why it should not start at zero
<nikunj97> heller1, we use stream triad benchmark
<nikunj97> we use it's 2FLOP/iter and does 24B/iter
<nikunj97> that gives us arithmetic intensity of 1/12 FLOP/B
<heller1> we used that to determine the maximum bandwidth
<heller1> but sure, you can use those values to validate your graph
<heller1> bonus point: where will the triad benchmark result sit at?
<nikunj97> at the intersection
<nikunj97> ?
<heller1> actually, it is 3 FLOPS
<heller1> well, try and see
<heller1> I am signing out now, have fun
<nikunj97> alright! I'll try to figure everything out in the meantime
<heller1> how about you show me your graphs tomorrow ;)?
<nikunj97> sure :D
<heller1> damn, yes, it is two
<heller1> my bad
gonidelis has quit [Ping timeout: 240 seconds]
<hkaiser> Yorlik: yt?
<Yorlik> Ya
ct-clmsn has quit [Quit: Leaving]
<Yorlik> hkaiser: You extended /examples ? :)
<hkaiser> Yorlik: whatever feature you ask for, that requires me to fix something ;-)
<Yorlik> lol
<Yorlik> I am actually using your stuff - obviously in an unusual - non-scientific computing way. After all I'm just making a lousy gameserver. :)
<hkaiser> Yorlik: nah, this feature was broken at some point during our refactorings
<Yorlik> I'm happy to help :)
* Yorlik starts liking variant: https://godbolt.org/z/n55XYn
<hkaiser> (I turned it into a test as I needed it anyways)
<Yorlik> Now we need this + variant + filesystem in a release :)
<hkaiser> Yorlik: this will be in the next release
<Yorlik> Awesome !
<hkaiser> also, I'd expect you discover more of those...
<Yorlik> This is really fun. I mess up stuff, you fix it and I can use it. :)
<Yorlik> Fells like beingf a child again
<hkaiser> I'm not a candy shop however ;-)
<Yorlik> lol, no.
<Yorlik> Oh I see the test.
<Yorlik> Now one question - how would I set the task exit callback just once without testing it every time in my parloop?
<Yorlik> hkaiser: ^^
<Yorlik> Since these tasks get created automagically.
<hkaiser> right
<hkaiser> nice one
<Yorlik> Maybe parloop needs an extension.
<hkaiser> is that for_loop(par, ...)?
<Yorlik> Yes
<hkaiser> sec
<Yorlik> Don't say you can already do that?
<hkaiser> Yorlik: create an executor that handles the lua state
<Yorlik> Next release is 1.5?
<Yorlik> I'll work on that, once variant + this fix is together in a release
<Yorlik> Just reduced my vardata from 40 to 16 bytes by ditching strings
<hkaiser> well, I think we could have a special executor that allows to pass in a start and a exit function for each created thread
<hkaiser> you could use that one for your needs
<Yorlik> That would be incredibly useful
<hkaiser> yorliks_executor exec([](){ "start"; }, []() { "stop"; }); for_loop(exec, ...);
<Yorlik> I still have to learn about executors, but I guess it's not witchcraft. :)
<hkaiser> Yorlik: also since foor_loop runs several iterations on a single HPX thread, that would reduce the number of lua states
<Yorlik> Absolutely
<Yorlik> I wouzld just request the task local lua state instead a new one from the pool.
<hkaiser> I can try sketching that as an example ;-)
nk__ has joined #ste||ar
<hkaiser> let's see what I will uncover this time
<Yorlik> :)
<Yorlik> Since I started to work on the Lua scripting Api a whole new bunch of difficulties came up. Killed 2 races yesterday.
<Yorlik> It feels like a minefield and I have to move very carefully and slowly.
nikunj97 has quit [Ping timeout: 260 seconds]
diehlpk has quit [Remote host closed the connection]
diehlpk has joined #ste||ar
diehlpk has quit [Changing host]
diehlpk has joined #ste||ar
diehlpk has quit [Ping timeout: 246 seconds]
nikunj97 has joined #ste||ar
nk__ has quit [Ping timeout: 240 seconds]
diehlpk has joined #ste||ar
nk__ has joined #ste||ar
nikunj97 has quit [Ping timeout: 246 seconds]
diehlpk has quit [Ping timeout: 240 seconds]
karame78 has quit [Remote host closed the connection]
nan1 has quit [Ping timeout: 240 seconds]
diehlpk has joined #ste||ar
bita has quit [Ping timeout: 246 seconds]
diehlpk has quit [Ping timeout: 260 seconds]
shahrzad has joined #ste||ar
rtohid has left #ste||ar [#ste||ar]
shahrzad has quit [Ping timeout: 246 seconds]