hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | | HPX: A cure for performance impaired parallel applications | | Buildbot: | Log: | GSoC:
nan11 has joined #ste||ar
<diehlpk_mobile[m> The student submission period is now over - and the numbers were much higher than ever before -- over 51,000 students registered for the program this year (a 65% increase from the previous high)! 6,335 students submitted their final proposals and applications for you all to review over these next couple of weeks (that's a 13% increase over last year).
<diehlpk_mobile[m> Wow
<hkaiser> diehlpk_mobile[m: do you know how many organizations they have?
<diehlpk_mobile[m> Hkaiser 200 and 30 novel ones
akheir1 has quit [Read error: Connection reset by peer]
<diehlpk_mobile[m> 200 in total and 30 out of them are novel
akheir1 has joined #ste||ar
<wate123_Jun> wow, but also not surprising because of this pandemic.
hkaiser has quit [Read error: Connection reset by peer]
weilewei has quit [Remote host closed the connection]
hkaiser has joined #ste||ar
<Yorlik> hkaiser: I think I have run into my first deadlocks
<Yorlik> It seems to depend on the upper and lower limit of the task limiting algorithm.
<Yorlik> What happens is, that all workers are in the yield_while
<Yorlik> Still investigating.
<hkaiser> nod, bad one
<Yorlik> I'm doing a bit of reading on the topic. Do you have any recommendation what to look for?
<Yorlik> I am thinking about modifying the yield_while predicate I'm using and have a chance to continue or add a time based release
<Yorlik> But I don't want to kill performance ofc.
<Yorlik> You know - I always want my free lunch and eat it too :)
nan11 has quit [Ping timeout: 240 seconds]
wate123_Jun has quit [Remote host closed the connection]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 246 seconds]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Remote host closed the connection]
wate123_Jun has joined #ste||ar
Vir has quit [Ping timeout: 256 seconds]
Vir has joined #ste||ar
Vir has quit [Changing host]
Vir has joined #ste||ar
weilewei has joined #ste||ar
<weilewei> do mentors need to review proposal from other organizations?
wate123_Jun has quit [Ping timeout: 240 seconds]
<hkaiser> weilewei: no
<weilewei> hkaiser ok, got it
diehlpk_work has quit [Remote host closed the connection]
wate123_Jun has joined #ste||ar
akheir1 has quit [Read error: Connection reset by peer]
akheir1 has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
hkaiser has quit [Quit: bye]
akheir1 has quit [Read error: Connection reset by peer]
akheir1 has joined #ste||ar
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
akheir1 has quit [Read error: Connection reset by peer]
akheir1 has joined #ste||ar
parsa has quit [Ping timeout: 252 seconds]
weilewei has quit [Remote host closed the connection]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
akheir1 has quit [Remote host closed the connection]
parsa has joined #ste||ar
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 240 seconds]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
nikunj97 has joined #ste||ar
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
kale_ has joined #ste||ar
kale_ has quit [Ping timeout: 265 seconds]
kale_ has joined #ste||ar
kale_ has quit [Client Quit]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
<nikunj97> heller1, just confirming. If expected peak of double is x, then expected peak of float will be 2x. Right?
<heller1> depends on the architecture
<nikunj97> I see. My expected performance for HiSilicon1616 are way off the observed results
<nikunj97> heller1, I've sent you the drive link to the initial benchmark results I've obtained
<nikunj97> not the best looking graph, but I wanted to know your opinions on it
<nikunj97> I'm considering 2 loads and 1 store for calculating peak expected performance
<heller1> Where is the roofline in those graphs?
wate123_Jun has joined #ste||ar
<nikunj97> roofline is the expected peak
<nikunj97> Since the problem is memory bound, I calculated the peak expected value from the values obtained from roofline and plotted that as expected peak performance
<nikunj97> would that not suffice?
<heller1> sorry, I have no idea what I am looking at
<nikunj97> aah, they're that bad damn. Let me try to beautify them for you
<heller1> no, the 'expected' looks very wrong
<nikunj97> how so?
<heller1> I think
<heller1> for the hisilicon, what's that drop at 24 cores?
wate123_Jun has quit [Ping timeout: 240 seconds]
<nikunj97> I calculated expected as follows: 1. Calculate peak bandwidth (from stream triad) for that many cores, 2. Calculate the peak performance considering 2 loads and 1 store i.e. 1 MLUP per 24 Byte
<heller1> it should be a line that's monotonically increasing and eventually converging to the value of AI*peak_bw
<nikunj97> HiSilicon had a drop in memory bandwidth going from 16 to 24 cores
<heller1> that's expectd
<heller1> * that's expected
<heller1> look again at its architecture
<nikunj97> yes, so you see a drop in performance across the board as well
<heller1> no, you just measured wrong
<heller1> or interpreted the result wrongly
<nikunj97> also the cores go up til 64 cores (with hyperthreading)
<heller1> remember that I told you to watch out for SMT and NUMA stuff?
<nikunj97> hartmut asked me to measure till 32 if you don't hyperthreading stuff
<heller1> 2 threads on a single core (with hyperthreading) isn't the same as 2 threads on two different cores
<nikunj97> yes, I remember that
<nikunj97> should I use something like numactl then?
<heller1> did you pin the openmp threads for the stream benchmark?
<nikunj97> yes, I did
<nikunj97> you mean this: export OMP_NUM_THREADS=4, right?
<heller1> no
<nikunj97> crap
<nikunj97> so I use numactl to pin cores and then run everything again?
<heller1> HPX does what you want here. OpenMP not so much, at least not the gnu implementation
<heller1> or hwloc-bind
<heller1> which is a tad easier, since you can visualize the topology nicely
<simbergm> `OMP_PROC_BIND=thread` might be what you're looking for
<nikunj97> stream used that in FAQ, so I thought, you meant that by thread pinning
<nikunj97> simbergm, thanks!
<nikunj97> heller1, how should I proceed then?
<nikunj97> what would you suggest?
<heller1> redo your measurements with the correct pinning ;)
Abhishek09 has joined #ste||ar
<nikunj97> using hwloc-bind, alright :/
<simbergm> nikunj97: if you're using slurm to set the cores, you can get the same bindings in openmp and hpx by using `--hpx:use-process-mask` (openmp will automatically use the mask)
<heller1> choose whichever method you think fits best
<heller1> ms[m]: even the GNU implementation?
<simbergm> heller: it might not
<simbergm> not sure about the differences between the implementations, some will at least
<heller1> yeah ...
Hashmi has joined #ste||ar
<simbergm> doing it manually is the surest way of course
<nikunj97> btw what if I do srun -c <num_cores> --threads-per-core=1?
<nikunj97> will that also be similar to thread pinning?
<heller1> depends on 1) How slurm is configured 2) What your OpenMP implementation does
<nikunj97> could you elaborate?
<heller1> slurm and OpenMP do interact sometimes, sometimes not
<nikunj97> will it work for HPX?
<heller1> huge option space there. Best consult the docs of your cluster
<nikunj97> if I run my benchmarks with that, I mean
<heller1> yes, HPX always does the right thing (tm)
<heller1> well
<heller1> check it
<heller1> --hpx:print-bind is yoru friend
<heller1> lstopo is your friend
<heller1> the manual of your OpenMP implementation is your friend, as well as the manual of your cluster ;)
<nikunj97> will have to look at the manual somewhere, all I have right now is a specification of the cluster
<heller1> what you have to ensure is that you always measure the same thing.
<heller1> and be sure to know what you measure
<nikunj97> simbergm, if I do slurm ... -c 20 ./hpx_executable --hpx:user-process-mask. Will it make sure to bind only to slurm allocated cores? Is that what you meant?
<heller1> nikunj97: check with `--hpx:print-bind`
<simbergm> nikunj97: yeah, that's what it's supposed to do (it's *use*-process-mask, not user btw)
<simbergm> and you can always check with print-bind like heller suggested
<simbergm> it might depend on slurm how slurm is configured as well, but as far as I know it'll always set a mask for the process
<nikunj97> simbergm, heller1 thanks! I'll try doing it this way and check
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
<simbergm> mdiers I forgot to mention yesterday: 4306 was cherry picked into the 1.4.1 release, so those changes should already be there
<simbergm> in any case I've just updated the pr so hopefully we can get it in this week
wate123_Jun has joined #ste||ar
<mdiers[m]> <simbergm " mdiers I forgot to mention yes"> the cherry picks are already in master?
<simbergm> mdiers: good question, apparently not
<simbergm> they are in 1.4.1 though
<simbergm> normally we'd merge release branches back into master, but I think 1.4.1 might not have gotten that treatment
<simbergm> in that case you'll still have to wait for the PR... sorry for the confusion!
<mdiers[m]> ms: all right, no problem ;-)
wate123_Jun has quit [Ping timeout: 252 seconds]
nikunj97 has quit [Remote host closed the connection]
nikunj97 has joined #ste||ar
<nikunj97> heller1, just realized that hisilicon one is incomplete. That's why you don't see it hit the peak
<nikunj97> it has 64 cores and 64 threads and I plotted only till 32 cores
Abhishek09 has quit [Remote host closed the connection]
<nikunj97> also with xeon, afaik --hpx-threads=x uses the first x cores from machine. So the xeon graphs should've used the correct cores as well (since hyperthreads attached 0,21;1,22;and so on)
wate123_Jun has joined #ste||ar
<nikunj97> *0,20;1,21; and so on
wate123_Jun has quit [Ping timeout: 252 seconds]
<heller1> yes, as said, HPX should be fine. I am concerned about your OMP comparisons
<nikunj97> yea, I'll do those peak bandwidth calculations again
<nikunj97> with numactl
<heller1> never used that
<nikunj97> it's pretty easy, `numactl --localalloc --physcpubind=0-20` would mean that cores 0 to 20 are used and each of them is bound to their specific node
<nikunj97> that's what JSC people asked me to use for thread pinning
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
Hashmi has quit [Quit: Connection closed for inactivity]
Hashmi has joined #ste||ar
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 240 seconds]
hkaiser has joined #ste||ar
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
<nikunj97> heller1, what did you think was wrong with xeon e5 graphs?
<nikunj97> I have tried with both numactl and hwloc-bind and the updated results looks skewed. My runs are running better than expected peak.
<nikunj97> other core performance for hisilicon are still running. Will let you know the final results when they're done
<nikunj97> heller1, I've updated e5 graphs with the new one's. There's not much of a difference, except for the benchmark running faster than 5 core peak
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
<nikunj97> heller1, I also added the rooflines for hisilicon1616 and xeon e5 as well (for double precision peak performance for CPU)
<nikunj97> I think arithmetic intensity of 1/8 (assuming 3 loads and 1 store) is not apt for the calculations. I don't like seeing values above peak expected performance
wate123_Jun has joined #ste||ar
hkaiser has quit [Quit: bye]
wate123_Jun has quit [Ping timeout: 252 seconds]
weilewei has joined #ste||ar
Hashmi has quit [Quit: Connection closed for inactivity]
diehlpk_work has joined #ste||ar
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 240 seconds]
wate123_Jun has joined #ste||ar
akheir has joined #ste||ar
<nikunj97> heller1, one thing that I noticed from your paper. You decompose the stencil to blocks, while I'm decomposing into lines
<nikunj97> I'm thinking to take this route as well. This way I can work on evenly sized stencils (i.e. x and y dimensions does not differ by a large factor)
nikunj97 has quit [Ping timeout: 252 seconds]
nikunj97 has joined #ste||ar
Hashmi has joined #ste||ar
akheir1 has joined #ste||ar
<heller1> Yes, blocking is the way to go. Just makes it significantly harder in distributed
hkaiser has joined #ste||ar
akheir has quit [Read error: Connection reset by peer]
<heller1> Well, did you plot the values in the normal roofline?
<Yorlik> o/
<heller1> I'm just very confused about the presentation about your expected performance
<Yorlik> hkaiser: I ended up writing my own primitive version of a yield_while:
<Yorlik> It's like a lightweight selfmade fake scheduler with limited scope (parloop chunks)
<Yorlik> But so far no more deadlocks.
nikunj97 has quit [Ping timeout: 260 seconds]
<hkaiser> Yorlik: sure, even if you use HPX that doesn't prevent you from writing your own facilities
<Yorlik> It all comes over time.
<Yorlik> Can't learn everything at once. :)
<hkaiser> simbergm: ?
nan11 has joined #ste||ar
<simbergm> hkaiser: ??
<hkaiser> hey
<hkaiser> #4485
<hkaiser> you propose to movethe function serialization to the functional module
<simbergm> yeah
<simbergm> yes, not sure about that yet... hence the draft
<hkaiser> I'm confused - when creating the functional module we intentionally left that out, didn't we?
<simbergm> however, since you made serialization more lightweight it wouldn't be too bad
<hkaiser> simbergm: I don't remember what we discussed
<hkaiser> did we want to make essentially all modules dpend on serialization or make serialization depend on all modules
<simbergm> we have made other things depend directly on serialization now as well, instead of having the serialization separately
<hkaiser> iirc, we wanted for serialization to provide the infrastructure needed and serialization support for ambient data types (from the std library, etc.)
<simbergm> that was also before you had made the serialization module
<simbergm> it would still be nice to have it separate, but there are too many other things that still have serialization implemented intrusively that it doesn't make sense to have them in a separate module at the moment
<hkaiser> simbergm: I'm not saying what you propose is wrong, I'm trying to discuss to find a uniform solution
<hkaiser> so I think we can agree that the serialization module itself should have serialization support for ambient types
<simbergm> hkaiser: yeah, didn't take it that way
<simbergm> yep, agreed on that
<simbergm> and that's pretty much the way it is right now
<hkaiser> std:: and possibly boost::
<simbergm> yep
<simbergm> external dependencies essentially
<hkaiser> and serialization support for our own types go into their respective module
<simbergm> we have lot of functionality that still has intrusive serialization, those would have to be separated out as well for this whole concept of separate serialization to make sense
<hkaiser> I'm not sure I'd like that
<simbergm> in that case having function serialization separate never made sense either :P
<hkaiser> so yes, I think moving the function serialization into the functional module is a correct move
<simbergm> the original reason for separating function serialization (it was the first one I did it for) was that serialization was a heavy dependency
<hkaiser> yah, not sure anymore why we did this (same with any, btw)
<simbergm> since it's lightweight now it makes much less sense to have it in a separate module
<hkaiser> right
<hkaiser> ok, count me in, I think this is sensible
<simbergm> well, the benefit of having serialization completely separate is that it would allow building hpx with serialization, but consumers not having to pull in any serialization if they don't want to
<simbergm> but that is a lot of effort for not too much gain at the moment
<simbergm> there are more important things before that
<simbergm> so I think having it in the functional module makes most sense at least right now
<hkaiser> nod, I agree
nikunj97 has joined #ste||ar
<simbergm> did you see my reply on the async modules? does what I wrote make sense?
<simbergm> actually, jbjnr yt? do you think it would be feasible to not use dataflow in the guided_pool_executor?
<simbergm> we can of course both get rid of dataflow in that executor and have some sort of base module for async/apply/dataflow
bita has joined #ste||ar
<nikunj97> heller1, I did not plot it. Let me do that now and see how it looks
<hkaiser> simbergm: in a meeting now, will get back to you
<simbergm> hkaiser: np
<hkaiser> simbergm: ok, I'm back
<hkaiser> simbergm: I added a comment to the ticket
<hkaiser> we can break the circular dependencies by separating the executor related specializations of the dataflow implementation
<hkaiser> similar to what you've done for hpx::future
gonidelis has joined #ste||ar
<hkaiser> simbergm: hmm, I was sure I did add a comment on the ticket, but apparently I didn't :/
rtohid has joined #ste||ar
<bita> hkaiser, I set two localities for this,, and I get the "thread pool is not running" error
<hkaiser> bita: ok, I'll have a look, thanks
<hkaiser> bita: running on 2 localities?
<bita> This test should fail throwing another exception, because cannon_product does not work on two localities.
<bita> yes
<hkaiser> ok
<bita> If you like to run it, I can make another test and create an issue. should I do it?
<hkaiser> this test is fine for now, thanks
<bita> got it :)
<hkaiser> but please feel free to create an issue
<bita> Okay :)
Hashmi has quit [Quit: Connection closed for inactivity]
<bita> hkaiser, I made the example using dot_d and it worked. So, I think the problem is setting 2 localities for cannon. I tested with cannon on 4 localities and it worked too
<hkaiser> ok
<bita> so the problem is we always get the "thread pool is not running" exception instead of the real exception. The good news is everything else is working
<bita> I will create an issue if you think it makes sense to you. however, fixing this should not be the priority
<hkaiser> I'd like to find out what's wrong - it should work
<bita> of course
Hashmi has joined #ste||ar
<hkaiser> bita: btw, do you think it would be sensible to make the last two arguments to random_d and constand_d optional (the current locality and the number of localities)?
<hkaiser> those could be initialized from the current HPX locality, if needed
<bita> I am confused, I think the user should decide how many localities she needs for her generated new array (of constants or randoms)
<hkaiser> bita: yes, of course
<hkaiser> bita: but most of the time the number of localities to use is the same as the number of localities the HPX applications runs on
<hkaiser> so it could be derived at runtime as default values
<bita> Okay, I get it
<bita> I will work on that after getting perftests done
<nikunj97> heller1, I updated the x86 roofline with 2d stencil performance recorded
<nikunj97> you can check it in the drive link
<nikunj97> 15 and 20 core performance is not so great compared to 5 and 10 core ones
<nikunj97> btw the files are named as x86_64_<data_type>.png, where data_type denotes the performance of the stencil with that type
weilewei has quit [Remote host closed the connection]
<nikunj97> heller1, just updated the hisilicon runs as well
<simbergm> hkaiser: yep, I think how to get rid of the dependency is clear (let's see tomorrow...)
<simbergm> I was more thinking about how to structure the rest of the files
<simbergm> I don't think it makes much sense to only move the dataflow executor specializations to the execution module, I would in that case also move them for async, apply, and sync
<simbergm> which doesn't leave much in the local_async module...
<simbergm> but I would perhaps add the slightly silly async_base module with the bare minimum, move the specializations to the execution module
<simbergm> and maybe still have a local_async module that just pulls in all of that, but not sure about this yet
gonidelis has quit [Remote host closed the connection]
<simbergm> if I get rid of the local_async module a user can also include just say hpx/{executors,execution}/async.hpp for the local only versions
<hkaiser> simbergm: there is much more than the executor specializations for async and friends in local_async
<simbergm> I guess the only reason I don't like that naming is because it breaks the symmetry with the async module... but then that could be called distributed_execution
<hkaiser> but yah, all of this applies to async, apply, dataflow, future etc
<simbergm> hkaiser: there's the launch policy and executor specializations
<hkaiser> launch policy can stay in local_async, no?
gonidelis has joined #ste||ar
<simbergm> yeah, it can
<simbergm> wait, no, it can't
nan11 has quit [Remote host closed the connection]
<simbergm> that's the problem
<hkaiser> and the base templates for the specializations too? or should those be somewhere else?
<simbergm> at least the way dataflow is used currently in guided_pool_executor
<simbergm> guided_pool_executor uses the launch policy specializations
ibalampanis has joined #ste||ar
<simbergm> but maybe it can just specify an explicit executor instead
<hkaiser> well that exeutor could directly dispatch to something like async_launch_policy_dispatch
<hkaiser> that one was created for similar reasons, to break the dependency of the parallel executor on async
<simbergm> yep, that's what I would like it to do
<simbergm> I mean, we managed to get rid of dataflow elsewhere
<hkaiser> just dataflow_dispatch or simila
<hkaiser> r
<ibalampanis> Hello to everyone! Have a nice day!
<simbergm> right
<simbergm> hi ibalampanis!
<simbergm> I'll try to talk to jbjnr tomorrow (latest at the meeting)
<ibalampanis> hkaiser: I 'd like to ask you, how many proposals for GSoC have you received?
<simbergm> I think if it can be implemented without dataflow that'd nice
<simbergm> but otherwise I'll do some reshuffling
<simbergm> I'd move the base templates into execution then as well, or into a completely separate module
<simbergm> ibalampanis: about 15 I think
<ibalampanis> Interesting! Good luck to everyone!
<hkaiser> simbergm: nah, let's extract the dataflow engine
<simbergm> hkaiser: what do you mean by that exactly?
<hkaiser> or hide it in a template specialization such that we need to expose an unimplemented base template only, same as for future
<simbergm> expose to what? the guided_pool_executor will need the specialization as well
<simbergm> we might be thinking about the same thing, I'm just not communicating it well (I hope...)
<simbergm> the base templates can be in execution or in a dependency of execution
<simbergm> the executor specializations can be in execution
<hkaiser> nod, I see what you mean
<hkaiser> but local_async and friends could depend on the specializations that does not depend on executors
<hkaiser> so the executors can depend on local_async
<hkaiser> I'd consider local_async to be fairly low level
<simbergm> ok, we might have different ideas about what local_async should contain
<simbergm> I consider it a user facing module
<simbergm> but the only other specialization is the launch policy specialization and that one depends even on a concrete executor, not just execution traits
nan11 has joined #ste||ar
akheir1_ has joined #ste||ar
<simbergm> I was just calling it async_base :P
<hkaiser> simbergm: sure
<hkaiser> sounds like a plan
<hkaiser> async_base is fine as it would hold all base templates for async, dataflow, even future
<simbergm> I'll put something together tomorrow and then we can discuss at the meeting
<hkaiser> bita: the exception is thrown because of bad exception handling for exceptions that escape hpx_main()
akheir1 has quit [Ping timeout: 256 seconds]
<hkaiser> the reson is that cannon require a perfect square of tiles, but your example has only 1*2 tiles
<hkaiser> that also explains why it works for 4 localities
<simbergm> hkaiser: exactly, didn't even think about future_base, but that could make sense as well
<simbergm> I think we might go for a separate future_base module as there's a bit more code there, but it could go either way
<hkaiser> sure
<hkaiser> but the ideas should be the same
<hkaiser> bita: see #4487 (HPX)
<bita> hkaiser, I thought I would get "All tiles in the tile row/column do not have equal height/width" but I was mistaken. We don't have an exception for not being a perfect square
<hkaiser> bita: we do, it was just not properly propagated to you
<hkaiser> #4487 fixes that
<bita> got it, thank you
rtohid has left #ste||ar [#ste||ar]
nan11 has quit [Remote host closed the connection]
nan11 has joined #ste||ar
ibalampanis has quit [Remote host closed the connection]
weilewei has joined #ste||ar
rtohid has joined #ste||ar
rtohid has quit [Remote host closed the connection]
rtohid has joined #ste||ar
nk__ has joined #ste||ar
nikunj97 has quit [Ping timeout: 256 seconds]
akheir1 has joined #ste||ar
akheir1_ has quit [Ping timeout: 240 seconds]
nk__ has quit [Read error: Connection reset by peer]
nikunj97 has joined #ste||ar
nan11 has quit [Ping timeout: 240 seconds]
<nikunj97> heller1, interesting way to use a proxy vector<shared_future<void>> for dataflow!
<heller1> ?
<nikunj97> in your paper, you use vector of shared future
<nikunj97> and create dependencies with it, while working on the actual grid
<nikunj97> I find that a creative solution. I was thinking of ways to create a dataflow but couldn't think of anything other than vector<shared_future<vector> >
<heller1> Yeah
<heller1> Cool results
<nikunj97> you understood the graphs?
<heller1> Well hkaiser followed a different philosophy with the 1d stencil examples in the hpx repo
<nikunj97> he uses vector < shared_future < partition_data > > in 1d stencil
<heller1> Yes
<heller1> I like the "traditional" roofline ones most
<nikunj97> and partition_data is a vector again
<heller1> Yes, which is what you said
<nikunj97> yes
<nikunj97> gotcha, will send you plots wrt rooflines next time
<heller1> Yes
<heller1> Anyways, the x86 graphs look good as well
<nikunj97> what do you think of the results btw? I don't understand why it would reduce post a certain core count instead of keeping consistent?
<hkaiser> nikunj97: where can I see those graphs?
<nikunj97> hkaiser, wait let me send you the drive link
<nikunj97> hkaiser, see pm for the link
<heller1> And as expected
<hkaiser> git it, thanks
<heller1> The arm ones still look odd with those drops in them, those shouldn't be there
<heller1> Could be a bug in the binding code
<nikunj97> could be
<heller1> Or with how the os reports the topology
<heller1> Can you share the output of lstopo on one it those arm nodes please?
<nikunj97> their arm node doesn't have hyperthreading though. So that's the performance you see
<nikunj97> sure, a sec
<nikunj97> heller1, you'd like a photo or console output?
<heller1> nikunj97: doesn't matter
<heller1> nikunj97: also, the output of when you run the application with 64 cores and --hpx-print-bind
<heller1> nikunj97: it's especially the measurement at 24 cores which puts me off
<nikunj97> the 24 core example is running. hpx:print-bind prints only 0-23 PUs only
<heller1> nikunj97: regarding the quality of the results: Those are as expected. you nicely observe that you hit the memory bottleneck (when the curve flattens), once that's happening, you mostly observe the effect of too little parallelism
<heller1> nikunj97: ugh, this thing has 4 numa nodes?
<heller1> that sucks
<heller1> and explains everything ;)
<nikunj97> ohh did I not tell you that?
<nikunj97> I should've mentioned
<heller1> nope
<nikunj97> that was the reason why we were seeing peaks at 16 and 32
<nikunj97> and a dip at 24
<heller1> can you confirm that this is the one that you are using?
<heller1> yeah, more or less
<heller1> you should see another dip at 48 cores
<heller1> so what you need to do is to fill one NUMA domain first, and then just fill the next ones entirely
<heller1> or go 1 core per domain, 2 per domain etc.
<nikunj97> I see, I finally know nicely how to use hwloc-bind. I can use that
wate123_Jun has quit [Remote host closed the connection]
<heller1> yes
wate123_Jun has joined #ste||ar
<heller1> for HPX, you need something similar, but a tad different
<heller1> but I am sure you'll figure it out
<nikunj97> I see
<nikunj97> I'll try to figure it out
<heller1> so your task now is to explain why that drop is happening and why there's a peak at those specific core counts
<nikunj97> btw hwloc-bind with --hpx:use-process-mask works flawlessly
<heller1> ok
<nikunj97> so that's one solution for HPX
<heller1> whatever floats your boat ;)
<nikunj97> :D
wate123_Jun has quit [Ping timeout: 240 seconds]
<nikunj97> heller1, cross NUMA traffic is all I can think of
<nikunj97> basically 1 NUMA is completely filled and other is only half filled
<heller1> well, convince me with data
<nikunj97> heller1, sure :)
<nikunj97> I'm working on a futurized version now
<nikunj97> I feel that may provide minor optimization
<heller1> yes
<heller1> spoiler alert: it is not cross numa traffic.
gonidelis has quit [Remote host closed the connection]
<heller1> has all the hints you need, do the measurements I suggested, maybe even find a tool that allows you to read out some hardware performance counters that might give you a hint here
<nikunj97> I wouldn't believe memory bandwidth coz that was either consistent or increasing (or very minor drops)
<heller1> well, try and see
<nikunj97> will do!
<heller1> wee, nice number of proposals
<heller1> hkaiser: diehlpk_mobile jbjnr ms are we having a call about the selection of the gsoc proposals next week?
nikunj97 has quit [Quit: Leaving]
<hkaiser> heller1: might be a good idea
<hkaiser> heller1: also we have the pmc call tomorrow
<heller1> yeah, I know
<heller1> I am not sure I can make the PMC call tomorrow
<heller1> I am on forced vacation until the 8th. and my wife is working tomorrow afternoon
<hkaiser> heller1: ok
<heller1> I'll try though
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
wate123_Jun has joined #ste||ar
weilewei has quit [Remote host closed the connection]
rtohid has quit [Remote host closed the connection]
rtohid has joined #ste||ar
weilewei has joined #ste||ar
wate123_Jun has quit [Remote host closed the connection]
wate123_Jun has joined #ste||ar
bita has quit [Read error: Connection reset by peer]
rtohid has quit [Remote host closed the connection]
akheir1 has quit [Quit: Leaving]
bita has joined #ste||ar
Hashmi has quit [Quit: Connection closed for inactivity]
wate123_Jun has quit [Remote host closed the connection]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 240 seconds]
bita has quit [Ping timeout: 240 seconds]
wate123_Jun has joined #ste||ar
nikunj has quit [Ping timeout: 252 seconds]
nikunj has joined #ste||ar
nan11 has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]