<diehlpk_mobile[m>
The student submission period is now over - and the numbers were much higher than ever before -- over 51,000 students registered for the program this year (a 65% increase from the previous high)! 6,335 students submitted their final proposals and applications for you all to review over these next couple of weeks (that's a 13% increase over last year).
<diehlpk_mobile[m>
Wow
<hkaiser>
diehlpk_mobile[m: do you know how many organizations they have?
<diehlpk_mobile[m>
Hkaiser 200 and 30 novel ones
akheir1 has quit [Read error: Connection reset by peer]
<diehlpk_mobile[m>
200 in total and 30 out of them are novel
akheir1 has joined #ste||ar
<wate123_Jun>
wow, but also not surprising because of this pandemic.
hkaiser has quit [Read error: Connection reset by peer]
weilewei has quit [Remote host closed the connection]
hkaiser has joined #ste||ar
<Yorlik>
hkaiser: I think I have run into my first deadlocks
<Yorlik>
It seems to depend on the upper and lower limit of the task limiting algorithm.
<Yorlik>
What happens is, that all workers are in the yield_while
<Yorlik>
Still investigating.
<hkaiser>
nod, bad one
<Yorlik>
I'm doing a bit of reading on the topic. Do you have any recommendation what to look for?
<Yorlik>
I am thinking about modifying the yield_while predicate I'm using and have a chance to continue or add a time based release
<Yorlik>
But I don't want to kill performance ofc.
<Yorlik>
You know - I always want my free lunch and eat it too :)
nan11 has quit [Ping timeout: 240 seconds]
wate123_Jun has quit [Remote host closed the connection]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 246 seconds]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Remote host closed the connection]
wate123_Jun has joined #ste||ar
Vir has quit [Ping timeout: 256 seconds]
Vir has joined #ste||ar
Vir has quit [Changing host]
Vir has joined #ste||ar
weilewei has joined #ste||ar
<weilewei>
do mentors need to review proposal from other organizations?
wate123_Jun has quit [Ping timeout: 240 seconds]
<hkaiser>
weilewei: no
<weilewei>
hkaiser ok, got it
diehlpk_work has quit [Remote host closed the connection]
wate123_Jun has joined #ste||ar
akheir1 has quit [Read error: Connection reset by peer]
akheir1 has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
hkaiser has quit [Quit: bye]
akheir1 has quit [Read error: Connection reset by peer]
akheir1 has joined #ste||ar
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
akheir1 has quit [Read error: Connection reset by peer]
akheir1 has joined #ste||ar
parsa has quit [Ping timeout: 252 seconds]
weilewei has quit [Remote host closed the connection]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
akheir1 has quit [Remote host closed the connection]
parsa has joined #ste||ar
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 240 seconds]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
nikunj97 has joined #ste||ar
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
kale_ has joined #ste||ar
kale_ has quit [Ping timeout: 265 seconds]
kale_ has joined #ste||ar
kale_ has quit [Client Quit]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
<nikunj97>
heller1, just confirming. If expected peak of double is x, then expected peak of float will be 2x. Right?
<heller1>
depends on the architecture
<nikunj97>
I see. My expected performance for HiSilicon1616 are way off the observed results
<nikunj97>
heller1, I've sent you the drive link to the initial benchmark results I've obtained
<nikunj97>
not the best looking graph, but I wanted to know your opinions on it
<nikunj97>
I'm considering 2 loads and 1 store for calculating peak expected performance
<heller1>
Where is the roofline in those graphs?
wate123_Jun has joined #ste||ar
<nikunj97>
roofline is the expected peak
<nikunj97>
Since the problem is memory bound, I calculated the peak expected value from the values obtained from roofline and plotted that as expected peak performance
<nikunj97>
would that not suffice?
<heller1>
sorry, I have no idea what I am looking at
<nikunj97>
aah, they're that bad damn. Let me try to beautify them for you
<heller1>
no, the 'expected' looks very wrong
<nikunj97>
how so?
<heller1>
I think
<heller1>
for the hisilicon, what's that drop at 24 cores?
wate123_Jun has quit [Ping timeout: 240 seconds]
<nikunj97>
I calculated expected as follows: 1. Calculate peak bandwidth (from stream triad) for that many cores, 2. Calculate the peak performance considering 2 loads and 1 store i.e. 1 MLUP per 24 Byte
<heller1>
it should be a line that's monotonically increasing and eventually converging to the value of AI*peak_bw
<nikunj97>
HiSilicon had a drop in memory bandwidth going from 16 to 24 cores
<heller1>
that's expectd
<heller1>
* that's expected
<heller1>
look again at its architecture
<nikunj97>
yes, so you see a drop in performance across the board as well
<heller1>
no, you just measured wrong
<heller1>
or interpreted the result wrongly
<nikunj97>
also the cores go up til 64 cores (with hyperthreading)
<heller1>
remember that I told you to watch out for SMT and NUMA stuff?
<nikunj97>
hartmut asked me to measure till 32 if you don't hyperthreading stuff
<heller1>
2 threads on a single core (with hyperthreading) isn't the same as 2 threads on two different cores
<nikunj97>
yes, I remember that
<nikunj97>
should I use something like numactl then?
<heller1>
did you pin the openmp threads for the stream benchmark?
<nikunj97>
yes, I did
<nikunj97>
you mean this: export OMP_NUM_THREADS=4, right?
<heller1>
no
<nikunj97>
crap
<nikunj97>
so I use numactl to pin cores and then run everything again?
<heller1>
HPX does what you want here. OpenMP not so much, at least not the gnu implementation
<heller1>
or hwloc-bind
<heller1>
which is a tad easier, since you can visualize the topology nicely
<simbergm>
`OMP_PROC_BIND=thread` might be what you're looking for
<nikunj97>
stream used that in FAQ, so I thought, you meant that by thread pinning
<nikunj97>
simbergm, thanks!
<nikunj97>
heller1, how should I proceed then?
<nikunj97>
what would you suggest?
<heller1>
redo your measurements with the correct pinning ;)
Abhishek09 has joined #ste||ar
<nikunj97>
using hwloc-bind, alright :/
<simbergm>
nikunj97: if you're using slurm to set the cores, you can get the same bindings in openmp and hpx by using `--hpx:use-process-mask` (openmp will automatically use the mask)
<heller1>
choose whichever method you think fits best
<heller1>
ms[m]: even the GNU implementation?
<simbergm>
heller: it might not
<simbergm>
not sure about the differences between the implementations, some will at least
<heller1>
yeah ...
Hashmi has joined #ste||ar
<simbergm>
doing it manually is the surest way of course
<nikunj97>
btw what if I do srun -c <num_cores> --threads-per-core=1?
<nikunj97>
will that also be similar to thread pinning?
<heller1>
depends on 1) How slurm is configured 2) What your OpenMP implementation does
<nikunj97>
could you elaborate?
<heller1>
slurm and OpenMP do interact sometimes, sometimes not
<nikunj97>
will it work for HPX?
<heller1>
huge option space there. Best consult the docs of your cluster
<nikunj97>
if I run my benchmarks with that, I mean
<heller1>
yes, HPX always does the right thing (tm)
<heller1>
well
<heller1>
check it
<heller1>
--hpx:print-bind is yoru friend
<heller1>
lstopo is your friend
<heller1>
the manual of your OpenMP implementation is your friend, as well as the manual of your cluster ;)
<nikunj97>
will have to look at the manual somewhere, all I have right now is a specification of the cluster
<heller1>
what you have to ensure is that you always measure the same thing.
<heller1>
and be sure to know what you measure
<nikunj97>
simbergm, if I do slurm ... -c 20 ./hpx_executable --hpx:user-process-mask. Will it make sure to bind only to slurm allocated cores? Is that what you meant?
<heller1>
nikunj97: check with `--hpx:print-bind`
<simbergm>
nikunj97: yeah, that's what it's supposed to do (it's *use*-process-mask, not user btw)
<simbergm>
and you can always check with print-bind like heller suggested
<simbergm>
it might depend on slurm how slurm is configured as well, but as far as I know it'll always set a mask for the process
<nikunj97>
simbergm, heller1 thanks! I'll try doing it this way and check
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
<simbergm>
mdiers I forgot to mention yesterday: 4306 was cherry picked into the 1.4.1 release, so those changes should already be there
<simbergm>
in any case I've just updated the pr so hopefully we can get it in this week
wate123_Jun has joined #ste||ar
<mdiers[m]>
<simbergm " mdiers I forgot to mention yes"> the cherry picks are already in master?
<simbergm>
mdiers: good question, apparently not
<simbergm>
they are in 1.4.1 though
<simbergm>
normally we'd merge release branches back into master, but I think 1.4.1 might not have gotten that treatment
<simbergm>
in that case you'll still have to wait for the PR... sorry for the confusion!
<mdiers[m]>
ms: all right, no problem ;-)
wate123_Jun has quit [Ping timeout: 252 seconds]
nikunj97 has quit [Remote host closed the connection]
nikunj97 has joined #ste||ar
<nikunj97>
heller1, just realized that hisilicon one is incomplete. That's why you don't see it hit the peak
<nikunj97>
it has 64 cores and 64 threads and I plotted only till 32 cores
Abhishek09 has quit [Remote host closed the connection]
<nikunj97>
also with xeon, afaik --hpx-threads=x uses the first x cores from machine. So the xeon graphs should've used the correct cores as well (since hyperthreads attached 0,21;1,22;and so on)
wate123_Jun has joined #ste||ar
<nikunj97>
*0,20;1,21; and so on
wate123_Jun has quit [Ping timeout: 252 seconds]
<heller1>
yes, as said, HPX should be fine. I am concerned about your OMP comparisons
<nikunj97>
yea, I'll do those peak bandwidth calculations again
<nikunj97>
with numactl
<heller1>
never used that
<nikunj97>
it's pretty easy, `numactl --localalloc --physcpubind=0-20` would mean that cores 0 to 20 are used and each of them is bound to their specific node
<nikunj97>
that's what JSC people asked me to use for thread pinning
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
Hashmi has quit [Quit: Connection closed for inactivity]
Hashmi has joined #ste||ar
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 240 seconds]
hkaiser has joined #ste||ar
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
<nikunj97>
heller1, what did you think was wrong with xeon e5 graphs?
<nikunj97>
I have tried with both numactl and hwloc-bind and the updated results looks skewed. My runs are running better than expected peak.
<nikunj97>
other core performance for hisilicon are still running. Will let you know the final results when they're done
<nikunj97>
heller1, I've updated e5 graphs with the new one's. There's not much of a difference, except for the benchmark running faster than 5 core peak
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
<nikunj97>
heller1, I also added the rooflines for hisilicon1616 and xeon e5 as well (for double precision peak performance for CPU)
<nikunj97>
I think arithmetic intensity of 1/8 (assuming 3 loads and 1 store) is not apt for the calculations. I don't like seeing values above peak expected performance
wate123_Jun has joined #ste||ar
hkaiser has quit [Quit: bye]
wate123_Jun has quit [Ping timeout: 252 seconds]
weilewei has joined #ste||ar
Hashmi has quit [Quit: Connection closed for inactivity]
diehlpk_work has joined #ste||ar
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 240 seconds]
wate123_Jun has joined #ste||ar
akheir has joined #ste||ar
<nikunj97>
heller1, one thing that I noticed from your paper. You decompose the stencil to blocks, while I'm decomposing into lines
<nikunj97>
I'm thinking to take this route as well. This way I can work on evenly sized stencils (i.e. x and y dimensions does not differ by a large factor)
nikunj97 has quit [Ping timeout: 252 seconds]
nikunj97 has joined #ste||ar
Hashmi has joined #ste||ar
akheir1 has joined #ste||ar
<heller1>
Yes, blocking is the way to go. Just makes it significantly harder in distributed
hkaiser has joined #ste||ar
akheir has quit [Read error: Connection reset by peer]
<heller1>
Well, did you plot the values in the normal roofline?
<Yorlik>
o/
<heller1>
I'm just very confused about the presentation about your expected performance
<Yorlik>
It's like a lightweight selfmade fake scheduler with limited scope (parloop chunks)
<Yorlik>
But so far no more deadlocks.
nikunj97 has quit [Ping timeout: 260 seconds]
<hkaiser>
Yorlik: sure, even if you use HPX that doesn't prevent you from writing your own facilities
<Yorlik>
It all comes over time.
<Yorlik>
Can't learn everything at once. :)
<hkaiser>
simbergm: ?
nan11 has joined #ste||ar
<simbergm>
hkaiser: ??
<hkaiser>
hey
<hkaiser>
#4485
<hkaiser>
you propose to movethe function serialization to the functional module
<simbergm>
yeah
<simbergm>
yes, not sure about that yet... hence the draft
<hkaiser>
I'm confused - when creating the functional module we intentionally left that out, didn't we?
<simbergm>
however, since you made serialization more lightweight it wouldn't be too bad
<hkaiser>
simbergm: I don't remember what we discussed
<hkaiser>
did we want to make essentially all modules dpend on serialization or make serialization depend on all modules
<simbergm>
we have made other things depend directly on serialization now as well, instead of having the serialization separately
<hkaiser>
iirc, we wanted for serialization to provide the infrastructure needed and serialization support for ambient data types (from the std library, etc.)
<simbergm>
that was also before you had made the serialization module
<simbergm>
it would still be nice to have it separate, but there are too many other things that still have serialization implemented intrusively that it doesn't make sense to have them in a separate module at the moment
<hkaiser>
simbergm: I'm not saying what you propose is wrong, I'm trying to discuss to find a uniform solution
<hkaiser>
so I think we can agree that the serialization module itself should have serialization support for ambient types
<simbergm>
hkaiser: yeah, didn't take it that way
<simbergm>
yep, agreed on that
<simbergm>
and that's pretty much the way it is right now
<hkaiser>
std:: and possibly boost::
<simbergm>
yep
<simbergm>
external dependencies essentially
<hkaiser>
and serialization support for our own types go into their respective module
<simbergm>
we have lot of functionality that still has intrusive serialization, those would have to be separated out as well for this whole concept of separate serialization to make sense
<hkaiser>
I'm not sure I'd like that
<simbergm>
in that case having function serialization separate never made sense either :P
<hkaiser>
so yes, I think moving the function serialization into the functional module is a correct move
<simbergm>
the original reason for separating function serialization (it was the first one I did it for) was that serialization was a heavy dependency
<hkaiser>
yah, not sure anymore why we did this (same with any, btw)
<simbergm>
since it's lightweight now it makes much less sense to have it in a separate module
<hkaiser>
right
<hkaiser>
ok, count me in, I think this is sensible
<simbergm>
well, the benefit of having serialization completely separate is that it would allow building hpx with serialization, but consumers not having to pull in any serialization if they don't want to
<simbergm>
but that is a lot of effort for not too much gain at the moment
<simbergm>
there are more important things before that
<simbergm>
so I think having it in the functional module makes most sense at least right now
<hkaiser>
nod, I agree
nikunj97 has joined #ste||ar
<simbergm>
did you see my reply on the async modules? does what I wrote make sense?
<simbergm>
actually, jbjnr yt? do you think it would be feasible to not use dataflow in the guided_pool_executor?
<bita>
This test should fail throwing another exception, because cannon_product does not work on two localities.
<bita>
yes
<hkaiser>
ok
<bita>
If you like to run it, I can make another test and create an issue. should I do it?
<hkaiser>
this test is fine for now, thanks
<bita>
got it :)
<hkaiser>
but please feel free to create an issue
<bita>
Okay :)
Hashmi has quit [Quit: Connection closed for inactivity]
<bita>
hkaiser, I made the example using dot_d and it worked. So, I think the problem is setting 2 localities for cannon. I tested with cannon on 4 localities and it worked too
<hkaiser>
ok
<bita>
so the problem is we always get the "thread pool is not running" exception instead of the real exception. The good news is everything else is working
<bita>
I will create an issue if you think it makes sense to you. however, fixing this should not be the priority
<hkaiser>
I'd like to find out what's wrong - it should work
<bita>
of course
Hashmi has joined #ste||ar
<hkaiser>
bita: btw, do you think it would be sensible to make the last two arguments to random_d and constand_d optional (the current locality and the number of localities)?
<hkaiser>
those could be initialized from the current HPX locality, if needed
<bita>
I am confused, I think the user should decide how many localities she needs for her generated new array (of constants or randoms)
<hkaiser>
bita: yes, of course
<hkaiser>
bita: but most of the time the number of localities to use is the same as the number of localities the HPX applications runs on
<hkaiser>
so it could be derived at runtime as default values
<bita>
Okay, I get it
<bita>
I will work on that after getting perftests done
<nikunj97>
heller1, I updated the x86 roofline with 2d stencil performance recorded
<nikunj97>
you can check it in the drive link
<nikunj97>
15 and 20 core performance is not so great compared to 5 and 10 core ones
<nikunj97>
btw the files are named as x86_64_<data_type>.png, where data_type denotes the performance of the stencil with that type
weilewei has quit [Remote host closed the connection]
<nikunj97>
heller1, just updated the hisilicon runs as well
<simbergm>
hkaiser: yep, I think how to get rid of the dependency is clear (let's see tomorrow...)
<simbergm>
I was more thinking about how to structure the rest of the files
<simbergm>
I don't think it makes much sense to only move the dataflow executor specializations to the execution module, I would in that case also move them for async, apply, and sync
<simbergm>
which doesn't leave much in the local_async module...
<simbergm>
but I would perhaps add the slightly silly async_base module with the bare minimum, move the specializations to the execution module
<simbergm>
and maybe still have a local_async module that just pulls in all of that, but not sure about this yet
gonidelis has quit [Remote host closed the connection]
<simbergm>
if I get rid of the local_async module a user can also include just say hpx/{executors,execution}/async.hpp for the local only versions
<hkaiser>
simbergm: there is much more than the executor specializations for async and friends in local_async
<simbergm>
I guess the only reason I don't like that naming is because it breaks the symmetry with the async module... but then that could be called distributed_execution
<hkaiser>
but yah, all of this applies to async, apply, dataflow, future etc
<simbergm>
hkaiser: there's the launch policy and executor specializations
<hkaiser>
launch policy can stay in local_async, no?
gonidelis has joined #ste||ar
<simbergm>
yeah, it can
<simbergm>
wait, no, it can't
nan11 has quit [Remote host closed the connection]
<simbergm>
that's the problem
<hkaiser>
and the base templates for the specializations too? or should those be somewhere else?
<simbergm>
at least the way dataflow is used currently in guided_pool_executor
<simbergm>
guided_pool_executor uses the launch policy specializations
ibalampanis has joined #ste||ar
<simbergm>
but maybe it can just specify an explicit executor instead
<hkaiser>
well that exeutor could directly dispatch to something like async_launch_policy_dispatch
<hkaiser>
that one was created for similar reasons, to break the dependency of the parallel executor on async
<simbergm>
yep, that's what I would like it to do
<simbergm>
I mean, we managed to get rid of dataflow elsewhere
<hkaiser>
just dataflow_dispatch or simila
<hkaiser>
r
<ibalampanis>
Hello to everyone! Have a nice day!
<simbergm>
right
<simbergm>
hi ibalampanis!
<simbergm>
I'll try to talk to jbjnr tomorrow (latest at the meeting)
<ibalampanis>
hkaiser: I 'd like to ask you, how many proposals for GSoC have you received?
<simbergm>
I think if it can be implemented without dataflow that'd nice
<simbergm>
but otherwise I'll do some reshuffling
<simbergm>
I'd move the base templates into execution then as well, or into a completely separate module
<simbergm>
ibalampanis: about 15 I think
<ibalampanis>
Interesting! Good luck to everyone!
<hkaiser>
simbergm: nah, let's extract the dataflow engine
<simbergm>
hkaiser: what do you mean by that exactly?
<hkaiser>
or hide it in a template specialization such that we need to expose an unimplemented base template only, same as for future
<simbergm>
expose to what? the guided_pool_executor will need the specialization as well
<simbergm>
we might be thinking about the same thing, I'm just not communicating it well (I hope...)
<simbergm>
the base templates can be in execution or in a dependency of execution
<simbergm>
the executor specializations can be in execution
<hkaiser>
nod, I see what you mean
<hkaiser>
but local_async and friends could depend on the specializations that does not depend on executors
<hkaiser>
so the executors can depend on local_async
<hkaiser>
I'd consider local_async to be fairly low level
<simbergm>
ok, we might have different ideas about what local_async should contain
<simbergm>
I consider it a user facing module
<simbergm>
but the only other specialization is the launch policy specialization and that one depends even on a concrete executor, not just execution traits
<hkaiser>
async_base is fine as it would hold all base templates for async, dataflow, even future
<simbergm>
I'll put something together tomorrow and then we can discuss at the meeting
<hkaiser>
bita: the exception is thrown because of bad exception handling for exceptions that escape hpx_main()
akheir1 has quit [Ping timeout: 256 seconds]
<hkaiser>
the reson is that cannon require a perfect square of tiles, but your example has only 1*2 tiles
<hkaiser>
that also explains why it works for 4 localities
<simbergm>
hkaiser: exactly, didn't even think about future_base, but that could make sense as well
<simbergm>
I think we might go for a separate future_base module as there's a bit more code there, but it could go either way
<hkaiser>
sure
<hkaiser>
but the ideas should be the same
<hkaiser>
bita: see #4487 (HPX)
<bita>
hkaiser, I thought I would get "All tiles in the tile row/column do not have equal height/width" but I was mistaken. We don't have an exception for not being a perfect square
<hkaiser>
bita: we do, it was just not properly propagated to you
<hkaiser>
#4487 fixes that
<bita>
got it, thank you
rtohid has left #ste||ar [#ste||ar]
nan11 has quit [Remote host closed the connection]
nan11 has joined #ste||ar
ibalampanis has quit [Remote host closed the connection]
weilewei has joined #ste||ar
rtohid has joined #ste||ar
rtohid has quit [Remote host closed the connection]
rtohid has joined #ste||ar
nk__ has joined #ste||ar
nikunj97 has quit [Ping timeout: 256 seconds]
akheir1 has joined #ste||ar
akheir1_ has quit [Ping timeout: 240 seconds]
nk__ has quit [Read error: Connection reset by peer]
nikunj97 has joined #ste||ar
nan11 has quit [Ping timeout: 240 seconds]
<nikunj97>
heller1, interesting way to use a proxy vector<shared_future<void>> for dataflow!
<heller1>
?
<nikunj97>
in your paper, you use vector of shared future
<nikunj97>
and create dependencies with it, while working on the actual grid
<nikunj97>
I find that a creative solution. I was thinking of ways to create a dataflow but couldn't think of anything other than vector<shared_future<vector> >
<heller1>
Yeah
<heller1>
Cool results
<nikunj97>
you understood the graphs?
<heller1>
Well hkaiser followed a different philosophy with the 1d stencil examples in the hpx repo
<nikunj97>
he uses vector < shared_future < partition_data > > in 1d stencil
<heller1>
Yes
<heller1>
I like the "traditional" roofline ones most
<nikunj97>
and partition_data is a vector again
<heller1>
Yes, which is what you said
<nikunj97>
yes
<nikunj97>
gotcha, will send you plots wrt rooflines next time
<heller1>
Yes
<heller1>
Anyways, the x86 graphs look good as well
<nikunj97>
what do you think of the results btw? I don't understand why it would reduce post a certain core count instead of keeping consistent?
<hkaiser>
nikunj97: where can I see those graphs?
<nikunj97>
hkaiser, wait let me send you the drive link
<nikunj97>
hkaiser, see pm for the link
<heller1>
And as expected
<hkaiser>
git it, thanks
<heller1>
The arm ones still look odd with those drops in them, those shouldn't be there
<heller1>
Could be a bug in the binding code
<nikunj97>
could be
<heller1>
Or with how the os reports the topology
<heller1>
Can you share the output of lstopo on one it those arm nodes please?
<nikunj97>
their arm node doesn't have hyperthreading though. So that's the performance you see
<nikunj97>
sure, a sec
<nikunj97>
heller1, you'd like a photo or console output?
<heller1>
nikunj97: doesn't matter
<heller1>
nikunj97: also, the output of when you run the application with 64 cores and --hpx-print-bind
<heller1>
nikunj97: it's especially the measurement at 24 cores which puts me off
<nikunj97>
the 24 core example is running. hpx:print-bind prints only 0-23 PUs only
<heller1>
nikunj97: regarding the quality of the results: Those are as expected. you nicely observe that you hit the memory bottleneck (when the curve flattens), once that's happening, you mostly observe the effect of too little parallelism
gonidelis has quit [Remote host closed the connection]
<heller1>
has all the hints you need, do the measurements I suggested, maybe even find a tool that allows you to read out some hardware performance counters that might give you a hint here
<nikunj97>
I wouldn't believe memory bandwidth coz that was either consistent or increasing (or very minor drops)
<heller1>
well, try and see
<nikunj97>
will do!
<heller1>
wee, nice number of proposals
<heller1>
hkaiser: diehlpk_mobile jbjnr ms are we having a call about the selection of the gsoc proposals next week?
nikunj97 has quit [Quit: Leaving]
<hkaiser>
heller1: might be a good idea
<hkaiser>
heller1: also we have the pmc call tomorrow
<heller1>
yeah, I know
<heller1>
I am not sure I can make the PMC call tomorrow
<heller1>
I am on forced vacation until the 8th. and my wife is working tomorrow afternoon
<hkaiser>
heller1: ok
<heller1>
I'll try though
wate123_Jun has joined #ste||ar
wate123_Jun has quit [Ping timeout: 252 seconds]
wate123_Jun has joined #ste||ar
weilewei has quit [Remote host closed the connection]
rtohid has quit [Remote host closed the connection]
rtohid has joined #ste||ar
weilewei has joined #ste||ar
wate123_Jun has quit [Remote host closed the connection]
wate123_Jun has joined #ste||ar
bita has quit [Read error: Connection reset by peer]
rtohid has quit [Remote host closed the connection]
akheir1 has quit [Quit: Leaving]
bita has joined #ste||ar
Hashmi has quit [Quit: Connection closed for inactivity]
wate123_Jun has quit [Remote host closed the connection]