hkaiser changed the topic of #ste||ar to: The topic is 'STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
<primef1>
hkaiser: last message I got from you is "primef: (#3829)"
<hkaiser>
primef1: well you can do both
<primef1>
Strangely, the IRC logs seem off, as they are missing a couple of messages
<hkaiser>
PRs are tested by the testing infrastructure, but I'd make sure that it does what it should before creating the PR locally
<hkaiser>
but I never run the full test suite locally, only the relevant parts
<hkaiser>
primef: also, there is already a PR that implemented #3646 for two algorithms, but that one fails the testing - so it might be a good start to figure out what's wrong with it
<primef1>
Ok, yes I saw those sorry.
<primef1>
On those I replied: hkaiser: alright, I'll look into #3829. Saw it yesterday.
<primef1>
About the testing, any hint on how to run the test suite? Just compile the single .cpp files and run them?
<hkaiser>
ok
<primef1>
Not sure they went out
<hkaiser>
primef1: well, make tests.* will do the trick, where tetss.* is the target name of the test to run
<hkaiser>
I think make can autocomplete the target names
<primef1>
alright, then I'll experiment on that. Thank you!
<primef1>
Now I'll go offline though as it's sleep-time :-)
<primef1>
See you tomorrow, have a good day!
mdiers_1 has joined #ste||ar
mdiers_ has quit [Ping timeout: 260 seconds]
mdiers_1 is now known as mdiers_
primef1 has quit [Ping timeout: 268 seconds]
K-ballo1 has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
K-ballo1 is now known as K-ballo
hkaiser has quit [Quit: bye]
rori has joined #ste||ar
rori has quit [Ping timeout: 260 seconds]
jbjnr has quit [Ping timeout: 265 seconds]
jbjnr has joined #ste||ar
<simbergm>
hkaiser, heller, jbjnr, you should have been invited to the survey
<simbergm>
let me know if you didn't get it or it's the wrong email
<heller2>
ms: yup, already looked at it. not bad
<heller2>
I wonder if I was the unfriendly german guy on IRC :/
<K-ballo>
oh no, who let niall participate?
<simbergm>
niall who?
<K-ballo>
douglas, an old antagonist of the "german hpc posse"
<heller2>
yeah ...
jbjnr1 has joined #ste||ar
<jbjnr1>
This is the real jbjnr trying a test from matrix
<jbjnr>
This is the old one using the windows machine
<jbjnr1>
Connecting to here from matrix was much harder than it should have been. Thanks heller
Yorlik has quit [Ping timeout: 268 seconds]
<heller2>
yup, the whole registration business is a pain
<jbjnr>
This is another test whilst the matrix thingy is shut down, to see if it still appears when I restart it
<jbjnr1>
that's great. The matrix hingy tracks the messages I missed :)
<simbergm>
for users that might be wondering what this is about, we're testing matrix as an alternative to irc
<K-ballo>
maybe all the users are in that matrix thing
<jbjnr>
they're not!
hkaiser has joined #ste||ar
<simbergm>
btw, hkaiser, heller, K-ballo feel free to fill in the survey as well if you haven't already
<heller2>
I did ;)
<simbergm>
good :)
<hkaiser>
I did too
<hkaiser>
(I think)
<hkaiser>
simbergm: thanks for the governance comments
<simbergm>
hkaiser: I'll pass the thanks on to joost ;)
<hkaiser>
also, how do I see the survey results?
primef1 has joined #ste||ar
<simbergm>
hkaiser: did you get some sort of email about it?
<hkaiser>
yes, but that just opens the survey itself for me
<simbergm>
assuming you have access to it, there should be a responses tab next to the questions tab
<K-ballo>
why is the doc rating question mandatory?
<hkaiser>
ahh, got it
<simbergm>
K-ballo: mistake, shouldn't be mandatory anymroe
rori has joined #ste||ar
hkaiser has quit [Quit: bye]
<simbergm>
hkaiser, heller would it be possible to move the meeting tomorrow to one hour later? i.e. 17:00 CET/10:00 CT? (still need to confirm if it's needed)
<heller2>
should be fine
hkaiser has joined #ste||ar
<primef1>
simbergm: do you have any more suggestions, about any flags to activate to increase performance on distributed/numa aware applications?
<jbjnr>
primef1: what is your problem/task?
<primef1>
Moreover in one of your examples, you use "hpx.numa_sensitive=2". What does the 2 stand for? And how is this different from setting --hpx:numa-sensitive on the CLI
<jbjnr>
That setting is obsolete really.
<primef1>
I have to reproduce mutliple algorithms and optimize those. The algorithms are: reduce, transpose and at last a 3d stencil.
<jbjnr>
it says that a scheduler can steal work from another numa domain, but only the core adjacent can do it
<primef1>
jbjnr:
<jbjnr>
in reality it is not suppported any more
<simbergm>
primef1: jbjnr is our numa expert, if he has time I'll let him handle this :)
<primef1>
jbjnr: good to know, then I'll remove it from my code. I took it from the official examples.
<jbjnr>
I don't want to take on a whole ditributed matrix transpose problem right now, but maybe somethig smaller
<primef1>
simbergm: thanks!
<jbjnr>
simbergm: wtf is all this spellcheck fail stuff on the dashboard/emails
<jbjnr>
if we have to give our vars real names from now on, then shoot me now
<simbergm>
jbjnr: it should ignore most nonsense variable names
<primef1>
jbjnr: Sure I can understand that, actually the code we wrote is finished. We are not achieving the results we want, but time is running and we don't have that much time to dedicate to more optimizations. We tried to implement numa distribution, using the numa_allocator, but honestly it seems not performing that good.
<simbergm>
if it turns out to give too many false positives we'll remove it
<primef1>
jbjnr: so instead of using the "hpx.numa_sensitive=2" option, should I set the CLI argument for numa-sensitiviness?
<simbergm>
or I'll limit it to docs (but I'd like to have it check comments in code at least)
<jbjnr>
primef1: can you perhaps tell me what the code does?
<jbjnr>
or better still let me see it.
<primef1>
Sure! Give me a sec.
<jbjnr>
in general we have a numa allocator (2 actually, a old one and a new one), they allow you to allocate dat on a particular numa domain, or striped in some way. However, the schedulers need to know where the data is in order to allocate tasks to cores on that numa node
<jbjnr>
the shared_priority_scheduler has some new support for this kind of thing
<primef1>
Ok, good to know that. The code I sent is an adapted version of the examples you use.
<jbjnr>
is this the transpose code?
<primef1>
Yes it is. I understood you don't want to look into it. But it is the only one, where we actually use numa, right now.
<jbjnr>
what do you mean by "time is running out" - what deadline must you meet?
<primef1>
It's a university project. Deadline is next week
<jbjnr>
monday next week, or friday next week?
<jbjnr>
and how bad is your performance?
<jbjnr>
...and does it need to work distributed - or just on one node with N numa domains
<primef1>
thursday next week. performance does actually worsen based on a version without numa. System is using xeon phi
<jbjnr>
how much trouble will you be in if you can't improve the perform ance before thursday? (meaning, how important is this project)
<primef1>
the more the better. But our primar goal was 1 node N numa domains and introducing node distribution, is a step that looked far for us.
<primef1>
I put it this way, I guess we won't get in trouble. It will just affect our grade. So nothing we need urgent support on. I mean if you look over it once and discover stuff we shouldn't do or how we can improve it it would be great. But nothing you have to invest much of your time on it
<primef1>
thank you so much, for asking though and helping us
<jbjnr>
this shows how to use the new numa allocator to allocate memory using different patters
<jbjnr>
then the guided_pool_executor - or using a schedulehint can tell tasks where to run (which numa node)
<jbjnr>
(schedulehint with an normal executor I mean) - but this stuff is a bit advanced usage and I still have problems with it
<jbjnr>
so what you've got might be the best for the time being
<jbjnr>
How close to peak memory bandwidth do you get?
<jbjnr>
I'm interested in your transpose example, because it is exactly this kind of thing we need to make the numa API simple and easy enough to allow guys like you to use it and write decent code without being an expert.
<jbjnr>
so I'll try to look into it
<jbjnr>
(but I already have 2 other project that have deadlines)
<primef1>
Alright, this all sounds good advice to me. Thanks a lot! I'll look into the pages you sent and try to make my head around it.
<primef1>
To be honest, I don't know what peak memory bandwidth we have. But it's a good point, to check. Sorry for that, we are also quite new in high performance computing stuff
<jbjnr>
the numa test I linked to, allocates a big array and then binds differnt pages to different numa nodes - then dumps out the binding pattern
<jbjnr>
on a KNL with 4 nodes you should see a pattern like 0000111122223333 etc etc for pages or memory
<jbjnr>
^of
<jbjnr>
4 numa nodes I meant ^^
<primef1>
But for sure, the proposal to make the numa concept more newbie-friendly sounds amazing. If I might be of any help on that please let me know.
<jbjnr>
now if you know which page of memory is bound to which numa node, you can tell the scheduler to run tasks that use that memory on that node
<simbergm>
primef1: another thing to test is forget about all the explicit numa management and launch one hpx locality per numa node
<jbjnr>
tip - find my c++ italia talk on youtube and skip to the part with the cholesky and guided executor mentioned
<simbergm>
you don't have as much control but it might get you pretty close to the same effect
<primef1>
ohh alright, understood the concept. Thing is for simple arrays/vectors, this doesn't sound difficult. What I think is however how this behaves in combination with custom structs and futures.
<jbjnr>
yes. what simbergm said also is a good (better) way
<jbjnr>
to keep it simple
<jbjnr>
then the operating system manages the memory for you and mpi handles the coms
<jbjnr>
gtg
<primef1>
jbjnr: alright, will look into the video.
<primef1>
simbergm: so you mean start a hpx:main on each numa node? How is that achieved?
<simbergm>
primef1: depends a bit on how you launch the application
<simbergm>
with mpirun I think you can set the number of processes/ranks per node explicitly
<simbergm>
if you use that together with --hpx:use-process-mask you should get correct thread bindings automatically
<primef1>
If that is what you mean, we launch it using srun (slurm)
<simbergm>
even better
<primef1>
Ok, but now a naive question for sure. How do I divide my transpose problem if the application is split even before that?
<simbergm>
just use the parameters that srun takes for multiple ranks per node (-n or -N, man srun will tell you)
<simbergm>
not sure I understand the question... for distributing to multiple nodes you'll also have to distribute your matrix somehow
<simbergm>
now how that is distributed is another question
<simbergm>
I think some sort of round robin distribution is typical, but I don't know what's best for this use case
<primef1>
ok, yes that is what my question was.
<simbergm>
whether you have one or multiple ranks per node is independent of how you distribute your matrix
<primef1>
so inside the application I tell if application is currently on thread x, do this subblock of the matrix
<primef1>
and so on
<simbergm>
more or less
<simbergm>
you'd most likely do it per rank and make sure you oversubscribe (have "too much work") for each rank
<primef1>
alright. Lots of input I can work on and lots of new concepts for me to understand. Thank you so much for your help guys simbergm and jbjnr. Amazing getting so quickly such constructive help.
<primef1>
simbergm: one more question. yesterday, you were talking about using MPI Parcelport. What dependencies are required for it? Is OpenMPI correct?
<primef1>
Or is it MPICH?
<simbergm>
primef1: openMPI will do just fine in most cases I think
<simbergm>
if you run it on a cluster there'll usually be an mpi that's optimized for that system and in those cases you just use that
<simbergm>
mpich and openmpi should both work
<primef1>
the system has available through spack a couple of openmpi packages. but none of them was compiled with gcc 9. so I will have to build openmpi.
<K-ballo>
it finally happened, we are "C++14-only" now
<hkaiser>
yah, now we're not allowed to use anything before C++14 anymore - no raw loops!
<simbergm>
Guys...
<simbergm>
hkaiser I like your thinking, I'll put together an inspect check for that
<hkaiser>
heh
<hkaiser>
simbergm: I have a couple of minor fixes for this for msvc however
<simbergm>
hkaiser I feared as much... Hope it's nothing too bad
primef1 has quit [Ping timeout: 265 seconds]
<hkaiser>
nah, no worries - minor issues, like printing the used standard twice and such
<K-ballo>
I tried the msvc build last week and it was ok (core only)