aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
mcopik has quit [Ping timeout: 248 seconds]
parsa has quit [Quit: Zzzzzzzzzzzz]
diehlpk has quit [Ping timeout: 240 seconds]
EverYoung has quit [Ping timeout: 276 seconds]
K-ballo has quit [Quit: K-ballo]
vamatya has joined #ste||ar
diehlpk has joined #ste||ar
eschnett has quit [Quit: eschnett]
vamatya has quit [Ping timeout: 260 seconds]
eschnett has joined #ste||ar
diehlpk has quit [Remote host closed the connection]
patg has joined #ste||ar
hkaiser has quit [Quit: bye]
parsa has joined #ste||ar
eschnett has quit [Quit: eschnett]
vamatya has joined #ste||ar
parsa has quit [Quit: Zzzzzzzzzzzz]
parsa has joined #ste||ar
parsa has quit [Client Quit]
parsa has joined #ste||ar
parsa has quit [Client Quit]
parsa has joined #ste||ar
parsa has quit [Client Quit]
parsa has joined #ste||ar
parsa has quit [Client Quit]
parsa has joined #ste||ar
parsa has quit [Client Quit]
patg has quit [Quit: See you later]
Matombo has joined #ste||ar
Matombo has quit [Ping timeout: 268 seconds]
<github>
[hpx] biddisco pushed 1 new commit to terminated_threads: https://git.io/v7qAr
<github>
hpx/terminated_threads 435aa82 John Biddiscombe: Only boolean config options use HPX_WITH_XXX and HPX_HAVE_XXX prefixes
<taeguk>
Because 'block' and 'block_manager' are so common naming, I worry duplication of namings.
<taeguk>
So, I put 'struct block' and 'struct block_manager' into unnamed namespace.
<K-ballo>
unnamed namespaces don't affect naming, just linkage
<taeguk>
K-ballo: Yes, you're right. because of that fact, I used unnamed namespace for 'struct block' and 'struct block_manager'.
<K-ballo>
taeguk: you'll have to explain what the intention is, because it makes no sense to me.. unnamed namespace won't help with duplication of namings when those are a problem
<diehlpk_work>
thundergroudon[m, taeguk , ABresting Your evaluation is still missing
<diehlpk_work>
Please do it until today
<zao>
If two HPX headers uses the same name in two separate unnamed namespaces, you'll still clash.
<K-ballo>
yeah, and if there's a struct foo {}; at the outer namespace, it will either hide the nested ones or be a conflict, depending on include order
<taeguk>
K-ballo: zao: I misunderstand unnamed namespace.
<taeguk>
My usage is incorrect :(
<zao>
Upside of things, you've learned something :)
<zao>
(I had to verify my understanding of them just the other day, w.r.t. visibility of names)
<taeguk>
zao: K-ballo: very thank you :)
<taeguk>
diehlpk_work: Okay, I'll do 2nd evaluation soon.
akheir has joined #ste||ar
hkaiser has joined #ste||ar
aserio has joined #ste||ar
<jbjnr>
heller: did you see the libfabric email about standardization. Interesting indeed.
<heller>
jbjnr: yes
<jbjnr>
this could be my chance to poush rma_objects!
<heller>
:D
<jbjnr>
^push
<heller>
I am fighting scalable endpoints right now
<heller>
pain in the fucking ass
<jbjnr>
good for you!
<jbjnr>
(the fighting bit, not the pita bit)
<heller>
and guess what ... the test ain't working
<jbjnr>
Glad it's you and not me for once
<jbjnr>
:)
<jbjnr>
(sorry, I believe is some kind of schadenfreuden or something)
<heller>
;)
<heller>
I mean the stuff that's included in fabtests
<hkaiser>
heller: we'll get a small power 8 node (8 cores) here and will add it to rostam
Matombo has quit [Remote host closed the connection]
Kiril_ has joined #ste||ar
<Kiril_>
Hello. I was playing with the pingpong example in examples/quickstart today, and I can't seem to make it work for distributed runs. It works for single-node runs though. Can anyone give me some help?
<jbjnr>
Kiril_: what command line are you using to start the binaries on each node?
<Kiril_>
So I have Slurm set up for a small 2-node setting
<jbjnr>
does hello_world run on 2 nodes?
<Kiril_>
when I run single-node run "srun -n 2 bin/pingpong", it runs fine
<jbjnr>
ok
<jbjnr>
(I'm surprised though, cos 2 binaries on the same node usually fail!)
<jbjnr>
did you compile with the MPI parcelport, or only tcp?
<Kiril_>
okay, hello_world also hangs ... it seems something is not right with my setup
<Kiril_>
only TCP
<Kiril_>
I try to avoid MPI (not important here why)
<Kiril_>
do I need to pass some flag for only TCP settings?
Matombo has joined #ste||ar
<jbjnr>
tcp should be ok, but the MPI parcelport 'knows' about slurm and gets settings from it, I can't remember if the tcp one does.
<Kiril_>
No -- let me try. It has nothing to do with pingpong then, something with my setup
<hkaiser>
jbjnr: the tcp one does as well
<jbjnr>
if you can ssh into two nodes - try launching the binaries by hand, using the --hpx:console on one and hpx::worker on the other, and pas the hpx:
<jbjnr>
pass the hpx:agas and hpx:hpx ip addresses
<Kiril_>
I see a hanging pingpong process
<jbjnr>
kill any hanging jobs too. they hold onto the port and stop the next one working
<Kiril_>
let me read through that setup ...
<jbjnr>
yup
eschnett has quit [Quit: eschnett]
eschnett has joined #ste||ar
aserio has quit [Ping timeout: 246 seconds]
<Kiril_>
I can't seem to make it work. I managed to pass the "hpx:hpx" and "hpx:agas" options with the heartbeat/heartbeat_console example and that worked. For pingpong, I also specify "hpx:localities=2" in addition to the previous options, otherwise the pingpong runs in shared memory and finishes. But that hangs
hkaiser has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar
<Kiril_>
I run on one node "bin/pingpong --hpx:localities=2 --hpx:agas=192.168.137.2:7910 --hpx:hpx=192.168.137.1:7910" and on another bin/pingpong --hpx:localities=2 --hpx:agas=192.168.137.2:7910 --hpx:hpx=192.168.137.2:7910
<jbjnr>
add --hpx:console to one, and hpx:worker to the olther
<jbjnr>
(just to play safe)
<hkaiser>
K-ballo: pls read teh SO post carefully
<jbjnr>
he means Kiril_ ^
<K-ballo>
...
<hkaiser>
K-ballo: sorry ;)
<hkaiser>
the first command line misses --hpx:worker
<Kiril_>
okay -- that worked
<Kiril_>
thank you
<jbjnr>
\o/
<Kiril_>
can I ask one more question related to another example?
<jbjnr>
no
<Kiril_>
please?
<jbjnr>
of course you can. That was a joke^
<Kiril_>
so I was looking at the heartbeat
<jbjnr>
(note that launching the jobs "by hand" now works, but we must determine whay the slurm launch failed)
<Kiril_>
all I understand is that performance counters, in particular some kind of queue is used -- but I fail to see what this has to do with a heartbeat
<Kiril_>
and I could not find any documentation of how these counters can be used for a heartbeat
<Kiril_>
anyone has a few sentences for me explaining this?
<jbjnr>
the heartbeat refers only to the ping of one node to another every second or so to see if it is still doing something
<jbjnr>
Can't quite remember what the heartbeat example does. I have a modified version of it somehwere that needs cleaning up and contributing
<hkaiser>
Kiril_: heartbeat demonstrates two things
<hkaiser>
Kiril_: a) launch a new locality after startup and let it connect back to the main set of localities (in this case just locality 0)
<hkaiser>
and b) query arbitrary perf counters, in this case from inside the newly attached locality
<Kiril_>
I can't figure how locality 0 ( I assume this is the heartbeat_console process) communicates at all with joining localities
<hkaiser>
Kiril_: it tries to create a perf counter
<Kiril_>
uhm ... does the perf counter of locality 1 then reside on locality 0?
<hkaiser>
no
<jbjnr>
no, the perf counter of 1 lies on 1, but when 0 requests the counter for 1, it is doing a remote query
<Kiril_>
ah, okay
<hkaiser>
Kiril_: just looking at the heartbeat_console
<Kiril_>
I had a hard time figuring from where to where data flows
<hkaiser>
it's not doing any perf counter queries
<hkaiser>
I misremembered
<jbjnr>
oops. ignore me then
<jbjnr>
I spent ages on that example as well and completely forgtten what it does. going for tea so I don't send any more false messages for a few minutes
<hkaiser>
it's the attached heartbeat locality which is querying the perf counter
<Kiril_>
:)
<Kiril_>
but I assume the perf counter is then at locality 0?
<Kiril_>
if the attached locality is reading its own counter, it would not be any heartbeat check
<hkaiser>
Kiril_: whatever you specify on the command line, by default '/threadqueue{locality#0/total}/length', which is on locality 0, yes
<Kiril_>
okay, that makes sense, I just wanted to make sure that at some stage one locality (e.g. 1) is querying someone across the wire (in this case, locality 0)
<Kiril_>
which, I guess, is the heartbeat then
aserio has joined #ste||ar
<jbjnr>
happy late birthday for yesterday aserio :)
<aserio>
jbjnr: thanks!
Kiril_ has quit [Quit: Page closed]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
parsa has joined #ste||ar
parsa has quit [Client Quit]
Matombo has quit [Ping timeout: 260 seconds]
Matombo has joined #ste||ar
<hkaiser>
ajaivgeorge: two of the new tests from your PR are failing
<jbjnr>
hkaiser: great ^^^ removing some of those (now unnecessary) extra classes is on our todo list. was on the list I mean. :)
ajaivgeorge__ has joined #ste||ar
ajaivgeorge__ has left #ste||ar [#ste||ar]
pree has quit [Remote host closed the connection]
ajaivgeorge__ has joined #ste||ar
vamatya has quit [Ping timeout: 260 seconds]
<hkaiser>
jbjnr: :D
<hkaiser>
I'm glad you approve
<jbjnr>
remove more!
<hkaiser>
jbjnr: most of the work there was done already
pree has joined #ste||ar
<hkaiser>
working on it
<hkaiser>
perf counters first
eschnett has quit [Ping timeout: 260 seconds]
<jbjnr>
shoshana ran out of time and we decided to concentrate on the main features we need rather than making it clean and nice
pree has quit [Read error: Connection reset by peer]
<hkaiser>
jbjnr: sure
<jbjnr>
clean and nce is your job ::) thanks!
eschnett has joined #ste||ar
<ajaivgeorge>
hkaiser: I am looking at the failing tests. I will fix it in my branch. I had run the tests on rostam before submitting. Not sure why it is failing now.
<ABresting>
diehlpk_work: working on my PR and report, will do 2nd eval. form
bikineev has quit [Remote host closed the connection]
mars0000 has joined #ste||ar
mars0000 has quit [Client Quit]
bikineev has joined #ste||ar
patg[[w]] has joined #ste||ar
denis_blank has joined #ste||ar
pree_ has quit [Ping timeout: 240 seconds]
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
bikineev has quit [Remote host closed the connection]
pree_ has joined #ste||ar
aserio has quit [Ping timeout: 248 seconds]
david_pfander has quit [Ping timeout: 255 seconds]
taeguk has quit [Ping timeout: 260 seconds]
pree_ has quit [Ping timeout: 260 seconds]
aserio has joined #ste||ar
<hkaiser>
jbjnr: yt?
pree_ has joined #ste||ar
<denis_blank>
hkaiser: Did you receive my PM?
bikineev has joined #ste||ar
<hkaiser>
denis_blank: yah, let's talk now
<denis_blank>
hkaiser: ok
patg[[w]] has quit [Quit: Leaving]
Matombo has quit [Remote host closed the connection]
mars0000 has joined #ste||ar
parsa has joined #ste||ar
bikineev has quit [Remote host closed the connection]
Matombo has joined #ste||ar
jgoncal has quit [Ping timeout: 240 seconds]
aserio has quit [Ping timeout: 246 seconds]
bikineev has joined #ste||ar
hkaiser has quit [Quit: bye]
aserio has joined #ste||ar
Matombo has quit [Remote host closed the connection]
jgoncal has joined #ste||ar
vamatya has joined #ste||ar
hkaiser has joined #ste||ar
aserio has quit [Quit: aserio]
parsa has quit [Quit: Zzzzzzzzzzzz]
denis_blank has quit [Quit: denis_blank]
eschnett has quit [Quit: eschnett]
jgoncal has quit [Ping timeout: 246 seconds]
EverYoun_ has joined #ste||ar
jgoncal has joined #ste||ar
EverYoung has quit [Ping timeout: 246 seconds]
eschnett has joined #ste||ar
EverYoun_ has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
eschnett has quit [Quit: eschnett]
eschnett has joined #ste||ar
mars0000 has quit [Quit: mars0000]
diehlpk_work has quit [Quit: Leaving]
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
jgoncal has quit [Ping timeout: 240 seconds]
eschnett has quit [Quit: eschnett]
<K-ballo>
so, what's the proper thing to link to in order to use std::atomic on linux? and get the entire support, not just the builtins
<zao>
Is there a way to not get it?
<zao>
Assuming you have a non-horrible stdlib?
<K-ballo>
somehow... you get extra lockfree-ness by linking
<zao>
*sigh*
<K-ballo>
otherwise you get just the built ins + mutex based for the rest
<zao>
I guess I shouldn't be surprised.
<K-ballo>
or something like that... I haven't actually seen it, a.williams said something to that effect some time ago
<K-ballo>
# Sometimes linking against libatomic is required for atomic ops, if
<K-ballo>
# the platform doesn't support lock-free atomics.
<hkaiser>
K-ballo: linking with latomic is probably not needed for the feature test, then, rather we should add it to the normal build, shouldn't we?
<K-ballo>
hkaiser: I don't know about the feature test, but yeah we should add it to the normal build.. that's why I'm asking, because I'm not sure when/how
eschnett has joined #ste||ar
bikineev has quit [Remote host closed the connection]
hkaiser_ has joined #ste||ar
hkaiser has quit [Read error: Connection reset by peer]
eschnett has quit [Quit: eschnett]
jgoncal has joined #ste||ar
akheir has quit [Remote host closed the connection]
EverYoun_ has joined #ste||ar
EverYoun_ has quit [Ping timeout: 246 seconds]
EverYoung has quit [Ping timeout: 276 seconds]
pree_ has quit [Quit: AaBbCc]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]