aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
StefanLSU has joined #ste||ar
StefanLSU has quit [Quit: StefanLSU]
StefanLSU has joined #ste||ar
denis_blank has quit [Quit: denis_blank]
StefanLSU has quit [Quit: StefanLSU]
hkaiser has quit [Quit: bye]
jbjnr_ has joined #ste||ar
patg has joined #ste||ar
patg is now known as Guest44047
jbjnr has quit [Ping timeout: 246 seconds]
jbjnr_ is now known as jbjnr
Guest44047 has quit [Read error: Connection reset by peer]
patg_ has joined #ste||ar
K-ballo has quit [Quit: K-ballo]
StefanLSU has joined #ste||ar
StefanLSU has quit [Client Quit]
rod_t has joined #ste||ar
taeguk has joined #ste||ar
<taeguk>
congraturations to 700 stars! :)
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
taeguk has quit [Quit: Page closed]
rod_t has joined #ste||ar
<jbjnr>
yay \o\ 700!
<jbjnr>
ops. that was a \o/
bikineev has joined #ste||ar
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<github>
[hpx] biddisco created numa_balanced (+1 new commit): https://git.io/v51dj
<github>
hpx/numa_balanced 62f72e3 John Biddiscombe: Add numa-balanced mode to hpx::bind, spread cores over numa domains
<github>
[hpx] biddisco merged numa_balanced into throttle_cores: https://git.io/v51Ff
<jbjnr>
oops
<github>
[hpx] biddisco force-pushed throttle_cores from 62f72e3 to bb9f490: https://git.io/v5wWj
<github>
hpx/throttle_cores bb9f490 John Biddiscombe: Add numa-balanced mode to hpx::bind, spread cores over numa domains
<jbjnr>
force pushed a fix to that last commit cos I renamed something without checking. sorry
<github>
[hpx] biddisco force-pushed throttle_cores from bb9f490 to 2924fda: https://git.io/v5wWj
<github>
hpx/throttle_cores 2924fda John Biddiscombe: Add numa-balanced mode to hpx::bind, spread cores over numa domains
AnujSharma has joined #ste||ar
<heller>
jbjnr: hijacking my PR, hm?
<jbjnr>
not hijacking, but enhancing. I will remove commits if you don't like it
<heller>
np
<jbjnr>
I will make a test to keep hartmut happy today
<heller>
thanks
<heller>
I think it would be wise to keep the balanced-numa distribution seperate to the other fixes
<jbjnr>
in my calendar it says skye call this afternoon - did we agree to that or was I being presumptive?
<heller>
i think we agreed
<jbjnr>
I renamed it numa-balanced because otherwise the partit thingy chokes
<heller>
ok
<heller>
I am fine with the name
<jbjnr>
ok. I will move the commit back to my branch and remove it from the other PR
<heller>
thanks, smaller PRs are easier to handle ;)
<github>
[hpx] biddisco force-pushed throttle_cores from 2924fda to de6c7d7: https://git.io/v5wWj
<github>
[hpx] biddisco created numa_balanced (+1 new commit): https://git.io/v51NW
<github>
hpx/numa_balanced a71bee0 John Biddiscombe: Add numa-balanced mode to hpx::bind, spread cores over numa domains
<heller>
I want to have the SLURM PR in first though ...
<heller>
MBGA
<jbjnr>
?
<heller>
Make Buildbot Green Again
<github>
[hpx] biddisco opened pull request #2900: Add numa-balanced mode to hpx::bind, spread cores over numa domains (master...numa_balanced) https://git.io/v51Nu
<mcopik>
looks like coroutine passes a nullptr to set_self which does not make sense
<mcopik>
I checked for memory problems but I have run out of ideas
<mcopik>
any ideas what could cause such problems?
bikineev has joined #ste||ar
<heller>
mcopik: never saw this before
<heller>
mcopik: how would I reproduce it?
<mcopik>
heller: it's happening in my sycl stuff and the primary cause seems to be OpenCL callbacks
<heller>
hmm
<mcopik>
it used to work fine but this problem started appearing after merging with HPX master and updating SYCL compiler
<heller>
could be stack overflows
<heller>
does it also happen in debug builds?
<mcopik>
I was 100% sure that callbacks are working correctly and it looks that the future_data is updated and destroyed correctly
<mcopik>
heller: I'm running with larger values for -hpx:ini=hpx.stacks.small_size, no help
<mcopik>
I'll try debug build
<mcopik>
and custom malloc
<mcopik>
heller: funny thing, I also started to get random deadlocks on shutdown. for some reason, Intel OpenCL throws errors randomly, even if the only action is creating a context and device, and stacktrace obtained after interrupting suggests that SYCL issues termination, HPX termination_handler is called and then there is a deadlock in some function called by the termination_handler. but it should not call anything except std::abort
<mcopik>
I wonder if these problems could be connected
<heller>
#8 0x00007ffff5845c27 in hpx::detail::throws_if (ec=..., errcode=hpx::invalid_status, msg="this function can be called from an HPX thread only", func="hpx::finalize",
<heller>
mcopik: next thing you should try is to comment out the sycl pieces bit by bit
<mcopik>
heller: yes, Thomas, I've done it
<mcopik>
heller: it works without futures
<heller>
so which one leads to the error?
<mcopik>
heller: creating a future after enqueue of OpenCL kernel
<mcopik>
obtaining a future before always works
<heller>
ok
<mcopik>
what I noticed is that when creating future after the kernel enqueue
<mcopik>
the callback is always called immediately and it's called before creating the future
<mcopik>
however, I'm 100% sure that the future data is not destroyed as long as it does not go out of scope
<heller>
it looks like there is some strange buffer overflow or similar going on then
<heller>
get_self_ptr() is stored inside a TLS segment
<heller>
you said it worked before?
<heller>
and using target.synchronize() is working as expected?
<mcopik>
heller: yes to first question
<mcopik>
heller: yes, pure synchronize works. I ran it hundreds times and no segfaults have appeared
<heller>
hmmm
<heller>
what do you do inside get_future?
<heller>
can you upload the code somewhere?
<mcopik>
heller: should I perhaps build without native TLS? I recall that Hartmut mentioned it might cause problems when we've been working with AMD stuff
<hkaiser>
jkleinh: this is in heavy flux right now
<hkaiser>
we have not implemented any of this yet
bikineev has joined #ste||ar
<hkaiser>
the traits are functional, but will go away - sorry, you caught us in the middle of a major change here
<jkleinh>
ok, no problem. So you'd suggest writting against executor_traits and then adapting when the new interface is stabilized?
<jkleinh>
Also is there an implementation of an executor that can distribute work over multiple localities somewhere in hpx or are all implemented executors strictly local?
<hkaiser>
jkleinh: we have the distribution_policy_executor
<hkaiser>
you give it a distribution policy and it will distribute the work using that
<hkaiser>
not tested too well, though - so any feedback is appreciated
pree has quit [Read error: Connection reset by peer]
<hkaiser>
jkleinh: this is where we could use input and real use cases. nothing is set in stone
<github>
[hpx] sithhell created fix_service_executor (+1 new commit): https://git.io/v5MVZ
<github>
hpx/fix_service_executor 2c6e61f Thomas Heller: Fixing service_executor...
<jkleinh>
cool that looks like the right thing. We are currently porting a sort of large quantum monte carlo code to hpx. I'll definitely let you know any stumbling blocks we run into.
<heller>
can't reproduce the ignore_while_locked_1485 hang yet :/
hkaiser has quit [Quit: bye]
<diehlpk>
heller, when should we skype today?
<diehlpk>
zack and I contributed to the paper.
aserio has joined #ste||ar
rod_t has joined #ste||ar
pree has joined #ste||ar
aserio has quit [Read error: Connection reset by peer]
rod_t has quit [Client Quit]
aserio has joined #ste||ar
<heller>
diehlpk: can we move it to tomorrow please?
<diehlpk>
Ok, remember that the deadline is this Friday
<diehlpk>
And I do not have time to work on the paper tomorrow or Thursday.
<heller>
ok
<heller>
I'll polish it tomorrow
<jbjnr>
diehlpk do you still need input/help?
<diehlpk>
jbjnr, Yes, you could prrofread the introduction and edit it
<jbjnr>
ok. I'll look at it this evening
<diehlpk>
Section 4 is finished too.
<jbjnr>
heller: sorry completely forgot other skype call. it can wait.
<diehlpk>
zbyerly, Will finish section 3 today.
<diehlpk>
I will finish and conclusion and outllok today.
<diehlpk>
jbjnr, Would be great if you can read introduction and section 4
<jbjnr>
no prpblem
pree has quit [Read error: Connection reset by peer]
<heller>
jbjnr: no problem, I forgot as well
<heller>
jbjnr: let's try tomorrow
hkaiser has joined #ste||ar
diehlpk has quit [Ping timeout: 264 seconds]
rod_t has joined #ste||ar
pree has joined #ste||ar
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
rod_t has joined #ste||ar
<jbjnr>
can anyone think of a reason why boost log/locale/graph/regex are linking to cuda on our cray, but other libs do not (pulled in by cray wrappers, but I'm not sure why)
pree has quit [Read error: Connection reset by peer]
<mcopik>
heller: same error with TLS disabled
<mcopik>
but now I see something really, really strange. two target futures are created but the callback is executed three times
AnujSharma has quit [Ping timeout: 248 seconds]
pree has joined #ste||ar
<mcopik>
heller: no, it's different. when the callback is executed between a creation of future_state and a corresponding future, the function passed to hpx::applier::register_thread_nullary is not executed at all?
aserio has quit [Quit: aserio]
aserio has joined #ste||ar
pree has quit [Read error: Connection reset by peer]
hkaiser has quit [Read error: Connection reset by peer]
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
zbyerly_ has joined #ste||ar
rod_t has joined #ste||ar
pree has joined #ste||ar
pree has quit [Read error: Connection reset by peer]
zbyerly_ has quit [Ping timeout: 240 seconds]
david_pfander has quit [Ping timeout: 248 seconds]
pree has joined #ste||ar
EverYoung has joined #ste||ar
pree has quit [Remote host closed the connection]
pree has joined #ste||ar
bibek_desktop has joined #ste||ar
<jkleinh>
is the distribution_policy concept defined somewhere?
<jkleinh>
default_distribution_policy has an async member function which is used by distribution_policy_executor but this method is missing from binpacking_distribution_policy
<jkleinh>
also the get_next_target method used by default_distribution_policy always returns the first locality
pree has quit [Read error: Connection reset by peer]
<jkleinh>
I'm not sure if this is desired behavior. Based on how the create method works I would expect the async method of default_distribution_policy to iterate over localities cycliclly
pree has joined #ste||ar
<zbyerly>
diehlpk_work, i'm almost done
<zbyerly>
diehlpk_work, do you mind if i proofread everything for grammar / spelling ?
mbremer has joined #ste||ar
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
pree has quit [Remote host closed the connection]
rod_t has joined #ste||ar
rod_t has quit [Client Quit]
pree has joined #ste||ar
hkaiser has joined #ste||ar
rod_t has joined #ste||ar
hkaiser has quit [Read error: Connection reset by peer]
<diehlpk_work>
zbyerly, Thanks. Sure go for it
pree has quit [Read error: Connection reset by peer]
<diehlpk_work>
I will extend the conclusion and outllok soon
Matombo has joined #ste||ar
pree has joined #ste||ar
EverYoung has quit [Ping timeout: 246 seconds]
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<zbyerly>
diehlpk_work, are we going to use any plots?
<diehlpk_work>
I asusem that heller will provide some figures on section 2.
<diehlpk_work>
*assume
<heller>
You assume correctly
rod_t has joined #ste||ar
aserio has quit [Ping timeout: 264 seconds]
EverYoung has joined #ste||ar
rod_t has quit [Client Quit]
hkaiser has joined #ste||ar
EverYoun_ has joined #ste||ar
EverYoung has quit [Ping timeout: 260 seconds]
rod_t has joined #ste||ar
jkleinh has quit [Quit: Page closed]
eschnett has quit [Quit: eschnett]
StefanLSU has joined #ste||ar
EverYoun_ has quit [Remote host closed the connection]
Matombo has quit [Ping timeout: 240 seconds]
EverYoung has joined #ste||ar
<zbyerly>
diehlpk_work, i added another 2 paragraphs, right now I am at about 1.25 pages, I have two more things to add
<diehlpk_work>
Ok, we still can shorten things later
<zbyerly>
diehlpk_work, yes, I was about to say I will go over that limit most likely, but we can trim down later
jkleinh has joined #ste||ar
aserio has joined #ste||ar
pree has quit [Quit: AaBbCc]
Matombo has joined #ste||ar
StefanLSU has quit [Quit: StefanLSU]
Matombo has quit [Remote host closed the connection]
<hkaiser>
mcopik: the lock used can be safely acquired from both an hpx thread and a non-hpx-thread
Matombo has quit [Remote host closed the connection]
akheir has joined #ste||ar
Matombo has joined #ste||ar
<heller>
mcopik: does this change anything?
<jbjnr>
jesus christ - who wrote that introduction? it's very strange!
<heller>
hkaiser: will you be in Berkeley next week?
<jbjnr>
zbyerly: diehlpk_work what is the page limit?
<zbyerly>
10 i think
<zbyerly>
jbjnr, jesus christ did not write any of the sections AFAIK
akheir has quit [Remote host closed the connection]
<jbjnr>
zbyerly: indeed
<jbjnr>
why you worried about space there's only 5 in there so far
eschnett has joined #ste||ar
Matombo has quit [Remote host closed the connection]
<jbjnr>
can I edit it freely, or will I conflict with others?
Matombo has joined #ste||ar
<mcopik>
heller: I'm asking because I'm not really knowledgeable about what HPX and non-HPX threads are allowed to do
Matombo has quit [Read error: Connection reset by peer]
<hkaiser>
heller: no
<zbyerly>
jbjnr, i think t. heller is going to bring the figures
Matombo has joined #ste||ar
<hkaiser>
jbjnr: go ahead
<hkaiser>
mcopik: sure - the old code was not really necessay, so we removed it
<hkaiser>
heller: will you come by for a day or two afterwards?
<mcopik>
heller: when running as an HPX thread, I can't confirm that the lambda passed to register_thread_nullary is ever executed in the case when callback is executed before creating hpx::future
<zao>
« On the third day He returned from his \write18 and emitted a \section. »
<hkaiser>
ROFL
<hkaiser>
I hope that didn't violate anybodies religious feelings
<mcopik>
heller: and I get the failed assertion when executing get() on future (non-HPX thread). and for the case, when the future state is modified from a non-HPX thread, f.get() somehow is successful but I get the failed assertion in coroutine self_set
<zbyerly>
hkaiser, i don't think it's offensive to imply that Jesus would use LaTeX if the Bible were written today
<hkaiser>
yah, you can't call get() on a non-hpx-thread
<hkaiser>
get might suspend inside the future, set will not
<heller>
mcopik: that makes perfect sense indeed
<heller>
hkaiser: hmm, I could do that
<hkaiser>
heller: I'd enjoy that!
<heller>
hkaiser: talk is Thursday, I could fly to br, and we could talk Friday.
<hkaiser>
nice
<heller>
Let's do it then
<hkaiser>
you won't get to br from SF in the afternoon/evening, though
<hkaiser>
except over night - but I wouldn't suggest doing this
<heller>
This will be a 3k trip for the days, nice.
<heller>
Hmm
<zbyerly>
heller, FYI there are non-stop flights from SF to NOLA
<hkaiser>
ahh yes, I could pick you up ther
<hkaiser>
e
<heller>
I was going to be back on Saturday, since I need to be in Stockholm on Monday
<heller>
zbyerly: good to know!
<hkaiser>
so come before the talk?
<hkaiser>
come monday, fly to SF Wed
<hkaiser>
heller: otoh, don't sweat it - np if it doesn't work out
<mcopik>
heller: I can't reproduce the issue when hpx::register_thread and unregister_thread is not performed
<mcopik>
perhaps OpenCL setCallback fires the callback immediately, within the HPX thread which called setCallback?
<hkaiser>
mcopik: those calls shouldn't hurt, they can only prevent problems, not add them
<mcopik>
and then, when callback is finished, this HPX thread calls unregister_thread
<mcopik>
couldn't that lead to my problems if an HPX thread tries to unregister himself?
<hkaiser>
yah, hpx threads shouldn't call [un]register_thread, that will blow things up
<hkaiser>
note to self - I should check that those are not called on hpx threads
<mcopik>
hkaiser: and I always assumed that callback can be called only by a foreign thread coming from OpenCL library
<mcopik>
which might not be true
<hkaiser>
nod
<mcopik>
shit
<mcopik>
two days of debugging
<mcopik>
I just hope that's the true cause
<hkaiser>
mcopik: check the result of hpx::threads::get_self_ptr(), it will be nullptr only on a non-hpx thread
<heller>
hkaiser: I'll sleep over it
<mcopik>
hkaiser: yes, the ptr is not null
<mcopik>
heller: many thanks for helping me today, I think I solved it
<heller>
Great!
<heller>
I didn't do anything ;)
<mcopik>
now only I have to solve deadlocks on Intel OpenCL
<heller>
hkaiser: I'll see what flight options I'll get
<heller>
Your free either way?
<hkaiser>
heller: cool
<hkaiser>
yes, I'll make it happen
<heller>
That is Wednesday or Friday?
<hkaiser>
heller: yes
<heller>
Great
<hkaiser>
Tuesday or Friday, I guess
<hkaiser>
not sure what flights there are over night
<heller>
Yeah, let's see
bikineev has joined #ste||ar
<mbremer>
@hkaiser: I finally have some profiling data. Would you have some time this week to sit down and talk me through it?
<hkaiser>
mbremer: sure, absolutely
<mbremer>
How would Wednesday or Thursday work?
<mbremer>
Also, do you have a tacc account or can I tar up these results and send them to you?
<mbremer>
The results were run using vtune17 update 4.
EverYoun_ has joined #ste||ar
<hkaiser>
mbremer: I might have a tacc account, but sending the files should work too
<hkaiser>
mbremer: Wed/Thu should work yah - gtg now, though
<hkaiser>
ttyl
hkaiser has quit [Quit: bye]
EverYoun_ has quit [Remote host closed the connection]
EverYoung has quit [Ping timeout: 246 seconds]
EverYoung has joined #ste||ar
Matombo has quit [Remote host closed the connection]
Matombo has joined #ste||ar
jaafar has joined #ste||ar
eschnett has quit [Quit: eschnett]
rod_t has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]