aserio changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
EverYoung has quit [Ping timeout: 246 seconds]
vamatya has joined #ste||ar
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
eschnett has joined #ste||ar
hkaiser has quit [Quit: bye]
vamatya has quit [Ping timeout: 240 seconds]
eschnett_ has joined #ste||ar
eschnett has quit [Ping timeout: 240 seconds]
eschnett_ is now known as eschnett
K-ballo has quit [Quit: K-ballo]
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
patg has joined #ste||ar
patg is now known as Guest15479
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
Guest15479 has quit [Read error: Connection reset by peer]
patg has joined #ste||ar
patg is now known as Guest25792
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
taeguk has joined #ste||ar
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
<heller> wash: sure
<heller> That's the plan ;)
taeguk has quit [Quit: Page closed]
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
bikineev has joined #ste||ar
bikineev has quit [Remote host closed the connection]
david_pfander has joined #ste||ar
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
bikineev has joined #ste||ar
Matombo has joined #ste||ar
bikineev has quit [Ping timeout: 240 seconds]
<github> [hpx] StellarBot pushed 1 new commit to gh-pages: https://git.io/vQ7Yc
<github> hpx/gh-pages 2ad5d87 StellarBot: Updating docs
david_pfander1 has joined #ste||ar
david_pfander has quit [Ping timeout: 260 seconds]
david_pfander1 is now known as david_pfander
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
bikineev has joined #ste||ar
<heller> david_pfander: see mail
<heller> i've identified some more "good" commits in the meantime
<heller> david_pfander: see here, all marked green are good.
<heller> last commit finished retesting is "6aec8ce"
<david_pfander> heller: saw your mail, I guess we should discuss that after it is more clear what the future of octotiger will be
<heller> yes
<heller> david_pfander: well, it doesn't hurt to take responsibility, this will help in future projects as well ;)
<david_pfander> heller: what do you mean with that?
<heller> it's very frustrating to dump hours and hours into debugging problems which should have been seen months ago...
<david_pfander> heller: tell me about it...
<heller> well, the unit tests have been failing for months now
<heller> "unit tests"
<heller> without a real attempt trying to fix them
<heller> the general opinion was: "This has to be the fault of HPX, FIX IT!"
hkaiser has joined #ste||ar
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
<heller> hkaiser: good morning
<heller> see PM please
david_pfander has quit [Quit: david_pfander]
david_pfander has joined #ste||ar
mcopik has quit [Ping timeout: 240 seconds]
jbjnr has joined #ste||ar
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
<heller> jbjnr: hey! how did your presentation go?
<jbjnr> fine. I gave an internal presentation of the zero-copy serialization work to our group
<jbjnr> I have some questions:
<jbjnr> if I disable the timer pool and the io-pool - what will stop working?
<hkaiser> timed suspension
<hkaiser> and all code relying on the io-pool, like octotiger
<hkaiser> jbjnr: why do you want to disable that? those threads are most of the time suspended
<jbjnr> raffaele collected the task timings for all the tasks that do a dgemm on a single tile of the matrix and got a plot that looks like a two humped camel. One hum corresponds to the time that it takes to do a dgemm on a tile of the matrix. The second hump is what we would get if a thread was being suspended by the OS for a timeslice. Running the matrix code on fewer threads improves the...
<jbjnr> ...situation and improves the runtime, so I wanted to try disabling the timer pol and io-pool to see if it made any difference.
<jbjnr> are there any other threads that can run?
<jbjnr> (other codes that do not use hpx do not show this multi-hump timing)
<jbjnr> I've checked the affinity binding and can confirm that worker threads are pinned to os cores uniquely (which I suspected might have been a possibilitiy, but it turns out no)
<jbjnr> (I mean if two worker threads were bound to the same core, that would hose things in the manner we see)
ajaivgeorge has joined #ste||ar
<heller> jbjnr: interesting
eschnett has quit [Quit: eschnett]
<heller> jbjnr: that would mean that the timer and/or IO threads are periodically scheduled for some reason
<heller> jbjnr: I don't see a reason why you shouldn't be able to (temporarly) disable those two helper threads to see if it makes a difference
<jbjnr> I have removed io pool and timer pool, but there are still extra threads running
david_pfander has quit [Ping timeout: 268 seconds]
<heller> jbjnr: there is the main thread, which is just waiting on the runtime to down again
<jbjnr> and 4 more
<heller> jbjnr: there is currently no way to disable that
<heller> that would be the TCP parcelport threads?
<jbjnr> HPX_H?A
<jbjnr> HPX_HAVE_NETWORKING=
<jbjnr> OFF
<heller> hmmm
<jbjnr> but tcp might still be active.
<jbjnr> tracking that down now
<heller> ok
<heller> maybe some CUDA related threads?
<jbjnr> does agas have any dedicated threads at all?
<heller> no, AGAS runs purely as HPX threads
<hkaiser> jbjnr: kill the main thread (the one running main())!
<hkaiser> jbjnr: FWIW hpx launches the following threads: main() (launched by the OS), 2 threads each per IO, timer, and TCP, and one thread per --hpx:threads=N
<hkaiser> so either kill main or the ones run by the scheduler - voilla, look'ma no thread
<hkaiser> s
<hkaiser> jbjnr: may I see those humps, pls?
denis_blank has joined #ste||ar
<jbjnr> the red line is the mean. it tells us that we're only getting 60% ish of peak. The matrix should be solved in 8ms if we have peak constantly
<hkaiser> jbjnr: does this show results from many runs?
<jbjnr> this is one run
<hkaiser> ok, what do you measure then? many dgemm's?
<jbjnr> there are thousands of matrix multiplies in each run and each one is timed
<hkaiser> ok
<hkaiser> I don't think this is caused by stray threads being scheduled
<hkaiser> this is most likely caused by bad NUMA placement
<hkaiser> jbjnr: have you looked at cross NUMA memory traffic?
<jbjnr> I'd like to test the numa theory. I am worried that we do not schedule child tasks on the right threads
<jbjnr> ^no
<hkaiser> hpx does not do anything by default to run tasks on the numa domain where the data is located
<jbjnr> I know
<hkaiser> use executors and hpx::compute allocators
<jbjnr> I need to try fixing that, but I have to eliminate other options first
<hkaiser> NUMA is the most obvious and hurtful option
<hkaiser> I doubt the thread scheduling hurst you that much
<hkaiser> even more as those threads sleep allmost all the time
<jbjnr> the two peaks are too far apart in timing to be just a numa problem. the tile size is 512x512 double and fits into cache
<jbjnr> so there should only be a single fetch or block overhead
<hkaiser> how do you know?
<jbjnr> ^of block
<hkaiser> well, reading stuff from the other numa domain is almost twice as slow than reading from the own memory
<jbjnr> well, I may be wrong. If many blocks are being fetched, then the memory BW will be a bottleneck and could cause this,
<hkaiser> the next (non-obvious) problem may be caused by the tasking nature of HPX
K-ballo has joined #ste||ar
<hkaiser> this may cause excessive TLB reloading which can hurt you massively
<jbjnr> but 2MB at 100GB/s should be only 1 or 2 ms delay
<jbjnr> we've got 10ms delay here
<heller> jbjnr: so this is a histogram of the timing of the different dgemm kernels?
bikineev has quit [Ping timeout: 260 seconds]
<jbjnr> yes essentially
hkaiser has quit [Read error: Connection reset by peer]
hkaiser_ has joined #ste||ar
<hkaiser_> jbjnr: all I'm saying is that I doubt the threads are the culprit
<heller> are they potentially suspending somehow?
<heller> jbjnr: I am taking sides with hkaiser_ ;)
zbyerly_ has quit [Remote host closed the connection]
zbyerly_ has joined #ste||ar
<jbjnr> question: when a task completes - and triggers a continuation - do we know on which thread number the task that is completing is on. It would be trivial in my sceduler to use the same queue for the continuation as a first choice to run on
<jbjnr> instead of the round robin we use currently
<heller> yes
<jbjnr> this would help numa placement
<hkaiser_> on the thread which makes the future ready
<heller> jbjnr: good idea!
<heller> not when it has async launch policy
<hkaiser_> well, then it's scheduled on the current thread, I believe
<heller> not sure about that right now
<hkaiser_> let's have a look
<jbjnr> I'm talking about the queues in the thread pool. When a task is added to the queue, we use curr_queue++ and the tasks are added round robin, but if I know where the current one is ....
<hkaiser_> not always
<jbjnr> correct
<heller> as far as I can see, there is no thread specified
<hkaiser_> nod, I agree
<hkaiser_> right
<heller> jbjnr: this defaults to num_thread == -1, and only then our current schedulers do the round robin thing
<jbjnr> yes, there we do not pass the current queue number into create_thread
<jbjnr> yes, that's what I want, to replace the -1 with the one we are on now
<heller> so you can either change the thread placement policy locally (in that code) or in your scheduler globally
<jbjnr> I could get the thread num_tss and use that?
<heller> you have to figure out if the code on line 353 or on line 321 is called though ;)
<heller> one sec
<heller> ok, can't find the place where I did that last time...
<heller> where is the actual thread scheduling happening though?
<heller> jbjnr: btw, what is median timing for the other solutions?
<jbjnr> ?
<jbjnr> don't understand the question
<jbjnr> so the thread init data comes from here and is just default initialized https://github.com/STEllAR-GROUP/hpx/blob/bb7ab04e938727f8f49354991ca520069083cfb2/hpx/runtime/applier/apply.hpp#L195
<jbjnr> I can just get the current OS thread from thread_num_tss and use that, convert it to pool index and then make sure it goes into that queue first.
<heller> yes
<jbjnr> but is that apply function only called for a continuation, or ...
<heller> FWIW, the round robin scheduling is the most plausible explanation
<heller> you potentially have cross-domain atomic instructions to schedule those new threads
<heller> which leads to cache line invalidations etc.
<heller> which would very well explain this intense delay and the two humps
<jbjnr> yes, but we're 10ms slow. A cache line fail is tiny compared to that
<heller> not if it accumulates
<jbjnr> there is not that much contention
<jbjnr> these tasks take 8ms each
<jbjnr> and are running 10ms slow
<jbjnr> milli, not micro
<hkaiser_> jbjnr: look at cross numa-domain memory traffic at least once - that will answer your question - no need to second-guess
<heller> the contention happens when all want to schedule a new task, no?
<jbjnr> no
<heller> sure
<heller> curr_queue_ is a shared atomic, for example, the queues are implemented using atomics
<jbjnr> we see a lovelt task trace, with all the tasks beind scheduled beautiffully, but significant number of them just task twice as long as the should
<heller> which explains it
<heller> you don't see this delay in the task trace
<jbjnr> yes we do
<jbjnr> the contention appears as a staggering of the tasks
<heller> you see it because some tasks take longer, sure
<jbjnr> this is just tasks taking longer
<heller> but you don't see this effect as holes in the task trace
<heller> because scheduling a new task inside of the task does still account to the tasks time
<heller> there is no suspension or anything going on. the instructions just take longer
<jbjnr> you see them as holes because raffaele's task profiling doesn't get created until the task is running
<heller> so the task traces are purely within rafaele's user code, and they don't include, for example the attaching of continuations?
<jbjnr> essentially yes
<heller> ok
<heller> then this theory is bullshit
<jbjnr> we can tell the difference between tasks starting late and tasks taking longer.
<heller> did you figure out what those remaining 4 OS threads are?
<jbjnr> there's a boost::asio::servive thread that I thought had disabled, and possibly an apex thread
<jbjnr> I got distracted.
<jbjnr> if HPX_HAVE_NETWORKING is OFF, should the tcp threads still be created, or skipped altogether?
<zao> From Raymond Chen himself - »There's also std::future, but its lack of composability makes it a poor choice for asynchronous programming.»
<zao> (about the PPL, but still amusing to see)
<heller> jbjnr: they shouldn't run if you disabled networking
<heller> zao: source?
diehlpk_work has joined #ste||ar
<diehlpk_work> Next week I will be at a conference trying to advertise HPX in the engineering community. So I will be not available via irc
mcopik has joined #ste||ar
Guest25792 has quit [Quit: This computer has gone to sleep]
zbyerly_ has quit [Remote host closed the connection]
vamatya has joined #ste||ar
<jbjnr> hkaiser_: do you know if apex allows collection of papi type data yet?
<jbjnr> (HPX5 git repo is not getting much activity these days)
hkaiser_ has quit [Quit: bye]
<K-ballo> I love the new state of the branches btw.. can we drop 5 or 6 more? or is it asking for too much?
vamatya has quit [Ping timeout: 260 seconds]
<jbjnr> hmm. I thought I deleted all mine, but some are still there
<diehlpk_work> 0'reilly is not interested in publishing the HPX book
<K-ballo> lol, I was just kidding, didn't mean yours specifically jbjnr
<jbjnr> K-ballo: well if I do a count of branches in my local tree, including all remote clone branches, it's 647, which frankly is a trifle large, so a bit of a cleanup is not a bad thing
<K-ballo> wow :|
<K-ballo> I assume you've never did prune on sync?
<jbjnr> and HPX is just one of several projects that I have too many branches in :(
<jbjnr> I don't prune ofter. Too afrid of losing something that "might be useful one day"
<K-ballo> then I can see getting to 647 for HPX rather quickly
<jbjnr> indeed
<K-ballo> I have like 39.. counting by hand, not sure how to measure properly
<jbjnr> I have a similar number for paraview/vtk/vtkm etc
<K-ballo> I prune pretty much every time I fetch from upstream
<jbjnr> 39 sounds about right for local branches. I also counted the remote copies etc etc. Say 100 each for 5 or 6 repos in all.
<jbjnr> so most are copies
<K-ballo> I count 9 local refs, 9 in origin, 21 in upstream
<K-ballo> local + origin = 15 unique
aserio has joined #ste||ar
bikineev has joined #ste||ar
hkaiser has joined #ste||ar
zbyerly_ has joined #ste||ar
<K-ballo> why is StellarBot blocked on github?
<heller> what do you mean?
bikineev has quit [Ping timeout: 258 seconds]
hkaiser has quit [Quit: bye]
ajaivgeorge has quit [Quit: ajaivgeorge]
<wash> aserio: ping
hkaiser has joined #ste||ar
<aserio> wash: here (but in a meeting)
aserio1 has joined #ste||ar
aserio has quit [Ping timeout: 246 seconds]
aserio1 is now known as aserio
<diehlpk_work> In 9 days the second evaluation starts denis_blank ABresting thundergroudon[m taeguk[m]
bikineev has joined #ste||ar
<denis_blank> diehlpk_work: Ok, thanks for noticing
aserio has quit [Ping timeout: 246 seconds]
bikineev has quit [Ping timeout: 260 seconds]
mars0000 has joined #ste||ar
aserio has joined #ste||ar
bikineev has joined #ste||ar
EverYoung has joined #ste||ar
aserio has quit [Ping timeout: 240 seconds]
aserio has joined #ste||ar
denis_blank has quit [Quit: denis_blank]
zbyerly_ has quit [Ping timeout: 276 seconds]
bikineev has quit [Ping timeout: 240 seconds]
mars0000 has quit [Quit: mars0000]
bikineev has joined #ste||ar
bikineev has quit [Remote host closed the connection]
<ABresting> diehlpk_work: +1
vamatya has joined #ste||ar
aserio has quit [Ping timeout: 240 seconds]
EverYoung has quit [Ping timeout: 246 seconds]
EverYoung has joined #ste||ar
<wash> aserio: We should touch base re the paper sometime next week
EverYoun_ has joined #ste||ar
EverYoun_ has quit [Remote host closed the connection]
EverYoung has quit [Ping timeout: 276 seconds]
EverYoung has joined #ste||ar
<github> [hpx] hkaiser force-pushed simple_base_lco from 6430975 to e4a4c2c: https://git.io/vQ5mv
<github> hpx/simple_base_lco e4a4c2c Hartmut Kaiser: Leave managed base_lco_with_value the default...
aserio has joined #ste||ar
ajaivgeorge has joined #ste||ar
ajaivgeorge has quit [Remote host closed the connection]
ajaivgeorge has joined #ste||ar
ajaivgeorge has quit [Read error: Connection reset by peer]
ajaivgeorge has joined #ste||ar
antoinet has joined #ste||ar
antoinet has quit [Remote host closed the connection]
K-ballo has quit [Read error: Connection reset by peer]
K-ballo has joined #ste||ar
ajaivgeorge_ has joined #ste||ar
ajaivgeorge has quit [Read error: Connection reset by peer]
ajaivgeorge has joined #ste||ar
ajaivgeorge_ has quit [Client Quit]
mars0000 has joined #ste||ar
<diehlpk_work> Does anyone know why my code compiles and links with gcc, but for clang I got this
<diehlpk_work> In function `void output<problem::Quasistatic<material::Elastic> >(IO::deck::PD*, problem::Quasistatic<material::Elastic>*)':
<diehlpk_work> In function `void output<problem::Quasistatic<material::Elastic> >(IO::deck::PD*, problem::Quasistatic<material::Elastic>*)':
ajaivgeorge has quit [Client Quit]
ajaivgeorge has joined #ste||ar
<diehlpk_work> undefined reference
mars0000 has quit [Quit: mars0000]
<heller> diehlpk_work: I don't see an undefined reference.
<heller> And please, use a paste site
<heller> Looking at my Crystal ball, you probably link against a library which uses an abi incompatible stdlib
<heller> That is, at least one of your dependencies was compiled with gcc
<heller> And/or a different flavor of c++
<diehlpk_work> No, the lib is compiled with clang too
<heller> You didn't export the symbol?
<diehlpk_work> Will check after the meeting, but why it will work with gcc then?
hkaiser has quit [Quit: bye]
EverYoung has quit [Remote host closed the connection]
ajaivgeorge has quit [Ping timeout: 268 seconds]
aserio has quit [Ping timeout: 246 seconds]
ajaivgeorge has joined #ste||ar
aserio has joined #ste||ar
mars0000 has joined #ste||ar
EverYoung has joined #ste||ar
bikineev has joined #ste||ar
hkaiser has joined #ste||ar
bikineev has quit [Remote host closed the connection]
<aserio> hkaiser: yt?
mcopik has quit [Ping timeout: 260 seconds]
bikineev has joined #ste||ar
aserio has quit [Quit: aserio]
diehlpk_work has quit [Quit: Leaving]
Matombo has quit [Remote host closed the connection]
mars0000 has quit [Quit: mars0000]
bikineev has quit [Remote host closed the connection]
bikineev has joined #ste||ar
bikineev has quit [Remote host closed the connection]
EverYoun_ has joined #ste||ar
EverYoung has quit [Ping timeout: 246 seconds]
EverYoun_ has quit [Ping timeout: 240 seconds]
EverYoung has joined #ste||ar
EverYoung has quit [Remote host closed the connection]
EverYoung has joined #ste||ar
hkaiser_ has joined #ste||ar
hkaiser has quit [Read error: Connection reset by peer]
hkaiser_ has quit [Read error: Connection reset by peer]
hkaiser has joined #ste||ar