hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/
<simbergm>
they're seem to time out every time, but only with that builder
<jbjnr>
you shouldn't be. the test is completely broken anyway
<simbergm>
how broken?
<jbjnr>
oh hold on, that's 4 differnt tests
<simbergm>
i.e. what do you know?
<jbjnr>
the block_executor doesn't work properly with the RP if I recall correctly
<simbergm>
also, if the scheduler branch is pretty far along it would probably be a good idea to open a PR already now so that we can review the parts that are ready
<simbergm>
at all?
<jbjnr>
those tests need updating. I did make flyby fixes to one of them but wish I hadn't, so this stuff has nothing to do with my numa allocator
<jbjnr>
I just didn't like there being an old numa alocator and numa transpose test that didn't work
<simbergm>
right :/
<simbergm>
something changed though... I'll have a look
<jbjnr>
I'll remove changes to that stuff from my pr - lookinga th the dashboard, it seems that the old tests useed to at least pass
<jbjnr>
even if they were not doing numa very well.
<simbergm>
ok, I'm just confused by what might've caused the change... you only changed one of the transpose examples and the topology changes you made seem unrelated
<jbjnr>
correct
<jbjnr>
(we have so much old unmaintained code that can cause problems like this that hold us back from making changes - these are unmaintained examples and not proper unit tests)
<simbergm>
yep, also correct
<simbergm>
they're there to avoid them breaking without us knowing, but I'm not at all against removing them if they're not relevant anymore
<simbergm>
*the examples as tests are there...
mdiers_ has joined #ste||ar
<simbergm>
anyway, if you can remove the unrelated changes that'd be good, we can see if the examples still fail, and then decide if we should remove some of them
<simbergm>
it'd be a shame to remove them completely without a replacement
<jbjnr>
I was going to try to fix the numa_transpose - but it has a ton of code and uses the block_executor and it's own numa allocator, plus a bunch of other stuff that doesn't really fit with the RP way of doing things
<simbergm>
and open a PR with the scheduler changes ;) you can set it to a "draft" PR so that we don't merge it before you're done with it
<jbjnr>
(launching threads on cores directly)
<simbergm>
feel free to fix it, but it can go in another PR
<jbjnr>
if I canot get a simple numa allocator PR in, then there's no point in me submitting my scheduler fixes
<simbergm>
sure, that's why it's good to keep the numa allocator PR free from unrelated changes so that we can get it in
<jbjnr>
it still doesn't work on those docker containers and windows machines
<simbergm>
can you tell by the output why?
<simbergm>
either the numa allocator isn't general enough to run on a machine with one numa domain, or the test just doesn't make sense on whatever machine the circleci tests are run? if the latter you can skip the test as long as you can detect whatever condition it is that breaks it
<jbjnr>
simbergm: because ---- instead of 0000 - I need to put in a default fallback for machines that don't have the hwloc support that we use. It will end up just assuming numa node 0 for everything
<simbergm>
what does ---- mean? hwloc can't detect anything about numa domains?
<jbjnr>
in my unit test, I create an array, bind pages to numa nodes then create a string with the 'detected' numa node for each page and compare it to what it should get. If hwloc can't get the numa I return the string '-' instead of '0' or '1' etc, so the string compare of expected and detected fails.
<jbjnr>
so far hwloc worked on laptop/daint/ault/greina/dom/etc, but cicrcleci manages to surprise us
<simbergm>
jbjnr: ok, maybe skipping the test with a warning is better since the test doesn't make much sense after that? I just hope it won't then silently break on daint after hwloc changes something... not sure how we can get loud errors on daint but skip it on circleci
<jbjnr>
The test is quite thorough - if something changes, it will triggger a fail - that's the reason I have '-' as an output as well. If hwloc fails to get the number, it triggers a fail. What would be better is to know why circleci fails - I presume due to container use, but I have no idea how to setup a containter and test for that
<jbjnr>
for windows we can easily just disable the test, or dfault to numa 0 - this is not a problem, but the container one is annoying, because using a default for that, might cause silent fails on other machines in the future
<jbjnr>
is there a help section anywher in the docs on container use of hpx