hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/ | GSoC: https://github.com/STEllAR-GROUP/hpx/wiki/Google-Summer-of-Code-%28GSoC%29-2020
kale[m] has quit [Ping timeout: 260 seconds]
kale[m] has joined #ste||ar
hkaiser has quit [Quit: bye]
K-ballo has quit [Ping timeout: 246 seconds]
K-ballo has joined #ste||ar
kale[m] has quit [Ping timeout: 240 seconds]
oleg[m] has quit [Ping timeout: 246 seconds]
ralph[m] has quit [Ping timeout: 246 seconds]
gnikunj[m] has quit [Ping timeout: 246 seconds]
parsa[m] has quit [Ping timeout: 246 seconds]
gonidelis[m] has quit [Remote host closed the connection]
kimbo[m] has quit [Read error: Connection reset by peer]
neill[m] has quit [Write error: Connection reset by peer]
gretax[m] has quit [Write error: Connection reset by peer]
diehlpk_mobile[m has quit [Read error: Connection reset by peer]
mdiers[m] has quit [Read error: Connection reset by peer]
mathegenie[m] has quit [Read error: Connection reset by peer]
carola[m] has quit [Read error: Connection reset by peer]
tiagofg[m] has quit [Read error: Connection reset by peer]
heller1 has quit [Remote host closed the connection]
bennie[m] has quit [Remote host closed the connection]
camila[m] has quit [Remote host closed the connection]
richard[m]1 has quit [Remote host closed the connection]
teonnik has quit [Remote host closed the connection]
joe[m] has quit [Remote host closed the connection]
klaus[m] has quit [Remote host closed the connection]
sidhu[m] has quit [Remote host closed the connection]
gdaiss[m] has quit [Remote host closed the connection]
k-ballo[m] has quit [Remote host closed the connection]
rori has quit [Remote host closed the connection]
marzipan[m] has quit [Read error: Connection reset by peer]
smith[m] has quit [Write error: Connection reset by peer]
ms[m] has quit [Remote host closed the connection]
spring[m] has quit [Write error: Connection reset by peer]
tiagofg[m] has joined #ste||ar
diehlpk_mobile[m has joined #ste||ar
oleg[m] has joined #ste||ar
ms[m] has joined #ste||ar
camila[m] has joined #ste||ar
rori has joined #ste||ar
spring[m] has joined #ste||ar
klaus[m] has joined #ste||ar
mathegenie[m] has joined #ste||ar
smith[m] has joined #ste||ar
marzipan[m] has joined #ste||ar
joe[m] has joined #ste||ar
neill[m] has joined #ste||ar
richard[m]1 has joined #ste||ar
carola[m] has joined #ste||ar
bennie[m] has joined #ste||ar
gretax[m] has joined #ste||ar
teonnik has joined #ste||ar
gdaiss[m] has joined #ste||ar
mdiers[m] has joined #ste||ar
kimbo[m] has joined #ste||ar
sidhu[m] has joined #ste||ar
ralph[m] has joined #ste||ar
gnikunj[m] has joined #ste||ar
k-ballo[m] has joined #ste||ar
heller1 has joined #ste||ar
parsa[m] has joined #ste||ar
gonidelis[m] has joined #ste||ar
bita has quit [Ping timeout: 260 seconds]
elfring has joined #ste||ar
nanmiao99 has quit [Ping timeout: 245 seconds]
norbert[m]2 has joined #ste||ar
smith[m] has quit [Quit: killed]
joe[m] has quit [Quit: killed]
tiagofg[m] has quit [Quit: killed]
diehlpk_mobile[m has quit [Quit: killed]
neill[m] has quit [Quit: killed]
ms[m] has quit [Quit: killed]
rori has quit [Quit: killed]
gdaiss[m] has quit [Quit: killed]
teonnik has quit [Quit: killed]
k-ballo[m] has quit [Quit: killed]
camila[m] has quit [Quit: killed]
klaus[m] has quit [Quit: killed]
marzipan[m] has quit [Quit: killed]
richard[m]1 has quit [Quit: killed]
gnikunj[m] has quit [Quit: killed]
carola[m] has quit [Quit: killed]
gretax[m] has quit [Quit: killed]
mathegenie[m] has quit [Quit: killed]
spring[m] has quit [Quit: killed]
ralph[m] has quit [Quit: killed]
bennie[m] has quit [Quit: killed]
sidhu[m] has quit [Quit: killed]
gonidelis[m] has quit [Quit: killed]
oleg[m] has quit [Quit: killed]
kimbo[m] has quit [Quit: killed]
mdiers[m] has quit [Quit: killed]
heller1 has quit [Quit: killed]
parsa[m] has quit [Quit: killed]
norbert[m]2 has quit [Quit: killed]
kale[m] has joined #ste||ar
norbert[m] has joined #ste||ar
rori has joined #ste||ar
smith[m] has joined #ste||ar
ralph[m] has joined #ste||ar
gretax[m] has joined #ste||ar
joe[m] has joined #ste||ar
klaus[m] has joined #ste||ar
marzipan[m] has joined #ste||ar
carola[m] has joined #ste||ar
camila[m] has joined #ste||ar
oleg[m] has joined #ste||ar
ms[m] has joined #ste||ar
neill[m] has joined #ste||ar
richard[m]1 has joined #ste||ar
spring[m] has joined #ste||ar
mathegenie[m] has joined #ste||ar
bennie[m] has joined #ste||ar
k-ballo[m] has joined #ste||ar
heller1 has joined #ste||ar
teonnik has joined #ste||ar
gdaiss[m] has joined #ste||ar
sidhu[m] has joined #ste||ar
diehlpk_mobile[m has joined #ste||ar
gnikunj[m] has joined #ste||ar
kimbo[m] has joined #ste||ar
tiagofg[m] has joined #ste||ar
mdiers[m] has joined #ste||ar
parsa[m] has joined #ste||ar
gonidelis[m] has joined #ste||ar
hkaiser has joined #ste||ar
<gnikunj[m]> hkaiser: yt?
<hkaiser> here
<gnikunj[m]> I saw your work on the executors. Thanks a lot!
<gnikunj[m]> I'm using them in the 1d stencil example
<gnikunj[m]> also, how do I add validation to a parallel_for? Does a parallel for_each even return anything?
<hkaiser> gnikunj[m]: that's a good question ;-)
<hkaiser> the loops run the chunks in parallel for which a) there is no way to know what the boundaries are (from inside the chunk) and b) they don't return anything you could check on
<hkaiser> to support that use case (which I really would like to support for Kokkos) we would need some changes to the API/implementation
<gnikunj[m]> API changes in executors or resiliency work?
<hkaiser> currently we assume that the replayed/replicated task covers all of the work, in parallel loops this is not the case
<hkaiser> resiliency
<gnikunj[m]> I see. What changes are we looking at?
<gnikunj[m]> I wanted to add algorithm based fault tolerance (essentially replay/replicate validate) to 1d stencil
<hkaiser> don't know yet, I did not think about it too much yet, for now I shelved this problem
<hkaiser> yah, that's what should be possible in the end
<gnikunj[m]> any messy way to get the validation done?
<gonidelis[m]> Hello all!
<hkaiser> might be that I designed the resiliency executors wrongly
<hkaiser> hey gonidelis[m] welcome back
<hkaiser> gnikunj[m]: there is a executor hook function called by the algorithms after all the work is done, we could try to use that to do the verification and replay if needed
<gnikunj[m]> hkaiser: alright. Let me see what I can do to get inject some validation code in there. Would be nice to have results for validation as well. Other benchmarks are currently running on loni, so we should have everything very soon.
<hkaiser> nod, thanks
<gnikunj[m]> Could you also email regarding overleaf access? I'll start working on sections of the paper as well.
<gnikunj[m]> hkaiser: I'll look into it. Thanks!
<hkaiser> gnikunj[m]: here is an example for this hook, albeight used for something else: https://github.com/STEllAR-GROUP/hpx/blob/master/examples/quickstart/disable_thread_stealing_executor.cpp#L59-L64
<hkaiser> this function will be called once the algorithm has done all the work
<hkaiser> (if it's exposed by the executor)
<gonidelis[m]> How's the team been going?
<gnikunj[m]> hkaiser: so if we define mark_end_executor to the resiliency executors, I should be able to call a validation function. Right?
<hkaiser> yes
<gnikunj[m]> freenode_gonidelis[m]: doing good. How are you?
<hkaiser> and somehow trigger replay
<hkaiser> not sure how, though ;-)
<gnikunj[m]> hkaiser: alright. Let me see what I can do. I need to get everything done before friday so that we have a week to polish up the work.
<gnikunj[m]> and rerun any benchmarks if need be
<hkaiser> gnikunj[m]: I wouldn't try to implement the algorithmic resiliency at this point, except if you have an idea on how to do it
<hkaiser> but my suspicion is that this might require changes to the algorithms themselves
<hkaiser> gnikunj[m]: for kokkos we were planning to replement the parallel algorithms anyways, there it might be easier to have resiliency implemented
<gnikunj[m]> hkaiser: true. I'm focusing on the current results at the moment. I'll try to play with the executor stuff once I'm done with this.
<hkaiser> ok
<gonidelis[m]> hkaiser: yt?
weilewei has joined #ste||ar
<hkaiser> gonidelis[m]: here
elfring has quit [Quit: Konversation terminated!]
bita has joined #ste||ar
<gonidelis[m]> I just pushed tests for the non-exec policy overloads of transform
<gonidelis[m]> I immitated your work on `generate`
<gonidelis[m]> We do not check bad_alloc test cases without exec policy arg, right?
<gonidelis[m]> hkaiser: ^^
<hkaiser> gonidelis[m]: we should, though ;-)
<gonidelis[m]> oh ok... I don't really get what bad_alloc tests are for
<hkaiser> I regret that I didn't do it right away, wouldn't have been too much work...
<diehlpk_work> hkaiser, Octotiger meeting
<hkaiser> gonidelis[m]: we talked about exception handling for parallel algorithms
<hkaiser> in HPX we have adapted teh execption policy that was proposed back then
<hkaiser> for this bad_alloc was handled in specific ways, namely that it would be rethrown by the algorithm without change
<gonidelis[m]> hkaiser: oh okkk... right. I thought these were the `test_algo_exception` tests though...
<hkaiser> while all other exception would have to be wrapped into an exception_list
<gonidelis[m]> hkaiser: ohhhhhhh great!!!!! Thank you very much... I will `amend` a bad_alloc test without exec-policy then
<gonidelis[m]> hkaiser: After these tests pass, I proceed on the` binary transforms`. (They should be quite easy after having adapted the unary transform).
<hkaiser> gonidelis[m]: thanks, keep in mind that it's missing from the other CPO based algorithms as well ;-)
<hkaiser> gonidelis[m]: yes, binary_trainsform sounds good
<gonidelis[m]> hkaiser: Well as you suggested: it shouldn't be that much work after all
<hkaiser> indeed
<hkaiser> but this would be for a separate PR, I think
<gonidelis[m]> hkaiser: oh... So do we like add a stand-alone PR for bad_alloc tests without exec-policies for all the ported algos?
<gonidelis[m]> ...all the ported algos thus far*
<hkaiser> yah, might be worth doing... and sorry again - I should have done that right away
<gonidelis[m]> hkaiser: ahh don't worry. We sure can handle it any time we want. Either we create a ticket and wait for all the adaptations to be completed first, or we just create a PR and like reabase and update it every time we adapt an algo...
<gnikunj[m]> hkaiser: we may want to ditch 1d stencil :/ You were right about the overheads of sending data over the wire
<gnikunj[m]> the difference is not just noticeable, it is about 30x
<gnikunj[m]> it takes normal stencil to run in 8s while a single faulty node will take ~250s
<hkaiser> gnikunj[m]: nod, I'm not surprised
<gnikunj[m]> what should we do?
<gnikunj[m]> any data movement related benchmark will show poorly on our paper
<gnikunj[m]> should I show 1d stencil for a single node and fractal for distributed? That's kinda cheating but we get our point through
<gonidelis[m]> hkaiser: Btw, you can check the tests I pushed whenever you have free time just to be sure...
<hkaiser> ok
<hkaiser> gnikunj[m]: yah, that's what we have, it would be cheating pretending we have something else
<hkaiser> gnikunj[m]: as discussed, I think we can make the point why we use these benchmarks
<gnikunj[m]> yes, we can always say that distributed APIs are best suited when trying something from scratch that requires initialization on the node it is called on
<gnikunj[m]> or make something similar up. We need to address the point of data movement though, just to let the reviewers know that a person is better off restarting their benchmarks again/ checkpointing that relying on our distributed APIs in case of heavy data movements
<gnikunj[m]> hkaiser: can we think of a better example than fractals? we will have to show the usage of validate predicate as well.
<hkaiser> let me think
<gnikunj[m]> ok. I'll think of something as well.
akheir has joined #ste||ar
K-ballo has quit [Ping timeout: 240 seconds]
K-ballo has joined #ste||ar
nanmiao11 has joined #ste||ar
<nanmiao11> gnikunj[m] see email please
<gnikunj[m]> nanmiao11: replied
<diehlpk_work> hkaiser, How set up the coverall for HPX?
kale[m] has quit [Ping timeout: 240 seconds]
<hkaiser> diehlpk_work: you mean this: https://coveralls.io/github/STEllAR-GROUP/hpx?
<diehlpk_work> yes
<diehlpk_work> I did the same for the load balancing code, but have some issues
<hkaiser> diehlpk_work: talk to ms[m], they run it over at CSCS
<diehlpk_work> same for the perihpx code
<diehlpk_work> ms[m], I have some questions how to setup the coveralls.io
<gonidelis[m]> diehlpk_work: wow! What is that?
<gonidelis[m]> The coverall I mean ^^
<diehlpk_work> gonidelis[m], Some tool to check the quality of your code
<gonidelis[m]> diehlpk_work: it's beautiful ;p
<ms[m]> diehlpk_work: this is the config file that creates the coverage reports and then uploads them to coveralls: https://gitlab.com/cscs-ci/STEllAR-GROUP/gitlab-pipeline/-/blob/master/.gitlab-ci.yml
<ms[m]> are you wondering about the gitlab ci setup, the coveralls configuration, something else, or all of it? ;)
<ms[m]> freenode_gonidelis[m]: specifically, it's a coverag tools that checks which lines of code actually get run when tests and examples are run
<ms[m]> ^coverage tool
<gonidelis[m]> ms[m]: so it's like you recognize the "useless" lines of code?
<ms[m]> freenode_gonidelis[m]: that's one way of seeing it :P typically if you have code that is not exercised by tests you'd want to add tests that exercise that code
<ms[m]> but likewise you might find unused code like that as well
<diehlpk_work> However, the files are 100% but the folder shows 0%
<ms[m]> uhh, I don't know anything about that unfortunately
<ms[m]> I could ask the guy who set it up though
<gonidelis[m]> ms[m]: great tool! Thanks!
<gonidelis[m]> hkaiser: hey we could run this tool to check whether we run tests for all of our `tag_invoke` overloads. ms[m] too
<hkaiser> gonidelis[m]: yes, absolutely
kale[m] has joined #ste||ar
<diehlpk_work> HPX 1.5.0-rc2
<diehlpk_work> This on is on rawhide
<diehlpk_work> on upcoming F33
<hkaiser> gnikunj[m]: yt?
<gnikunj[m]> @hkaiser here
<hkaiser> gnikunj[m]: saw the overleaf project?
<gnikunj[m]> Yes. I'll start working on right away
<hkaiser> k, thanks
kale[m] has quit [Ping timeout: 246 seconds]
kale[m] has joined #ste||ar