hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu
K-ballo has quit [Quit: K-ballo]
hkaiser has quit [Quit: Bye!]
Yorlik has joined #ste||ar
john98zakaria[m] has joined #ste||ar
K-ballo has joined #ste||ar
hkaiser has joined #ste||ar
<john98zakaria[m]> I am trying to setup the summa algorithm, however my communicators are failing and I don't understand what the error message means
<john98zakaria[m]> index is out of range for this base_and_gate: HPX(bad_parameter)
<hkaiser> john98zakaria[m]: hey
<john98zakaria[m]> Hi sorry for bothering so much
<john98zakaria[m]> void summa_matrix_mult(int num_rows,int num_cols){... (full message at https://libera.ems.host/_matrix/media/r0/download/libera.chat/1c20d475829cc528c1b778ca4561b6c224e9af20)
<hkaiser> this error probably means that there are more participating sites in a communicator compared to how many sites were used while creating it
<john98zakaria[m]> * #include <hpx/wrap_main.hpp>... (full message at https://libera.ems.host/_matrix/media/r0/download/libera.chat/6f9901be1da46887c59f86a547d3b8f9bb9c9f2b)
<john98zakaria[m]> hkaiser: 4 sites
<hkaiser> is that the full code?
<john98zakaria[m]> yes
<hkaiser> I'll try to run it later today (once I've had my coffee)
<hkaiser> how do you run your executable, just one locality?
<john98zakaria[m]> hkaiser: Thank you <3
<john98zakaria[m]> I am running on a single node using mpirun -n 4
<hkaiser> ok, so four localities - got it.
<hkaiser> but your code initializes the communicators with 2 sites
<hkaiser> all_gather by default will use the current locality index, so if your communicator expects only two connecting site this will lead to the error you see
<john98zakaria[m]> Does a site mean a node or a process?
<hkaiser> john98zakaria[m]: you might want to pass the corrcet this_site argument to all_gather to constrain things to the correct indicies
<hkaiser> a site in your case means a process
<hkaiser> by 'a site' we mean a unique endpoint participating in the collective operation
<hkaiser> if you create a communicator with two sites, then the corresponding collective operation shouldn't use anything by zero or one as their site indicies
<hkaiser> if you don't specify a this_site argument, then the seuqnce number of the locality (process) will be used - i.e. the MPI rank
<john98zakaria[m]> My goal is to let iteration 1 do the talking
<john98zakaria[m]> Of course now just with 4 sets
<john98zakaria[m]> > <@john98zakaria:matrix.org> I modified the comms... (full message at https://libera.ems.host/_matrix/media/r0/download/libera.chat/cd9cfaca6637fbc07c4af94b94d97a739bc4ed4c)
<hkaiser> john98zakaria[m]: the indieces used for this_site for each of the communicators should be consecutive
<john98zakaria[m]> hkaiser: I'll try again tonight and let you know.
<john98zakaria[m]> I thought this site means my_rank.
<hkaiser> john98zakaria[m]: yes, it's your rank, but relative to the comunicator
<hkaiser> we probably should add more index testing to the API to report these problems early on
<john98zakaria[m]> So if rank 2 wants to talk to 3
<john98zakaria[m]> 3-2 = 1?
<hkaiser> something like that
<hkaiser> each communicator has its own set of unique endpoints that should be numbered sequentially from zero to N-1
K-ballo has quit [Read error: Connection reset by peer]
K-ballo has joined #ste||ar
Yorlik has quit [Ping timeout: 260 seconds]