hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar-group.org | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | This channel is logged: irclog.cct.lsu.edu
tufei has quit [Remote host closed the connection]
tufei has joined #ste||ar
Yorlik__ has joined #ste||ar
Yorlik_ has quit [Ping timeout: 264 seconds]
hkaiser has quit [Quit: Bye!]
tufei has quit [Remote host closed the connection]
tufei has joined #ste||ar
tufei has quit [Remote host closed the connection]
tufei has joined #ste||ar
tufei has quit [Remote host closed the connection]
tufei has joined #ste||ar
tufei has quit [Read error: Connection reset by peer]
tufei has joined #ste||ar
K-ballo1 has joined #ste||ar
K-ballo has quit [Ping timeout: 256 seconds]
K-ballo1 is now known as K-ballo
tufei_ has joined #ste||ar
tufei has quit [Ping timeout: 255 seconds]
tufei_ has quit [Remote host closed the connection]
tufei_ has joined #ste||ar
apop has joined #ste||ar
<apop>
Hi, I've been using HPX on Slurm clusters for a while and now am looking at running multiple jobs at the same time (e.g., SLURM job matrix) on the same cluster, but I can't seem to be able to run more than one at a time. SLURM is able to launch them, but only one (or none in some cases) runs and all other crash with the following error. I've tried to switch to MPI Parcelport and disable TCP, but still doesn't work. Is this a known issue? Any
<apop>
fixes? Thanks!
<apop>
the bootstrap parcelport (tcp) has failed to initialize on locality 0:
<apop>
<unknown>: HPX(network_error),
<apop>
bailing out
<apop>
terminate called without an active exception