#ste||ar on 2019-03-29 — irc logs at irclog.cct.lsu.edu

2018-08-26 23:03 hkaiser changed the topic of #ste||ar to: STE||AR: Systems Technology, Emergent Parallelism, and Algorithm Research | stellar.cct.lsu.edu | HPX: A cure for performance impaired parallel applications | github.com/STEllAR-GROUP/hpx | Buildbot: http://rostam.cct.lsu.edu/ | Log: http://irclog.cct.lsu.edu/

01:45 eschnett has joined #ste||ar

02:24 K-ballo has quit [Quit: K-ballo]

02:40 hkaiser has quit [Quit: bye]

07:50 david_pfander has joined #ste||ar

11:34 K-ballo has joined #ste||ar

12:18 hkaiser has joined #ste||ar

12:50 eschnett has quit [Quit: eschnett]

13:44 aserio has joined #ste||ar

13:46 hkaiser has quit [Quit: bye]

14:27 hkaiser has joined #ste||ar

15:02 daissgr has quit [Quit: WeeChat 1.9.1]

15:13 <jbjnr__> heller: yt?

15:23 <heller> jbjnr__: what's up?

15:24 <jbjnr__> I was asking hkaiser last night about the MPI parcelport and if the MPI rank is always the same as the locality_id

15:24 <jbjnr__> I stopped worrying about it, but now I'm concerned again

15:24 <jbjnr__> as the libfabric rank is frequently different from the locality_id

15:25 <jbjnr__> and I don't like it

15:25 <jbjnr__> is there any mechanism to match them up in the simple case that we are not expecing workers to join after bootup

15:25 <jbjnr__> I had a quick look at the BBB code and address naming stuff, but I want to avoid going through it all !

15:45 <jbjnr__> heller: ran away!

15:58 aserio1 has joined #ste||ar

16:01 aserio has quit [Ping timeout: 268 seconds]

16:01 aserio1 is now known as aserio

16:05 <heller> jbjnr__: yes, with the mpi parcelport, we use the rank

16:05 <heller> For everything else, we give them away in a first come first served basis

16:05 <jbjnr__> can you point me to where agas takes the rank and assigns the locality_id? I didn't see it.

16:05 <heller> You can modify that though

16:06 <heller> If you give me a second or two

16:06 <jbjnr__> thanks. No hurry

16:06 <jbjnr__> it shouldn't actually matter

16:06 <jbjnr__> but it would be nice for consistency to have them the same

16:10 <heller> jbjnr__: https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/agas/big_boot_barrier.cpp#L773

16:10 <heller> jbjnr__: this is where it gets suggested prefix

16:11 <jbjnr__> goodness. it takes it from the config!

16:11 <heller> jbjnr__: https://github.com/STEllAR-GROUP/hpx/blob/master/plugins/parcelport/mpi/mpi_environment.cpp#L201

16:11 <heller> the mpi parcelport sets it here

16:11 <jbjnr__> aha!

16:12 <jbjnr__> thank you very much.

16:12 <heller> https://github.com/STEllAR-GROUP/hpx/blob/master/src/util/batch_environments/slurm_environment.cpp#L47

16:12 <heller> i guess you need to edit something here

16:12 <heller> to get it from slurm

16:14 <jbjnr__> I actually want to do it differently - each rank contacts agas using the libfabric connectionless mode and then we generate an address vector - using the order they reach agas = random - the easiest thing for me to do is use the address vector index as the rank (this works lovely as I now use TABLE type instead of MAP type for the AV)

16:14 <jbjnr__> so I don't really care what slurm thinks the rank is

16:14 <jbjnr__> I will set the config from my bootup if I can and see if that is picked up correctly by the rest of the code

16:14 <jbjnr__> thanks a bundle. I missed that config set/get

16:15 <jbjnr__> I suppose I could insert the address using the slurm config index

16:15 <jbjnr__> then it would be consistent all over. I'll try that first

16:18 aserio1 has joined #ste||ar

16:21 aserio has quit [Ping timeout: 250 seconds]

16:21 aserio1 is now known as aserio

18:01 aserio has quit [Ping timeout: 250 seconds]

18:17 aserio has joined #ste||ar

18:57 nikunj has joined #ste||ar

19:08 nikunj has quit [Ping timeout: 245 seconds]

20:05 mreese3 has joined #ste||ar

20:05 <mreese3> Can HPX serialize std::unordered_maps?

20:06 <K-ballo> yes

20:06 <mreese3> Okay, thanks!

20:19 hkaiser has quit [Quit: bye]

20:32 nikunj has joined #ste||ar

20:35 <nikunj> Hey! Can anyone tell me why running this code with hpx leads to deadlock while running it normally does not? https://pastebin.com/9THdsHTN

20:37 <zao> No idea about your deadlocking, but you should never really reseed your PRNG while running.

20:38 <nikunj> I'll change that

20:43 <zao> nikunj: std::cin.get() blocks, which might be not-cool on a HPX thread.

20:44 <nikunj> perhaps hpx::cin.get() then?

20:44 <zao> Block indefinitely, I should say.

20:44 <zao> I've got no idea :)

20:44 <nikunj> zao: that is meant to block btw, that is just to end the infinite execution from user end

20:45 <zao> Yes, the problem is that HPX might expect work to either yield via a HPX synchronization primitive or complete.

20:45 <nikunj> aah

20:45 <nikunj> that might be the underlying issue then

20:45 <zao> Loops of indeterminate duration that don't cause the runtime to switch tasks or OS-blocking operations grind HPX to a halt.

20:46 <zao> Your loops polling atomic variables don't really let the runtime do anything either, unless you happen to do something that gets the runtime to reconsider whether you should be actively running or not.

20:47 <zao> I don't know if HPX has any "yield if you feel like it" functionality.

20:48 <nikunj> that implementation of semaphores with atomic variable was required in the assignment

20:48 <nikunj> but I think I realize why it deadlocked

20:48 <nikunj> thanks for the help!

20:49 <zao> You could probably cheat by having enough OS threads servicing HPX, so that you get at least an 1:1 mapping of HPX tasks to threads.

20:50 <zao> In the real world, you might have different executors or something where you could run things that are expected to be long-running or blocking, or have some other requirement.

20:51 <nikunj> I see

21:38 hkaiser has joined #ste||ar

21:40 aserio has quit [Quit: aserio]

21:59 nikunj has quit [Ping timeout: 256 seconds]

22:35 mreese3 has quit [Read error: Connection reset by peer]