Bug #47275
closeddaemon may be missing in mgr service map
0%
Description
It have been sporadically observed when running rbd-mirror functional tests: some of running rbd-mirror daemons may be missing in the mgr service map.
Here is an example [1].
There are only two rbd-mirror daemons registered in the service map:
2020-09-01T15:23:45.481 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stderr:+ ceph --cluster cluster2 -s 2020-09-01T15:23:45.481 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: 2020-09-01T15:23:45.482 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:cluster2 status 2020-09-01T15:23:45.948 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: cluster: 2020-09-01T15:23:45.949 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: id: 50c2ce51-a5bd-4cf9-a22e-ee524f36a3ff 2020-09-01T15:23:45.949 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: health: HEALTH_OK 2020-09-01T15:23:45.949 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: 2020-09-01T15:23:45.950 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: services: 2020-09-01T15:23:45.950 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: mon: 1 daemons, quorum a (age 5m) 2020-09-01T15:23:45.950 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: mgr: x(active, since 5m) 2020-09-01T15:23:45.950 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: osd: 3 osds: 3 up (since 5m), 3 in (since 5m) 2020-09-01T15:23:45.951 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: rbd-mirror: 2 daemons active (4380, 4384) 2020-09-01T15:23:45.951 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: 2020-09-01T15:23:45.951 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: data: 2020-09-01T15:23:45.951 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: pools: 4 pools, 137 pgs 2020-09-01T15:23:45.952 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: objects: 152 objects, 298 MiB 2020-09-01T15:23:45.952 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: usage: 62 MiB used, 270 GiB / 270 GiB avail 2020-09-01T15:23:45.952 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: pgs: 137 active+clean 2020-09-01T15:23:45.953 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: 2020-09-01T15:23:45.953 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: io: 2020-09-01T15:23:45.953 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: client: 12 KiB/s rd, 597 B/s wr, 12 op/s rd, 0 op/s wr 2020-09-01T15:23:45.953 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: 2020-09-01T15:23:45.958 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stderr:+ CEPH_ARGS= 2020-09-01T15:23:45.959 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stderr:+ ceph --cluster cluster2 service dump 2020-09-01T15:23:46.316 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:{ 2020-09-01T15:23:46.316 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "epoch": 5, 2020-09-01T15:23:46.316 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "modified": "2020-09-01T15:20:39.851107+0000", 2020-09-01T15:23:46.317 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "services": { 2020-09-01T15:23:46.317 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "rbd-mirror": { 2020-09-01T15:23:46.317 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "daemons": { 2020-09-01T15:23:46.317 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "summary": "", 2020-09-01T15:23:46.318 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "4380": { 2020-09-01T15:23:46.318 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "start_epoch": 5, 2020-09-01T15:23:46.318 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "start_stamp": "2020-09-01T15:20:39.808598+0000", 2020-09-01T15:23:46.318 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "gid": 4380, 2020-09-01T15:23:46.318 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "addr": "172.21.15.146:0/1490269339", 2020-09-01T15:23:46.319 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "metadata": { 2020-09-01T15:23:46.319 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "arch": "x86_64", 2020-09-01T15:23:46.319 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "ceph_release": "pacific", 2020-09-01T15:23:46.319 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "ceph_version": "ceph version 16.0.0-4970-g3eec59f6962 (3eec59f6962e804c64d36520f4882f41f9ce481b) pacific (dev)", 2020-09-01T15:23:46.319 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "ceph_version_short": "16.0.0-4970-g3eec59f6962", 2020-09-01T15:23:46.320 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "cpu": "Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz", 2020-09-01T15:23:46.320 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "distro": "centos", 2020-09-01T15:23:46.320 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "distro_description": "CentOS Linux 8 (Core)", 2020-09-01T15:23:46.320 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "distro_version": "8", 2020-09-01T15:23:46.320 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "hostname": "smithi146", 2020-09-01T15:23:46.321 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "id": "mirror.0", 2020-09-01T15:23:46.321 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "instance_id": "4380", 2020-09-01T15:23:46.321 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "kernel_description": "#1 SMP Sun Jul 26 03:54:29 UTC 2020", 2020-09-01T15:23:46.321 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "kernel_version": "4.18.0-193.14.2.el8_2.x86_64", 2020-09-01T15:23:46.321 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "mem_swap_kb": "0", 2020-09-01T15:23:46.322 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "mem_total_kb": "32653172", 2020-09-01T15:23:46.322 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "os": "Linux" 2020-09-01T15:23:46.322 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: }, 2020-09-01T15:23:46.322 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "task_status": {} 2020-09-01T15:23:46.322 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: }, 2020-09-01T15:23:46.323 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "4384": { 2020-09-01T15:23:46.323 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "start_epoch": 5, 2020-09-01T15:23:46.323 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "start_stamp": "2020-09-01T15:20:39.820979+0000", 2020-09-01T15:23:46.323 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "gid": 4384, 2020-09-01T15:23:46.323 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "addr": "172.21.15.146:0/342955495", 2020-09-01T15:23:46.324 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "metadata": { 2020-09-01T15:23:46.324 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "arch": "x86_64", 2020-09-01T15:23:46.324 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "ceph_release": "pacific", 2020-09-01T15:23:46.324 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "ceph_version": "ceph version 16.0.0-4970-g3eec59f6962 (3eec59f6962e804c64d36520f4882f41f9ce481b) pacific (dev)", 2020-09-01T15:23:46.324 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "ceph_version_short": "16.0.0-4970-g3eec59f6962", 2020-09-01T15:23:46.325 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "cpu": "Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz", 2020-09-01T15:23:46.325 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "distro": "centos", 2020-09-01T15:23:46.325 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "distro_description": "CentOS Linux 8 (Core)", 2020-09-01T15:23:46.325 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "distro_version": "8", 2020-09-01T15:23:46.325 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "hostname": "smithi146", 2020-09-01T15:23:46.326 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "id": "mirror.1", 2020-09-01T15:23:46.326 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "instance_id": "4384", 2020-09-01T15:23:46.326 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "kernel_description": "#1 SMP Sun Jul 26 03:54:29 UTC 2020", 2020-09-01T15:23:46.326 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "kernel_version": "4.18.0-193.14.2.el8_2.x86_64", 2020-09-01T15:23:46.326 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "mem_swap_kb": "0", 2020-09-01T15:23:46.327 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "mem_total_kb": "32653172", 2020-09-01T15:23:46.327 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "os": "Linux" 2020-09-01T15:23:46.327 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: }, 2020-09-01T15:23:46.327 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "task_status": {} 2020-09-01T15:23:46.327 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: } 2020-09-01T15:23:46.328 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: } 2020-09-01T15:23:46.328 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: } 2020-09-01T15:23:46.328 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: } 2020-09-01T15:23:46.328 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:}
While there should be 4, and all these 4 are actually seen in the `ceph service status` output:
2020-09-01T15:23:46.329 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stderr:+ ceph --cluster cluster2 service status 2020-09-01T15:23:46.681 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:{ 2020-09-01T15:23:46.681 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "rbd-mirror": { 2020-09-01T15:23:46.681 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "4380": { 2020-09-01T15:23:46.682 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "status_stamp": "2020-09-01T15:23:44.815638+0000", 2020-09-01T15:23:46.682 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "last_beacon": "2020-09-01T15:23:44.815638+0000", 2020-09-01T15:23:46.682 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "status": { 2020-09-01T15:23:46.682 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "json": "{\"3\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":3,\"image_remote_count\":3,\"image_warning_count\":0,\"instance_id\":\"4387\",\"leader\":true,\"namespaces\":{\"ns1\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0},\"ns2\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0}}},\"4\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4406\",\"leader\":true}}" 2020-09-01T15:23:46.682 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: } 2020-09-01T15:23:46.683 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: }, 2020-09-01T15:23:46.683 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "4384": { 2020-09-01T15:23:46.683 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "status_stamp": "2020-09-01T15:23:44.828198+0000", 2020-09-01T15:23:46.683 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "last_beacon": "2020-09-01T15:23:44.828198+0000", 2020-09-01T15:23:46.683 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "status": { 2020-09-01T15:23:46.684 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "json": "{\"3\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_warning_count\":0,\"instance_id\":\"4394\",\"namespaces\":{\"ns1\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0},\"ns2\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0}}},\"4\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0,\"instance_id\":\"4410\"}}" 2020-09-01T15:23:46.684 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: } 2020-09-01T15:23:46.684 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: }, 2020-09-01T15:23:46.684 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "4393": { 2020-09-01T15:23:46.684 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "status_stamp": "2020-09-01T15:23:44.863397+0000", 2020-09-01T15:23:46.685 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "last_beacon": "2020-09-01T15:23:44.863397+0000", 2020-09-01T15:23:46.685 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "status": { 2020-09-01T15:23:46.685 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "json": "{\"3\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_warning_count\":0,\"instance_id\":\"4400\",\"namespaces\":{\"ns1\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0},\"ns2\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0}}},\"4\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0,\"instance_id\":\"4413\"}}" 2020-09-01T15:23:46.685 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: } 2020-09-01T15:23:46.685 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: }, 2020-09-01T15:23:46.686 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "4401": { 2020-09-01T15:23:46.687 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "status_stamp": "2020-09-01T15:23:44.886579+0000", 2020-09-01T15:23:46.687 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "last_beacon": "2020-09-01T15:23:44.886579+0000", 2020-09-01T15:23:46.687 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "status": { 2020-09-01T15:23:46.687 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: "json": "{\"3\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0,\"instance_id\":\"4411\",\"namespaces\":{\"ns1\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0},\"ns2\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0}}},\"4\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0,\"instance_id\":\"4416\"}}" 2020-09-01T15:23:46.687 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: } 2020-09-01T15:23:46.687 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: } 2020-09-01T15:23:46.688 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout: } 2020-09-01T15:23:46.688 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:}
Updated by Mykola Golub over 3 years ago
In the mgr log I see a race that might be responsible for this:
Connections come from rbd-mirror.4380 and rbd-mirror.4384 and they are registered in the pending service map:
2020-09-01T15:20:39.807+0000 7fcf8fa14700 10 mgr.server handle_open from rbd-mirror.4380 172.21.15.146:0/1490269339 2020-09-01T15:20:39.807+0000 7fcf8fa14700 4 mgr.server handle_open constructing new DaemonState for rbd-mirror.4380 2020-09-01T15:20:39.807+0000 7fcf8fa14700 10 mgr.server handle_open registering rbd-mirror.4380 in pending_service_map ... 2020-09-01T15:20:39.820+0000 7fcf8fa14700 10 mgr.server handle_open from rbd-mirror.4384 172.21.15.146:0/342955495 2020-09-01T15:20:39.820+0000 7fcf8fa14700 4 mgr.server handle_open constructing new DaemonState for rbd-mirror.4384 2020-09-01T15:20:39.820+0000 7fcf8fa14700 10 mgr.server handle_open registering rbd-mirror.4384 in pending_service_map
Then the mgr (DaemonServer::send_report) sends the pending map (e5), increasing the pending map epoch to 6:
2020-09-01T15:20:39.850+0000 7fcf98a1a700 10 mgr.server operator() sending service_map e5
It is expected that it would receive this new map from the mon and update its current map but before it got it, the following events happen:
A connection from rbd-mirror.4393 comes and it is registered in the pending map:
2020-09-01T15:20:39.856+0000 7fcf8fa14700 10 mgr.server handle_open from rbd-mirror.4393 172.21.15.146:0/653430756 2020-09-01T15:20:39.856+0000 7fcf8fa14700 4 mgr.server handle_open constructing new DaemonState for rbd-mirror.4393 2020-09-01T15:20:39.856+0000 7fcf8fa14700 10 mgr.server handle_open registering rbd-mirror.4393 in pending_service_map
The service map, but still older (e4) version, is received from the mon and processed:
2020-09-01T15:20:39.862+0000 7fcfad273700 1 -- 172.21.15.197:0/24547 <== mon.0 v2:172.21.15.146:3300/0 363 ==== service_map(e4 0 svc) v1 ==== 22+0+0 (secure 0 0 0) 0x560cc306e180 con 0x560cc332b400 2020-09-01T15:20:39.862+0000 7fcfad273700 10 mgr ms_dispatch2 active service_map(e4 0 svc) v1 2020-09-01T15:20:39.862+0000 7fcfad273700 10 mgr ms_dispatch2 service_map(e4 0 svc) v1 2020-09-01T15:20:39.862+0000 7fcfad273700 10 mgr handle_service_map e4 2020-09-01T15:20:39.862+0000 7fcfad273700 10 mgr.server operator() got updated map e4
A connection from rbd-mirror.4401 comes and it is registered in the pending map:
2020-09-01T15:20:39.880+0000 7fcf8fa14700 10 mgr.server handle_open from rbd-mirror.4401 172.21.15.146:0/1795443783 2020-09-01T15:20:39.880+0000 7fcf8fa14700 4 mgr.server handle_open constructing new DaemonState for rbd-mirror.4401 2020-09-01T15:20:39.880+0000 7fcf8fa14700 10 mgr.server handle_open registering rbd-mirror.4401 in pending_service_map
The new (e5) service map is received from the mon and processed:
2020-09-01T15:20:39.903+0000 7fcfad273700 1 -- 172.21.15.197:0/24547 <== mon.0 v2:172.21.15.146:3300/0 364 ==== service_map(e5 1 svc) v1 ==== 1386+0+0 (secure 0 0 0) 0x560cc306ea80 con 0x560cc332b400 2020-09-01T15:20:39.903+0000 7fcfad273700 10 mgr ms_dispatch2 active service_map(e5 1 svc) v1 2020-09-01T15:20:39.903+0000 7fcfad273700 10 mgr ms_dispatch2 service_map(e5 1 svc) v1 2020-09-01T15:20:39.903+0000 7fcfad273700 10 mgr handle_service_map e5 2020-09-01T15:20:39.903+0000 7fcfad273700 10 mgr.server operator() got updated map e5
Note, e5 contains only the first two rbd-mirror daemons. And we would expect e6 with the remaining daemons is sent, but e6 is sent much later, when the test actually failed and new changes registered in the service map:
2020-09-01T15:23:57.531+0000 7fcf8fa14700 4 mgr.server handle_close from 0x560cc3508c00 rbd-mirror.4380 ... 2020-09-01T15:23:57.901+0000 7fcf98a1a700 10 mgr.server operator() sending service_map e6
I suppose the problem was caused by the older (e4) service map obtained from the mon after e5 map had been advertised by the mgr. In DaemonServer::send_report, after it sending the pending service map, it increases its epoch (so it becomes 6), but when the old (e4) map was recieved unexpectedly, in DaemonServer::got_service_map, the pending map epoch was set to service_map.epoch + 1, i.e. decreased to 5. Then when rbd-mirror.4401 was registered in the pending service map, the pending_service_map_dirty was set to 5. Then when the new (e5) map was received from the mon, pending_service_map.epoch became 5 + 1 = 6. After this the condition (pending_service_map_dirty >= pending_service_map.epoch) became false and `DaemonServer::send_report` did not send the updated map with two new rbd-mirror daemons.
Updated by Mykola Golub over 3 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 36964
Updated by Jason Dillaman over 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler over 3 years ago
- Copied to Backport #47410: octopus: daemon may be missing in mgr service map added
Updated by Nathan Cutler over 3 years ago
- Copied to Backport #47411: nautilus: daemon may be missing in mgr service map added
Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".