Project

General

Profile

Actions

Bug #47275

closed

daemon may be missing in mgr service map

Added by Mykola Golub over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

It have been sporadically observed when running rbd-mirror functional tests: some of running rbd-mirror daemons may be missing in the mgr service map.

Here is an example [1].

There are only two rbd-mirror daemons registered in the service map:

2020-09-01T15:23:45.481 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stderr:+ ceph --cluster cluster2 -s
2020-09-01T15:23:45.481 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:
2020-09-01T15:23:45.482 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:cluster2 status
2020-09-01T15:23:45.948 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:  cluster:
2020-09-01T15:23:45.949 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    id:     50c2ce51-a5bd-4cf9-a22e-ee524f36a3ff
2020-09-01T15:23:45.949 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    health: HEALTH_OK
2020-09-01T15:23:45.949 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:
2020-09-01T15:23:45.950 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:  services:
2020-09-01T15:23:45.950 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    mon:        1 daemons, quorum a (age 5m)
2020-09-01T15:23:45.950 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    mgr:        x(active, since 5m)
2020-09-01T15:23:45.950 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    osd:        3 osds: 3 up (since 5m), 3 in (since 5m)
2020-09-01T15:23:45.951 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    rbd-mirror: 2 daemons active (4380, 4384)
2020-09-01T15:23:45.951 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:
2020-09-01T15:23:45.951 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:  data:
2020-09-01T15:23:45.951 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    pools:   4 pools, 137 pgs
2020-09-01T15:23:45.952 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    objects: 152 objects, 298 MiB
2020-09-01T15:23:45.952 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    usage:   62 MiB used, 270 GiB / 270 GiB avail
2020-09-01T15:23:45.952 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    pgs:     137 active+clean
2020-09-01T15:23:45.953 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:
2020-09-01T15:23:45.953 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:  io:
2020-09-01T15:23:45.953 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    client:   12 KiB/s rd, 597 B/s wr, 12 op/s rd, 0 op/s wr
2020-09-01T15:23:45.953 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:
2020-09-01T15:23:45.958 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stderr:+ CEPH_ARGS=
2020-09-01T15:23:45.959 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stderr:+ ceph --cluster cluster2 service dump
2020-09-01T15:23:46.316 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:{
2020-09-01T15:23:46.316 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    "epoch": 5,
2020-09-01T15:23:46.316 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    "modified": "2020-09-01T15:20:39.851107+0000",
2020-09-01T15:23:46.317 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    "services": {
2020-09-01T15:23:46.317 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:        "rbd-mirror": {
2020-09-01T15:23:46.317 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            "daemons": {
2020-09-01T15:23:46.317 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                "summary": "",
2020-09-01T15:23:46.318 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                "4380": {
2020-09-01T15:23:46.318 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    "start_epoch": 5,
2020-09-01T15:23:46.318 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    "start_stamp": "2020-09-01T15:20:39.808598+0000",
2020-09-01T15:23:46.318 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    "gid": 4380,
2020-09-01T15:23:46.318 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    "addr": "172.21.15.146:0/1490269339",
2020-09-01T15:23:46.319 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    "metadata": {
2020-09-01T15:23:46.319 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "arch": "x86_64",
2020-09-01T15:23:46.319 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "ceph_release": "pacific",
2020-09-01T15:23:46.319 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "ceph_version": "ceph version 16.0.0-4970-g3eec59f6962 (3eec59f6962e804c64d36520f4882f41f9ce481b) pacific (dev)",
2020-09-01T15:23:46.319 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "ceph_version_short": "16.0.0-4970-g3eec59f6962",
2020-09-01T15:23:46.320 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "cpu": "Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz",
2020-09-01T15:23:46.320 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "distro": "centos",
2020-09-01T15:23:46.320 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "distro_description": "CentOS Linux 8 (Core)",
2020-09-01T15:23:46.320 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "distro_version": "8",
2020-09-01T15:23:46.320 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "hostname": "smithi146",
2020-09-01T15:23:46.321 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "id": "mirror.0",
2020-09-01T15:23:46.321 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "instance_id": "4380",
2020-09-01T15:23:46.321 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "kernel_description": "#1 SMP Sun Jul 26 03:54:29 UTC 2020",
2020-09-01T15:23:46.321 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "kernel_version": "4.18.0-193.14.2.el8_2.x86_64",
2020-09-01T15:23:46.321 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "mem_swap_kb": "0",
2020-09-01T15:23:46.322 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "mem_total_kb": "32653172",
2020-09-01T15:23:46.322 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "os": "Linux" 
2020-09-01T15:23:46.322 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    },
2020-09-01T15:23:46.322 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    "task_status": {}
2020-09-01T15:23:46.322 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                },
2020-09-01T15:23:46.323 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                "4384": {
2020-09-01T15:23:46.323 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    "start_epoch": 5,
2020-09-01T15:23:46.323 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    "start_stamp": "2020-09-01T15:20:39.820979+0000",
2020-09-01T15:23:46.323 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    "gid": 4384,
2020-09-01T15:23:46.323 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    "addr": "172.21.15.146:0/342955495",
2020-09-01T15:23:46.324 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    "metadata": {
2020-09-01T15:23:46.324 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "arch": "x86_64",
2020-09-01T15:23:46.324 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "ceph_release": "pacific",
2020-09-01T15:23:46.324 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "ceph_version": "ceph version 16.0.0-4970-g3eec59f6962 (3eec59f6962e804c64d36520f4882f41f9ce481b) pacific (dev)",
2020-09-01T15:23:46.324 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "ceph_version_short": "16.0.0-4970-g3eec59f6962",
2020-09-01T15:23:46.325 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "cpu": "Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz",
2020-09-01T15:23:46.325 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "distro": "centos",
2020-09-01T15:23:46.325 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "distro_description": "CentOS Linux 8 (Core)",
2020-09-01T15:23:46.325 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "distro_version": "8",
2020-09-01T15:23:46.325 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "hostname": "smithi146",
2020-09-01T15:23:46.326 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "id": "mirror.1",
2020-09-01T15:23:46.326 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "instance_id": "4384",
2020-09-01T15:23:46.326 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "kernel_description": "#1 SMP Sun Jul 26 03:54:29 UTC 2020",
2020-09-01T15:23:46.326 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "kernel_version": "4.18.0-193.14.2.el8_2.x86_64",
2020-09-01T15:23:46.326 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "mem_swap_kb": "0",
2020-09-01T15:23:46.327 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "mem_total_kb": "32653172",
2020-09-01T15:23:46.327 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                        "os": "Linux" 
2020-09-01T15:23:46.327 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    },
2020-09-01T15:23:46.327 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                    "task_status": {}
2020-09-01T15:23:46.327 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                }
2020-09-01T15:23:46.328 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            }
2020-09-01T15:23:46.328 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:        }
2020-09-01T15:23:46.328 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    }
2020-09-01T15:23:46.328 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:}

While there should be 4, and all these 4 are actually seen in the `ceph service status` output:
2020-09-01T15:23:46.329 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stderr:+ ceph --cluster cluster2 service status
2020-09-01T15:23:46.681 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:{
2020-09-01T15:23:46.681 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    "rbd-mirror": {
2020-09-01T15:23:46.681 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:        "4380": {
2020-09-01T15:23:46.682 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            "status_stamp": "2020-09-01T15:23:44.815638+0000",
2020-09-01T15:23:46.682 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            "last_beacon": "2020-09-01T15:23:44.815638+0000",
2020-09-01T15:23:46.682 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            "status": {
2020-09-01T15:23:46.682 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                "json": "{\"3\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":3,\"image_remote_count\":3,\"image_warning_count\":0,\"instance_id\":\"4387\",\"leader\":true,\"namespaces\":{\"ns1\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0},\"ns2\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0}}},\"4\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4406\",\"leader\":true}}" 
2020-09-01T15:23:46.682 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            }
2020-09-01T15:23:46.683 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:        },
2020-09-01T15:23:46.683 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:        "4384": {
2020-09-01T15:23:46.683 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            "status_stamp": "2020-09-01T15:23:44.828198+0000",
2020-09-01T15:23:46.683 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            "last_beacon": "2020-09-01T15:23:44.828198+0000",
2020-09-01T15:23:46.683 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            "status": {
2020-09-01T15:23:46.684 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                "json": "{\"3\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_warning_count\":0,\"instance_id\":\"4394\",\"namespaces\":{\"ns1\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0},\"ns2\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0}}},\"4\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0,\"instance_id\":\"4410\"}}" 
2020-09-01T15:23:46.684 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            }
2020-09-01T15:23:46.684 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:        },
2020-09-01T15:23:46.684 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:        "4393": {
2020-09-01T15:23:46.684 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            "status_stamp": "2020-09-01T15:23:44.863397+0000",
2020-09-01T15:23:46.685 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            "last_beacon": "2020-09-01T15:23:44.863397+0000",
2020-09-01T15:23:46.685 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            "status": {
2020-09-01T15:23:46.685 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                "json": "{\"3\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_warning_count\":0,\"instance_id\":\"4400\",\"namespaces\":{\"ns1\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0},\"ns2\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0}}},\"4\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0,\"instance_id\":\"4413\"}}" 
2020-09-01T15:23:46.685 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            }
2020-09-01T15:23:46.685 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:        },
2020-09-01T15:23:46.686 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:        "4401": {
2020-09-01T15:23:46.687 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            "status_stamp": "2020-09-01T15:23:44.886579+0000",
2020-09-01T15:23:46.687 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            "last_beacon": "2020-09-01T15:23:44.886579+0000",
2020-09-01T15:23:46.687 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            "status": {
2020-09-01T15:23:46.687 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:                "json": "{\"3\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0,\"instance_id\":\"4411\",\"namespaces\":{\"ns1\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0},\"ns2\":{\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0}}},\"4\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_warning_count\":0,\"instance_id\":\"4416\"}}" 
2020-09-01T15:23:46.687 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:            }
2020-09-01T15:23:46.687 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:        }
2020-09-01T15:23:46.688 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:    }
2020-09-01T15:23:46.688 INFO:tasks.workunit.cluster1.client.mirror.smithi146.stdout:}

[1] http://qa-proxy.ceph.com/teuthology/trociny-2020-09-01_14:55:56-rbd-wip-mgolub-testing-distro-basic-smithi/5397684/teuthology.log


Related issues 2 (0 open2 closed)

Copied to mgr - Backport #47410: octopus: daemon may be missing in mgr service mapResolvedLaura PaduanoActions
Copied to mgr - Backport #47411: nautilus: daemon may be missing in mgr service mapResolvedLaura PaduanoActions
Actions #1

Updated by Mykola Golub over 3 years ago

In the mgr log I see a race that might be responsible for this:

Connections come from rbd-mirror.4380 and rbd-mirror.4384 and they are registered in the pending service map:

2020-09-01T15:20:39.807+0000 7fcf8fa14700 10 mgr.server handle_open from rbd-mirror.4380 172.21.15.146:0/1490269339
2020-09-01T15:20:39.807+0000 7fcf8fa14700  4 mgr.server handle_open constructing new DaemonState for rbd-mirror.4380
2020-09-01T15:20:39.807+0000 7fcf8fa14700 10 mgr.server handle_open registering rbd-mirror.4380 in pending_service_map
...
2020-09-01T15:20:39.820+0000 7fcf8fa14700 10 mgr.server handle_open from rbd-mirror.4384 172.21.15.146:0/342955495
2020-09-01T15:20:39.820+0000 7fcf8fa14700  4 mgr.server handle_open constructing new DaemonState for rbd-mirror.4384
2020-09-01T15:20:39.820+0000 7fcf8fa14700 10 mgr.server handle_open registering rbd-mirror.4384 in pending_service_map

Then the mgr (DaemonServer::send_report) sends the pending map (e5), increasing the pending map epoch to 6:
2020-09-01T15:20:39.850+0000 7fcf98a1a700 10 mgr.server operator() sending service_map e5

It is expected that it would receive this new map from the mon and update its current map but before it got it, the following events happen:

A connection from rbd-mirror.4393 comes and it is registered in the pending map:

2020-09-01T15:20:39.856+0000 7fcf8fa14700 10 mgr.server handle_open from rbd-mirror.4393 172.21.15.146:0/653430756
2020-09-01T15:20:39.856+0000 7fcf8fa14700  4 mgr.server handle_open constructing new DaemonState for rbd-mirror.4393
2020-09-01T15:20:39.856+0000 7fcf8fa14700 10 mgr.server handle_open registering rbd-mirror.4393 in pending_service_map

The service map, but still older (e4) version, is received from the mon and processed:

2020-09-01T15:20:39.862+0000 7fcfad273700  1 -- 172.21.15.197:0/24547 <== mon.0 v2:172.21.15.146:3300/0 363 ==== service_map(e4 0 svc) v1 ==== 22+0+0 (secure 0 0 0) 0x560cc306e180 con 0x560cc332b400
2020-09-01T15:20:39.862+0000 7fcfad273700 10 mgr ms_dispatch2 active service_map(e4 0 svc) v1
2020-09-01T15:20:39.862+0000 7fcfad273700 10 mgr ms_dispatch2 service_map(e4 0 svc) v1
2020-09-01T15:20:39.862+0000 7fcfad273700 10 mgr handle_service_map e4
2020-09-01T15:20:39.862+0000 7fcfad273700 10 mgr.server operator() got updated map e4

A connection from rbd-mirror.4401 comes and it is registered in the pending map:

2020-09-01T15:20:39.880+0000 7fcf8fa14700 10 mgr.server handle_open from rbd-mirror.4401 172.21.15.146:0/1795443783
2020-09-01T15:20:39.880+0000 7fcf8fa14700  4 mgr.server handle_open constructing new DaemonState for rbd-mirror.4401
2020-09-01T15:20:39.880+0000 7fcf8fa14700 10 mgr.server handle_open registering rbd-mirror.4401 in pending_service_map

The new (e5) service map is received from the mon and processed:

2020-09-01T15:20:39.903+0000 7fcfad273700  1 -- 172.21.15.197:0/24547 <== mon.0 v2:172.21.15.146:3300/0 364 ==== service_map(e5 1 svc) v1 ==== 1386+0+0 (secure 0 0 0) 0x560cc306ea80 con 0x560cc332b400
2020-09-01T15:20:39.903+0000 7fcfad273700 10 mgr ms_dispatch2 active service_map(e5 1 svc) v1
2020-09-01T15:20:39.903+0000 7fcfad273700 10 mgr ms_dispatch2 service_map(e5 1 svc) v1
2020-09-01T15:20:39.903+0000 7fcfad273700 10 mgr handle_service_map e5
2020-09-01T15:20:39.903+0000 7fcfad273700 10 mgr.server operator() got updated map e5

Note, e5 contains only the first two rbd-mirror daemons. And we would expect e6 with the remaining daemons is sent, but e6 is sent much later, when the test actually failed and new changes registered in the service map:
2020-09-01T15:23:57.531+0000 7fcf8fa14700  4 mgr.server handle_close from 0x560cc3508c00  rbd-mirror.4380
...
2020-09-01T15:23:57.901+0000 7fcf98a1a700 10 mgr.server operator() sending service_map e6

I suppose the problem was caused by the older (e4) service map obtained from the mon after e5 map had been advertised by the mgr. In DaemonServer::send_report, after it sending the pending service map, it increases its epoch (so it becomes 6), but when the old (e4) map was recieved unexpectedly, in DaemonServer::got_service_map, the pending map epoch was set to service_map.epoch + 1, i.e. decreased to 5. Then when rbd-mirror.4401 was registered in the pending service map, the pending_service_map_dirty was set to 5. Then when the new (e5) map was received from the mon, pending_service_map.epoch became 5 + 1 = 6. After this the condition (pending_service_map_dirty >= pending_service_map.epoch) became false and `DaemonServer::send_report` did not send the updated map with two new rbd-mirror daemons.

Actions #2

Updated by Mykola Golub over 3 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 36964
Actions #3

Updated by Jason Dillaman over 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #4

Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #47410: octopus: daemon may be missing in mgr service map added
Actions #5

Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #47411: nautilus: daemon may be missing in mgr service map added
Actions #6

Updated by Nathan Cutler over 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF