Bug #52624: qa: "Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)" - RADOS - Ceph

Actions

#1

Updated by Neha Ojha over 2 years ago

2021-09-14T02:04:30.392+0000 7f75fd230700  1 -- 172.21.15.134:0/17688 <== mon.0 v2:172.21.15.134:3300/0 162 ==== mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.2e", "id": [0, 5]}]=0 set 7.2e pg_upmap_items mapping to [0->5] v44) v1 ==== 155+0+0 (secure 0 0 0) 0x55d65a8c29c0 con 0x55d655fbd400
...
2021-09-14T02:04:32.312+0000 7f75eb753700 20 mgr.server operator() health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 1 pg peering",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 7.2e is stuck peering for 62s, current state peering, last acting [4,5]" 
            }
        ]
    }
}

Looks like peering induced by mapping change by the balancer. How often does this happen?

Actions

Copy link

#2

Updated by Patrick Donnelly over 2 years ago

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

Actions

Copy link

#3

Updated by Patrick Donnelly over 2 years ago

Has duplicate Bug #52607: qa: "mon.a (mon.0) 1022 : cluster [WRN] Health check failed: Reduced data availability: 4 pgs peering (PG_AVAILABILITY)" added

Actions

Copy link

#4

Updated by Patrick Donnelly over 2 years ago

Patrick Donnelly wrote:

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

3/262 jobs: https://pulpito.ceph.com/pdonnell-2021-09-17_20:49:46-fs-wip-pdonnell-testing-20210917.174826-distro-basic-smithi/

Actions

Copy link

#5

Updated by Neha Ojha over 2 years ago

Patrick Donnelly wrote:

Patrick Donnelly wrote:

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

3/262 jobs: https://pulpito.ceph.com/pdonnell-2021-09-17_20:49:46-fs-wip-pdonnell-testing-20210917.174826-distro-basic-smithi/

Did this start happening fairly recently? I'll take a look at the logs to see what I can find.

Actions

Copy link

#6

Updated by Patrick Donnelly over 2 years ago

Neha Ojha wrote:

Patrick Donnelly wrote:

Patrick Donnelly wrote:

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

3/262 jobs: https://pulpito.ceph.com/pdonnell-2021-09-17_20:49:46-fs-wip-pdonnell-testing-20210917.174826-distro-basic-smithi/

Did this start happening fairly recently? I'll take a look at the logs to see what I can find.

Oldest occurrence I can find is

https://pulpito.ceph.com/teuthology-2021-08-03_03:15:03-fs-master-distro-basic-gibba/6308125/

but there may be some I missed.

Actions

Copy link

#7

Updated by Patrick Donnelly over 2 years ago

Patrick Donnelly wrote:

Neha Ojha wrote:

Patrick Donnelly wrote:

Patrick Donnelly wrote:

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

3/262 jobs: https://pulpito.ceph.com/pdonnell-2021-09-17_20:49:46-fs-wip-pdonnell-testing-20210917.174826-distro-basic-smithi/

Did this start happening fairly recently? I'll take a look at the logs to see what I can find.

Oldest occurrence I can find is

https://pulpito.ceph.com/teuthology-2021-08-03_03:15:03-fs-master-distro-basic-gibba/6308125/

but there may be some I missed.

Scratch that. Looks like the earliest I can grep for is:

/a/pdonnell-2021-05-12_04:01:31-fs-wip-pdonnell-testing-20210511.232042-distro-basic-smithi/6110639/teuthology.log.gz

This is before the .mgr pool change was merged (in June).

Actions

Copy link

#8

Updated by Kotresh Hiremath Ravishankar over 2 years ago

Seen in this pacific run as well.

http://pulpito.front.sepia.ceph.com/yuriw-2021-11-12_00:33:28-fs-wip-yuri7-testing-2021-11-11-1339-pacific-distro-basic-smithi/6498163/

Actions

Copy link

#9

Updated by Milind Changire almost 2 years ago

These PG_AVAILBILITY warnings are frequently seen with snap-schedule teuthology jobs.

Actions

Copy link

#10

Updated by Radoslaw Zarzynski almost 2 years ago

To judge how severe the problem really is we need the information whether the stall is permanent (PG gets stuck and there is no progress) or it's just a (slightly) delayed operation,

Actions

Copy link

#11

Updated by Kotresh Hiremath Ravishankar over 1 year ago

Seen in recent quincy run https://pulpito.ceph.com/yuriw-2022-08-02_21:20:37-fs-wip-yuri7-testing-2022-07-27-0808-quincy-distro-default-smithi/6956049

Actions

Copy link

#12

Updated by Kotresh Hiremath Ravishankar over 1 year ago

Seen in https://pulpito.ceph.com/yuriw-2022-08-04_11:54:20-fs-wip-yuri8-testing-2022-08-03-1028-quincy-distro-default-smithi/6957970

Actions

Copy link

#13

Updated by Kotresh Hiremath Ravishankar over 1 year ago

Seen in https://pulpito.ceph.com/yuriw-2022-08-09_15:36:21-fs-wip-yuri8-testing-2022-08-03-1028-quincy-distro-default-smithi/6964057

Actions

Copy link

#14

Updated by Kotresh Hiremath Ravishankar over 1 year ago

Seen in these pacific runs
1. https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959069
2. https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959102
3. https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959168
4. https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959193

Actions

Copy link

#15

Updated by Radoslaw Zarzynski over 1 year ago

Assignee set to Aishwarya Mathuria

Actions

Copy link

#16

Updated by Kotresh Hiremath Ravishankar over 1 year ago

Seen in these recent pacific runs:

1. https://pulpito.ceph.com/yuriw-2022-08-18_23:16:33-fs-wip-yuri10-testing-2022-08-18-1400-pacific-distro-default-smithi/6979462
2. https://pulpito.ceph.com/yuriw-2022-08-18_23:16:33-fs-wip-yuri10-testing-2022-08-18-1400-pacific-distro-default-smithi/6979495
3. https://pulpito.ceph.com/yuriw-2022-08-19_21:01:11-fs-wip-yuri10-testing-2022-08-18-1400-pacific-distro-default-smithi/6981264
4. https://pulpito.ceph.com/yuriw-2022-08-19_21:01:11-fs-wip-yuri10-testing-2022-08-18-1400-pacific-distro-default-smithi/6981270

Actions

Copy link

#17

Updated by Aishwarya Mathuria over 1 year ago

I have been going through the failure logs mentioned above and I see that the health check does pass eventually:

2022-08-19T23:14:08.032190+0000 mon.a (mon.0) 1636 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-08-19T23:14:09.980141+0000 mgr.y (mgr.9288) 36 : cluster [DBG] pgmap v37: 137 pgs: 2 peering, 135 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 1.1 KiB/s wr, 0 op/s
2022-08-19T23:14:11.980868+0000 mgr.y (mgr.9288) 37 : cluster [DBG] pgmap v38: 137 pgs: 2 peering, 135 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 127 B/s wr, 0 op/s
2022-08-19T23:14:13.981488+0000 mgr.y (mgr.9288) 38 : cluster [DBG] pgmap v39: 137 pgs: 137 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 127 B/s wr, 0 op/s; 0 B/s, 0 objects/s recovering
2022-08-19T23:14:14.032388+0000 mon.a (mon.0) 1637 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-08-19T23:14:14.032413+0000 mon.a (mon.0) 1638 : cluster [INF] Cluster is now healthy

But in tasks/ceph.py here: https://github.com/ceph/ceph/blob/main/qa/tasks/ceph.py#L1138 we check for the first occurrence of WRN level logs.
So the test fails because it finds the following log:

2022-08-19T23:15:59.841 INFO:tasks.ceph:Checking cluster log for badness...
2022-08-19T23:15:59.841 DEBUG:teuthology.orchestra.run.smithi138:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1
2022-08-19T23:15:59.955 INFO:teuthology.orchestra.run.smithi138.stdout:2022-08-19T22:51:27.586480+0000 mon.a (mon.0) 590 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2022-08-19T23:15:59.956 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2022-08-19T23:16:00.284 INFO:teuthology.orchestra.run.smithi138.stdout:2022-08-19T22:51:27.586480+0000 mon.a (mon.0) 590 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)

If we look at the timestamp, that log is from 25 minutes before the cluster was healthy again. At 2022-08-19T23:16:00 the cluster was actually healthy again

Actions

Copy link

#18

Updated by Aishwarya Mathuria over 1 year ago

Took a look at why peering was happening in the first place. Looking at PG 7.16 logs below, we can see that the balancer has changed the OSD from [0,5] to [4,5]. This rebalancing of PGs is causing the peering to start.
Maybe the test failures can be avoided by adding 'Reduced data availability' to the log-ignorelist in the ignorelist_health.yaml?
I am still taking a look at why peering was taking long.

2022-08-19T22:51:25.522+0000 7f3fefbf2700  1 -- [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] <== mgr.6216 172.21.15.138:0/40300 118 ==== mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 ==== 122+0+0 (secure 0 0 0) 0x5558b7630000 con 0x5558b69aa400
2022-08-19T22:51:25.523+0000 7f3fefbf2700 10 mon.a@0(leader).osd e66 preprocess_query mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 from mgr.6216 172.21.15.138:0/40300
2022-08-19T22:51:25.523+0000 7f3fefbf2700  7 mon.a@0(leader).osd e66 prepare_update mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 from mgr.6216 172.21.15.138:0/40300
2022-08-19T22:51:25.523+0000 7f3fefbf2700 10 mon.a@0(leader).log v530  logging 2022-08-19T22:51:25.523722+0000 mon.a (mon.0) 576 : audit [INF] from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]: dispatch
2022-08-19T22:51:25.578+0000 7f3fee3ef700  0 log_channel(audit) log [INF] : from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]': finished
2022-08-19T22:51:25.578+0000 7f3fee3ef700  2 mon.a@0(leader) e1 send_reply 0x5558b6558c30 0x5558b6bd1860 mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]=0 set 7.16 pg_upmap_items mapping to [0->5] v67) v1
2022-08-19T22:51:25.578+0000 7f3fee3ef700  1 -- [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] --> 172.21.15.138:0/40300 -- mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]=0 set 7.16 pg_upmap_items mapping to [0->5] v67) v1 -- 0x5558b6bd1860 con 0x5558b69aa400
2022-08-19T22:51:25.582+0000 7f3fee3ef700  7 mon.a@0(leader).log v531 update_from_paxos applying incremental log 531 2022-08-19T22:51:25.523722+0000 mon.a (mon.0) 576 : audit [INF] from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]: dispatch
022-08-19T22:51:27.530+0000 7f3fefbf2700 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 2 pgs peering",
            "count": 2
        },
        "detail": [
            {
                "message": "pg 7.16 is stuck peering for 61s, current state peering, last acting [4,5]" 
            },
            {
                "message": "pg 8.25 is stuck peering for 61s, current state peering, last acting [1,6]" 
            }
        ]
    }
}

Actions

Copy link

#19

Updated by Radoslaw Zarzynski over 1 year ago

Let's move back to it next week.

Actions

Copy link

#20

Updated by Kotresh Hiremath Ravishankar 11 months ago

https://pulpito.ceph.com/yuriw-2023-05-09_19:39:46-fs-wip-yuri4-testing-2023-05-08-0846-pacific-distro-default-smithi/7268929
https://pulpito.ceph.com/yuriw-2023-05-09_19:39:46-fs-wip-yuri4-testing-2023-05-08-0846-pacific-distro-default-smithi/7268972

Actions

Copy link

#21

Updated by Kotresh Hiremath Ravishankar 11 months ago

reef:
https://pulpito.ceph.com/yuriw-2023-05-09_19:37:41-fs-wip-yuri10-testing-2023-05-08-0849-reef-distro-default-smithi/7268751
https://pulpito.ceph.com/yuriw-2023-05-09_19:37:41-fs-wip-yuri10-testing-2023-05-08-0849-reef-distro-default-smithi/7268809

Actions

Copy link

#22

Updated by Kotresh Hiremath Ravishankar 11 months ago

reef - https://pulpito.ceph.com/yuriw-2023-05-15_15:22:39-fs-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7274407

Actions

Copy link

#23

Updated by Kotresh Hiremath Ravishankar 11 months ago

pacific - https://pulpito.ceph.com/yuriw-2023-05-15_21:56:33-fs-wip-yuri2-testing-2023-05-15-0810-pacific_2-distro-default-smithi/7274976

Actions

Copy link

#24

Updated by Kotresh Hiremath Ravishankar 11 months ago

pacific:
https://pulpito.ceph.com/yuriw-2023-05-15_21:56:33-fs-wip-yuri2-testing-2023-05-15-0810-pacific_2-distro-default-smithi/7275061
https://pulpito.ceph.com/yuriw-2023-05-15_21:56:33-fs-wip-yuri2-testing-2023-05-15-0810-pacific_2-distro-default-smithi/7275106

Actions

Copy link

#25

Updated by Kotresh Hiremath Ravishankar 11 months ago

reef
https://pulpito.ceph.com/yuriw-2023-05-10_18:53:39-fs-wip-yuri3-testing-2023-05-10-0851-reef-distro-default-smithi/7270261

Actions

Copy link

#26

Updated by Kotresh Hiremath Ravishankar 11 months ago

reef
https://pulpito.ceph.com/yuriw-2023-05-22_14:44:12-fs-wip-yuri3-testing-2023-05-21-0740-reef-distro-default-smithi/7282496
https://pulpito.ceph.com/yuriw-2023-05-22_14:44:12-fs-wip-yuri3-testing-2023-05-21-0740-reef-distro-default-smithi/7282565

Actions

Copy link

#27

Updated by Kotresh Hiremath Ravishankar 11 months ago

quincy:
https://pulpito.ceph.com/yuriw-2023-05-23_15:23:11-fs-wip-yuri10-testing-2023-05-18-0815-quincy-distro-default-smithi/7284042

Actions

Copy link

#28

Updated by Radoslaw Zarzynski 11 months ago

Aishwarya, it started showing again. Could you please take a look?

Actions

Copy link

#29

Updated by Kotresh Hiremath Ravishankar 11 months ago

reef:
https://pulpito.ceph.com/yuriw-2023-05-28_14:46:14-fs-reef-release-distro-default-smithi/7288896

Actions

Copy link

#30

Updated by Milind Changire 11 months ago

quincy:
http://pulpito.front.sepia.ceph.com/yuriw-2023-05-31_21:56:15-fs-wip-yuri6-testing-2023-05-31-0933-quincy-distro-default-smithi/7292626

Actions

Copy link

#31

Updated by Radoslaw Zarzynski 10 months ago

Backport set to reef, quincy

Actions

Copy link

#32

Updated by Milind Changire 10 months ago

quincy:
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-13_23:20:02-fs-wip-yuri3-testing-2023-06-13-1204-quincy-distro-default-smithi/7303222
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-13_23:20:02-fs-wip-yuri3-testing-2023-06-13-1204-quincy-distro-default-smithi/7303379
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-13_23:20:02-fs-wip-yuri3-testing-2023-06-13-1204-quincy-distro-default-smithi/7303257

Actions

Copy link

#33

Updated by Aishwarya Mathuria 10 months ago

After taking a look at the logs, I think this is related to the following tracker: https://tracker.ceph.com/issues/51688

From monitor logs, the health check that caused the test to fail:


2023-06-14T02:35:19.373+0000 7f07df565700 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 1 pg peering",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 7.3e is stuck peering for 62s, current state peering, last acting [5,4]" 
            }
        ]
    }
}

However, according to the osd.5 logs peering for PG 7.3e was over by then and at this point it was in active state:

2023-06-14T02:35:19.020+0000 7fb5b3b79700 10 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=46/47 n=0 ec=42/42 lis/c=46/46 les/c/f=47/47/0 sis=46) [5,4] r=0 lpr=46 crt=0'0 mlcod 0'0 active+clean] remove_stray_recovery_sources remove osd 0 from missing_loc
2023-06-14T02:35:19.020+0000 7fb5b3b79700 10 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=46/47 n=0 ec=42/42 lis/c=46/46 les/c/f=47/47/0 sis=46) [5,4] r=0 lpr=46 crt=0'0 mlcod 0'0 active+clean] update_heartbeat_peers 0,4,5 -> 4,5
2023-06-14T02:35:19.020+0000 7fb5b3b79700 20 osd.5 47 need_heartbeat_peer_update
2023-06-14T02:35:19.020+0000 7fb5b3b79700 20 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=46/47 n=0 ec=42/42 lis/c=46/46 les/c/f=47/47/0 sis=46) [5,4] r=0 lpr=46 crt=0'0 mlcod 0'0 active+clean] prepare_stats_for_publish reporting purged_snaps []

Peering for PG 7.3e taking around one second:

2023-06-14T02:35:17.916+0000 7fb5b3b79700  5 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] enter Started/Primary/Peering/GetInfo
2023-06-14T02:35:17.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] build_prior all_probe 0,5
2023-06-14T02:35:17.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] build_prior maybe_rw interval:42, acting: 0,5
2023-06-14T02:35:17.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] build_prior final: probe 0,4,5 down  blocked_by {}
2023-06-14T02:35:17.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] up_thru 42 < same_since 46, must notify monitor
.
.
.
2023-06-14T02:35:18.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.759792328s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] state<Started/Primary/Peering>: Leaving Peering
2023-06-14T02:35:18.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.759792328s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 unknown pruub 161.579864502s@ mbc={}] state<Started/Primary/Active>: In Active, about to call activate
2023-06-14T02:35:18.916+0000 7fb5b3b79700 20 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=46/47 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 activating mbc={}] update_calc_stats no peer_missing found for 0

There is a PR out for review for the tracker I have mentioned, I will do a test run with it and update this tracker if it fixes the issue

Actions

Copy link

#34

Updated by Milind Changire 10 months ago

reef:
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-23_16:20:29-fs-wip-yuri11-testing-2023-06-19-1232-reef-distro-default-smithi/7313671

Actions

Copy link

#35

Updated by Radoslaw Zarzynski 9 months ago

In addition to the suggestion of being a duplicate, the alternative hypothesis could be a ceph-mgr. My understanding the of last Aishwarya's comment the issue isn't _actual one – the PG was active, just mgr's report was inaccurate.

Actions

Copy link

#36