Sam and I did some digging around this bug.
Here is the timestamp of the DEGRADED warning in the cluster log. The osdmap changed epochs right before the warning came up:
2024-03-20T19:01:33.799655+0000 mon.a (mon.0) 1058 : cluster [DBG] osdmap e87: 12 total, 12 up, 12 in
2024-03-20T19:01:34.803442+0000 mon.a (mon.0) 1059 : cluster [DBG] osdmap e88: 12 total, 12 up, 12 in
2024-03-20T19:01:34.998093+0000 mgr.x (mgr.14152) 382 : cluster [DBG] pgmap v349: 97 pgs: 1 activating+degraded, 2 activating, 1 peering, 93 active+clean; 639 KiB data, 360 MiB used, 1.0 TiB / 1.0 TiB avail; 3.1 KiB/s rd, 5 op/s; 2/192 objects degraded (1.042%)
2024-03-20T19:01:35.798559+0000 mon.a (mon.0) 1060 : cluster [WRN] Health check failed: Degraded data redundancy: 2/192 objects degraded (1.042%), 1 pg degraded (PG_DEGRADED)
Right before this change, the upmap balancer had issued some mapping changes:
2024-03-20T19:01:32.916211+0000 mon.a (mon.0) 1040 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "2.1b", "id": [2, 4]}]: dispatch
2024-03-20T19:01:32.916522+0000 mon.a (mon.0) 1041 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.0", "id": [5, 9, 4, 7]}]: dispatch
2024-03-20T19:01:32.916701+0000 mon.a (mon.0) 1042 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.6", "id": [3, 6]}]: dispatch
2024-03-20T19:01:32.916992+0000 mon.a (mon.0) 1043 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.d", "id": [1, 9]}]: dispatch
2024-03-20T19:01:32.917171+0000 mon.a (mon.0) 1044 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.13", "id": [4, 9]}]: dispatch
2024-03-20T19:01:32.917335+0000 mon.a (mon.0) 1045 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.1b", "id": [4, 6]}]: dispatch
2024-03-20T19:01:32.917512+0000 mon.a (mon.0) 1046 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.21", "id": [4, 8]}]: dispatch
2024-03-20T19:01:32.917686+0000 mon.a (mon.0) 1047 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.28", "id": [5, 7]}]: dispatch
2024-03-20T19:01:32.917890+0000 mon.a (mon.0) 1048 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.2a", "id": [3, 8]}]: dispatch
2024-03-20T19:01:33.788115+0000 mon.a (mon.0) 1049 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "2.1b", "id": [2, 4]}]': finished
2024-03-20T19:01:33.788370+0000 mon.a (mon.0) 1050 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.0", "id": [5, 9, 4, 7]}]': finished
2024-03-20T19:01:33.788690+0000 mon.a (mon.0) 1051 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.6", "id": [3, 6]}]': finished
2024-03-20T19:01:33.788912+0000 mon.a (mon.0) 1052 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.d", "id": [1, 9]}]': finished
2024-03-20T19:01:33.789072+0000 mon.a (mon.0) 1053 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.13", "id": [4, 9]}]': finished
2024-03-20T19:01:33.789276+0000 mon.a (mon.0) 1054 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.1b", "id": [4, 6]}]': finished
2024-03-20T19:01:33.789500+0000 mon.a (mon.0) 1055 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.21", "id": [4, 8]}]': finished
2024-03-20T19:01:33.789674+0000 mon.a (mon.0) 1056 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.28", "id": [5, 7]}]': finished
2024-03-20T19:01:33.789931+0000 mon.a (mon.0) 1057 : audit [INF] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "3.2a", "id": [3, 8]}]': finished
2024-03-20T19:01:38.865046+0000 mon.a (mon.0) 1061 : audit [DBG] from='mgr.14152 172.21.15.43:0/2091623787' entity='mgr.x' cmd=[{"prefix": "config dump", "format": "json"}]: dispatch
This was the degraded pg at that time. In this snippet, the epoch changed. The acting set is set to go from [5,2,7] to [5,4,7].
2024-03-20T19:01:33.797+0000 7f5654c25700 10 osd.5 pg_epoch: 86 pg[2.1b( v 86'5 (0'0,86'5] local-lis/les=83/84 n=2 ec=83/83 lis/c=83/83 les/c/f=84/84/0 sis=83) [5,2,7] r=0 lpr=83 crt=86'5 lcod 86'4 mlcod
86'4 active+clean] PeeringState::advance_map handle_advance_map [5,4,7]/[5,4,7] -- 5/5
2024-03-20T19:01:33.798+0000 7f5654c25700 20 osd.5 pg_epoch: 87 pg[2.1b( v 86'5 (0'0,86'5] local-lis/les=83/84 n=2 ec=83/83 lis/c=83/83 les/c/f=84/84/0 sis=83) [5,2,7] r=0 lpr=83 crt=86'5 lcod 86'4 mlcod
86'4 active+clean] PeeringState::should_restart_peering new interval newup [5,4,7] newacting [5,4,7]
Going a bit farther down, we can see that the notification from osd.4 ("MNotifyRec 2.1b from 4") shows that osd.4 does not have objects (DNE empty).
2024-03-20T19:01:33.799+0000 7f5654c25700 10 osd.5 pg_epoch: 87 pg[2.1b( v 86'5 (0'0,86'5] local-lis/les=83/84 n=2 ec=83/83 lis/c=83/83 les/c/f=84/84/0 sis=87 pruub=8.005908012s) [5,4,7] r=0 lpr=87 pi=[83
,87)/1 crt=86'5 lcod 86'4 mlcod 0'0 peering pruub 299.747192383s@ mbc={}] do_peering_event: epoch_sent: 87 epoch_requested: 87 MNotifyRec 2.1b from 4 notify: (query:87 sent:87 2.1b( DNE empty local-lis/le
s=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0) ([0,0] all_participants={} intervals=[])) features: 0x3f03cffffffdffff +create_info
This would have led to log based recovery (as opposed to backfill) since the pg had only 5 writes. Log based recovery can lead to degraded pgs, and in this case, it was only for a short time.
In short, this effect is normal and expected with the upmap balancer, and likely "came up" only recently since we enabled monitoring to the cluster log in cephadm tests. The solution here would be to add it to the ignorelist.
Adding this warning to the ignorelist would not endanger the test of erroneously passing in the case that a pg is degraded and stays degraded; rather, the test would eventually time out and fail.