Project

General

Profile

Bug #52385

a possible data loss due to recovery_unfound PG after restarting all nodes

Added by Satoru Takeuchi over 2 years ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
container
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Related to the discussion in ceph-users ML.
https://marc.info/?l=ceph-users&m=162947327817532&w=2

I encountered a possible data loss in my Rook/Ceph cluster.

I detected an `active+recovery_unfound+undersized+degraded+remapped`
PG after restarting all nodes one by one as follows.

1. Reboot one machine.
2. Wait until completing reboot as a Kubernetes level (not Ceph cluster
level).
3. If there are other nodes to be rebooted, go to step 1.

I understand that I should wait to come back to HEALTH_OK in step 2
to prevent temporal blocking I/O and `recovery_unfound` PGs.

However, IIRC, these problems should be disappeared after all OSDs
becomes `UP`. It's because Ceph cluster keeps consistency by MONs
and the newest replicas are known by primary log-based replication.
Otherwise, if a power outage happens and some (or all) nodes result
in surprise power down, user data might corrupt.

I consider this behavior comes from a bug. However, please corrects
me If I'm wrong and Ceph can't keep data consistent in such cases.

  1. additional information
  1. software

- Ceph: v16.2.4
- Rook: v1.6.3

  1. the result of some commands
  1. ceph -s

```command
ceph -s
cluster:
id: b160a475-c579-46a2-9346-416d3a229c5f
health: HEALTH_ERR
8/47565926 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 142103/142697778 objects
degraded (0.100%), 1 pg degraded, 1 pg undersized
1 daemons have recently crashed

services:
mon: 3 daemons, quorum eb,ef,eg (age 9h)
mgr: a(active, since 9h), standbys: b
osd: 18 osds: 18 up (since 13h), 18 in (since 13h); 1 remapped pgs
rgw: 3 daemons active (3 hosts, 1 zones)
data:
pools: 12 pools, 185 pgs
objects: 47.57M objects, 3.1 TiB
usage: 20 TiB used, 111 TiB / 131 TiB avail
pgs: 142103/142697778 objects degraded (0.100%)
8/47565926 objects unfound (0.000%)
181 active+clean
2 active+clean+scrubbing+deep+repair
1 active+clean+scrubbing
1 active+recovery_unfound+undersized+degraded+remapped
io:
client: 114 KiB/s rd, 5.7 MiB/s wr, 22 op/s rd, 279 op/s wr
```
  1. ceph health detail

```
ceph health detail
HEALTH_ERR 8/47565953 objects unfound (0.000%); Possible data damage:
1 pg recovery_unfound; Degraded data redundancy: 142103/142697859
objects degraded (0.100%), 1 pg degraded, 1 pg undersized; 1 daemons
have recently crashed
[WRN] OBJECT_UNFOUND: 8/47565953 objects unfound (0.000%)
pg 10.1 has 8 unfound objects
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
pg 10.1 is active+recovery_unfound+undersized+degraded+remapped,
acting [3,13], 8 unfound
[WRN] PG_DEGRADED: Degraded data redundancy: 142103/142697859 objects
degraded (0.100%), 1 pg degraded, 1 pg undersized
pg 10.1 is stuck undersized for 14h, current state
active+recovery_unfound+undersized+degraded+remapped, last acting
[3,13]
[WRN] RECENT_CRASH: 1 daemons have recently crashed
client.rgw.ceph.hdd.object.store.a crashed on host
rook-ceph-rgw-ceph-hdd-object-store-a-6d5d75c87c-z9rxz at
2021-08-16T17:24:05.471655Z
```

  1. ceph pg query

```
ceph pg 10.1 query
...
"up": [
11,
13,
3
],
"acting": [
3,
13
],
"backfill_targets": [
"11"
],
"acting_recovery_backfill": [
"3",
"11",
"13"
],
...
"recovery_state": [ {
"name": "Started/Primary/Active",
"enter_time": "2021-08-16T11:24:12.402345+0000",
"might_have_unfound": [ {
"osd": "2",
"status": "already probed"
}, {
"osd": "7",
"status": "already probed"
}, {
"osd": "11",
"status": "already probed"
}, {
"osd": "12",
"status": "already probed"
}, {
"osd": "13",
"status": "already probed"
}, {
"osd": "15",
"status": "already probed"
}
],
```

  1. ceph pg ls

```
ceph pg ls | grep -v 'active+clean'
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES*
OMAP_KEYS* LOG STATE
SINCE VERSION REPORTED UP ACTING
SCRUB_STAMP DEEP_SCRUB_STAMP
10.1 142087 142103 0 8 0
87077343246 300322300 9716
active+recovery_unfound+undersized+degraded+remapped 14h
46437'28097849 46437:121092131 [11,13,3]p11 [3,13]p3
2021-08-15T00:10:21.615494+0000 2021-08-10T16:42:39.172001+0000
```

  1. ceph osd tree

```
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 130.99301 root default
-4 43.66434 zone rack0
-3 7.27739 host 10-69-0-22
0 hdd 7.27739 osd.0 up 1.00000 1.00000
-19 7.27739 host 10-69-0-23
6 hdd 7.27739 osd.6 up 1.00000 1.00000
-7 7.27739 host 10-69-0-24
1 hdd 7.27739 osd.1 up 1.00000 1.00000
-27 7.27739 host 10-69-0-28
13 hdd 7.27739 osd.13 up 1.00000 1.00000
-9 7.27739 host 10-69-0-29
2 hdd 7.27739 osd.2 up 1.00000 1.00000
-15 7.27739 host 10-69-0-30
4 hdd 7.27739 osd.4 up 1.00000 1.00000
-24 43.66434 zone rack1
-41 0 host 10-69-0-214
-33 7.27739 host 10-69-0-215
12 hdd 7.27739 osd.12 up 1.00000 1.00000
-35 0 host 10-69-0-217
-45 7.27739 host 10-69-0-218
9 hdd 7.27739 osd.9 up 1.00000 1.00000
-29 14.55478 host 10-69-0-220
10 hdd 7.27739 osd.10 up 1.00000 1.00000
17 hdd 7.27739 osd.17 up 1.00000 1.00000
-23 7.27739 host 10-69-0-221
8 hdd 7.27739 osd.8 up 1.00000 1.00000
-31 7.27739 host 10-69-0-222
11 hdd 7.27739 osd.11 up 1.00000 1.00000
-12 43.66434 zone rack2
-21 7.27739 host 10-69-1-151
```

History

#1 Updated by Satoru Takeuchi over 2 years ago

I got the logs of sd.{3,11,13}'s during their boot. This data was collected with tuning up the log level.

https://drive.google.com/file/d/1dNEKj9yfD14Xl7NFHzqiFsK0_K-bHrAR/view?usp=sharing

Here is the snippet of `grep -A20 "select_replicated_primary primary is osd.3 with 10.1" osd-3.log`.

```
...
select_replicated_primary primary is osd.3 with 10.1( v 46501'28187664 lc 0'0 (46499'28140309,46501'28187664] local-lis/les=46614/46615 n=142087 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645)
calc_replicated_acting
shard 11 (up) backfill 10.1( v 46501'28187664 (46499'28140309,46501'28187664] lb MIN local-lis/les=46137/46138 n=0 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645)
osd.13 (up) accepted 10.1( v 46501'28187664 lc 0'0 (46499'28140309,46501'28187664] local-lis/les=46614/46615 n=142087 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645 pruub=12.601959229s)
shard 2 (stray) REJECTED 10.1( v 45959'28070775 (45923'28062622,45959'28070775] lb MIN local-lis/les=45954/45955 n=0 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645)
shard 7 (stray) REJECTED 10.1( v 46056'28071170 (45923'28063281,46056'28071170] lb 10:80282841:::.dir.213d6d58-f9f1-434e-9ddc-72dbbd388059.46826989.1.518:head local-lis/les=46030/46039 n=724 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645)
shard 12 (stray) REJECTED 10.1( v 46135'28077936 (46086'28071790,46135'28077936] lb MIN local-lis/les=46091/46092 n=0 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645)
shard 15 (stray) REJECTED 10.1( v 46028'28071122 (45923'28063181,46028'28071122] lb MIN local-lis/les=46022/46023 n=0 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645)
...
```

In addition, here is why the above-mentiond object got missing. This information was got from "grep `10.1.*also missing on osd." osd-3.log | grep 10:9ef4c2c0:::.dir.213d6d58-f9f1-434e-9ddc-72dbbd388059.24130937.1.150:head`.

```
...
debug 2021-08-18T05:24:14.158+0000 7f1a2e745700 10 osd.3 pg_epoch: 46647 pg[10.1( v 46501'28187664 lc 0'0 (46499'28140309,46501'28187664] local-lis/les=46645/46647 n=142087 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645) [11,13,3]/[3,13] backfill=[11] r=0 lpr=46645 pi=[45392,46645)/3 crt=46501'28187664 mlcod 0'0 remapped m=8 u=8 mbc={}] search_for_missing 10:9ef4c2c0:::.dir.213d6d58-f9f1-434e-9ddc-72dbbd388059.24130937.1.150:head 46041'28071169 also missing on osd.3
debug 2021-08-18T05:24:14.158+0000 7f1a2e745700 10 osd.3 pg_epoch: 46647 pg[10.1( v 46501'28187664 lc 0'0 (46499'28140309,46501'28187664] local-lis/les=46645/46647 n=142087 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645) [11,13,3]/[3,13] backfill=[11] r=0 lpr=46645 pi=[45392,46645)/3 crt=46501'28187664 mlcod 0'0 remapped m=8 u=8 mbc={}] search_for_missing 10:9ef4c2c0:::.dir.213d6d58-f9f1-434e-9ddc-72dbbd388059.24130937.1.150:head 46041'28071169 also missing on osd.11 (past last_backfill MIN)
debug 2021-08-18T05:24:14.159+0000 7f1a2e745700 10 osd.3 pg_epoch: 46647 pg[10.1( v 46501'28187664 lc 0'0 (46499'28140309,46501'28187664] local-lis/les=46645/46647 n=142087 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645) [11,13,3]/[3,13] backfill=[11] r=0 lpr=46645 pi=[45392,46645)/3 crt=46501'28187664 mlcod 0'0 remapped m=8 u=8 mbc={}] search_for_missing 10:9ef4c2c0:::.dir.213d6d58-f9f1-434e-9ddc-72dbbd388059.24130937.1.150:head 46041'28071169 also missing on osd.13
debug 2021-08-18T05:24:15.400+0000 7f1a2e745700 10 osd.3 pg_epoch: 46647 pg[10.1( v 46501'28187664 lc 0'0 (46499'28140309,46501'28187664] local-lis/les=46645/46647 n=142087 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645) [11,13,3]/[3,13] backfill=[11] r=0 lpr=46645 pi=[45392,46645)/3 crt=46501'28187664 mlcod 0'0 activating+undersized+degraded+remapped m=8 u=8 mbc={}] search_for_missing 10:9ef4c2c0:::.dir.213d6d58-f9f1-434e-9ddc-72dbbd388059.24130937.1.150:head 46041'28071169 also missing on osd.12 (past last_backfill MIN)
debug 2021-08-18T05:24:15.402+0000 7f1a2e745700 10 osd.3 pg_epoch: 46647 pg[10.1( v 46501'28187664 lc 0'0 (46499'28140309,46501'28187664] local-lis/les=46645/46647 n=142087 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645) [11,13,3]/[3,13] backfill=[11] r=0 lpr=46645 pi=[45392,46645)/3 crt=46501'28187664 mlcod 0'0 activating+undersized+degraded+remapped m=8 u=8 mbc={}] search_for_missing 10:9ef4c2c0:::.dir.213d6d58-f9f1-434e-9ddc-72dbbd388059.24130937.1.150:head 46041'28071169 also missing on osd.2 (last_update 45959'28070775 < needed 46041'28071169)
debug 2021-08-18T05:24:15.404+0000 7f1a2e745700 10 osd.3 pg_epoch: 46647 pg[10.1( v 46501'28187664 lc 0'0 (46499'28140309,46501'28187664] local-lis/les=46645/46647 n=142087 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645) [11,13,3]/[3,13] backfill=[11] r=0 lpr=46645 pi=[45392,46645)/3 crt=46501'28187664 mlcod 0'0 activating+undersized+degraded+remapped m=8 u=8 mbc={}] search_for_missing 10:9ef4c2c0:::.dir.213d6d58-f9f1-434e-9ddc-72dbbd388059.24130937.1.150:head 46041'28071169 also missing on osd.7 (past last_backfill 10:80282841:::.dir.213d6d58-f9f1-434e-9ddc-72dbbd388059.46826989.1.518:head)
debug 2021-08-18T05:24:15.406+0000 7f1a2e745700 10 osd.3 pg_epoch: 46647 pg[10.1( v 46501'28187664 lc 0'0 (46499'28140309,46501'28187664] local-lis/les=46645/46647 n=142087 ec=211/211 lis/c=46614/45392 les/c/f=46615/45393/0 sis=46645) [11,13,3]/[3,13] backfill=[11] r=0 lpr=46645 pi=[45392,46645)/3 crt=46501'28187664 mlcod 0'0 activating+undersized+degraded+remapped m=8 u=8 mbc={}] search_for_missing 10:9ef4c2c0:::.dir.213d6d58-f9f1-434e-9ddc-72dbbd388059.24130937.1.150:head 46041'28071169 also missing on osd.15 (last_update 46028'28071122 < needed 46041'28071169)
...
```

#2 Updated by Neha Ojha over 2 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (librados)

#3 Updated by Neha Ojha over 2 years ago

Can you share the full set of logs using ceph-post-file (https://docs.ceph.com/en/pacific/man/8/ceph-post-file/)?

#4 Updated by Satoru Takeuchi over 2 years ago

Thank you for your reply.

My cluster has another trouble now. So I'll take these logs after resolving this problem.

#5 Updated by Satoru Takeuchi over 2 years ago

Can you share the full set of logs using ceph-post-file (https://docs.ceph.com/en/pacific/man/8/ceph-post-file/)?

Unfortunately, ceph-post-file wouldn't work because the logs of Ceph daemons are stored for about one week in my system. However, these logs are stored in Loki.

Could you tell me what kind of logs do you need and what period of time do you need the logs for? For example, all osd/mon logs of one hour before and after the problem.

#6 Updated by Neha Ojha over 2 years ago

Satoru Takeuchi wrote:

Can you share the full set of logs using ceph-post-file (https://docs.ceph.com/en/pacific/man/8/ceph-post-file/)?

Unfortunately, ceph-post-file wouldn't work because the logs of Ceph daemons are stored for about one week in my system. However, these logs are stored in Loki.

Could you tell me what kind of logs do you need and what period of time do you need the logs for? For example, all osd/mon logs of one hour before and after the problem.

Please provide osd logs (roughly 10 mins) from all replicas with debug_osd=20, debug_ms=1, when the osds are restarted, to capture peering events,

#7 Updated by Satoru Takeuchi over 2 years ago

Please provide osd logs (roughly 10 mins) from all replicas with debug_osd=20, debug_ms=1, when the osds are restarted, to capture peering events,

Unfortunately, the corresponding osd are already gone and we've not captured no log which has enough configuration.

I'll try to reproduce this problem with the configuration which you requested, and let you know and will let you know the result.

#8 Updated by Neha Ojha over 2 years ago

  • Status changed from New to Need More Info

#9 Updated by Satoru Takeuchi about 1 year ago

I have tried to reproduce this issue in 17.2.5 for two weeks. However, nothing happened. Please close this ticket.

#10 Updated by Radoslaw Zarzynski about 1 year ago

  • Status changed from Need More Info to Closed

Closing per the prev comment.

Also available in: Atom PDF