Bug #4873
osd: scrub found missing object on primary
0%
Description
failure_reason: '"2013-04-30 02:41:51.132730 osd.0 10.214.132.19:6803/3151 12 : [ERR] 3.0 osd.0 missing f5e3ef31/plana6012082-215/head//3" in cluster log'
job was
ubuntu@teuthology:/a/teuthology-2013-04-30_01:00:09-rados-next-testing-basic/3492$ cat orig.config.yaml kernel: kdb: true sha1: 1de177703e3fd1b2787045a24eaf0e8c29682388 machine_type: plana nuke-on-error: true overrides: ceph: conf: global: ms inject delay max: 1 ms inject delay probability: 0.005 ms inject delay type: osd ms inject socket failures: 2500 mon: debug mon: 20 debug ms: 20 debug paxos: 20 fs: xfs log-whitelist: - slow request sha1: 6f2a7df4b0f848ff3d809925462a395656410e87 s3tests: branch: next workunit: sha1: 6f2a7df4b0f848ff3d809925462a395656410e87 roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - client.0 tasks: - chef: null - clock.check: null - install: null - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 timeout: 1200 - rados: clients: - client.0 objects: 500 op_weights: delete: 50 read: 100 rollback: 50 snap_create: 50 snap_remove: 50 write: 100 ops: 4000
History
#1 Updated by Ian Colle almost 11 years ago
- Assignee set to David Zafman
#2 Updated by Sage Weil almost 11 years ago
ubuntu@teuthology:/a/sage-2013-04-30_21:21:02-rados-wip-mds-testing-basic/4319
not that the regression from #4872 might be a trigger, altho i suspect there is it some other bug that is ultimately responsible here
#3 Updated by Sage Weil almost 11 years ago
- Status changed from New to In Progress
#4 Updated by Samuel Just almost 11 years ago
First thing is we see that osd.0 goes down while pgs are creating (actually, splitting).
2013-04-30 21:55:31.270026 mon.0 10.214.131.10:6789/0 65 : [INF] pgmap v26: 59 pgs: 10 creating, 43 active+clean, 6 stale+active+clean; 207 MB data, 1200 MB used, 2792 GB / 2793 GB avail
2013-04-30 21:55:32.022864 mon.0 10.214.131.10:6789/0 66 : [DBG] osd.0 10.214.131.10:6800/32000 reported failed by osd.3 10.214.131.12:6800/8531
2013-04-30 21:55:32.329493 mon.0 10.214.131.10:6789/0 67 : [INF] pgmap v27: 59 pgs: 10 creating, 49 active+clean; 207 MB data, 1200 MB used, 2792 GB / 2793 GB avail; 0B/s rd, 2010B/s wr, 1op/s
2013-04-30 21:55:32.522214 mon.0 10.214.131.10:6789/0 68 : [INF] osdmap e16: 6 osds: 5 up, 6 in
It then comes back up as the pgs finish splitting:
2013-04-30 21:55:33.549274 mon.0 10.214.131.10:6789/0 70 : [INF] osd.0 10.214.131.10:6810/315 boot
2013-04-30 21:55:33.550557 mon.0 10.214.131.10:6789/0 71 : [INF] osdmap e17: 6 osds: 6 up, 6 in
2013-04-30 21:55:33.608243 mon.0 10.214.131.10:6789/0 72 : [INF] pgmap v29: 59 pgs: 42 active+clean, 10 stale+creating, 7 stale+active+clean; 207 MB data, 1200 MB used, 2792 GB / 2793 GB avail
2013-04-30 21:55:34.596397 mon.0 10.214.131.10:6789/0 73 : [INF] osdmap e18: 6 osds: 6 up, 6 in
Then, pg 3.0 goes active with osd.0 as primary and 4 as replica. A scrub happens:
2013-04-30 21:55:47.561414 osd.0 10.214.131.10:6810/315 4 : [ERR] 3.0 osd.4 missing 796b1c41/plana288887-147/head//3
2013-04-30 21:55:47.561417 osd.0 10.214.131.10:6810/315 5 : [ERR] 3.0 osd.4 missing e62fb903/plana288887-150/head//3
2013-04-30 21:55:47.583047 osd.0 10.214.131.10:6810/315 6 : [ERR] 3.0 osd.4 missing 6f5e2c13/plana288887-152/head//3
2013-04-30 21:55:47.583049 osd.0 10.214.131.10:6810/315 7 : [ERR] 3.0 osd.4 missing 26eb7c33/plana288887-151/head//3
2013-04-30 21:55:47.583053 osd.0 10.214.131.10:6810/315 8 : [ERR] 3.0 osd.4 missing 33a19f63/plana288887-153/head//3
2013-04-30 21:55:47.583054 osd.0 10.214.131.10:6810/315 9 : [ERR] 3.0 osd.4 missing cdcfafc3/plana288887-154/head//3
2013-04-30 21:55:47.583055 osd.0 10.214.131.10:6810/315 10 : [ERR] 3.0 osd.4 missing 85a83fd5/plana288887-156/head//3
2013-04-30 21:55:47.598104 osd.0 10.214.131.10:6810/315 11 : [ERR] 3.0 osd.4 missing e6f4ab26/plana288887-148/head//3
2013-04-30 21:55:47.598107 osd.0 10.214.131.10:6810/315 12 : [ERR] 3.0 osd.4 missing c88d2ec7/plana288887-157/head//3
2013-04-30 21:55:47.598109 osd.0 10.214.131.10:6810/315 13 : [ERR] 3.0 osd.4 missing d3fc46bb/plana288887-146/head//3
2013-04-30 21:55:47.598110 osd.0 10.214.131.10:6810/315 14 : [ERR] 3.0 osd.4 missing a9b7a23f/plana288887-155/head//3
2013-04-30 21:55:47.598173 osd.0 10.214.131.10:6810/315 15 : [ERR] 3.0 deep-scrub 11 missing, 0 inconsistent objects
The interesting bit is that none of those objects are supposed to be mapped to pg 3.0 with a pgnum > 7. This suggests that on startup osd.0 failed to split pg 3.0 somehow. Still investigating.
#5 Updated by Anonymous almost 11 years ago
- Priority changed from Urgent to High
#6 Updated by David Zafman almost 11 years ago
- Status changed from In Progress to Can't reproduce