Bug #4873
closedosd: scrub found missing object on primary
0%
Description
failure_reason: '"2013-04-30 02:41:51.132730 osd.0 10.214.132.19:6803/3151 12 : [ERR] 3.0 osd.0 missing f5e3ef31/plana6012082-215/head//3" in cluster log'
job was
ubuntu@teuthology:/a/teuthology-2013-04-30_01:00:09-rados-next-testing-basic/3492$ cat orig.config.yaml kernel: kdb: true sha1: 1de177703e3fd1b2787045a24eaf0e8c29682388 machine_type: plana nuke-on-error: true overrides: ceph: conf: global: ms inject delay max: 1 ms inject delay probability: 0.005 ms inject delay type: osd ms inject socket failures: 2500 mon: debug mon: 20 debug ms: 20 debug paxos: 20 fs: xfs log-whitelist: - slow request sha1: 6f2a7df4b0f848ff3d809925462a395656410e87 s3tests: branch: next workunit: sha1: 6f2a7df4b0f848ff3d809925462a395656410e87 roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - client.0 tasks: - chef: null - clock.check: null - install: null - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 timeout: 1200 - rados: clients: - client.0 objects: 500 op_weights: delete: 50 read: 100 rollback: 50 snap_create: 50 snap_remove: 50 write: 100 ops: 4000
Updated by Sage Weil almost 11 years ago
ubuntu@teuthology:/a/sage-2013-04-30_21:21:02-rados-wip-mds-testing-basic/4319
not that the regression from #4872 might be a trigger, altho i suspect there is it some other bug that is ultimately responsible here
Updated by Samuel Just almost 11 years ago
First thing is we see that osd.0 goes down while pgs are creating (actually, splitting).
2013-04-30 21:55:31.270026 mon.0 10.214.131.10:6789/0 65 : [INF] pgmap v26: 59 pgs: 10 creating, 43 active+clean, 6 stale+active+clean; 207 MB data, 1200 MB used, 2792 GB / 2793 GB avail
2013-04-30 21:55:32.022864 mon.0 10.214.131.10:6789/0 66 : [DBG] osd.0 10.214.131.10:6800/32000 reported failed by osd.3 10.214.131.12:6800/8531
2013-04-30 21:55:32.329493 mon.0 10.214.131.10:6789/0 67 : [INF] pgmap v27: 59 pgs: 10 creating, 49 active+clean; 207 MB data, 1200 MB used, 2792 GB / 2793 GB avail; 0B/s rd, 2010B/s wr, 1op/s
2013-04-30 21:55:32.522214 mon.0 10.214.131.10:6789/0 68 : [INF] osdmap e16: 6 osds: 5 up, 6 in
It then comes back up as the pgs finish splitting:
2013-04-30 21:55:33.549274 mon.0 10.214.131.10:6789/0 70 : [INF] osd.0 10.214.131.10:6810/315 boot
2013-04-30 21:55:33.550557 mon.0 10.214.131.10:6789/0 71 : [INF] osdmap e17: 6 osds: 6 up, 6 in
2013-04-30 21:55:33.608243 mon.0 10.214.131.10:6789/0 72 : [INF] pgmap v29: 59 pgs: 42 active+clean, 10 stale+creating, 7 stale+active+clean; 207 MB data, 1200 MB used, 2792 GB / 2793 GB avail
2013-04-30 21:55:34.596397 mon.0 10.214.131.10:6789/0 73 : [INF] osdmap e18: 6 osds: 6 up, 6 in
Then, pg 3.0 goes active with osd.0 as primary and 4 as replica. A scrub happens:
2013-04-30 21:55:47.561414 osd.0 10.214.131.10:6810/315 4 : [ERR] 3.0 osd.4 missing 796b1c41/plana288887-147/head//3
2013-04-30 21:55:47.561417 osd.0 10.214.131.10:6810/315 5 : [ERR] 3.0 osd.4 missing e62fb903/plana288887-150/head//3
2013-04-30 21:55:47.583047 osd.0 10.214.131.10:6810/315 6 : [ERR] 3.0 osd.4 missing 6f5e2c13/plana288887-152/head//3
2013-04-30 21:55:47.583049 osd.0 10.214.131.10:6810/315 7 : [ERR] 3.0 osd.4 missing 26eb7c33/plana288887-151/head//3
2013-04-30 21:55:47.583053 osd.0 10.214.131.10:6810/315 8 : [ERR] 3.0 osd.4 missing 33a19f63/plana288887-153/head//3
2013-04-30 21:55:47.583054 osd.0 10.214.131.10:6810/315 9 : [ERR] 3.0 osd.4 missing cdcfafc3/plana288887-154/head//3
2013-04-30 21:55:47.583055 osd.0 10.214.131.10:6810/315 10 : [ERR] 3.0 osd.4 missing 85a83fd5/plana288887-156/head//3
2013-04-30 21:55:47.598104 osd.0 10.214.131.10:6810/315 11 : [ERR] 3.0 osd.4 missing e6f4ab26/plana288887-148/head//3
2013-04-30 21:55:47.598107 osd.0 10.214.131.10:6810/315 12 : [ERR] 3.0 osd.4 missing c88d2ec7/plana288887-157/head//3
2013-04-30 21:55:47.598109 osd.0 10.214.131.10:6810/315 13 : [ERR] 3.0 osd.4 missing d3fc46bb/plana288887-146/head//3
2013-04-30 21:55:47.598110 osd.0 10.214.131.10:6810/315 14 : [ERR] 3.0 osd.4 missing a9b7a23f/plana288887-155/head//3
2013-04-30 21:55:47.598173 osd.0 10.214.131.10:6810/315 15 : [ERR] 3.0 deep-scrub 11 missing, 0 inconsistent objects
The interesting bit is that none of those objects are supposed to be mapped to pg 3.0 with a pgnum > 7. This suggests that on startup osd.0 failed to split pg 3.0 somehow. Still investigating.
Updated by David Zafman almost 11 years ago
- Status changed from In Progress to Can't reproduce