Project

General

Profile

Bug #4873

osd: scrub found missing object on primary

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
David Zafman
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

failure_reason: '"2013-04-30 02:41:51.132730 osd.0 10.214.132.19:6803/3151 12 : [ERR] 3.0 osd.0 missing f5e3ef31/plana6012082-215/head//3" in cluster log'

job was
ubuntu@teuthology:/a/teuthology-2013-04-30_01:00:09-rados-next-testing-basic/3492$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 1de177703e3fd1b2787045a24eaf0e8c29682388
machine_type: plana
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject delay type: osd
        ms inject socket failures: 2500
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    fs: xfs
    log-whitelist:
    - slow request
    sha1: 6f2a7df4b0f848ff3d809925462a395656410e87
  s3tests:
    branch: next
  workunit:
    sha1: 6f2a7df4b0f848ff3d809925462a395656410e87
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000

History

#1 Updated by Ian Colle almost 11 years ago

  • Assignee set to David Zafman

#2 Updated by Sage Weil almost 11 years ago

ubuntu@teuthology:/a/sage-2013-04-30_21:21:02-rados-wip-mds-testing-basic/4319

not that the regression from #4872 might be a trigger, altho i suspect there is it some other bug that is ultimately responsible here

#3 Updated by Sage Weil almost 11 years ago

  • Status changed from New to In Progress

#4 Updated by Samuel Just almost 11 years ago

First thing is we see that osd.0 goes down while pgs are creating (actually, splitting).

2013-04-30 21:55:31.270026 mon.0 10.214.131.10:6789/0 65 : [INF] pgmap v26: 59 pgs: 10 creating, 43 active+clean, 6 stale+active+clean; 207 MB data, 1200 MB used, 2792 GB / 2793 GB avail
2013-04-30 21:55:32.022864 mon.0 10.214.131.10:6789/0 66 : [DBG] osd.0 10.214.131.10:6800/32000 reported failed by osd.3 10.214.131.12:6800/8531
2013-04-30 21:55:32.329493 mon.0 10.214.131.10:6789/0 67 : [INF] pgmap v27: 59 pgs: 10 creating, 49 active+clean; 207 MB data, 1200 MB used, 2792 GB / 2793 GB avail; 0B/s rd, 2010B/s wr, 1op/s
2013-04-30 21:55:32.522214 mon.0 10.214.131.10:6789/0 68 : [INF] osdmap e16: 6 osds: 5 up, 6 in

It then comes back up as the pgs finish splitting:

2013-04-30 21:55:33.549274 mon.0 10.214.131.10:6789/0 70 : [INF] osd.0 10.214.131.10:6810/315 boot
2013-04-30 21:55:33.550557 mon.0 10.214.131.10:6789/0 71 : [INF] osdmap e17: 6 osds: 6 up, 6 in
2013-04-30 21:55:33.608243 mon.0 10.214.131.10:6789/0 72 : [INF] pgmap v29: 59 pgs: 42 active+clean, 10 stale+creating, 7 stale+active+clean; 207 MB data, 1200 MB used, 2792 GB / 2793 GB avail
2013-04-30 21:55:34.596397 mon.0 10.214.131.10:6789/0 73 : [INF] osdmap e18: 6 osds: 6 up, 6 in

Then, pg 3.0 goes active with osd.0 as primary and 4 as replica. A scrub happens:

2013-04-30 21:55:47.561414 osd.0 10.214.131.10:6810/315 4 : [ERR] 3.0 osd.4 missing 796b1c41/plana288887-147/head//3
2013-04-30 21:55:47.561417 osd.0 10.214.131.10:6810/315 5 : [ERR] 3.0 osd.4 missing e62fb903/plana288887-150/head//3
2013-04-30 21:55:47.583047 osd.0 10.214.131.10:6810/315 6 : [ERR] 3.0 osd.4 missing 6f5e2c13/plana288887-152/head//3
2013-04-30 21:55:47.583049 osd.0 10.214.131.10:6810/315 7 : [ERR] 3.0 osd.4 missing 26eb7c33/plana288887-151/head//3
2013-04-30 21:55:47.583053 osd.0 10.214.131.10:6810/315 8 : [ERR] 3.0 osd.4 missing 33a19f63/plana288887-153/head//3
2013-04-30 21:55:47.583054 osd.0 10.214.131.10:6810/315 9 : [ERR] 3.0 osd.4 missing cdcfafc3/plana288887-154/head//3
2013-04-30 21:55:47.583055 osd.0 10.214.131.10:6810/315 10 : [ERR] 3.0 osd.4 missing 85a83fd5/plana288887-156/head//3
2013-04-30 21:55:47.598104 osd.0 10.214.131.10:6810/315 11 : [ERR] 3.0 osd.4 missing e6f4ab26/plana288887-148/head//3
2013-04-30 21:55:47.598107 osd.0 10.214.131.10:6810/315 12 : [ERR] 3.0 osd.4 missing c88d2ec7/plana288887-157/head//3
2013-04-30 21:55:47.598109 osd.0 10.214.131.10:6810/315 13 : [ERR] 3.0 osd.4 missing d3fc46bb/plana288887-146/head//3
2013-04-30 21:55:47.598110 osd.0 10.214.131.10:6810/315 14 : [ERR] 3.0 osd.4 missing a9b7a23f/plana288887-155/head//3
2013-04-30 21:55:47.598173 osd.0 10.214.131.10:6810/315 15 : [ERR] 3.0 deep-scrub 11 missing, 0 inconsistent objects

The interesting bit is that none of those objects are supposed to be mapped to pg 3.0 with a pgnum > 7. This suggests that on startup osd.0 failed to split pg 3.0 somehow. Still investigating.

#5 Updated by Anonymous almost 11 years ago

  • Priority changed from Urgent to High

#6 Updated by David Zafman almost 11 years ago

  • Status changed from In Progress to Can't reproduce

Also available in: Atom PDF