Project

General

Profile

Fix #6116

osd: incomplete pg from thrashing on next

Added by Sage Weil over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
% Done:

100%

Source:
Q/A
Tags:
Backport:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

... u'overall_status': u'HEALTH_WARN', u'summary': [{u'severity': u'HEALTH_WARN', u'summary': u'1 pgs incomplete'}]} ...

ubuntu@teuthology:/a/teuthology-2013-08-24_14:13:07-rados-next-testing-basic-plana/4228$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: c2f29906882bd30794da6993e755a0dab2b7a665
machine_type: plana
nuke-on-error: true
os_type: ubuntu
overrides:
  admin_socket:
    branch: next
  ceph:
    conf:
      global:
        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject delay type: osd
        ms inject internal delays: 0.002
        ms inject socket failures: 2500
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon min osdmap epochs: 2
      osd:
        osd map cache size: 1
    fs: ext4
    log-whitelist:
    - slow request
    sha1: 4b529c8bceea98aaf69dceec3a4d1a239036d5d7
  ceph-deploy:
    branch:
      dev: next
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
  install:
    ceph:
      sha1: 4b529c8bceea98aaf69dceec3a4d1a239036d5d7
  s3tests:
    branch: next
  workunit:
    sha1: 4b529c8bceea98aaf69dceec3a4d1a239036d5d7
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
  - client.0
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.1
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    chance_test_map_discontinuity: 0.5
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 50
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
teuthology_branch: next

Subtasks

Subtask #6391: stuck incompleteDuplicateGreg Farnum


Related issues

Duplicated by Ceph - Bug #5922: osd: unfound objects on next Duplicate 08/09/2013

History

#1 Updated by Sage Weil over 10 years ago

ubuntu@teuthology:/a/teuthology-2013-08-26_15:47:58-rados-next-testing-basic-plana/6694

cluster is still hung

#2 Updated by Ian Colle over 10 years ago

  • Assignee set to Samuel Just

Sam, please take a look.

#3 Updated by Sage Weil over 10 years ago

ubuntu@teuthology://a/teuthology-2013-08-28_01:00:04-rados-master-testing-basic-plana/10150

#4 Updated by Samuel Just over 10 years ago

time: 2717s
log: http://qa-proxy.ceph.com/teuthology/teuthology-2013-09-09_20:00:20-rados-dumpling-testing-basic-plana/27708/

failed to become clean before timeout expired

Hung

2013-09-09 22:33:32.641520 mon.0 10.214.131.15:6789/0 3025 : [INF] pgmap v1622: 172 pgs: 171 active+clean, 1 incomplete; 21590 bytes data, 892 MB used, 2174 GB / 2291 GB avail

#5 Updated by Samuel Just over 10 years ago

Hmm, the last osd log entry indicates that the pg in question may have gone clean?
2013-09-09 22:27:19.022997 7f1724a49700 5 osd.3 pg_epoch: 1049 pg[2.1e( empty local-les=1044 n=0 ec=1 les/c 1044/972 1043/1043/1043) [3,0] r=0 lpr=1043 pi=791-1042/8 bft=0 mlcod 0'0 active] enter Started/Primary/Active/Clean

#6 Updated by Samuel Just over 10 years ago

The task was in process of letting the cluster recover with osd.2 down.

#7 Updated by Samuel Just over 10 years ago

There appear to be no pgs in incomplete state according to the osd log. Issue notifying the mon?

#8 Updated by Samuel Just over 10 years ago

From the mon logs, last reported seems to be
2013-09-09 22:31:28.047555 7f56db94d700 15 mon.a@0(leader).pg v1614 got 1.3f reported at 1348:307 state incomplete -> incomplete

#9 Updated by Samuel Just over 10 years ago

1.3f does appear to be incomplete in the osd log

#10 Updated by Samuel Just over 10 years ago

Ok, there are enough logs to confirm that this is the primary-thinks-it's-clean vs backfill-peer-thinks-it's-clean race.

#11 Updated by Samuel Just over 10 years ago

  • Tracker changed from Bug to Fix
  • Target version set to v0.70

#12 Updated by Samuel Just over 10 years ago

  • Target version deleted (v0.70)

#13 Updated by Samuel Just over 10 years ago

The workaround I put into teuthology was inadequate, I'm going to put this in the backlog and downgrade it now that it should stop messing up the nightlies.

#14 Updated by Samuel Just over 10 years ago

  • Target version set to v0.73

#15 Updated by Samuel Just over 10 years ago

  • translation missing: en.field_story_points set to 5.0

#16 Updated by Samuel Just over 10 years ago

  • Status changed from New to Resolved

I was way off on this one. We do ack the backfill completion. I suspect that the actual problem was probably fixed by the 6585 fixes (backfill_pos vs last_backfill confusion)

#17 Updated by Samuel Just over 10 years ago

Removed the teuthology workaround as well.

Also available in: Atom PDF