Project

General

Profile

Actions

Bug #5392

closed

osd: unfound objects from thrashing

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@plana41:~$ ceph health detail
HEALTH_WARN 4 pgs recovering; 4 pgs stuck unclean; recovery 699/25994 degraded (2.689%); 134/12997 unfound (1.031%)
pg 0.3b is stuck unclean for 32312.629395, current state active+recovering, last acting [3,5]
pg 0.39 is stuck unclean for 32310.951536, current state active+recovering, last acting [5,3]
pg 0.37 is stuck unclean for 32309.638516, current state active+recovering, last acting [5,3]
pg 0.36 is stuck unclean for 32318.087019, current state active+recovering, last acting [3,5]
pg 0.3b is active+recovering, acting [3,5], 50 unfound
pg 0.39 is active+recovering, acting [5,3], 1 unfound
pg 0.37 is active+recovering, acting [5,3], 46 unfound
pg 0.36 is active+recovering, acting [3,5], 37 unfound
recovery 699/25994 degraded (2.689%); 134/12997 unfound (1.031%)
ubuntu@plana41:~$ ceph osd dump
epoch 193
fsid 9f81f67c-6ad0-49e9-ae23-d60e7cb3beed
created 2013-06-18 02:12:09.286514
modified 2013-06-18 03:11:56.787216
flags 

pool 0 'data' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 34 last_change 63 owner 0 flags 1 crash_replay_interval 45
pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 24 pgp_num 24 last_change 1 owner 0 flags 1
pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 24 pgp_num 24 last_change 1 owner 0 flags 1

max_osd 6
osd.0 up   out weight 0 up_from 33 up_thru 116 down_at 31 last_clean_interval [5,28) 10.214.132.37:6812/1720 10.214.132.37:6813/1720 10.214.132.37:6814/1720 10.214.132.37:6815/1720 exists,up 89a3a95b-1039-4f5f-ae8b-89b91e47965d
osd.1 up   out weight 0 up_from 5 up_thru 13 down_at 0 last_clean_interval [0,0) 10.214.132.37:6808/19825 10.214.132.37:6809/19825 10.214.132.37:6810/19825 10.214.132.37:6811/19825 exists,up f8a5ad3e-a9c2-4fbf-a6a8-80f81e8f7b58
osd.2 up   out weight 0 up_from 2 up_thru 183 down_at 0 last_clean_interval [0,0) 10.214.132.37:6800/19821 10.214.132.37:6801/19821 10.214.132.37:6802/19821 10.214.132.37:6803/19821 exists,up c577c923-04e7-4bac-bab1-b9324d6f40e0
osd.3 up   in  weight 1 up_from 5 up_thru 192 down_at 0 last_clean_interval [0,0) 10.214.132.33:6801/17858 10.214.132.33:6803/17858 10.214.132.33:6805/17858 10.214.132.33:6807/17858 exists,up cbd7563d-5da4-40dc-b89f-ab8a2db76f34
osd.4 up   out weight 0 up_from 3 up_thru 55 down_at 0 last_clean_interval [0,0) 10.214.132.33:6800/17859 10.214.132.33:6802/17859 10.214.132.33:6804/17859 10.214.132.33:6806/17859 exists,up e6d4f9ac-7ff8-490b-b738-ab59c6e5bbb0
osd.5 up   in  weight 1 up_from 10 up_thru 186 down_at 9 last_clean_interval [5,6) 10.214.132.33:6813/18438 10.214.132.33:6814/18438 10.214.132.33:6815/18438 10.214.132.33:6816/18438 exists,up 6620bae6-5e44-48b6-801d-017e536b2aba

job is
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-06-18_01:00:05-rados-next-testing-basic/38759$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: dbb898fa64ead2446a8e7e40b90ab55b2e066e09
machine_type: plana
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject delay type: osd
        ms inject socket failures: 2500
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    fs: xfs
    log-whitelist:
    - slow request
    sha1: df8a3e5591948dfd94de2e06640cfe54d2de4322
  install:
    ceph:
      sha1: df8a3e5591948dfd94de2e06640cfe54d2de4322
  s3tests:
    branch: next
  workunit:
    sha1: df8a3e5591948dfd94de2e06640cfe54d2de4322
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- radosbench:
    clients:
    - client.0
    time: 1800

cluster is still stuck


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #5269: osd: EEXIST on mkcollResolved06/06/2013

Actions
Actions

Also available in: Atom PDF