Project

General

Profile

Actions

Bug #5873

closed

osd: unfound object from thrashing when all osds are up

Added by Sage Weil over 10 years ago. Updated over 10 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph ubuntu@plana26:~$ ceph osd dump
epoch 24
fsid a235a407-d6c3-40cb-9a2d-e02877f6a370
created 2013-08-03 09:55:28.662129
modified 2013-08-03 09:57:06.624303
flags 

pool 0 'data' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 24 pgp_num 24 last_change 7 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 34 pgp_num 24 last_change 16 owner 0
pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 34 pgp_num 24 last_change 13 owner 0

max_osd 6
osd.0 up   out weight 0 up_from 3 up_thru 4 down_at 0 last_clean_interval [0,0) 10.214.131.14:6801/30903 10.214.131.14:6803/30903 10.214.131.14:6805/30903 10.214.131.14:6807/30903 exists,up 01aba9ee-208b-4584-8894-8959bf0f43e6
osd.1 up   out weight 0 up_from 3 up_thru 8 down_at 0 last_clean_interval [0,0) 10.214.131.14:6800/30902 10.214.131.14:6802/30902 10.214.131.14:6804/30902 10.214.131.14:6806/30902 exists,up c098f8ad-e4e5-4418-982f-f09441395ae7
osd.2 up   in  weight 1 up_from 2 up_thru 23 down_at 0 last_clean_interval [0,0) 10.214.131.14:6808/30904 10.214.131.14:6809/30904 10.214.131.14:6810/30904 10.214.131.14:6811/30904 exists,up 8a977a93-30e3-4a4c-a9ea-2c871338bc22
osd.3 up   in  weight 1 up_from 3 up_thru 22 down_at 0 last_clean_interval [0,0) 10.214.132.38:6804/25399 10.214.132.38:6805/25399 10.214.132.38:6806/25399 10.214.132.38:6807/25399 exists,up 96836c48-dabb-4794-863a-8b2e8191d326
osd.4 up   in  weight 1 up_from 3 up_thru 22 down_at 0 last_clean_interval [0,0) 10.214.132.38:6800/25397 10.214.132.38:6801/25397 10.214.132.38:6802/25397 10.214.132.38:6803/25397 exists,up 551d1bf2-8c3c-45bd-b472-aef32b061530
osd.5 up   in  weight 1 up_from 3 up_thru 22 down_at 0 last_clean_interval [0,0) 10.214.132.38:6808/25403 10.214.132.38:6809/25403 10.214.132.38:6810/25403 10.214.132.38:6811/25403 exists,up b6ca68f7-bd89-4e98-a288-00dff5db7ea2

ubuntu@plana26:~$ ceph -s
  cluster a235a407-d6c3-40cb-9a2d-e02877f6a370
   health HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; 27 requests are blocked > 32 sec; recovery 838/29008 degraded (2.889%); 1/14504 unfound (0.007%)
   monmap e1: 3 mons at {a=10.214.131.14:6789/0,b=10.214.132.38:6789/0,c=10.214.131.14:6790/0}, election epoch 6, quorum 0,1,2 a,b,c
   osdmap e24: 6 osds: 6 up, 4 in
    pgmap v806: 92 pgs: 91 active+clean, 1 active+recovering; 57932 MB data, 114 GB used, 2495 GB / 2749 GB avail; 838/29008 degraded (2.889%); 1/14504 unfound (0.007%)
   mdsmap e5: 1/1/1 up {0=a=up:active}

ubuntu@teuthology:/a/teuthology-2013-08-02_01:00:11-rados-next-testing-basic-plana/93499$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 05542c395ce50bb1750cc6fead85727903fc3e72
machine_type: plana
nuke-on-error: true
os_type: ubuntu
overrides:
  admin_socket:
    branch: next
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
    fs: ext4
    log-whitelist:
    - slow request
    sha1: ef036bd4bc0e79bff8a5805800fbdeb0cc2db6ae
  ceph-deploy:
    branch:
      dev: next
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
  install:
    ceph:
      sha1: ef036bd4bc0e79bff8a5805800fbdeb0cc2db6ae
  s3tests:
    branch: next
  workunit:
    sha1: ef036bd4bc0e79bff8a5805800fbdeb0cc2db6ae
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 3
    chance_pgpnum_fix: 1
    timeout: 1200
- radosbench:
    clients:
    - client.0
    time: 1800
teuthology_branch: next

cluster is still running

Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #5799: SIGABRT in build_push_op -> object_info_t::decodeResolvedSamuel Just07/29/2013

Actions
Actions #1

Updated by Sage Weil over 10 years ago

also ubuntu@teuthology:/a/teuthology-2013-08-02_01:00:11-rados-next-testing-basic-plana/93547. sitll running too

Actions #2

Updated by Ian Colle over 10 years ago

  • Assignee set to Samuel Just
Actions #3

Updated by Samuel Just over 10 years ago

  • Status changed from New to Duplicate

This was probably caused by the same bug as 5799. We'll see if it comes up again with that patch in next.

Actions

Also available in: Atom PDF