Project

General

Profile

Actions

Bug #5270

closed

osd: crash in PG::peek_map_epoch()

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

    -3> 2013-06-06 22:11:43.947436 7f69f8cda780 10 osd.1 245 pgid 76.5 coll 76.5_head
    -2> 2013-06-06 22:11:43.947446 7f69f8cda780 15 filestore(/var/lib/ceph/osd/ceph-1) collection_getattr /var/lib/ceph/osd/ceph-1/current/76.5_head 'info'
    -1> 2013-06-06 22:11:43.947468 7f69f8cda780 10 filestore(/var/lib/ceph/osd/ceph-1) collection_getattr /var/lib/ceph/osd/ceph-1/current/76.5_head 'info' = -61
     0> 2013-06-06 22:11:43.950057 7f69f8cda780 -1 *** Caught signal (Aborted) **
 in thread 7f69f8cda780

 ceph version 0.63-393-g08923eb (08923eb842a9768ff556939221e63b983724e9bf)
 1: ceph-osd() [0x7a825a]
 2: (()+0xfcb0) [0x7f69f86bdcb0]
 3: (gsignal()+0x35) [0x7f69f678b425]
 4: (abort()+0x17b) [0x7f69f678eb8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f69f70dd69d]
 6: (()+0xb5846) [0x7f69f70db846]
 7: (()+0xb5873) [0x7f69f70db873]
 8: (()+0xb596e) [0x7f69f70db96e]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x85ab37]
 10: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, ceph::buffer::list*)+0x112) [0x6c1e82]
 11: (OSD::load_pgs()+0x14dd) [0x64cd6d]
 12: (OSD::init()+0xf85) [0x64f1c5]
 13: (main()+0x1cff) [0x5836cf]

several instances on current master. one test was

ubuntu@teuthology:/a/sage-2013-06-06_17:56:44-rados-wip-mon-testing-basic/32596$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 19bb6a83cb93383b363cc5956e304213f0f1b79f
machine_type: plana
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject delay type: osd
        ms inject socket failures: 2500
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    fs: xfs
    log-whitelist:
    - slow request
    sha1: 08923eb842a9768ff556939221e63b983724e9bf
  install:
    ceph:
      sha1: 08923eb842a9768ff556939221e63b983724e9bf
  s3tests:
    branch: master
  workunit:
    sha1: 08923eb842a9768ff556939221e63b983724e9bf
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- ceph-fuse: null
- workunit:
    clients:
      client.0:
      - rados/test.sh

looking at the store the xattr is not there:

root@plana24:/var/lib/ceph/osd/ceph-1/current/76.5_head# getfattr -d .
# file: .
user.cephos.collection_version=0sAwAAAA==
user.cephos.phash.contents=0sAQAAAAAAAAAAAAAAAAAAAAA=
user.cephos.seq=0sAQEQAAAAHRMAAAAAAAAAAAAABQAAAAA=

other jobs, also pg splitting:

ubuntu@teuthology:/a/sage-2013-06-06_17:56:44-rados-wip-mon-testing-basic/32578
ubuntu@teuthology:/a/sage-2013-06-06_17:56:44-rados-wip-mon-testing-basic/32584
ubuntu@teuthology:/a/sage-2013-06-06_17:56:44-rados-wip-mon-testing-basic/32590


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #5269: osd: EEXIST on mkcollResolved06/06/2013

Actions
Actions #1

Updated by Ian Colle almost 11 years ago

  • Assignee set to Samuel Just
Actions #2

Updated by Samuel Just almost 11 years ago

Very odd. That xattr is written atomically on pg collection creation and never overwritten thereafter.

Actions #3

Updated by Sergey Fionov almost 11 years ago

I've got the same error when some pginfo files have been lost due to XFS corruption. Removing pg collection helped to start osd again.

Actions #4

Updated by Samuel Just almost 11 years ago

  • Status changed from 12 to Resolved
Actions

Also available in: Atom PDF