Project

General

Profile

Bug #6233

OSD crash during repair

Added by Chris Dunlop over 10 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On 0.56.7-1~bpo70+1, whilst trying to repair an OSD:

2013-09-05 09:19:33.020619 7f540a12d700 0 log [ERR] : 2.12 repair stat mismatch, got 2842/2843 objects, 280/280 clones, 11744127488/11748321792 bytes.
2013-09-05 09:19:33.020722 7f540a12d700 0 log [ERR] : 2.12 repair 0 missing, 1 inconsistent objects
2013-09-05 09:19:33.037816 7f540a12d700 -1 ** Caught signal (Aborted) *
in thread 7f540a12d700

ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33)
1: /usr/bin/ceph-osd() [0x8530a2]
2: (()+0xf030) [0x7f541ca39030]
3: (gsignal()+0x35) [0x7f541b132475]
4: (abort()+0x180) [0x7f541b1356f0]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f541b98789d]
6: (()+0x63996) [0x7f541b985996]
7: (()+0x639c3) [0x7f541b9859c3]
8: (()+0x63bee) [0x7f541b985bee]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x8fa9a7]
10: (object_info_t::decode(ceph::buffer::list::iterator&)+0x29) [0x95b579]
11: (object_info_t::object_info_t(ceph::buffer::list&)+0x180) [0x695ec0]
12: (PG::repair_object(hobject_t const&, ScrubMap::object*, int, int)+0xc7) [0x7646b7]
13: (PG::scrub_process_inconsistent()+0x9bd) [0x76534d]
14: (PG::scrub_finish()+0x4f) [0x76587f]
15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x10d6) [0x76cb96]
16: (PG::scrub(ThreadPool::TPHandle&)+0x138) [0x76d7e8]
17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0xf) [0x70515f]
18: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8f0542]
19: (ThreadPool::WorkThread::entry()+0x10) [0x8f14d0]
20: (()+0x6b50) [0x7f541ca30b50]
21: (clone()+0x6d) [0x7f541b1daa7d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ceph-osd.6.log View - OSD log (2.78 MB) Chris Dunlop, 09/04/2013 05:11 PM

ceph-osd-objdump - OSD objdump (57.9 MB) Chris Dunlop, 09/04/2013 05:11 PM

ceph-osd.6.log View (2.57 MB) Chris Dunlop, 09/05/2013 07:46 PM

History

#1 Updated by Chris Dunlop over 10 years ago

The pg being repaired at the time is 2.12, which 'ceph pg dump' tells me lives on [6,7]. Attached log is the output after:

  1. ceph osd tell 6 injectargs '--debug_osd 0/10'
  2. ceph pg repair 2.12

#2 Updated by Chris Dunlop over 10 years ago

Was missing xattrs:

2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying configuration change: internal_safe_to_start_threads = 'true'
2013-09-06 09:33:28.303658 7f0ae94bd700 0 log [ERR] : 2.12 osd.7: soid 56987a12/rb.0.17d9b.2ae8944a.000000001e11/head//2 extra attr _,
extra attr snapset
2013-09-06 09:33:28.303685 7f0ae94bd700 0 log [ERR] : repair 2.12 56987a12/rb.0.17d9b.2ae8944a.000000001e11/head//2 no 'snapset' attr
2013-09-06 09:34:45.138468 7f0ae94bd700 0 log [ERR] : 2.12 repair stat mismatch, got 2722/2723 objects, 339/339 clones,
11307104768/11311299072 bytes.
2013-09-06 09:34:45.142215 7f0ae94bd700 0 log [ERR] : 2.12 repair 0 missing, 1 inconsistent objects
2013-09-06 09:34:45.206621 7f0ae94bd700 -1 ** Caught signal (Aborted) *

b5# cd /var/lib/ceph/osd/ceph-6/current
b5# find 2.12* | grep -i 17d9b.2ae8944a.000000001e11
2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.000000001e11__head_56987A12__2
b5# getfattr -d 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.000000001e11__head_56987A12__2
<<< ...crickets... >>>

vs.

b4# cd /var/lib/ceph/osd/ceph-7/current
b4# getfattr -d 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.000000001e11__head_56987A12__2
  1. file: 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.000000001e11__head_56987A12__2
    user.ceph._=0sCgjhAAAABANBAAAAAAAAACAAAAByYi4wLjE3ZDliLjJhZTg5NDRhLjAwMDAwMDAwMWUxMf7/////////EnqYVgAAAAAAAgAAAAAAAAAEAxAAAAACAAAAAAAAA
    P////8AAAAAAAAAAEInCgAAAAAAuEsAAEEnCgAAAAAAuEsAAAICFQAAAAgTmwEAAAAAAHD1AgAAAAAAAAAAAAAAQAAAAAAAyY4dUpjCTSACAhUAAAAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCJwoAAAAAALhLAAAA
    user.ceph.snapset=0sAgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==

Pool fixed by (re)moving ceph-6/current/2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.000000001e11__head_56987A12__2 and 'ceph pg repair 2.12'

See Also: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/16897

...if there isn't a qa test for missing xattrs + repair, there probably should be?

#3 Updated by Samuel Just about 9 years ago

  • Status changed from New to Won't Fix

#4 Updated by Samuel Just about 9 years ago

  • Status changed from Won't Fix to 12

#5 Updated by Sage Weil almost 7 years ago

  • Status changed from 12 to Closed

Also available in: Atom PDF