Bug #4556: OSDs crash with OSD::handle_op during recovery - Ceph - Ceph

Actions

Copy link

Bug #4556

closed

OSDs crash with OSD::handle_op during recovery

Added by Wido den Hollander about 11 years ago. Updated about 9 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

While tracking down #3816 I stumbled upon this one multiple times.

I tried the upgrade to 0.56.4 to be sure, but that didn't change anything.

 ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)
 1: /usr/bin/ceph-osd() [0x788fba]
 2: (()+0xfcb0) [0x7f083e63ecb0]
 3: (gsignal()+0x35) [0x7f083cffd425]
 4: (abort()+0x17b) [0x7f083d000b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f083d94f69d]
 6: (()+0xb5846) [0x7f083d94d846]
 7: (()+0xb5873) [0x7f083d94d873]
 8: (()+0xb596e) [0x7f083d94d96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x8343af]
 10: (OSD::handle_op(std::tr1::shared_ptr<OpRequest>)+0x12d8) [0x624668]
 11: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0xe9) [0x62cba9]
 12: (OSD::do_waiters()+0x1a5) [0x62d105]
 13: (OSD::ms_dispatch(Message*)+0x1c2) [0x636a82]
 14: (DispatchQueue::entry()+0x349) [0x8c7399]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x81fbad]
 16: (()+0x7e9a) [0x7f083e636e9a]
 17: (clone()+0x6d) [0x7f083d0bacbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This cluster has already sustained a lot of issues and some OSDs have been down and out for quite some time now.

I added the logs of two OSDs:
- osd.2
- osd.38

It goes wrong during the peering process. All 40 OSDs are active and trying to recover, but one by one they keep falling down until I eventually end up with 11 OSDs and a cluster in a very bad state.

osdmap e21977: 40 osds: 11 up, 11 in

In the end it's just 11 OSDs surviving. I added the output of "ceph osd tree". As you can see it's 11 OSDs surviving, but not always the same ones.

The attached logs were produced with debug osd = 20

From what I make up out of the logs it goes wrong when OSDs transition to the Primary state for a PG, that seems to go wrong and they crash.

Files

Download all files

ceph-osd.2.log.gz (901 KB) ceph-osd.2.log.gz		Wido den Hollander, 03/26/2013 05:41 AM
ceph-osd.38.log.gz (412 KB) ceph-osd.38.log.gz		Wido den Hollander, 03/26/2013 05:41 AM
tree.1.txt (1.12 KB) tree.1.txt		Wido den Hollander, 03/26/2013 05:41 AM
tree.2.txt (1.12 KB) tree.2.txt		Wido den Hollander, 03/26/2013 05:41 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #4556

OSDs crash with OSD::handle_op during recovery

Updated by Sage Weil about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Wido den Hollander about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Xinxin Shu about 9 years ago

Updated by Xinxin Shu about 9 years ago