Project

General

Profile

Actions

Bug #4556

closed

OSDs crash with OSD::handle_op during recovery

Added by Wido den Hollander about 11 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While tracking down #3816 I stumbled upon this one multiple times.

I tried the upgrade to 0.56.4 to be sure, but that didn't change anything.

 ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)
 1: /usr/bin/ceph-osd() [0x788fba]
 2: (()+0xfcb0) [0x7f083e63ecb0]
 3: (gsignal()+0x35) [0x7f083cffd425]
 4: (abort()+0x17b) [0x7f083d000b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f083d94f69d]
 6: (()+0xb5846) [0x7f083d94d846]
 7: (()+0xb5873) [0x7f083d94d873]
 8: (()+0xb596e) [0x7f083d94d96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x8343af]
 10: (OSD::handle_op(std::tr1::shared_ptr<OpRequest>)+0x12d8) [0x624668]
 11: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0xe9) [0x62cba9]
 12: (OSD::do_waiters()+0x1a5) [0x62d105]
 13: (OSD::ms_dispatch(Message*)+0x1c2) [0x636a82]
 14: (DispatchQueue::entry()+0x349) [0x8c7399]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x81fbad]
 16: (()+0x7e9a) [0x7f083e636e9a]
 17: (clone()+0x6d) [0x7f083d0bacbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This cluster has already sustained a lot of issues and some OSDs have been down and out for quite some time now.

I added the logs of two OSDs:
- osd.2
- osd.38

It goes wrong during the peering process. All 40 OSDs are active and trying to recover, but one by one they keep falling down until I eventually end up with 11 OSDs and a cluster in a very bad state.

osdmap e21977: 40 osds: 11 up, 11 in

In the end it's just 11 OSDs surviving. I added the output of "ceph osd tree". As you can see it's 11 OSDs surviving, but not always the same ones.

The attached logs were produced with debug osd = 20

From what I make up out of the logs it goes wrong when OSDs transition to the Primary state for a PG, that seems to go wrong and they crash.


Files

ceph-osd.2.log.gz (901 KB) ceph-osd.2.log.gz Wido den Hollander, 03/26/2013 05:41 AM
ceph-osd.38.log.gz (412 KB) ceph-osd.38.log.gz Wido den Hollander, 03/26/2013 05:41 AM
tree.1.txt (1.12 KB) tree.1.txt Wido den Hollander, 03/26/2013 05:41 AM
tree.2.txt (1.12 KB) tree.2.txt Wido den Hollander, 03/26/2013 05:41 AM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #3816: osd/OSD.cc: 3318: FAILED assert(osd_lock.is_locked()) ResolvedSage Weil01/16/2013

Actions
Actions

Also available in: Atom PDF