Project

General

Profile

Actions

Bug #3770

closed

OSD crashes on boot

Added by Faidon Liambotis over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

One of my 0.56.1 OSDs crashed and couldn't boot: it was reaching tp_op heartbeats, and even after increasing that I was getting nothing but:
2013-01-08 23:57:25.337731 7fc515c26700 0 -- 0.0.0.0:6805/31953 >> 10.64.0.174:6818/8710 pipe(0x3cceb6c0 sd=56 :0 pgs=0 cs=0 l=0).fault with nothing to send, going to standby
2013-01-08 23:57:29.043846 7fc515b25700 0 -- 0.0.0.0:6805/31953 >> 10.64.0.174:6845/4111 pipe(0x3cceb240 sd=57 :32953 pgs=0 cs=0 l=0).connect claims to be 10.64.0.174:6845/11414 not 10.64.0.174:6845/4111 - wrong node!
2013-01-08 23:57:29.043957 7fc515b25700 0 -- 0.0.0.0:6805/31953 >> 10.64.0.174:6845/4111 pipe(0x3cceb240 sd=57 :32953 pgs=0 cs=0 l=0).fault with nothing to send, going to standby
2013-01-08 23:57:38.310206 7fc515a24700 0 -- 0.0.0.0:6805/31953 >> 10.64.0.173:6842/821 pipe(0x16bf0d80 sd=58 :0 pgs=0 cs=0 l=0).fault with nothing to send, going to standby

I waited a few hours and left the cluster to recover and become healthy again. Now it's HEALTH_OK and all pgs are active+clean.

However, when trying now to start the OSD in question, it immediately dies on boot on assert(_get_map_bl(epoch, bl)). Attached is the --debug_ms 20 --debug_osd 20 log and a full backtrace from gdb.

This is on ceph.com 0.56.1 packages in a Ubuntu 12.04 LTS platform.


Files

ceph-osd.27.log (3.19 MB) ceph-osd.27.log Faidon Liambotis, 01/09/2013 12:23 AM
ceph-osd.27.gdb (17.7 KB) ceph-osd.27.gdb Faidon Liambotis, 01/09/2013 12:23 AM
ceph-osd.27.meta.gz (273 KB) ceph-osd.27.meta.gz Faidon Liambotis, 01/10/2013 04:15 PM
Actions

Also available in: Atom PDF