Bug #36250: ceph-osd process crashing - RADOS - Ceph

Actions

Copy link

Bug #36250

closed

ceph-osd process crashing

Added by Josh Haft over 5 years ago. Updated almost 5 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Brad Hubbard

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ceph-osd process crashes in thread msgr-worker. This happens with all OSDs in the cluster, roughly once per day at the peak frequency. It does seem to happen more often during evening/overnight hours when there is more load on the cluster. Originally posted on ml: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030040.html

Version 12.2.2

From the log:
Sep 28 00:30:10 sn02 ceph-osd¹⁹²¹⁰³: 2018-09-28 00:30:10.399237 7fb5031f6700 -1 ** Caught signal (Aborted) *
in thread 7fb5031f6700 thread_name:msgr-worker-0

Stack:
#0 0x00007f9e738764ab in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1 0x000055925e1edab6 in reraise_fatal (signum=6) at /usr/src/debug/ceph-12.2.2/src/global/signal_handler.cc:74
#2 handle_fatal_signal (signum=6) at /usr/src/debug/ceph-12.2.2/src/global/signal_handler.cc:138
#3 <signal handler called>
#4 0x00007f9e7289f1f7 in _GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#5 0x00007f9e728a08e8 in __GI_abort () at abort.c:90
#6 0x00007f9e731a5ac5 in __gnu_cxx::_verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#7 0x00007f9e731a3a36 in _cxxabiv1::_terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:38
#8 0x00007f9e731a3a63 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#9 0x00007f9e731fa345 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:92
#10 0x00007f9e7386ee25 in start_thread (arg=0x7f9e6ff94700) at pthread_create.c:308
#11 0x00007f9e7296234d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

I've uploaded the log from a process which crashed, debug_ms set to 5 long before the crash happened. Id from ceph-post-file: 83aa1468-7dc5-401a-82fd-22c344322efe

Actions

Copy link

Updated by Greg Farnum over 5 years ago

Project changed from Ceph to RADOS
Category deleted (~~msgr~~)
Component(RADOS) OSD added

Actions

Copy link

Updated by Brad Hubbard over 5 years ago

Assignee set to Brad Hubbard

Actions

Copy link

Updated by Brad Hubbard over 5 years ago

Hello Josh,

Sorry it took me a while to see this.

Could you attach the output of "ceph report" please?

Actions

Copy link

Updated by Brad Hubbard over 5 years ago

Also...

In your original post you showed a message from the log showing an exception "buffer::malformed_input: entity_addr_t marker != 1". There is no such message in the logs you uploaded making this a seemingly different crash (although definitely similar and likely a common cause).

Could you upload at least one core dump, more if there is evidence the crashes are even slightly different please. Please also upload an sosreport from the system where the coredump is captured. You can use ceph-post-file and let us know the IDs here.

Actions

Copy link

Updated by Brad Hubbard over 5 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Josh Haft over 5 years ago

I believe this issue was due to a malfunctioning ceph-fuse client, although I don't have data to back that up as it was not mounted in debug mode. I needed to reboot a machine which had this CephFS mounted via ceph-fuse, and the problem has not occurred since then. It was happening nightly, and has not happened again in over 6 weeks.

Actions

Copy link

Updated by Brad Hubbard almost 5 years ago

Status changed from Need More Info to Can't reproduce

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #36250

ceph-osd process crashing

Updated by Greg Farnum over 5 years ago

Updated by Brad Hubbard over 5 years ago

Updated by Brad Hubbard over 5 years ago

Updated by Brad Hubbard over 5 years ago

Updated by Brad Hubbard over 5 years ago

Updated by Josh Haft over 5 years ago

Updated by Brad Hubbard almost 5 years ago