Project

General

Profile

Actions

Bug #36250

closed

ceph-osd process crashing

Added by Josh Haft over 5 years ago. Updated almost 5 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph-osd process crashes in thread msgr-worker. This happens with all OSDs in the cluster, roughly once per day at the peak frequency. It does seem to happen more often during evening/overnight hours when there is more load on the cluster. Originally posted on ml: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030040.html

Version 12.2.2

From the log:
Sep 28 00:30:10 sn02 ceph-osd192103: 2018-09-28 00:30:10.399237 7fb5031f6700 -1 ** Caught signal (Aborted) *
in thread 7fb5031f6700 thread_name:msgr-worker-0

Stack:
#0 0x00007f9e738764ab in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1 0x000055925e1edab6 in reraise_fatal (signum=6) at /usr/src/debug/ceph-12.2.2/src/global/signal_handler.cc:74
#2 handle_fatal_signal (signum=6) at /usr/src/debug/ceph-12.2.2/src/global/signal_handler.cc:138
#3 <signal handler called>
#4 0x00007f9e7289f1f7 in _GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#5 0x00007f9e728a08e8 in __GI_abort () at abort.c:90
#6 0x00007f9e731a5ac5 in __gnu_cxx::
_verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#7 0x00007f9e731a3a36 in _cxxabiv1::_terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:38
#8 0x00007f9e731a3a63 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#9 0x00007f9e731fa345 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:92
#10 0x00007f9e7386ee25 in start_thread (arg=0x7f9e6ff94700) at pthread_create.c:308
#11 0x00007f9e7296234d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

I've uploaded the log from a process which crashed, debug_ms set to 5 long before the crash happened. Id from ceph-post-file: 83aa1468-7dc5-401a-82fd-22c344322efe

Actions #1

Updated by Greg Farnum over 5 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (msgr)
  • Component(RADOS) OSD added
Actions #2

Updated by Brad Hubbard over 5 years ago

  • Assignee set to Brad Hubbard
Actions #3

Updated by Brad Hubbard over 5 years ago

Hello Josh,

Sorry it took me a while to see this.

Could you attach the output of "ceph report" please?

Actions #4

Updated by Brad Hubbard over 5 years ago

Also...

In your original post you showed a message from the log showing an exception "buffer::malformed_input: entity_addr_t marker != 1". There is no such message in the logs you uploaded making this a seemingly different crash (although definitely similar and likely a common cause).

Could you upload at least one core dump, more if there is evidence the crashes are even slightly different please. Please also upload an sosreport from the system where the coredump is captured. You can use ceph-post-file and let us know the IDs here.

Actions #5

Updated by Brad Hubbard over 5 years ago

  • Status changed from New to Need More Info
Actions #6

Updated by Josh Haft over 5 years ago

I believe this issue was due to a malfunctioning ceph-fuse client, although I don't have data to back that up as it was not mounted in debug mode. I needed to reboot a machine which had this CephFS mounted via ceph-fuse, and the problem has not occurred since then. It was happening nightly, and has not happened again in over 6 weeks.

Actions #7

Updated by Brad Hubbard almost 5 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF