Project

General

Profile

Bug #37721

mds crashes frequently when using snapshots in CephFS on mimic

Added by Soenke Schippmann 27 days ago. Updated 8 days ago.

Status:
Pending Backport
Priority:
High
Assignee:
Category:
Snapshots
Target version:
Start date:
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
mimic
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash, snapshots
Pull request ID:

Description

After we started to use snapshots in CephFS we expect frequent crashes (every couple of hours) of the active mds daemon. The FS was created with version Mimic (13.2.1), so snapshots were already enabled by default.

Here is an example backtrace:

(gdb) bt
#0  raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00005643386d53ae in reraise_fatal (signum=6) at ./src/global/signal_handler.cc:74
#2  handle_fatal_signal (signum=6) at ./src/global/signal_handler.cc:138
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#5  0x00007f10c61c5801 in __GI_abort () at abort.c:79
#6  0x00007f10c77b4080 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () from /usr/lib/ceph/libceph-common.so.0
#7  0x00007f10c77b40f7 in ceph::__ceph_assert_fail(ceph::assert_data const&) () from /usr/lib/ceph/libceph-common.so.0
#8  0x000056433855c4a1 in Locker::snapflush_nudge (this=0x56433afa0640, in=0x5643d0467800) at ./src/mds/Locker.cc:2577
#9  0x000056433855c5f6 in Locker::caps_tick (this=0x56433afa0640) at ./src/mds/Locker.cc:3641
#10 0x0000564338560932 in Locker::tick (this=<optimized out>) at ./src/mds/Locker.cc:127
#11 0x00005643383fe754 in MDSRankDispatcher::tick (this=0x56433b24f000) at ./src/mds/MDSRank.cc:325
#12 0x00005643383ee96c in boost::function1<void, int>::operator() (a0=<optimized out>, this=<optimized out>)
    at ./obj-x86_64-linux-gnu/boost/include/boost/function/function_template.hpp:768
#13 FunctionContext::finish (this=<optimized out>, r=<optimized out>) at ./src/include/Context.h:522
#14 0x00005643383ec719 in Context::complete (this=0x56433b1432c0, r=<optimized out>) at ./src/include/Context.h:77
#15 0x00007f10c77b08ab in SafeTimer::timer_thread() () from /usr/lib/ceph/libceph-common.so.0
#16 0x00007f10c77b1e6d in SafeTimerThread::entry() () from /usr/lib/ceph/libceph-common.so.0
#17 0x00007f10c70c06db in start_thread (arg=0x7f10bc3b7700) at pthread_create.c:463
#18 0x00007f10c62a688f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

A coredump, the mds log and the ceph.conf have been uploaded with ceph-post-file:

  • ceph-mds.log: 7bc127e2-27e9-4a24-b126-b7bd930a82bd
  • core.safe_timer.2474: 7af3b92f-0970-4885-ad86-f5c860616554
  • ceph.conf: 089f90bf-83e5-4915-bc47-1cafeba0c3ff

Thanks for investigating this issue.


Related issues

Copied to fs - Backport #37818: mimic: mds crashes frequently when using snapshots in CephFS on mimic In Progress

History

#1 Updated by Greg Farnum 27 days ago

  • Project changed from RADOS to fs
  • Category set to Snapshots
  • Priority changed from Normal to High
  • Component(FS) MDS added
  • Labels (FS) crash, snapshots added

#2 Updated by Zheng Yan 14 days ago

  • Status changed from New to Need Review

#3 Updated by Patrick Donnelly 14 days ago

  • Assignee set to Zheng Yan
  • Target version set to v14.0.0
  • Start date deleted (12/20/2018)
  • Source set to Community (user)
  • Backport set to mimic
  • Pull request ID set to 25741
  • Affected Versions deleted (v13.2.1)
  • ceph-qa-suite deleted (fs)

#4 Updated by Patrick Donnelly 8 days ago

  • Status changed from Need Review to Pending Backport

#5 Updated by Nathan Cutler 8 days ago

  • Copied to Backport #37818: mimic: mds crashes frequently when using snapshots in CephFS on mimic added

Also available in: Atom PDF