Project

General

Profile

Actions

Bug #41187

closed

src/mds/Server.cc: 958: FAILED assert(in->snaprealm)

Added by Jan Fajerski over 4 years ago. Updated over 4 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We saw this on a cluster where two out of three mds servers needed to be rebooted. After the reboot both mds dumped core, while the remaining mds keeps running.

2019-07-17 09:33:01.655344 7fc31c040700 -1 /home/abuild/rpmbuild/BUILD/ceph-12.2.8-467-g080f2248ff/src/mds/Server.cc: In function 'void Server::handle_client_reconnect(MClientReconnect*)' thread 7fc31c040700 time 2019-07-17 09:33:01.653041
/home/abuild/rpmbuild/BUILD/ceph-12.2.8-467-g080f2248ff/src/mds/Server.cc: 958: FAILED assert(in->snaprealm)

 ceph version 12.2.8-467-g080f2248ff (080f2248ff90306f7b2c50b2ddd8da094e5ca3a7) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x555635e18c0e]
 2: (Server::handle_client_reconnect(MClientReconnect*)+0x1b67) [0x555635b61287]
 3: (Server::dispatch(Message*)+0x305) [0x555635b86ef5]
 4: (MDSRank::handle_deferrable_message(Message*)+0x7ec) [0x555635aff72c]
 5: (MDSRank::_dispatch(Message*, bool)+0x1d3) [0x555635b0d4e3]
 6: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x555635b0e335]
 7: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x555635af7053]
 8: (DispatchQueue::entry()+0x78b) [0x5556360f3deb]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x555635e9936d]
 10: (()+0x8724) [0x7fc321508724]
 11: (clone()+0x6d) [0x7fc32057de8d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Clients were ceph-fuse, snapshots were not used (as reported by the user).

I realize this is a rather old luminous, but noticed that

commit 34682e447522b94068425b761772d6a52634477c
Author: Yan, Zheng <zyan@redhat.com>
Date:   Mon Jul 31 21:12:25 2017 +0800

    mds: rollback snaprealms when rolling back slave request

removed that particular assertion in mimic. So this might be in newer luminous point releases.

Unfortunately this cluster has since been scrapped. Any insight into what can cause this?

Actions #1

Updated by Patrick Donnelly over 4 years ago

  • Status changed from New to Rejected

Sorry Jan, snapshots are not stable in Luminous and we don't spend time looking at snapshot related failures for that release.

Actions

Also available in: Atom PDF