Bug #41187: src/mds/Server.cc: 958: FAILED assert(in->snaprealm) - CephFS - Ceph

Actions

Copy link

Bug #41187

closed

src/mds/Server.cc: 958: FAILED assert(in->snaprealm)

Added by Jan Fajerski over 4 years ago. Updated over 4 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.8

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We saw this on a cluster where two out of three mds servers needed to be rebooted. After the reboot both mds dumped core, while the remaining mds keeps running.

2019-07-17 09:33:01.655344 7fc31c040700 -1 /home/abuild/rpmbuild/BUILD/ceph-12.2.8-467-g080f2248ff/src/mds/Server.cc: In function 'void Server::handle_client_reconnect(MClientReconnect*)' thread 7fc31c040700 time 2019-07-17 09:33:01.653041
/home/abuild/rpmbuild/BUILD/ceph-12.2.8-467-g080f2248ff/src/mds/Server.cc: 958: FAILED assert(in->snaprealm)

 ceph version 12.2.8-467-g080f2248ff (080f2248ff90306f7b2c50b2ddd8da094e5ca3a7) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x555635e18c0e]
 2: (Server::handle_client_reconnect(MClientReconnect*)+0x1b67) [0x555635b61287]
 3: (Server::dispatch(Message*)+0x305) [0x555635b86ef5]
 4: (MDSRank::handle_deferrable_message(Message*)+0x7ec) [0x555635aff72c]
 5: (MDSRank::_dispatch(Message*, bool)+0x1d3) [0x555635b0d4e3]
 6: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x555635b0e335]
 7: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x555635af7053]
 8: (DispatchQueue::entry()+0x78b) [0x5556360f3deb]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x555635e9936d]
 10: (()+0x8724) [0x7fc321508724]
 11: (clone()+0x6d) [0x7fc32057de8d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Clients were ceph-fuse, snapshots were not used (as reported by the user).

I realize this is a rather old luminous, but noticed that

commit 34682e447522b94068425b761772d6a52634477c
Author: Yan, Zheng <zyan@redhat.com>
Date:   Mon Jul 31 21:12:25 2017 +0800

    mds: rollback snaprealms when rolling back slave request

removed that particular assertion in mimic. So this might be in newer luminous point releases.

Unfortunately this cluster has since been scrapped. Any insight into what can cause this?