Project

General

Profile

Actions

Bug #52280

closed

Mds crash and fails with assert on prepare_new_inode

Added by Yael Azulay over 2 years ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-disk
Component(FS):
MDS, libcephfs
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi All
We have nautilus 14.2.7, cluster with 3 MDs.
Sometimes, during heavy loads of kubernetese pods, the MDs keep restarting and fail on MDCache::add_inode

On one of our setups that this crash happened , we also noticed that the size of cephfs_metadata was big,1.3TB.

stack trace from mds log file:

E/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/mds/MDCache.cc: In function 'void MDCache::add_inode(CInode*)' thread 7f657ec45700 time 2021-08-16 15:14:11.438857
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/mds/MDCache.cc: 268: FAILED ceph_assert(!p)

 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f658816b031]
 2: (()+0x2661f9) [0x7f658816b1f9]
 3: (()+0x20aeee) [0x5588cc076eee]
 4: (Server::prepare_new_inode(boost::intrusive_ptr<MDRequestImpl>&, CDir*, inodeno_t, unsigned int, file_layout_t*)+0x2a4) [0x5588cc00a054]
 5: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0xcf1) [0x5588cc019da1]
 6: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xb5b) [0x5588cc040bbb]
 7: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x308) [0x5588cc041048]
 8: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x122) [0x5588cc04cb02]
 9: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x6dc) [0x5588cbfc315c]
 10: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7fa) [0x5588cbfc55ca]
 11: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x12) [0x5588cbfc5c12]
 12: (MDSContext::complete(int)+0x74) [0x5588cc232b14]
 13: (MDSRank::_advance_queues()+0xa4) [0x5588cbfc4634]
 14: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x1d8) [0x5588cbfc4fa8]
 15: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x40) [0x5588cbfc5b50]
 16: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x5588cbfb3078]
 17: (DispatchQueue::entry()+0x1709) [0x7f65883819d9]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f658842e9cd]
 19: (()+0x7e65) [0x7f6586018e65]
 20: (clone()+0x6d) [0x7f6584cc688d]

     0> 2021-08-16 15:14:11.441 7f657ec45700 -1 *** Caught signal (Aborted) **
 in thread 7f657ec45700 thread_name:ms_dispatch

ceph df 

RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED
    ssd       8.7 TiB     5.0 TiB     3.8 TiB      3.8 TiB         43.29
    TOTAL     8.7 TiB     5.0 TiB     3.8 TiB      3.8 TiB         43.29

POOLS:
    POOL                          ID     STORED      OBJECTS     USED        %USED     MAX AVAIL
    cephfs_data                    1     246 GiB     591.31k     499 GiB     13.36       1.6 TiB
    cephfs_metadata                2     1.5 TiB     561.84k     3.0 TiB     48.69       1.6 TiB
    default.rgw.meta               3         0 B           0         0 B         0       1.6 TiB
    .rgw.root                      4     3.5 KiB           8     256 KiB         0       1.6 TiB
    default.rgw.buckets.index      5         0 B           0         0 B         0       1.6 TiB
    default.rgw.control            6         0 B           8         0 B         0       1.6 TiB
    default.rgw.buckets.data       7         0 B           0         0 B         0       1.6 TiB
    default.rgw.log                8         0 B         207         0 B         0       1.6 TiB
    volumes                        9     141 GiB      57.69k     282 GiB      8.01       1.6 TiB
    backups                       10         0 B           0         0 B         0       1.6 TiB
    metrics                       11         0 B           0         0 B         0       1.6 TiB


Related issues 5 (2 open3 closed)

Related to CephFS - Bug #40002: mds: not trim log under heavy loadFix Under ReviewXiubo Li

Actions
Related to CephFS - Bug #53542: Ceph Metadata Pool disk throughput usage increasingFix Under ReviewXiubo Li

Actions
Copied to CephFS - Backport #59706: pacific: Mds crash and fails with assert on prepare_new_inodeResolvedXiubo LiActions
Copied to CephFS - Backport #59707: quincy: Mds crash and fails with assert on prepare_new_inodeResolvedXiubo LiActions
Copied to CephFS - Backport #59708: reef: Mds crash and fails with assert on prepare_new_inodeResolvedXiubo LiActions
Actions

Also available in: Atom PDF