Bug #52280: Mds crash and fails with assert on prepare_new_inode - CephFS - Ceph

Actions

Copy link

Bug #52280

closed

Mds crash and fails with assert on prepare_new_inode

Added by Yael Azulay over 2 years ago. Updated 9 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Xiubo Li

Category:

Correctness/Safety

Target version:

Ceph - v19.0.0

% Done:

Source:

Community (user)

Tags:

backport_processed

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

ceph-disk

Component(FS):

MDS, libcephfs

Labels (FS):

crash

Pull request ID:

43184

Crash signature (v1):

Crash signature (v2):

Description

Hi All
We have nautilus 14.2.7, cluster with 3 MDs.
Sometimes, during heavy loads of kubernetese pods, the MDs keep restarting and fail on MDCache::add_inode

On one of our setups that this crash happened , we also noticed that the size of cephfs_metadata was big,1.3TB.

stack trace from mds log file:

E/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/mds/MDCache.cc: In function 'void MDCache::add_inode(CInode*)' thread 7f657ec45700 time 2021-08-16 15:14:11.438857
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/mds/MDCache.cc: 268: FAILED ceph_assert(!p)

 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f658816b031]
 2: (()+0x2661f9) [0x7f658816b1f9]
 3: (()+0x20aeee) [0x5588cc076eee]
 4: (Server::prepare_new_inode(boost::intrusive_ptr<MDRequestImpl>&, CDir*, inodeno_t, unsigned int, file_layout_t*)+0x2a4) [0x5588cc00a054]
 5: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0xcf1) [0x5588cc019da1]
 6: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xb5b) [0x5588cc040bbb]
 7: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x308) [0x5588cc041048]
 8: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x122) [0x5588cc04cb02]
 9: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x6dc) [0x5588cbfc315c]
 10: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7fa) [0x5588cbfc55ca]
 11: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x12) [0x5588cbfc5c12]
 12: (MDSContext::complete(int)+0x74) [0x5588cc232b14]
 13: (MDSRank::_advance_queues()+0xa4) [0x5588cbfc4634]
 14: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x1d8) [0x5588cbfc4fa8]
 15: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x40) [0x5588cbfc5b50]
 16: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x5588cbfb3078]
 17: (DispatchQueue::entry()+0x1709) [0x7f65883819d9]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f658842e9cd]
 19: (()+0x7e65) [0x7f6586018e65]
 20: (clone()+0x6d) [0x7f6584cc688d]

     0> 2021-08-16 15:14:11.441 7f657ec45700 -1 *** Caught signal (Aborted) **
 in thread 7f657ec45700 thread_name:ms_dispatch

ceph df 

RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED
    ssd       8.7 TiB     5.0 TiB     3.8 TiB      3.8 TiB         43.29
    TOTAL     8.7 TiB     5.0 TiB     3.8 TiB      3.8 TiB         43.29

POOLS:
    POOL                          ID     STORED      OBJECTS     USED        %USED     MAX AVAIL
    cephfs_data                    1     246 GiB     591.31k     499 GiB     13.36       1.6 TiB
    cephfs_metadata                2     1.5 TiB     561.84k     3.0 TiB     48.69       1.6 TiB
    default.rgw.meta               3         0 B           0         0 B         0       1.6 TiB
    .rgw.root                      4     3.5 KiB           8     256 KiB         0       1.6 TiB
    default.rgw.buckets.index      5         0 B           0         0 B         0       1.6 TiB
    default.rgw.control            6         0 B           8         0 B         0       1.6 TiB
    default.rgw.buckets.data       7         0 B           0         0 B         0       1.6 TiB
    default.rgw.log                8         0 B         207         0 B         0       1.6 TiB
    volumes                        9     141 GiB      57.69k     282 GiB      8.01       1.6 TiB
    backups                       10         0 B           0         0 B         0       1.6 TiB
    metrics                       11         0 B           0         0 B         0       1.6 TiB

Related issues 5 (2 open — 3 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #52280

Mds crash and fails with assert on prepare_new_inode

Updated by Patrick Donnelly over 2 years ago

Updated by Patrick Donnelly over 2 years ago

Updated by Yael Azulay over 2 years ago

Updated by Patrick Donnelly over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Yael Azulay over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Yael Azulay over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Yael Azulay over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Yael Azulay over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Venky Shankar 12 months ago

Updated by Backport Bot 12 months ago

Updated by Backport Bot 12 months ago

Updated by Backport Bot 12 months ago

Updated by Backport Bot 12 months ago

Updated by Xiubo Li 9 months ago