Project

General

Profile

Actions

Bug #52280

closed

Mds crash and fails with assert on prepare_new_inode

Added by Yael Azulay over 2 years ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-disk
Component(FS):
MDS, libcephfs
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi All
We have nautilus 14.2.7, cluster with 3 MDs.
Sometimes, during heavy loads of kubernetese pods, the MDs keep restarting and fail on MDCache::add_inode

On one of our setups that this crash happened , we also noticed that the size of cephfs_metadata was big,1.3TB.

stack trace from mds log file:

E/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/mds/MDCache.cc: In function 'void MDCache::add_inode(CInode*)' thread 7f657ec45700 time 2021-08-16 15:14:11.438857
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/mds/MDCache.cc: 268: FAILED ceph_assert(!p)

 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f658816b031]
 2: (()+0x2661f9) [0x7f658816b1f9]
 3: (()+0x20aeee) [0x5588cc076eee]
 4: (Server::prepare_new_inode(boost::intrusive_ptr<MDRequestImpl>&, CDir*, inodeno_t, unsigned int, file_layout_t*)+0x2a4) [0x5588cc00a054]
 5: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0xcf1) [0x5588cc019da1]
 6: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xb5b) [0x5588cc040bbb]
 7: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x308) [0x5588cc041048]
 8: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x122) [0x5588cc04cb02]
 9: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x6dc) [0x5588cbfc315c]
 10: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7fa) [0x5588cbfc55ca]
 11: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x12) [0x5588cbfc5c12]
 12: (MDSContext::complete(int)+0x74) [0x5588cc232b14]
 13: (MDSRank::_advance_queues()+0xa4) [0x5588cbfc4634]
 14: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x1d8) [0x5588cbfc4fa8]
 15: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x40) [0x5588cbfc5b50]
 16: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x5588cbfb3078]
 17: (DispatchQueue::entry()+0x1709) [0x7f65883819d9]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f658842e9cd]
 19: (()+0x7e65) [0x7f6586018e65]
 20: (clone()+0x6d) [0x7f6584cc688d]

     0> 2021-08-16 15:14:11.441 7f657ec45700 -1 *** Caught signal (Aborted) **
 in thread 7f657ec45700 thread_name:ms_dispatch

ceph df 

RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED
    ssd       8.7 TiB     5.0 TiB     3.8 TiB      3.8 TiB         43.29
    TOTAL     8.7 TiB     5.0 TiB     3.8 TiB      3.8 TiB         43.29

POOLS:
    POOL                          ID     STORED      OBJECTS     USED        %USED     MAX AVAIL
    cephfs_data                    1     246 GiB     591.31k     499 GiB     13.36       1.6 TiB
    cephfs_metadata                2     1.5 TiB     561.84k     3.0 TiB     48.69       1.6 TiB
    default.rgw.meta               3         0 B           0         0 B         0       1.6 TiB
    .rgw.root                      4     3.5 KiB           8     256 KiB         0       1.6 TiB
    default.rgw.buckets.index      5         0 B           0         0 B         0       1.6 TiB
    default.rgw.control            6         0 B           8         0 B         0       1.6 TiB
    default.rgw.buckets.data       7         0 B           0         0 B         0       1.6 TiB
    default.rgw.log                8         0 B         207         0 B         0       1.6 TiB
    volumes                        9     141 GiB      57.69k     282 GiB      8.01       1.6 TiB
    backups                       10         0 B           0         0 B         0       1.6 TiB
    metrics                       11         0 B           0         0 B         0       1.6 TiB


Related issues 5 (2 open3 closed)

Related to CephFS - Bug #40002: mds: not trim log under heavy loadFix Under ReviewXiubo Li

Actions
Related to CephFS - Bug #53542: Ceph Metadata Pool disk throughput usage increasingFix Under ReviewXiubo Li

Actions
Copied to CephFS - Backport #59706: pacific: Mds crash and fails with assert on prepare_new_inodeResolvedXiubo LiActions
Copied to CephFS - Backport #59707: quincy: Mds crash and fails with assert on prepare_new_inodeResolvedXiubo LiActions
Copied to CephFS - Backport #59708: reef: Mds crash and fails with assert on prepare_new_inodeResolvedXiubo LiActions
Actions #1

Updated by Patrick Donnelly over 2 years ago

  • Status changed from New to Need More Info

Were there any cluster health warnings? Some of the surround MDS logs would also be helpful.

Actions #2

Updated by Patrick Donnelly over 2 years ago

  • Assignee set to Xiubo Li
Actions #3

Updated by Yael Azulay over 2 years ago

Patrick Donnelly wrote:

Were there any cluster health warnings? Some of the surround MDS logs would also be helpful.

Hi Pertick
On load we see
1 MDSs behind on trimming

Or/and:
1 filesystem is degraded

Also I want to mention that we enlarged mds_cache_memory_limit to 8G (on one of the setup we now changed it to 16GB)
and changed mds_min_caps_per_client to 150

In addition we now have setup which its mds keep restarting and I didn't manage to recover it at all even after trying redhat recovery procedure: http://people.redhat.com/bhubbard/nature/default/cephfs/disaster-recovery-experts/

Actions #4

Updated by Patrick Donnelly over 2 years ago

  • Description updated (diff)
Actions #5

Updated by Xiubo Li over 2 years ago

Hi Yael,

Do you have the mds side logs ? Thanks.

Actions #6

Updated by Yael Azulay over 2 years ago

Xiubo Li wrote:

Hi Yael,

Do you have the mds side logs ? Thanks.

Hi Xiubo
We reinstalled the setup and unfortunately I don't find logs of the failure now.
Thanks

Actions #7

Updated by Xiubo Li over 2 years ago

Yael Azulay wrote:

Xiubo Li wrote:

Hi Yael,

Do you have the mds side logs ? Thanks.

Hi Xiubo
We reinstalled the setup and unfortunately I don't find logs of the failure now.
Thanks

That's okay, thanks.

I have checked the MDS related code, I just suspect that the cephfs_metadata pool was full of the journals, because since you have enlarged the mds_cache_memory_limit from 1G to 8G, so only the MDCache has been exceed the too full threshold value (8G * 95%) will the journal flush function, which will try to trim the journal logs, will be triggered. In your case I am afraid it was never triggered, because the too full threshold value of the MDCache had never reached, so the journals were piling up to 1.5TB.

Actions #8

Updated by Xiubo Li over 2 years ago

Hi Yael,

BTW, did you see any memory killer logs in the /var/log/message* log files ? It should be OS's the memories were used up or it had reached up limitation of the mds process.

Actions #9

Updated by Xiubo Li over 2 years ago

I found some memories are not counted into the MDCache memories, such as the inode_map map in the MDCache class.

Actions #10

Updated by Xiubo Li over 2 years ago

Hi Yael,

BTW, have you ever changed the "mds_log_segment_size" and "mds_log_events_per_segment" options ?

If not, usually if the mds could trim the mdlog without stuck, the mdlog will occupy around 128 * 4MB == 512MB disk space.

Actions #11

Updated by Yael Azulay over 2 years ago

@xiubo Li
Thanks much, Li, for your analysis
I didnt change mds_log_segment_size and mds_log_segment_size.
To what value should i change them given I have the mds_cache_memory_limit is 8G?

At the time this segmentation happened mds_min_caps_per_client was enlarged to 150

(Currently we have also enlarged other parameters like mds_cache_trim_threshold, mds_recall_global_max_decay_threshold, mds_recall_max_caps,
mds_recall_max_decay_rate)

- Not sure why my journal was never flushed? what should trigger journal flush?

- Did you mean there is a bug that inode_map is not counted in MDCache - which case this?

Actions #12

Updated by Xiubo Li over 2 years ago

  • Pull request ID set to 43184
Actions #13

Updated by Xiubo Li over 2 years ago

  • Status changed from Need More Info to Fix Under Review

Yael Azulay wrote:

@xiubo Li
Thanks much, Li, for your analysis
I didnt change mds_log_segment_size and mds_log_segment_size.
To what value should i change them given I have the mds_cache_memory_limit is 8G?

I don't think they have any related.

At the time this segmentation happened mds_min_caps_per_client was enlarged to 150

This should be okay. Won't affect much.

(Currently we have also enlarged other parameters like mds_cache_trim_threshold, mds_recall_global_max_decay_threshold, mds_recall_max_caps,
mds_recall_max_decay_rate)

- Not sure why my journal was never flushed? what should trigger journal flush?

Since I didn't have the MDS logs, I don't know what has happened. Just in case the mdlog->trim() was stuck it could cause this. But from the code I still haven't figure it out how that could happen.

In theory the mdlog->trim() should be call in the mds->tick() every 5 seconds.

- Did you mean there is a bug that inode_map is not counted in MDCache - which case this?

I am not sure whether should be counted to the MDCache memories, I will check it more.

I have fixed the crash issue.

Actions #14

Updated by Yael Azulay over 2 years ago

@xiubo Li
Hi Li
Thanks again
- What are the recommended values for mds_log_segment_size and mds_log_segment_size.?
- Dont understand: is the journal pool supposed to be emptied from time to time?
- What is the fix you did to the crash? In which version is it included?

Actions #15

Updated by Xiubo Li over 2 years ago

Yael Azulay wrote:

@xiubo Li
Hi Li
Thanks again
- What are the recommended values for mds_log_segment_size and mds_log_segment_size.?

There has no other recommended values for this too options except the default ones, this should depend on your capacity of your disk or pool you use, mostly the default values should be okay.

- Dont understand: is the journal pool supposed to be emptied from time to time?

The journals won't be emptied, but it will be trimmed once the usage reached the upper limitation. For each MDS the threshold should be be 8G * 95% for you.

- What is the fix you did to the crash? In which version is it included?

The fixing is about not to take the inode numbers twice, the first one will be when replaying the journal logs, but it didn't erase this in the ino table, so when there has a new client request allocating new inode number, it may take the used ino, which will trigger a crash. This should be from Nautilus or earlier versions.

I will check whether will the frequently crashes of the MDSes make the journal logs to be cleaned or not.

Actions #16

Updated by Xiubo Li over 2 years ago

  • Related to Feature #10764: optimize memory usage of MDSCacheObject added
Actions #17

Updated by Xiubo Li over 2 years ago

  • Related to deleted (Feature #10764: optimize memory usage of MDSCacheObject)
Actions #18

Updated by Xiubo Li over 2 years ago

  • Related to Bug #40002: mds: not trim log under heavy load added
Actions #19

Updated by Xiubo Li over 2 years ago

ceph df 

RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED
    ssd       8.7 TiB     5.0 TiB     3.8 TiB      3.8 TiB         43.29
    TOTAL     8.7 TiB     5.0 TiB     3.8 TiB      3.8 TiB         43.29

POOLS:
    POOL                          ID     STORED      OBJECTS     USED        %USED     MAX AVAIL
    cephfs_data                    1     246 GiB     591.31k     499 GiB     13.36       1.6 TiB
    cephfs_metadata                2     1.5 TiB     561.84k     3.0 TiB     48.69       1.6 TiB
    default.rgw.meta               3         0 B           0         0 B         0       1.6 TiB
    .rgw.root                      4     3.5 KiB           8     256 KiB         0       1.6 TiB
    default.rgw.buckets.index      5         0 B           0         0 B         0       1.6 TiB
    default.rgw.control            6         0 B           8         0 B         0       1.6 TiB
    default.rgw.buckets.data       7         0 B           0         0 B         0       1.6 TiB
    default.rgw.log                8         0 B         207         0 B         0       1.6 TiB
    volumes                        9     141 GiB      57.69k     282 GiB      8.01       1.6 TiB
    backups                       10         0 B           0         0 B         0       1.6 TiB
    metrics                       11         0 B           0         0 B         0       1.6 TiB

For this issue, it's the same with https://tracker.ceph.com/issues/40002, which was caused by the submit_mutex unfair mutex problem. It's a similar issue with https://tracker.ceph.com/issues/51722.

Actions #20

Updated by Yael Azulay over 2 years ago

Hi xiubo Li
Thanks for information
The problem happen to us on nautilus 14.2.7

Are the fixes should be included on this version?

Actions #21

Updated by Xiubo Li over 2 years ago

  • Related to Bug #53542: Ceph Metadata Pool disk throughput usage increasing added
Actions #22

Updated by Venky Shankar 12 months ago

  • Category set to Correctness/Safety
  • Status changed from Fix Under Review to Pending Backport
  • Target version set to v19.0.0
  • Source set to Community (user)
  • Backport set to reef,quincy,pacific
Actions #23

Updated by Backport Bot 12 months ago

  • Copied to Backport #59706: pacific: Mds crash and fails with assert on prepare_new_inode added
Actions #24

Updated by Backport Bot 12 months ago

  • Copied to Backport #59707: quincy: Mds crash and fails with assert on prepare_new_inode added
Actions #25

Updated by Backport Bot 12 months ago

  • Copied to Backport #59708: reef: Mds crash and fails with assert on prepare_new_inode added
Actions #26

Updated by Backport Bot 12 months ago

  • Tags set to backport_processed
Actions #27

Updated by Xiubo Li 9 months ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF