Bug #52280: Mds crash and fails with assert on prepare_new_inode - CephFS - Ceph

Actions

Copy link

Bug #52280

closed

Mds crash and fails with assert on prepare_new_inode

Added by Yael Azulay over 2 years ago. Updated 9 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Xiubo Li

Category:

Correctness/Safety

Target version:

Ceph - v19.0.0

% Done:

Source:

Community (user)

Tags:

backport_processed

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

ceph-disk

Component(FS):

MDS, libcephfs

Labels (FS):

crash

Pull request ID:

43184

Crash signature (v1):

Crash signature (v2):

Description

Hi All
We have nautilus 14.2.7, cluster with 3 MDs.
Sometimes, during heavy loads of kubernetese pods, the MDs keep restarting and fail on MDCache::add_inode

On one of our setups that this crash happened , we also noticed that the size of cephfs_metadata was big,1.3TB.

stack trace from mds log file:

E/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/mds/MDCache.cc: In function 'void MDCache::add_inode(CInode*)' thread 7f657ec45700 time 2021-08-16 15:14:11.438857
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/mds/MDCache.cc: 268: FAILED ceph_assert(!p)

 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f658816b031]
 2: (()+0x2661f9) [0x7f658816b1f9]
 3: (()+0x20aeee) [0x5588cc076eee]
 4: (Server::prepare_new_inode(boost::intrusive_ptr<MDRequestImpl>&, CDir*, inodeno_t, unsigned int, file_layout_t*)+0x2a4) [0x5588cc00a054]
 5: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0xcf1) [0x5588cc019da1]
 6: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xb5b) [0x5588cc040bbb]
 7: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x308) [0x5588cc041048]
 8: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x122) [0x5588cc04cb02]
 9: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x6dc) [0x5588cbfc315c]
 10: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7fa) [0x5588cbfc55ca]
 11: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x12) [0x5588cbfc5c12]
 12: (MDSContext::complete(int)+0x74) [0x5588cc232b14]
 13: (MDSRank::_advance_queues()+0xa4) [0x5588cbfc4634]
 14: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x1d8) [0x5588cbfc4fa8]
 15: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x40) [0x5588cbfc5b50]
 16: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x5588cbfb3078]
 17: (DispatchQueue::entry()+0x1709) [0x7f65883819d9]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f658842e9cd]
 19: (()+0x7e65) [0x7f6586018e65]
 20: (clone()+0x6d) [0x7f6584cc688d]

     0> 2021-08-16 15:14:11.441 7f657ec45700 -1 *** Caught signal (Aborted) **
 in thread 7f657ec45700 thread_name:ms_dispatch

ceph df 

RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED
    ssd       8.7 TiB     5.0 TiB     3.8 TiB      3.8 TiB         43.29
    TOTAL     8.7 TiB     5.0 TiB     3.8 TiB      3.8 TiB         43.29

POOLS:
    POOL                          ID     STORED      OBJECTS     USED        %USED     MAX AVAIL
    cephfs_data                    1     246 GiB     591.31k     499 GiB     13.36       1.6 TiB
    cephfs_metadata                2     1.5 TiB     561.84k     3.0 TiB     48.69       1.6 TiB
    default.rgw.meta               3         0 B           0         0 B         0       1.6 TiB
    .rgw.root                      4     3.5 KiB           8     256 KiB         0       1.6 TiB
    default.rgw.buckets.index      5         0 B           0         0 B         0       1.6 TiB
    default.rgw.control            6         0 B           8         0 B         0       1.6 TiB
    default.rgw.buckets.data       7         0 B           0         0 B         0       1.6 TiB
    default.rgw.log                8         0 B         207         0 B         0       1.6 TiB
    volumes                        9     141 GiB      57.69k     282 GiB      8.01       1.6 TiB
    backups                       10         0 B           0         0 B         0       1.6 TiB
    metrics                       11         0 B           0         0 B         0       1.6 TiB

Related issues 5 (2 open — 3 closed)

Actions

Copy link

Updated by Patrick Donnelly over 2 years ago

Status changed from New to Need More Info

Were there any cluster health warnings? Some of the surround MDS logs would also be helpful.

Actions

Copy link

Updated by Patrick Donnelly over 2 years ago

Assignee set to Xiubo Li

Actions

Copy link

Updated by Yael Azulay over 2 years ago

Patrick Donnelly wrote:

Were there any cluster health warnings? Some of the surround MDS logs would also be helpful.

Hi Pertick
On load we see
1 MDSs behind on trimming

Or/and:
1 filesystem is degraded

Also I want to mention that we enlarged mds_cache_memory_limit to 8G (on one of the setup we now changed it to 16GB)
and changed mds_min_caps_per_client to 150

In addition we now have setup which its mds keep restarting and I didn't manage to recover it at all even after trying redhat recovery procedure: http://people.redhat.com/bhubbard/nature/default/cephfs/disaster-recovery-experts/

Actions

Copy link

Updated by Patrick Donnelly over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Hi Yael,

Do you have the mds side logs ? Thanks.

Actions

Copy link

Updated by Yael Azulay over 2 years ago

Xiubo Li wrote:

Hi Yael,

Do you have the mds side logs ? Thanks.

Hi Xiubo
We reinstalled the setup and unfortunately I don't find logs of the failure now.
Thanks

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Yael Azulay wrote:

Xiubo Li wrote:

Hi Yael,

Do you have the mds side logs ? Thanks.

Hi Xiubo
We reinstalled the setup and unfortunately I don't find logs of the failure now.
Thanks

That's okay, thanks.

I have checked the MDS related code, I just suspect that the cephfs_metadata pool was full of the journals, because since you have enlarged the mds_cache_memory_limit from 1G to 8G, so only the MDCache has been exceed the too full threshold value (8G * 95%) will the journal flush function, which will try to trim the journal logs, will be triggered. In your case I am afraid it was never triggered, because the too full threshold value of the MDCache had never reached, so the journals were piling up to 1.5TB.

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Hi Yael,

BTW, did you see any memory killer logs in the /var/log/message* log files ? It should be OS's the memories were used up or it had reached up limitation of the mds process.

Actions

Copy link

Updated by Xiubo Li over 2 years ago

I found some memories are not counted into the MDCache memories, such as the inode_map map in the MDCache class.

Actions

Copy link

#10

Updated by Xiubo Li over 2 years ago

Hi Yael,

BTW, have you ever changed the "mds_log_segment_size" and "mds_log_events_per_segment" options ?

If not, usually if the mds could trim the mdlog without stuck, the mdlog will occupy around 128 * 4MB == 512MB disk space.

Actions

Copy link

#11

Updated by Yael Azulay over 2 years ago

@xiubo Li
Thanks much, Li, for your analysis
I didnt change mds_log_segment_size and mds_log_segment_size.
To what value should i change them given I have the mds_cache_memory_limit is 8G?

At the time this segmentation happened mds_min_caps_per_client was enlarged to 150

(Currently we have also enlarged other parameters like mds_cache_trim_threshold, mds_recall_global_max_decay_threshold, mds_recall_max_caps,
mds_recall_max_decay_rate)

- Not sure why my journal was never flushed? what should trigger journal flush?

- Did you mean there is a bug that inode_map is not counted in MDCache - which case this?

Actions

Copy link

#12

Updated by Xiubo Li over 2 years ago

Pull request ID set to 43184

Actions

Copy link

#13

Updated by Xiubo Li over 2 years ago

Status changed from Need More Info to Fix Under Review

Yael Azulay wrote:

@xiubo Li
Thanks much, Li, for your analysis
I didnt change mds_log_segment_size and mds_log_segment_size.
To what value should i change them given I have the mds_cache_memory_limit is 8G?

I don't think they have any related.

At the time this segmentation happened mds_min_caps_per_client was enlarged to 150

This should be okay. Won't affect much.

(Currently we have also enlarged other parameters like mds_cache_trim_threshold, mds_recall_global_max_decay_threshold, mds_recall_max_caps,
mds_recall_max_decay_rate)

- Not sure why my journal was never flushed? what should trigger journal flush?

Since I didn't have the MDS logs, I don't know what has happened. Just in case the mdlog->trim() was stuck it could cause this. But from the code I still haven't figure it out how that could happen.

In theory the mdlog->trim() should be call in the mds->tick() every 5 seconds.

- Did you mean there is a bug that inode_map is not counted in MDCache - which case this?

I am not sure whether should be counted to the MDCache memories, I will check it more.

I have fixed the crash issue.

Actions

Copy link

#14

Updated by Yael Azulay over 2 years ago

@xiubo Li
Hi Li
Thanks again
- What are the recommended values for mds_log_segment_size and mds_log_segment_size.?
- Dont understand: is the journal pool supposed to be emptied from time to time?
- What is the fix you did to the crash? In which version is it included?

Actions

Copy link

#15

Updated by Xiubo Li over 2 years ago

Yael Azulay wrote:

@xiubo Li
Hi Li
Thanks again
- What are the recommended values for mds_log_segment_size and mds_log_segment_size.?

There has no other recommended values for this too options except the default ones, this should depend on your capacity of your disk or pool you use, mostly the default values should be okay.

- Dont understand: is the journal pool supposed to be emptied from time to time?

The journals won't be emptied, but it will be trimmed once the usage reached the upper limitation. For each MDS the threshold should be be 8G * 95% for you.

- What is the fix you did to the crash? In which version is it included?

The fixing is about not to take the inode numbers twice, the first one will be when replaying the journal logs, but it didn't erase this in the ino table, so when there has a new client request allocating new inode number, it may take the used ino, which will trigger a crash. This should be from Nautilus or earlier versions.

I will check whether will the frequently crashes of the MDSes make the journal logs to be cleaned or not.

Actions

Copy link

#16

Updated by Xiubo Li over 2 years ago

Related to Feature #10764: optimize memory usage of MDSCacheObject added

Actions

Copy link

#17

Updated by Xiubo Li over 2 years ago

Related to deleted (Feature #10764: optimize memory usage of MDSCacheObject)

Actions

Copy link

#18

Updated by Xiubo Li over 2 years ago

Related to Bug #40002: mds: not trim log under heavy load added

Actions

Copy link

#19

Updated by Xiubo Li over 2 years ago

ceph df 

RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED
    ssd       8.7 TiB     5.0 TiB     3.8 TiB      3.8 TiB         43.29
    TOTAL     8.7 TiB     5.0 TiB     3.8 TiB      3.8 TiB         43.29

POOLS:
    POOL                          ID     STORED      OBJECTS     USED        %USED     MAX AVAIL
    cephfs_data                    1     246 GiB     591.31k     499 GiB     13.36       1.6 TiB
    cephfs_metadata                2     1.5 TiB     561.84k     3.0 TiB     48.69       1.6 TiB
    default.rgw.meta               3         0 B           0         0 B         0       1.6 TiB
    .rgw.root                      4     3.5 KiB           8     256 KiB         0       1.6 TiB
    default.rgw.buckets.index      5         0 B           0         0 B         0       1.6 TiB
    default.rgw.control            6         0 B           8         0 B         0       1.6 TiB
    default.rgw.buckets.data       7         0 B           0         0 B         0       1.6 TiB
    default.rgw.log                8         0 B         207         0 B         0       1.6 TiB
    volumes                        9     141 GiB      57.69k     282 GiB      8.01       1.6 TiB
    backups                       10         0 B           0         0 B         0       1.6 TiB
    metrics                       11         0 B           0         0 B         0       1.6 TiB

For this issue, it's the same with https://tracker.ceph.com/issues/40002, which was caused by the submit_mutex unfair mutex problem. It's a similar issue with https://tracker.ceph.com/issues/51722.

Actions

Copy link

#20