Bug #52280
Mds crash and fails with assert on prepare_new_inode
0%
Description
Hi All
We have nautilus 14.2.7, cluster with 3 MDs.
Sometimes, during heavy loads of kubernetese pods, the MDs keep restarting and fail on MDCache::add_inode
On one of our setups that this crash happened , we also noticed that the size of cephfs_metadata was big,1.3TB.
stack trace from mds log file:
E/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/mds/MDCache.cc: In function 'void MDCache::add_inode(CInode*)' thread 7f657ec45700 time 2021-08-16 15:14:11.438857 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/mds/MDCache.cc: 268: FAILED ceph_assert(!p) ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f658816b031] 2: (()+0x2661f9) [0x7f658816b1f9] 3: (()+0x20aeee) [0x5588cc076eee] 4: (Server::prepare_new_inode(boost::intrusive_ptr<MDRequestImpl>&, CDir*, inodeno_t, unsigned int, file_layout_t*)+0x2a4) [0x5588cc00a054] 5: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0xcf1) [0x5588cc019da1] 6: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xb5b) [0x5588cc040bbb] 7: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x308) [0x5588cc041048] 8: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x122) [0x5588cc04cb02] 9: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x6dc) [0x5588cbfc315c] 10: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7fa) [0x5588cbfc55ca] 11: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x12) [0x5588cbfc5c12] 12: (MDSContext::complete(int)+0x74) [0x5588cc232b14] 13: (MDSRank::_advance_queues()+0xa4) [0x5588cbfc4634] 14: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x1d8) [0x5588cbfc4fa8] 15: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x40) [0x5588cbfc5b50] 16: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x5588cbfb3078] 17: (DispatchQueue::entry()+0x1709) [0x7f65883819d9] 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f658842e9cd] 19: (()+0x7e65) [0x7f6586018e65] 20: (clone()+0x6d) [0x7f6584cc688d] 0> 2021-08-16 15:14:11.441 7f657ec45700 -1 *** Caught signal (Aborted) ** in thread 7f657ec45700 thread_name:ms_dispatch ceph df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 8.7 TiB 5.0 TiB 3.8 TiB 3.8 TiB 43.29 TOTAL 8.7 TiB 5.0 TiB 3.8 TiB 3.8 TiB 43.29 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL cephfs_data 1 246 GiB 591.31k 499 GiB 13.36 1.6 TiB cephfs_metadata 2 1.5 TiB 561.84k 3.0 TiB 48.69 1.6 TiB default.rgw.meta 3 0 B 0 0 B 0 1.6 TiB .rgw.root 4 3.5 KiB 8 256 KiB 0 1.6 TiB default.rgw.buckets.index 5 0 B 0 0 B 0 1.6 TiB default.rgw.control 6 0 B 8 0 B 0 1.6 TiB default.rgw.buckets.data 7 0 B 0 0 B 0 1.6 TiB default.rgw.log 8 0 B 207 0 B 0 1.6 TiB volumes 9 141 GiB 57.69k 282 GiB 8.01 1.6 TiB backups 10 0 B 0 0 B 0 1.6 TiB metrics 11 0 B 0 0 B 0 1.6 TiB
Related issues
History
#1 Updated by Patrick Donnelly about 2 years ago
- Status changed from New to Need More Info
Were there any cluster health warnings? Some of the surround MDS logs would also be helpful.
#2 Updated by Patrick Donnelly about 2 years ago
- Assignee set to Xiubo Li
#3 Updated by Yael Azulay about 2 years ago
Patrick Donnelly wrote:
Were there any cluster health warnings? Some of the surround MDS logs would also be helpful.
Hi Pertick
On load we see
1 MDSs behind on trimming
Or/and:
1 filesystem is degraded
Also I want to mention that we enlarged mds_cache_memory_limit to 8G (on one of the setup we now changed it to 16GB)
and changed mds_min_caps_per_client to 150
In addition we now have setup which its mds keep restarting and I didn't manage to recover it at all even after trying redhat recovery procedure: http://people.redhat.com/bhubbard/nature/default/cephfs/disaster-recovery-experts/
#4 Updated by Patrick Donnelly about 2 years ago
- Description updated (diff)
#5 Updated by Xiubo Li about 2 years ago
Hi Yael,
Do you have the mds side logs ? Thanks.
#6 Updated by Yael Azulay about 2 years ago
Xiubo Li wrote:
Hi Yael,
Do you have the mds side logs ? Thanks.
Hi Xiubo
We reinstalled the setup and unfortunately I don't find logs of the failure now.
Thanks
#7 Updated by Xiubo Li about 2 years ago
Yael Azulay wrote:
Xiubo Li wrote:
Hi Yael,
Do you have the mds side logs ? Thanks.
Hi Xiubo
We reinstalled the setup and unfortunately I don't find logs of the failure now.
Thanks
That's okay, thanks.
I have checked the MDS related code, I just suspect that the cephfs_metadata pool was full of the journals, because since you have enlarged the mds_cache_memory_limit from 1G to 8G, so only the MDCache has been exceed the too full threshold value (8G * 95%) will the journal flush function, which will try to trim the journal logs, will be triggered. In your case I am afraid it was never triggered, because the too full threshold value of the MDCache had never reached, so the journals were piling up to 1.5TB.
#8 Updated by Xiubo Li about 2 years ago
Hi Yael,
BTW, did you see any memory killer logs in the /var/log/message* log files ? It should be OS's the memories were used up or it had reached up limitation of the mds process.
#9 Updated by Xiubo Li about 2 years ago
I found some memories are not counted into the MDCache memories, such as the inode_map map in the MDCache class.
#10 Updated by Xiubo Li about 2 years ago
Hi Yael,
BTW, have you ever changed the "mds_log_segment_size" and "mds_log_events_per_segment" options ?
If not, usually if the mds could trim the mdlog without stuck, the mdlog will occupy around 128 * 4MB == 512MB disk space.
#11 Updated by Yael Azulay about 2 years ago
@xiubo Li
Thanks much, Li, for your analysis
I didnt change mds_log_segment_size and mds_log_segment_size.
To what value should i change them given I have the mds_cache_memory_limit is 8G?
At the time this segmentation happened mds_min_caps_per_client was enlarged to 150
(Currently we have also enlarged other parameters like mds_cache_trim_threshold, mds_recall_global_max_decay_threshold, mds_recall_max_caps,
mds_recall_max_decay_rate)
- Not sure why my journal was never flushed? what should trigger journal flush?
- Did you mean there is a bug that inode_map is not counted in MDCache - which case this?
#12 Updated by Xiubo Li about 2 years ago
- Pull request ID set to 43184
#13 Updated by Xiubo Li about 2 years ago
- Status changed from Need More Info to Fix Under Review
Yael Azulay wrote:
@xiubo Li
Thanks much, Li, for your analysis
I didnt change mds_log_segment_size and mds_log_segment_size.
To what value should i change them given I have the mds_cache_memory_limit is 8G?
I don't think they have any related.
At the time this segmentation happened mds_min_caps_per_client was enlarged to 150
This should be okay. Won't affect much.
(Currently we have also enlarged other parameters like mds_cache_trim_threshold, mds_recall_global_max_decay_threshold, mds_recall_max_caps,
mds_recall_max_decay_rate)- Not sure why my journal was never flushed? what should trigger journal flush?
Since I didn't have the MDS logs, I don't know what has happened. Just in case the mdlog->trim() was stuck it could cause this. But from the code I still haven't figure it out how that could happen.
In theory the mdlog->trim() should be call in the mds->tick() every 5 seconds.
- Did you mean there is a bug that inode_map is not counted in MDCache - which case this?
I am not sure whether should be counted to the MDCache memories, I will check it more.
I have fixed the crash issue.
#14 Updated by Yael Azulay about 2 years ago
@xiubo Li
Hi Li
Thanks again
- What are the recommended values for mds_log_segment_size and mds_log_segment_size.?
- Dont understand: is the journal pool supposed to be emptied from time to time?
- What is the fix you did to the crash? In which version is it included?
#15 Updated by Xiubo Li almost 2 years ago
Yael Azulay wrote:
@xiubo Li
Hi Li
Thanks again
- What are the recommended values for mds_log_segment_size and mds_log_segment_size.?
There has no other recommended values for this too options except the default ones, this should depend on your capacity of your disk or pool you use, mostly the default values should be okay.
- Dont understand: is the journal pool supposed to be emptied from time to time?
The journals won't be emptied, but it will be trimmed once the usage reached the upper limitation. For each MDS the threshold should be be 8G * 95% for you.
- What is the fix you did to the crash? In which version is it included?
The fixing is about not to take the inode numbers twice, the first one will be when replaying the journal logs, but it didn't erase this in the ino table, so when there has a new client request allocating new inode number, it may take the used ino, which will trigger a crash. This should be from Nautilus or earlier versions.
I will check whether will the frequently crashes of the MDSes make the journal logs to be cleaned or not.
#16 Updated by Xiubo Li almost 2 years ago
- Related to Feature #10764: optimize memory usage of MDSCacheObject added
#17 Updated by Xiubo Li almost 2 years ago
- Related to deleted (Feature #10764: optimize memory usage of MDSCacheObject)
#18 Updated by Xiubo Li almost 2 years ago
- Related to Bug #40002: mds: not trim log under heavy load added
#19 Updated by Xiubo Li almost 2 years ago
ceph df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 8.7 TiB 5.0 TiB 3.8 TiB 3.8 TiB 43.29 TOTAL 8.7 TiB 5.0 TiB 3.8 TiB 3.8 TiB 43.29 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL cephfs_data 1 246 GiB 591.31k 499 GiB 13.36 1.6 TiB cephfs_metadata 2 1.5 TiB 561.84k 3.0 TiB 48.69 1.6 TiB default.rgw.meta 3 0 B 0 0 B 0 1.6 TiB .rgw.root 4 3.5 KiB 8 256 KiB 0 1.6 TiB default.rgw.buckets.index 5 0 B 0 0 B 0 1.6 TiB default.rgw.control 6 0 B 8 0 B 0 1.6 TiB default.rgw.buckets.data 7 0 B 0 0 B 0 1.6 TiB default.rgw.log 8 0 B 207 0 B 0 1.6 TiB volumes 9 141 GiB 57.69k 282 GiB 8.01 1.6 TiB backups 10 0 B 0 0 B 0 1.6 TiB metrics 11 0 B 0 0 B 0 1.6 TiB
For this issue, it's the same with https://tracker.ceph.com/issues/40002, which was caused by the submit_mutex unfair mutex problem. It's a similar issue with https://tracker.ceph.com/issues/51722.
#20 Updated by Yael Azulay almost 2 years ago
Hi xiubo Li
Thanks for information
The problem happen to us on nautilus 14.2.7
Are the fixes should be included on this version?
#21 Updated by Xiubo Li almost 2 years ago
- Related to Bug #53542: Ceph Metadata Pool disk throughput usage increasing added
#22 Updated by Venky Shankar 4 months ago
- Category set to Correctness/Safety
- Status changed from Fix Under Review to Pending Backport
- Target version set to v19.0.0
- Source set to Community (user)
- Backport set to reef,quincy,pacific
#23 Updated by Backport Bot 4 months ago
- Copied to Backport #59706: pacific: Mds crash and fails with assert on prepare_new_inode added
#24 Updated by Backport Bot 4 months ago
- Copied to Backport #59707: quincy: Mds crash and fails with assert on prepare_new_inode added
#25 Updated by Backport Bot 4 months ago
- Copied to Backport #59708: reef: Mds crash and fails with assert on prepare_new_inode added
#26 Updated by Backport Bot 4 months ago
- Tags set to backport_processed
#27 Updated by Xiubo Li about 1 month ago
- Status changed from Pending Backport to Resolved