Bug #41066: mds: skip trim mds cache if mdcache is not opened - CephFS - Ceph

Actions

Copy link

Bug #41066

closed

mds: skip trim mds cache if mdcache is not opened

Added by Zhi Zhang over 4 years ago. Updated over 4 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Correctness/Safety

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

29481

Crash signature (v1):

Crash signature (v2):

Description

```
2019-07-24 14:51:28.028198 7f6dc2543700 1 mds.0.940446 active_start
2019-07-24 14:51:39.452890 7f6dc2543700 1 mds.0.940446 cluster recovered.
2019-07-24 14:51:39.462159 7f6dc2543700 1 mds.docker-xxx Updating MDS map to version 940473 from mon.2
2019-07-24 14:51:39.474304 7f6dbfd3e700 0 mds.0.cache.dir(0x606) remove_dentry [dentry #0x100/stray6/1009f12e802 [2,head] auth NULL (dversion lock) v=3041765611 inode=0 state=1073741824 0x7f70098a10e0] elist item still on list: item_stray:1 item_dirty:0 item_dir_dirty:0
2019-07-24 14:51:39.477246 7f6dbfd3e700 -1 /data/build_ceph/ceph-build-luminous/BUILD/ceph-12.2.8-247-gafc50e0a32/src/include/elist.h: In function 'elist<T>::item::~item() [with T = CDentry*]' thread 7f6dbfd3e700 time 2019-07-24 14:51:39.474321
/data/build_ceph/ceph-build-luminous/BUILD/ceph-12.2.8-247-gafc50e0a32/src/include/elist.h: 39: FAILED assert(!is_on_list())

ceph version 12.2.8-247-gafc50e0a32 (afc50e0a327ac83baafe8af20e7ab628fdedb9f6) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const, char const, int, char const)+0x110) [0x7f6dcaa7e240]
 2: (CDentry::~CDentry()+0x491) [0x7f6dca930351]
 3: (CDentry::~CDentry()+0x9) [0x7f6dca930379]
 4: (CDir::remove_dentry(CDentry)+0x30c) [0x7f6dca93acdc]
 5: (MDCache::trim_dentry(CDentry, std::map&lt;int, MCacheExpire, std::less&lt;int&gt;, std::allocator&lt;std::pair&lt;int const, MCacheExpire&gt; > >&)+0xfa) [0x7f6dca813d8a]
 6: (MDCache::trim_lru(unsigned long, std::map&lt;int, MCacheExpire, std::less&lt;int&gt;, std::allocator&lt;std::pair&lt;int const, MCacheExpire*&gt; > >&)+0x660) [0x7f6dca866610]
 7: (MDCache::trim(unsigned long)+0x27a) [0x7f6dca868f4a]
 8: (MDSRankDispatcher::tick()+0xe8) [0x7f6dca7441d8]
 9: (FunctionContext::finish(int)+0x2a) [0x7f6dca7336fa]
 10: (Context::complete(int)+0x9) [0x7f6dca730be9]
 11: (SafeTimer::timer_thread()+0x104) [0x7f6dcaa7aa74]
 12: (SafeTimerThread::entry()+0xd) [0x7f6dcaa7c49d]
 13: (()+0x7dc5) [0x7f6dc855adc5]
 14: (clone()+0x6d) [0x7f6dc764074d]
 NOTE: a copy of the executable, or objdump -rdS &lt;executable&gt; is needed to interpret this.

```

This crash happend only few times on our clusters with very heavy loads. We added some above logs and found out the deleted dentry was still in stray when destroying it.

This crash always happened right after MDS became active. MDS cache might not be opened because of very heavy loads, so stray manager was also not started and stray dentry would be delayed.