Project

General

Profile

Bug #41066

mds: skip trim mds cache if mdcache is not opened

Added by Zhi Zhang 7 months ago. Updated 7 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature:

Description

```
2019-07-24 14:51:28.028198 7f6dc2543700 1 mds.0.940446 active_start
2019-07-24 14:51:39.452890 7f6dc2543700 1 mds.0.940446 cluster recovered.
2019-07-24 14:51:39.462159 7f6dc2543700 1 mds.docker-xxx Updating MDS map to version 940473 from mon.2
2019-07-24 14:51:39.474304 7f6dbfd3e700 0 mds.0.cache.dir(0x606) remove_dentry [dentry #0x100/stray6/1009f12e802 [2,head] auth NULL (dversion lock) v=3041765611 inode=0 state=1073741824 0x7f70098a10e0] elist item still on list: item_stray:1 item_dirty:0 item_dir_dirty:0
2019-07-24 14:51:39.477246 7f6dbfd3e700 -1 /data/build_ceph/ceph-build-luminous/BUILD/ceph-12.2.8-247-gafc50e0a32/src/include/elist.h: In function 'elist<T>::item::~item() [with T = CDentry*]' thread 7f6dbfd3e700 time 2019-07-24 14:51:39.474321
/data/build_ceph/ceph-build-luminous/BUILD/ceph-12.2.8-247-gafc50e0a32/src/include/elist.h: 39: FAILED assert(!is_on_list())

ceph version 12.2.8-247-gafc50e0a32 (afc50e0a327ac83baafe8af20e7ab628fdedb9f6) luminous (stable)
1: (ceph::__ceph_assert_fail(char const, char const, int, char const)+0x110) [0x7f6dcaa7e240]
2: (CDentry::~CDentry()+0x491) [0x7f6dca930351]
3: (CDentry::~CDentry()+0x9) [0x7f6dca930379]
4: (CDir::remove_dentry(CDentry)+0x30c) [0x7f6dca93acdc]
5: (MDCache::trim_dentry(CDentry, std::map&lt;int, MCacheExpire, std::less&lt;int&gt;, std::allocator&lt;std::pair&lt;int const, MCacheExpire&gt; > >&)+0xfa) [0x7f6dca813d8a]
6: (MDCache::trim_lru(unsigned long, std::map&lt;int, MCacheExpire, std::less&lt;int&gt;, std::allocator&lt;std::pair&lt;int const, MCacheExpire*&gt; > >&)+0x660) [0x7f6dca866610]
7: (MDCache::trim(unsigned long)+0x27a) [0x7f6dca868f4a]
8: (MDSRankDispatcher::tick()+0xe8) [0x7f6dca7441d8]
9: (FunctionContext::finish(int)+0x2a) [0x7f6dca7336fa]
10: (Context::complete(int)+0x9) [0x7f6dca730be9]
11: (SafeTimer::timer_thread()+0x104) [0x7f6dcaa7aa74]
12: (SafeTimerThread::entry()+0xd) [0x7f6dcaa7c49d]
13: (()+0x7dc5) [0x7f6dc855adc5]
14: (clone()+0x6d) [0x7f6dc764074d]
NOTE: a copy of the executable, or objdump -rdS &lt;executable&gt; is needed to interpret this.

```

This crash happend only few times on our clusters with very heavy loads. We added some above logs and found out the deleted dentry was still in stray when destroying it.

This crash always happened right after MDS became active. MDS cache might not be opened because of very heavy loads, so stray manager was also not started and stray dentry would be delayed.

History

#1 Updated by Zhi Zhang 7 months ago

  • Pull request ID set to 29481

#2 Updated by Zhi Zhang 7 months ago

  • Status changed from New to Closed

Also available in: Atom PDF