Project

General

Profile

Bug #23503

mds: crash during pressure test

Added by wei jin almost 6 years ago. Updated almost 6 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version: 12.2.4
10 mds, 9 active + 1 standby
disabled dir fragmentation

We created 9 directories, and pnined them to active mds(one dir, one active mds). Then run our script in each dir (decompress file with 100000 small files to different subdirs.)

mds.A crash log:

2018-03-29 10:11:19.451099 7faf33d69700 -1 /build/ceph-12.2.4/src/mds/MDCache.cc: In function 'MDRequestRef MDCache::request_get(metareqid_t)' thread 7faf33d69700 time 2018-03-29 10:11:19.439
198
/build/ceph-12.2.4/src/mds/MDCache.cc: 9043: FAILED assert(p != active_requests.end())

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x555e64b178d2]
2: (MDCache::request_get(metareqid_t)+0x24f) [0x555e648c735f]
3: (Server::handle_slave_request_reply(MMDSSlaveRequest*)+0x2ca) [0x555e6487d9ea]
4: (Server::handle_slave_request(MMDSSlaveRequest*)+0x94f) [0x555e6487f01f]
5: (Server::dispatch(Message*)+0x383) [0x555e6487faa3]
6: (MDSRank::handle_deferrable_message(Message*)+0x7fc) [0x555e647f510c]
7: (MDSRank::_dispatch(Message*, bool)+0x1db) [0x555e6480258b]
8: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x555e64803355]
9: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x555e647ecb13]
10: (DispatchQueue::entry()+0x7ca) [0x555e64e16eda]
11: (DispatchQueue::DispatchThread::entry()+0xd) [0x555e64b9c5ad]
12: (()+0x8064) [0x7faf38b41064]
13: (clone()+0x6d) [0x7faf37c2c62d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Before the crash, we observed subdir migration:
mds.1.migrator nicely exporting to mds.0 [dir 0x20013442763 /tmp/n20-064-085/n20-064-085_9275/
......
mds.1.migrator nicely exporting to mds.0 [dir 0x200134734a7 /tmp/n20-064-085/n20-064-085_9274/

The 'base' dir, such as n20-064-085, is pinned, however, the subdir can still be migrated to other ranks, is it expected behavior? Can we disable the migration completely?

Seems the migration is not stable enough, it is very easy to stall the whole filesystem. I tested it without pin, then made a choice to pin dir so that I could use multiple active mds.


Related issues

Related to CephFS - Bug #23518: mds: crash when failover Resolved 03/30/2018
Duplicates CephFS - Bug #23059: mds: FAILED assert (p != active_requests.end()) in MDRequestRef MDCache::request_get(metareqid_t) Resolved 02/21/2018

History

#1 Updated by wei jin almost 6 years ago

After crash, the standby mds took it over, however, we observed another crash:

2018-03-29 10:25:04.719502 7f5ae5ad2700 -1 /build/ceph-12.2.4/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_ack(MMDSCacheRejoin*)' thread 7f5ae5ad2700 time 2018-03-29 10:
25:04.716917
/build/ceph-12.2.4/src/mds/MDCache.cc: 5087: FAILED assert(session)

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55ba1428d8d2]
2: (MDCache::handle_cache_rejoin_ack(MMDSCacheRejoin*)+0x2422) [0x55ba14071542]
3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x233) [0x55ba1407def3]
4: (MDCache::dispatch(Message*)+0xa5) [0x55ba1407e045]
5: (MDSRank::handle_deferrable_message(Message*)+0x5bc) [0x55ba13f6aecc]
6: (MDSRank::_dispatch(Message*, bool)+0x1db) [0x55ba13f7858b]
7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55ba13f79355]
8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55ba13f62b13]
9: (DispatchQueue::entry()+0x7ca) [0x55ba1458ceda]
10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55ba143125ad]
11: (()+0x8064) [0x7f5aea8aa064]
12: (clone()+0x6d) [0x7f5ae999562d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#2 Updated by Patrick Donnelly almost 6 years ago

  • Subject changed from luminous: mds crash during pressure test to mds: crash during pressure test
  • Status changed from New to Duplicate

#3 Updated by Patrick Donnelly almost 6 years ago

  • Duplicates Bug #23059: mds: FAILED assert (p != active_requests.end()) in MDRequestRef MDCache::request_get(metareqid_t) added

#4 Updated by Patrick Donnelly almost 6 years ago

wei jin wrote:

After crash, the standby mds took it over, however, we observed another crash:

This smells like a different bug. Please open a separate issue.

#5 Updated by wei jin almost 6 years ago

Patrick Donnelly wrote:

wei jin wrote:

After crash, the standby mds took it over, however, we observed another crash:

This smells like a different bug. Please open a separate issue.

Done. https://tracker.ceph.com/issues/23518

Hi, Patrick, I have a question: after pinning base dir, will subdirs still be migrated to other active MDSs when heavy load?

#6 Updated by Patrick Donnelly almost 6 years ago

wei jin wrote:

Hi, Patrick, I have a question: after pinning base dir, will subdirs still be migrated to other active MDSs when heavy load?

No. Export pins are applied recursively.

#7 Updated by wei jin almost 6 years ago

Patrick Donnelly wrote:

wei jin wrote:

Hi, Patrick, I have a question: after pinning base dir, will subdirs still be migrated to other active MDSs when heavy load?

No. Export pins are applied recursively.

Thanks. I saw your mail in mail list, which mentioned patch https://github.com/ceph/ceph/pull/19220/commits/fb7a4cf2aaf68dc5e16733d8daf2e1bf716f183a.

It seems it is just a log issue.

#8 Updated by Zheng Yan almost 6 years ago

  • Related to Bug #23518: mds: crash when failover added

Also available in: Atom PDF