Bug #23503
mds: crash during pressure test
0%
Description
ceph version: 12.2.4
10 mds, 9 active + 1 standby
disabled dir fragmentation
We created 9 directories, and pnined them to active mds(one dir, one active mds). Then run our script in each dir (decompress file with 100000 small files to different subdirs.)
mds.A crash log:
2018-03-29 10:11:19.451099 7faf33d69700 -1 /build/ceph-12.2.4/src/mds/MDCache.cc: In function 'MDRequestRef MDCache::request_get(metareqid_t)' thread 7faf33d69700 time 2018-03-29 10:11:19.439
198
/build/ceph-12.2.4/src/mds/MDCache.cc: 9043: FAILED assert(p != active_requests.end())
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x555e64b178d2]
2: (MDCache::request_get(metareqid_t)+0x24f) [0x555e648c735f]
3: (Server::handle_slave_request_reply(MMDSSlaveRequest*)+0x2ca) [0x555e6487d9ea]
4: (Server::handle_slave_request(MMDSSlaveRequest*)+0x94f) [0x555e6487f01f]
5: (Server::dispatch(Message*)+0x383) [0x555e6487faa3]
6: (MDSRank::handle_deferrable_message(Message*)+0x7fc) [0x555e647f510c]
7: (MDSRank::_dispatch(Message*, bool)+0x1db) [0x555e6480258b]
8: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x555e64803355]
9: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x555e647ecb13]
10: (DispatchQueue::entry()+0x7ca) [0x555e64e16eda]
11: (DispatchQueue::DispatchThread::entry()+0xd) [0x555e64b9c5ad]
12: (()+0x8064) [0x7faf38b41064]
13: (clone()+0x6d) [0x7faf37c2c62d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Before the crash, we observed subdir migration:
mds.1.migrator nicely exporting to mds.0 [dir 0x20013442763 /tmp/n20-064-085/n20-064-085_9275/
......
mds.1.migrator nicely exporting to mds.0 [dir 0x200134734a7 /tmp/n20-064-085/n20-064-085_9274/
The 'base' dir, such as n20-064-085, is pinned, however, the subdir can still be migrated to other ranks, is it expected behavior? Can we disable the migration completely?
Seems the migration is not stable enough, it is very easy to stall the whole filesystem. I tested it without pin, then made a choice to pin dir so that I could use multiple active mds.
Related issues
History
#1 Updated by wei jin about 6 years ago
After crash, the standby mds took it over, however, we observed another crash:
2018-03-29 10:25:04.719502 7f5ae5ad2700 -1 /build/ceph-12.2.4/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_ack(MMDSCacheRejoin*)' thread 7f5ae5ad2700 time 2018-03-29 10:
25:04.716917
/build/ceph-12.2.4/src/mds/MDCache.cc: 5087: FAILED assert(session)
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55ba1428d8d2]
2: (MDCache::handle_cache_rejoin_ack(MMDSCacheRejoin*)+0x2422) [0x55ba14071542]
3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x233) [0x55ba1407def3]
4: (MDCache::dispatch(Message*)+0xa5) [0x55ba1407e045]
5: (MDSRank::handle_deferrable_message(Message*)+0x5bc) [0x55ba13f6aecc]
6: (MDSRank::_dispatch(Message*, bool)+0x1db) [0x55ba13f7858b]
7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55ba13f79355]
8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55ba13f62b13]
9: (DispatchQueue::entry()+0x7ca) [0x55ba1458ceda]
10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55ba143125ad]
11: (()+0x8064) [0x7f5aea8aa064]
12: (clone()+0x6d) [0x7f5ae999562d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#2 Updated by Patrick Donnelly almost 6 years ago
- Subject changed from luminous: mds crash during pressure test to mds: crash during pressure test
- Status changed from New to Duplicate
#3 Updated by Patrick Donnelly almost 6 years ago
- Duplicates Bug #23059: mds: FAILED assert (p != active_requests.end()) in MDRequestRef MDCache::request_get(metareqid_t) added
#4 Updated by Patrick Donnelly almost 6 years ago
wei jin wrote:
After crash, the standby mds took it over, however, we observed another crash:
This smells like a different bug. Please open a separate issue.
#5 Updated by wei jin almost 6 years ago
Patrick Donnelly wrote:
wei jin wrote:
After crash, the standby mds took it over, however, we observed another crash:
This smells like a different bug. Please open a separate issue.
Done. https://tracker.ceph.com/issues/23518
Hi, Patrick, I have a question: after pinning base dir, will subdirs still be migrated to other active MDSs when heavy load?
#6 Updated by Patrick Donnelly almost 6 years ago
wei jin wrote:
Hi, Patrick, I have a question: after pinning base dir, will subdirs still be migrated to other active MDSs when heavy load?
No. Export pins are applied recursively.
#7 Updated by wei jin almost 6 years ago
Patrick Donnelly wrote:
wei jin wrote:
Hi, Patrick, I have a question: after pinning base dir, will subdirs still be migrated to other active MDSs when heavy load?
No. Export pins are applied recursively.
Thanks. I saw your mail in mail list, which mentioned patch https://github.com/ceph/ceph/pull/19220/commits/fb7a4cf2aaf68dc5e16733d8daf2e1bf716f183a.
It seems it is just a log issue.
#8 Updated by Zheng Yan almost 6 years ago
- Related to Bug #23518: mds: crash when failover added