Project

General

Profile

Bug #37723

mds: stopping MDS with a large cache (40+GB) causes it to miss heartbeats

Added by Patrick Donnelly 10 months ago. Updated 9 months ago.

Status:
Pending Backport
Priority:
High
Category:
Performance/Resource Usage
Target version:
Start date:
12/20/2018
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature:

Description

2018-12-20 15:11:15.450875 7f5c3d7ac700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-12-20 15:11:15.450913 7f5c3d7ac700  0 mds.beacon.li638-125 Skipping beacon heartbeat to monitors (last acked 261.523s ago); MDS internal heartbeat is not healthy!
2018-12-20 15:11:16.450675 7f5c417b4700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-12-20 15:11:16.451795 7f5c3e7ae700  1 -- 192.168.198.247:6800/1087296702 --> 192.168.214.32:6800/2326 -- mgrreport(unknown.li638-125 +0-0 packed 1366) v5 -- 0x55cab2fce000 con 0
2018-12-20 15:11:19.451064 7f5c3d7ac700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-12-20 15:11:19.451105 7f5c3d7ac700  0 mds.beacon.li638-125 Skipping beacon heartbeat to monitors (last acked 265.523s ago); MDS internal heartbeat is not healthy!
2018-12-20 15:11:21.450857 7f5c417b4700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-12-20 15:11:21.452130 7f5c3e7ae700  1 -- 192.168.198.247:6800/1087296702 --> 192.168.214.32:6800/2326 -- mgrreport(unknown.li638-125 +0-0 packed 1366) v5 -- 0x55cb97fc7080 con 0
2018-12-20 15:11:23.451262 7f5c3d7ac700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-12-20 15:11:23.451306 7f5c3d7ac700  0 mds.beacon.li638-125 Skipping beacon heartbeat to monitors (last acked 269.524s ago); MDS internal heartbeat is not healthy!
2018-12-20 15:11:26.451050 7f5c417b4700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-12-20 15:11:26.452546 7f5c3e7ae700  1 -- 192.168.198.247:6800/1087296702 --> 192.168.214.32:6800/2326 -- mgrreport(unknown.li638-125 +0-0 packed 1366) v5 -- 0x55d675af0580 con 0
2018-12-20 15:11:27.451471 7f5c3d7ac700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-12-20 15:11:27.451538 7f5c3d7ac700  0 mds.beacon.li638-125 Skipping beacon heartbeat to monitors (last acked 273.524s ago); MDS internal heartbeat is not healthy!
2018-12-20 15:11:31.451186 7f5c417b4700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-12-20 15:11:31.451662 7f5c3d7ac700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-12-20 15:11:31.451673 7f5c3d7ac700  0 mds.beacon.li638-125 Skipping beacon heartbeat to monitors (last acked 277.525s ago); MDS internal heartbeat is not healthy!
2018-12-20 15:11:31.452871 7f5c3e7ae700  1 -- 192.168.198.247:6800/1087296702 --> 192.168.214.32:6800/2326 -- mgrreport(unknown.li638-125 +0-0 packed 1366) v5 -- 0x55d67bf462c0 con 0
2018-12-20 15:11:35.451783 7f5c3d7ac700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-12-20 15:11:35.451814 7f5c3d7ac700  0 mds.beacon.li638-125 Skipping beacon heartbeat to monitors (last acked 281.525s ago); MDS internal heartbeat is not healthy!

Inspection via gdb shows the MDS is trimming its cache. (No outstanding caps because all clients were unmounted.)

Thread 10 (Thread 0x7f5c3dfad700 (LWP 14646)):
#0  _S_equals (__eq=..., __extract=..., __n=0x55cc0b879a60, __c=8939229977632522906, __k=...) at /usr/include/c++/4.8.2/bits/hashtable_policy.h:1166
#1  _M_equals (this=0x55c8b75a80b8, __n=0x55cc0b879a60, __c=8939229977632522906, __k=...) at /usr/include/c++/4.8.2/bits/hashtable_policy.h:1470
#2  std::_Hashtable<inodeno_t, std::pair<inodeno_t const, CInode*>, std::allocator<std::pair<inodeno_t const, CInode*> >, std::__detail::_Select1st, std::equal_to<inodeno_t>, std::hash<inodeno_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node (this=this@entry=0x55c8b75a80b8,
    __n=__n@entry=13332437, __k=..., __code=__code@entry=8939229977632522906) at /usr/include/c++/4.8.2/bits/hashtable.h:1165
#3  0x000055c8ad5907c8 in _M_erase (__k=..., this=0x55c8b75a80b8) at /usr/include/c++/4.8.2/bits/hashtable.h:1517
#4  erase (__k=..., this=0x55c8b75a80b8) at /usr/include/c++/4.8.2/bits/hashtable.h:685
#5  erase (__x=..., this=0x55c8b75a80b8) at /usr/include/c++/4.8.2/bits/unordered_map.h:470
#6  MDCache::remove_inode (this=this@entry=0x55c8b75a8000, o=o@entry=0x55cf40267800) at /usr/src/debug/ceph-12.2.10-141-g55a0482/src/mds/MDCache.cc:338
#7  0x000055c8ad59b81d in MDCache::trim_inode (this=this@entry=0x55c8b75a8000, dn=dn@entry=0x55cf3661c780, in=0x55cf40267800, con=con@entry=0x55ce600eee00, expiremap=std::map with 0 elements)
    at /usr/src/debug/ceph-12.2.10-141-g55a0482/src/mds/MDCache.cc:6877
#8  0x000055c8ad59bfb3 in MDCache::trim_dentry (this=this@entry=0x55c8b75a8000, dn=0x55cf3661c780, expiremap=std::map with 0 elements) at /usr/src/debug/ceph-12.2.10-141-g55a0482/src/mds/MDCache.cc:6714
#9  0x000055c8ad59d6f4 in MDCache::trim_lru (this=this@entry=0x55c8b75a8000, count=18446744073697189344, count@entry=18446744073709551615, expiremap=std::map with 0 elements)
    at /usr/src/debug/ceph-12.2.10-141-g55a0482/src/mds/MDCache.cc:6518
#10 0x000055c8ad59dfe1 in MDCache::trim (this=this@entry=0x55c8b75a8000, count=count@entry=18446744073709551615) at /usr/src/debug/ceph-12.2.10-141-g55a0482/src/mds/MDCache.cc:6554
#11 0x000055c8ad5b8fc4 in MDCache::shutdown_pass (this=0x55c8b75a8000) at /usr/src/debug/ceph-12.2.10-141-g55a0482/src/mds/MDCache.cc:7590
#12 0x000055c8ad4d69e8 in MDSRankDispatcher::tick (this=0x55c8b7465000) at /usr/src/debug/ceph-12.2.10-141-g55a0482/src/mds/MDSRank.cc:303
#13 0x000055c8ad4c575a in operator() (a0=<optimized out>, this=<optimized out>) at /usr/src/debug/ceph-12.2.10-141-g55a0482/build/boost/include/boost/function/function_template.hpp:760
#14 FunctionContext::finish (this=<optimized out>, r=<optimized out>) at /usr/src/debug/ceph-12.2.10-141-g55a0482/src/include/Context.h:493
#15 0x000055c8ad4c2be9 in Context::complete (this=0x55cc778818c0, r=<optimized out>) at /usr/src/debug/ceph-12.2.10-141-g55a0482/src/include/Context.h:70
#16 0x000055c8ad8050a4 in SafeTimer::timer_thread (this=0x55c8b740e078) at /usr/src/debug/ceph-12.2.10-141-g55a0482/src/common/Timer.cc:97
#17 0x000055c8ad806acd in SafeTimerThread::entry (this=<optimized out>) at /usr/src/debug/ceph-12.2.10-141-g55a0482/src/common/Timer.cc:30
#18 0x00007f5c457fbdd5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f5c448d8ead in clone () from /lib64/libc.so.6

The fix for this should be part of a broader fix to make the MDS only shrink its cache gradually (e.g. if the operator reduces mds_cache_memory_limit).


Related issues

Copied to fs - Backport #38131: mimic: mds: stopping MDS with a large cache (40+GB) causes it to miss heartbeats In Progress
Copied to fs - Backport #38132: luminous: mds: stopping MDS with a large cache (40+GB) causes it to miss heartbeats Resolved

History

#1 Updated by Patrick Donnelly 9 months ago

  • Status changed from New to In Progress

#2 Updated by Patrick Donnelly 9 months ago

  • Category set to Performance/Resource Usage
  • Pull request ID set to 26038

#3 Updated by Patrick Donnelly 9 months ago

  • Status changed from In Progress to Pending Backport

#4 Updated by Patrick Donnelly 9 months ago

  • Copied to Backport #38131: mimic: mds: stopping MDS with a large cache (40+GB) causes it to miss heartbeats added

#5 Updated by Patrick Donnelly 9 months ago

  • Copied to Backport #38132: luminous: mds: stopping MDS with a large cache (40+GB) causes it to miss heartbeats added

Also available in: Atom PDF