Project

General

Profile

Actions

Bug #53521

closed

mds: heartbeat timeout by _prefetch_dirfrags during up:rejoin

Added by Yongseok Oh over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This timeout issue happens with v14.2.19. It may also be reproduced in the latest version.

2021-12-05 20:42:13.472 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-12-05 20:42:13.472 7f1d2863a700 0 mds.beacon.mds005 Skipping beacon heartbeat to monitors (last acked 3.99999s ago); MDS internal heartbeat is not healthy!
2021-12-05 20:42:13.972 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-12-05 20:42:13.972 7f1d2863a700 0 mds.beacon.mds005 Skipping beacon heartbeat to monitors (last acked 4.49999s ago); MDS internal heartbeat is not healthy!
2021-12-05 20:42:14.472 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-12-05 20:42:14.472 7f1d2863a700 0 mds.beacon.mds005 Skipping beacon heartbeat to monitors (last acked 4.99999s ago); MDS internal heartbeat is not healthy!
2021-12-05 20:42:14.972 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-12-05 20:43:36.787 7f1d2ae3f700 1 mds.mds005Map removed me [mds.mds005{0:296030} state up:rejoin seq 2076 addr [] from cluster; respawning! See cluster/monitor logs for details.
...
2021-12-05 20:43:36.787 7f1d2ae3f700 1 mds.mds005 respawn!

#0 0x00007f122f75a1f0 in ceph::buffer::v14_2_0::list::iterator_impl<false>::advance (this=0x7f121d007460, o=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/buffer.cc:743
#1 0x00007f122f75a082 in ceph::buffer::v14_2_0::list::iterator_impl<false>::iterator_impl (this=0x7f121d007460, l=0x7f121d007988, o=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/buffer.cc:728
#2 0x00007f122f750f79 in ceph::buffer::v14_2_0::list::iterator::iterator (this=0x7f121d007460, l=0x7f121d007988, o=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/buffer.cc:956
#3 0x000055b8ad3f8630 in ceph::buffer::v14_2_0::list::begin (this=0x7f121d007988) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/buffer.h:1148
#4 0x000055b8ad3f84d0 in ceph::buffer::v14_2_0::list::clear (this=0x7f121d007988) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/buffer.h:1071
#5 0x000055b8ad43da92 in ceph::decode (s=..., p=...) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/encoding.h:293
#6 0x000055b8ad8096f2 in InodeStoreBase::decode_bare (this=0x7f121d007760, bl=..., snap_blob=..., struct_v=5 '\005') at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CInode.cc:1485
#7 0x000055b8ad7e4318 in InodeStore::decode_bare (this=0x7f121d007760, bl=...) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CInode.h:137
#8 0x000055b8ad7d160a in CDir::_load_dentry (this=0x55b8c79c8500, key=..., dname=..., last=..., bl=..., pos=1135, snaps=0x0, force_dirty=0x7f121d007bcc) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CDir.cc:1795
Python Exception <class 'gdb.error'> There is no member or method named _M_value_field.:
#9 0x000055b8ad7d3c85 in CDir::_omap_fetched (this=0x55b8c79c8500, hdrbl=..., omap=std::map with 2512 elements, complete=true, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CDir.cc:1995
#10 0x000055b8ad7e690f in C_IO_Dir_OMAP_Fetched::finish (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CDir.cc:1643
#11 0x000055b8ad3fa23d in Context::complete (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/Context.h:77
#12 0x000055b8ad8bbdee in MDSContext::complete (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/MDSContext.cc:29
#13 0x000055b8ad8bc577 in MDSIOContextBase::complete (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/MDSContext.cc:114
#14 0x00007f122f2675ad in Finisher::finisher_thread_entry (this=0x55b8afd03440) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Finisher.cc:67
#15 0x000055b8ad44435c in Finisher::FinisherThread::entry (this=0x55b8afd03530) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Finisher.h:62
#16 0x00007f122f2d93a8 in Thread::entry_wrapper (this=0x55b8afd03530) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Thread.cc:84
#17 0x00007f122f2d9326 in Thread::_entry_func (arg=0x55b8afd03530) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Thread.cc:71
#18 0x00007f122c19eea5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f122ae4b96d in clone () from /lib64/libc.so.6


Related issues 1 (0 open1 closed)

Copied to CephFS - Backport #53759: pacific: mds: heartbeat timeout by _prefetch_dirfrags during up:rejoinResolvedXiubo LiActions
Actions #1

Updated by Xiubo Li over 2 years ago

  • Pull request ID set to 44246
Actions #2

Updated by Venky Shankar over 2 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Yongseok Oh
  • Target version set to v17.0.0
  • Backport set to pacific
Actions #3

Updated by Venky Shankar over 2 years ago

  • Category set to Correctness/Safety
Actions #4

Updated by Venky Shankar over 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Backport Bot over 2 years ago

  • Copied to Backport #53759: pacific: mds: heartbeat timeout by _prefetch_dirfrags during up:rejoin added
Actions #6

Updated by Xiubo Li about 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF