Project

General

Profile

Actions

Bug #46023

closed

mds: MetricAggregator.cc: 178: FAILED ceph_assert(rm)

Added by Patrick Donnelly almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

    -1> 2020-06-12T17:05:30.067+0000 7f250f92f700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-2463-gdb551b9c0ad/rpm/el8/BUILD/ceph-16.0.0-2463-gdb551b9c0ad/src/mds/MetricAggregator.cc: In function 'void MetricAggregator::remove_metrics_for_rank(const entity_inst_t&, mds_rank_t, bool)' thread 7f250f92f700 time 2020-06-12T17:05:30.066358+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-2463-gdb551b9c0ad/rpm/el8/BUILD/ceph-16.0.0-2463-gdb551b9c0ad/src/mds/MetricAggregator.cc: 178: FAILED ceph_assert(rm)

 ceph version 16.0.0-2463-gdb551b9c0ad (db551b9c0ad5c77ac86a97f6be17dc25e4ab80ce) pacific (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f2515ccf3a8]
 2: (()+0x2885c2) [0x7f2515ccf5c2]
 3: (MetricAggregator::remove_metrics_for_rank(entity_inst_t const&, int, bool)+0x1541) [0x557ebe07f7d1]
 4: (MetricAggregator::handle_mds_metrics(boost::intrusive_ptr<MMDSMetrics const> const&)+0x23b) [0x557ebe07fe6b]
 5: (MetricAggregator::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x6c) [0x557ebe07ff9c]
 6: (MetricAggregator::ms_fast_dispatch2(boost::intrusive_ptr<Message> const&)+0xe) [0x557ebe07b03e]
 7: (DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0x190) [0x7f2515ef72b0]
 8: (ProtocolV2::handle_message()+0x12ee) [0x7f2515fd8d7e]
 9: (ProtocolV2::handle_read_frame_dispatch()+0x258) [0x7f2515feaaa8]
 10: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x37d) [0x7f2515feae8d]
 11: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x7f2515fd179c]
 12: (AsyncConnection::process()+0x8a9) [0x7f2515f993f9]
 13: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7f2515ff1ea7]
 14: (()+0x5b286c) [0x7f2515ff986c]
 15: (()+0xc2b23) [0x7f2513d53b23]
 16: (()+0x82de) [0x7f25148a22de]
 17: (clone()+0x43) [0x7f2513430133]

From: /ceph/teuthology-archive/pdonnell-2020-06-12_09:40:42-multimds-wip-pdonnell-testing-20200612.063208-distro-basic-smithi/5142102/remote/smithi149/log/ceph-mds.a.log.gz

and: /ceph/teuthology-archive/pdonnell-2020-06-12_09:40:42-multimds-wip-pdonnell-testing-20200612.063208-distro-basic-smithi/5142298/teuthology.log

and: /ceph/teuthology-archive/pdonnell-2020-06-12_09:40:42-multimds-wip-pdonnell-testing-20200612.063208-distro-basic-smithi/5142487/teuthology.log

it seems easily reproducible with test_snapshots.

Actions #1

Updated by Patrick Donnelly almost 4 years ago

  • Description updated (diff)
Actions #2

Updated by Venky Shankar almost 4 years ago

This happens when a rank 0 MDS goes offline after handling metrics for a client from another rank (say, to rank 1) followed by a client session close to rank 1 after a new rank 0 MDS is active.

For the fix, ignoring the metric remove message (for client removal from tracking) should suffice. Another way is to always send a metrics update message (for all tracked clients) when a new rank 0 is chosen before sending the metric remove message.

Actions #3

Updated by Venky Shankar almost 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 35619
Actions #4

Updated by Venky Shankar almost 4 years ago

Note that the fix in the PR is to "patch" the sequence number in the tracking map. I didn't want to do away with the assert (that was hit) since it really catches these kind of bugs.

Actions #5

Updated by Patrick Donnelly almost 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF