Project

General

Profile

Bug #39484

mon: "FAILED assert(pending_finishers.empty())" when paxos restart

Added by yu feng 5 months ago. Updated 28 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
Start date:
04/25/2019
Due date:
% Done:

0%

Source:
Tags:
Backport:
nautilus, mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:

Description

We are running ceph 13.2.5 on Centos Linux 7.5.1804, and the ceph cluster consists of 5 ceph-mon. Every 30 seconds, we modify a couple of keys in the config-key. And we noticed that the ceph-mon failed several times.
The traceback shows below.

0> 2019-04-25 13:17:18.708 7fe4870cc700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/cent
os7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/mon/Paxos.cc: In function 'MonitorDBStore::TransactionRef Paxos::get_pending_transaction()' th
read 7fe4870cc700 time 2019-04-25 13:17:18.707278
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUI
LD/ceph-13.2.5/src/mon/Paxos.cc: 1559: FAILED assert(pending_finishers.empty())
ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7fe494e3afbf]
2: (()+0x26d187) [0x7fe494e3b187]
3: (Paxos::get_pending_transaction()+0xf7) [0x560e85393ba7]
4: (ConfigKeyService::store_put(std::string const&, ceph::buffer::list&, Context*)+0x36) [0x560e85346ba6]
5: (ConfigKeyService::service_dispatch(boost::intrusive_ptr<MonOpRequest>)+0xab6) [0x560e85349576]
6: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x1faf) [0x560e852b817f]
7: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x39d) [0x560e852bcbad]
8: (Monitor::C_Command::_finish(int)+0x5c) [0x560e852f729c]
9: (C_MonOp::finish(int)+0x43) [0x560e852c07d3]
10: (Context::complete(int)+0x9) [0x560e852bf9e9]
11: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x93) [0x560e852c7513]
12: (Paxos::restart()+0xe3) [0x560e85393733]
13: (Monitor::_reset()+0x17e) [0x560e8528b0ce]
14: (Monitor::join_election()+0x35) [0x560e8528b1f5]
15: (Elector::bump_epoch(unsigned int)+0x147) [0x560e853258c7]
16: (Elector::handle_propose(boost::intrusive_ptr<MonOpRequest>)+0x6b0) [0x560e853267f0]
17: (Elector::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x449) [0x560e85327f09]
18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xbbb) [0x560e852bd3cb]
19: (Monitor::_ms_dispatch(Message*)+0x732) [0x560e852be212]
20: (Monitor::ms_dispatch(Message*)+0x23) [0x560e852e3cf3]
21: (DispatchQueue::entry()+0xb7a) [0x7fe494ef739a]
22: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe494f952cd]
23: (()+0x7dd5) [0x7fe49417ddd5]
24: (clone()+0x6d) [0x7fe490a98ead]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I guess the assert failed when the paxos restart procedure is doing "finish_contexts(g_ceph_context, committing_finishers, -EAGAIN)" while pending_finishers is not empty.
And I found something interesting, in Paxos::shutdown, the sequence is:

finish_contexts(g_ceph_context, pending_finishers, -ECANCELED);
finish_contexts(g_ceph_context, committing_finishers, -ECANCELED);

while in Paxos::restart, the sequence is:

finish_contexts(g_ceph_context, committing_finishers, -EAGAIN);
finish_contexts(g_ceph_context, pending_finishers, -EAGAIN);

Should not the pending_finishers always be handled before committing_finishers?

core_dump.txt.tar.gz - The core dump log file (202 KB) haitao chen, 04/26/2019 01:36 AM

ceph_status.png View - The ceph -s info (46.6 KB) haitao chen, 04/26/2019 01:39 AM

选区_021.png View (44.5 KB) haitao chen, 04/30/2019 02:37 AM

选区_022.png View (42 KB) haitao chen, 04/30/2019 02:44 AM


Related issues

Copied to RADOS - Backport #39743: nautilus: mon: "FAILED assert(pending_finishers.empty())" when paxos restart Resolved
Copied to RADOS - Backport #39744: mimic: mon: "FAILED assert(pending_finishers.empty())" when paxos restart Resolved

History

#1 Updated by haitao chen 5 months ago

Upload the core dump log file.
And the ceph -s:
The ceph -s info
mon.b01 crashs again and again.

#2 Updated by Greg Farnum 5 months ago

  • Project changed from Ceph to RADOS
  • Category changed from Monitor to Correctness/Safety
  • Component(RADOS) Monitor added

pending_finishers get moved into committing_finishers once they have been submitted to disk, so we probably want to finish the committing_finishers first (not certain, though!).

#3 Updated by Greg Farnum 5 months ago

  • Status changed from New to In Progress
  • Assignee set to Greg Farnum

Hmm this doesn't make a lot of sense. finish_contexts() swaps out the input list with a local one before running finish() on any of them, so pending_finishers should be empty here. Unless one of the committing_finishers tries to get_pending_transaction() ?

...which ConfigKeyService::store_put() does.

Okay, so I guess the fix is to swap the lists into function-local ones in Paxos::restart()?

#4 Updated by Greg Farnum 5 months ago

  • Status changed from In Progress to Need Review
  • Pull request ID set to 27877

#5 Updated by haitao chen 5 months ago

Greg Farnum wrote:

pending_finishers get moved into committing_finishers once they have been submitted to disk, so we probably want to finish the committing_finishers first (not certain, though!).

Hi Greg,
Does it have the same problem into leader_init() and peon_init()?

#6 Updated by Greg Farnum 5 months ago

Hmm probably!

#7 Updated by Greg Farnum 5 months ago

Updated the PR. Please put further code reviews there. :)

#8 Updated by Nathan Cutler 5 months ago

  • Backport set to nautilus, mimic

#9 Updated by Sage Weil 4 months ago

  • Status changed from Need Review to Pending Backport

#10 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #39743: nautilus: mon: "FAILED assert(pending_finishers.empty())" when paxos restart added

#11 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #39744: mimic: mon: "FAILED assert(pending_finishers.empty())" when paxos restart added

#12 Updated by Greg Farnum 28 days ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF