Project

General

Profile

Actions

Bug #17719

closed

OSDs marked OUT wrongly after monitor failover

Added by Ridge Chen over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer,jewel
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Recently we find an issue with our ceph cluster, the version is 0.94.6.

We want to add additional RAM to the ceph nodes, so we need to stop
the ceph service on the nodes first. When we did that on the first
node, we found the OSDs on that node marked OUT and backfill started
(DOWN is expected in this case). The first node is somewhat special
that it is also the location of the leader monitor.

Then checked the monitor log and found the following:

cluster [INF] osd.0 out (down for 3375169.141844)

Looks like the monitor (who just become leader) has wrong
"down_pending_out" records and computes out a a very long DOWN time ,
finally decides to mark them OUT.

After researching the related code, the reason could be that:

1. "down_pending_out" is set a month ago for those OSDs because of a
network issue.
2. The down OSDs up and join the cluster again. "down_pending_out" is
cleared in the "OSDMonitor::tick()" method. But only happened on
leader monitor.
3. When we stop the ceph service on the first node. The monitor group
failover. The new leader monitor will recognize the OSDs kept in DOWN
status for a a very long time, and mark them OUT wrongly.


Related issues 2 (0 open2 closed)

Copied to Ceph - Backport #17883: hammer: OSDs marked OUT wrongly after monitor failoverResolvedNathan CutlerActions
Copied to Ceph - Backport #17884: jewel: OSDs marked OUT wrongly after monitor failoverResolvedNathan CutlerActions
Actions #2

Updated by Kefu Chai over 7 years ago

  • Category set to Monitor
  • Status changed from New to Fix Under Review
Actions #3

Updated by Kefu Chai over 7 years ago

  • Backport set to hammer,jewel
Actions #4

Updated by Kefu Chai over 7 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Kefu Chai over 7 years ago

would be idea if we can have a test for this bug in ceph-qa-suite rados suite.

Actions #6

Updated by Nathan Cutler over 7 years ago

  • Copied to Backport #17883: hammer: OSDs marked OUT wrongly after monitor failover added
Actions #7

Updated by Nathan Cutler over 7 years ago

  • Copied to Backport #17884: jewel: OSDs marked OUT wrongly after monitor failover added
Actions #8

Updated by Nathan Cutler about 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF