Project

General

Profile

Actions

Bug #62163

open

[rbd-mirror] Handle timeouts within mirroring daemon so that force promote doesn't lead to any stuck ops

Added by Prasanna Kumar Kalever 10 months ago.

Status:
New
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When the primary cluster hits disaster and is not reachable, a force promote on the secondary cluster will lead to stuck in the various components like image replayers, pool replayers, RemotePoolPoller, Cluster watcher, Leader watcher and various others, because these components need some OSD ops which are currently stuck for infinite time as the remote cluster is no longer available.

This tracker ensures that there are no stuck components with in the mirroring daemon. We set timeout option `--rbd-mirror-remote-osd-op-timeout=30` which internally uses OSD's `--rados-osd-op-timeout` with mirroring daemon that sets the OSD op timeout i.e. fix the various components which are not in a position to handle the timeout errors within the mirroring daemon.

Goal:
  • rbd-mirror daemon should be in a position to gracefully handle force promote
  • rbd-mirror daemon should be in a position to gracefully handle the random timeout errors from OSD's (that might happen because of a sudden spike in the network or overall pressure on the underline node)

No data to display

Actions

Also available in: Atom PDF