Project

General

Profile

Actions

Backport #18760

closed

Long delays awating OSD disk failure propagation

Added by David Disseldorp about 7 years ago. Updated over 5 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
Release:
jewel
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Related issues 1 (0 open1 closed)

Related to Ceph - Backport #18761: Jewel: async messenger bug fix additional backportsRejectedActions
Actions #1

Updated by Nathan Cutler about 7 years ago

  • Tracker changed from Bug to Backport
  • Description updated (diff)

description

IIUC, OSD disk-pull errors currently propagate through to the mons via:
OSD device I/O error -> filestore I/O error -> ceph-osd ceph_abort() -> heartbeat failure.

The master branch was recently updated to have the OSDs immediately report pier OSDs as down if they refuse to accept connections, rather than waiting for intermittent heartbeat exchanges to detect the outage:

commit 5083742803479a1ef18431f00dc1f8b5f2cd7ab5
Author: Piotr Dałek <>
Date: Sun May 22 15:30:49 2016 +0200

msg/async: implement ECONNREFUSED detection
This commit adds code that detects ECONNREFUSED and dispatches appropriate
event further in Async messenger.
Signed-off-by: Piotr Dałek &lt;&gt;

commit fca6817d7ed1534157d04e2100313e1c2e222e21
Author: Piotr Dałek <>
Date: Thu May 5 21:48:31 2016 +0200

messages/MOSDFailure.h: distinguish between timeout and immediate failure
Change "is_failed" field to "flags" and use it to distinguish between timeout
and immediate, known OSD failure. Then use that in OSD and MON, and make sure
"min_reporters" don't affect known failures by actually going around failure
heuristic code.
Signed-off-by: Piotr Dałek &lt;&gt;

commit 75074524fe15afff1374a6006628adab4f7abf7b
Author: Piotr Dałek <>
Date: Sun May 22 13:08:48 2016 +0200

OSD: Implement ms_handle_refused
Added implementation of ms_handle_refused in OSD code, so it sends
MOSDFailure message in case the peer connection fails with ECONNREFUSED
and it is known to be up and new option "osd fast fail on connection
refused" which enables or disables new behavior.
Signed-off-by: Piotr Dałek &lt;&gt;

commit d58d7d3dbd21a8dca0a19964f51cb9bf78814a75
Author: Piotr Dałek <>
Date: Thu May 5 21:03:37 2016 +0200

msg/simple: add ms_handle_refused callback
Added new callback (ms_handle_refused) to dispatchers. It is called
once connection attempt fails with ECONNREFUSED.
Also added dummy ms_handle_refused handlers across codebase.
Signed-off-by: Piotr Dałek &lt;&gt;

This patch set considerably speeds up failure propagation for disk pull/failure scenarios. This bug will track merging this patch-set into the Jewel branch.

Actions #2

Updated by David Disseldorp about 7 years ago

  • Status changed from New to Fix Under Review
Actions #3

Updated by Nathan Cutler about 7 years ago

  • Description updated (diff)
  • Status changed from Fix Under Review to In Progress
Actions #4

Updated by Nathan Cutler about 6 years ago

  • Status changed from In Progress to Need More Info
  • Assignee deleted (David Disseldorp)

non-trivial backport, due to memory leaks in Async Messenger

Actions #5

Updated by Nathan Cutler about 6 years ago

  • Description updated (diff)

First attempted backport was https://github.com/ceph/ceph/pull/13212

Actions #6

Updated by Nathan Cutler about 6 years ago

  • Related to Backport #18761: Jewel: async messenger bug fix additional backports added
Actions #7

Updated by Nathan Cutler over 5 years ago

  • Status changed from Need More Info to Rejected

Jewel is EOL.

Actions

Also available in: Atom PDF