Project

General

Profile

Actions

Bug #41866

open

OSD cannot report slow operation warnings in time.

Added by Ilsoo Byun over 4 years ago. Updated over 4 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

If an underlying device is blocked due to H/W issues, a thread that checks slow ops can’t report slow op warning in time in the current implementation, because the thread is also blocked.
For e.g.
1. if a DATA disk is blocked with a PG lock locked, the thread that executes TrackedOp::visit_ops_in_flight method is also blocked waiting for the PG lock or the MGRClient::lock.
2. if a WAL disk is blocked while flushing, the thread that executes TrackedOp::visit_ops_in_flight method is also blocked waiting for the BlueFS::lock.

It means that OSD can’t report slow op warnings in time.

In my opinion, how about running slow op checking code in the separate thread from the ’tick_without_osd_lock’ thread?

Actions #1

Updated by Ilsoo Byun over 4 years ago

assumed that bluestore is used.

Actions #2

Updated by Greg Farnum over 4 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
Actions #3

Updated by Ilsoo Byun over 4 years ago

report_callback thread is also blocked on PG::lock with MGRClient::lock locked while getting the pg stats. This in turn block the tick_wihtout_osd_lock thread.

Actions #4

Updated by Kefu Chai over 4 years ago

  • Category set to Administration/Usability
  • Status changed from New to 17
  • Assignee set to Ilsoo Byun
  • Pull request ID set to 30550
Actions #5

Updated by Kefu Chai over 4 years ago

  • Status changed from 17 to Fix Under Review
Actions

Also available in: Atom PDF