Project

General

Profile

Bug #36268

Unable to recover from ENOSPC in BlueFS

Added by Igor Fedotov over 5 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Under heavy load and full DB volume BlueStore might fall into the state where it lacks additional space for BlueFS even if the space is still available at block device.
This is cased by the "lazy" behavior of free space rebalancing - it happens periodically in background rather than on demand.
On the first allocation failure OSD asserts and then is unable to restart since log replay during BlueFS open needs the space as well but rebalance is still not executed.
Then assertion again and hence getting a sort of unrecoverable deadlock for OSD.


Related issues

Copied to bluestore - Backport #36640: luminous: Unable to recover from ENOSPC in BlueFS Rejected
Copied to bluestore - Backport #36641: mimic: Unable to recover from ENOSPC in BlueFS Rejected

History

#2 Updated by Sage Weil over 5 years ago

  • Status changed from New to Pending Backport
  • Backport set to mimic,luminous

#3 Updated by Patrick Donnelly over 5 years ago

  • Copied to Backport #36640: luminous: Unable to recover from ENOSPC in BlueFS added

#4 Updated by Patrick Donnelly over 5 years ago

  • Copied to Backport #36641: mimic: Unable to recover from ENOSPC in BlueFS added

#5 Updated by Igor Fedotov over 5 years ago

  • Status changed from Pending Backport to In Progress

In fact previously mentioned PR is just a workaround to be able to manually fix the issue.
Working on the actual solution to fix BlueFS allocation strategy.

#6 Updated by Igor Fedotov over 5 years ago

  • Status changed from In Progress to Fix Under Review

#7 Updated by Sage Weil about 5 years ago

  • Status changed from Fix Under Review to Resolved

#8 Updated by Nathan Cutler about 5 years ago

  • Status changed from Resolved to Pending Backport

Sage, did you mean to cancel the mimic and luminous backports when you changed the status to Resolved?

#9 Updated by Sage Weil about 5 years ago

  • Status changed from Pending Backport to Resolved
  • Backport deleted (mimic,luminous)

Alternative fix for mimic and luminous: https://github.com/ceph/ceph/pull/26735

#10 Updated by 鹏 张 almost 5 years ago

Sage Weil wrote:

Alternative fix for mimic and luminous: https://github.com/ceph/ceph/pull/26735

hello,sage weil , i have meet the same issue before in Lumious and i have merged the new patch you mentioned, but it is unuseful. restart the osd have the same assert.Is there any other way to restore OSD such as clean up bluefs size or
expand bluefs size

#11 Updated by Igor Fedotov over 3 years ago

  • Pull request ID set to 25132

Also available in: Atom PDF