Project

General

Profile

Actions

Documentation #49406

closed

Exceeding osd nearfull ratio causes write throttle.

Added by Justin Mammarella about 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Tags:
Backport:
Reviewed:
Affected Versions:
Labels (FS):
Pull request ID:

Description

We noticed a 20x write performance reduction on our CEPHFS cluster shortly after one of our OSDs exceeded the near-full ratio.

Historically (pre nautilus), the "nearfull" warning never had any operational impact, so it took us a while to narrow down the actual cause.

Would be worth adding to user documentation: "If we are near [nearfull] ENOSPC, write synchronously." and the implications on client performance.

Appears to be related to the following commit:

https://github.com/ceph/ceph-client/commit/7614209736fbc4927584d4387faade4f31444fce

Kernel 4.19.154

Actions #1

Updated by Jeff Layton about 3 years ago

It's unfortunate that it caught you by surprise. Would you care to draft a patch to update the documentation? Where would it have been most helpful to read this?

Actions #2

Updated by Jan-Philipp Litza almost 3 years ago

I got caught by surprise, too. Maybe at least in Kernel Mount Debugging so that when it gets slow, one can find the answer. And/or in man mount.ceph and Mount CephFS using kernel driver

At least those are the pages I currently have open...

Actions #3

Updated by Jeff Layton over 2 years ago

  • Assignee set to Jeff Layton
Actions #4

Updated by Jeff Layton over 2 years ago

  • Tracker changed from Bug to Documentation
  • Project changed from Linux kernel client to CephFS
  • Category deleted (fs/ceph)
  • Status changed from New to In Progress
  • Pull request ID set to 42749
Actions #5

Updated by Patrick Donnelly over 2 years ago

  • Status changed from In Progress to Resolved
Actions #6

Updated by Niklas Hambuechen almost 2 years ago

After wondering for a long time why my clusters get slow at some point, I finally found this as well.

It would be fantastic if `ceph status` could not only point out when a device gets NEARFULL, but also give a hint what massive impact that can have.

Actions

Also available in: Atom PDF