Project

General

Profile

Actions

Bug #65298

open

Free space can be leaked in Quincy+ when bdev_async_discard is enabled

Added by Joshua Baergen 27 days ago. Updated 19 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

80%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:

Description

Starting in Quincy, we no longer maintain a free space map in rocksdb in bluestore (https://github.com/ceph/ceph/pull/39871).

When bluestore is shutting down, it will serialize the current allocator state to disk, and this is what will be used on the next boot. The issue is that, with bdev_async_discard enabled, the allocator is not updated with any freed blocks immediately after a bluestore txn completes; rather, it's updated once the actual discard happens. KernelDevice::close() appears to wait for all discards to complete, but this will not happen until after BlueStore::close_db(), when the allocator state is serialized, thus leaking the free space for any outstanding discards.

I have not observed this directly; it came to mind as a possibility during a conversation with another Ceph user, who mentioned seeing something that sounds a lot like this (he has async discard queues that can back up for hours under some workloads). He mentioned that resharding would reclaim the leaked space, which I'm guessing is because the allocation map gets regenerated in this case.

A few options come immediately to mind:
  1. Create an in-memory freespace manager which is what gets serialized to disk at shutdown. This has more complexity, but means that we don't need to wait for discards to complete during OSD shutdown (though maybe we're already waiting for them during bdev shutdown, per above).
  2. Switch to synchronous discards and flush outstanding discards at the bdev level before proceeding with serializing allocator state. This can take a while in extreme circumstances.
  3. Similar to above, except simply disable discards and drop outstanding discards, returning pending space to the allocator.
  4. Never serialize allocator state when async discards are outstanding.
Actions #1

Updated by Gabriel BenHanokh 21 days ago

  • % Done changed from 0 to 50
  • Pull request ID set to 56744

PR https://github.com/ceph/ceph/pull/56744 should solve this issue

Actions #2

Updated by Gabriel BenHanokh 19 days ago

  • % Done changed from 50 to 80
Actions

Also available in: Atom PDF