Project

General

Profile

Backport #22450

Updated by Nathan Cutler over 6 years ago

We observed unexplained, constant disk space usage increase on a few of our prod clusters. At first we thought that it's because of customers abusing them, but that wasn't it. Then we though that images are constantly filled with data, but space usage reported by Ceph wasn't consistent with filesystem. After further digging, we realized that snap trim queues for some of PGs are in 250k elements territory... We increased the snap trimmer frequency and number of parallel snap trim ops and disk space usage finally started to drop.
Ceph needs a features to efficiently and conveniently access snap trim queue lengths so it can be used with monitoring, and a features to warn Ceph cluster admins when snap trim queues are long enough to be requiring some attention.

https://github.com/ceph/ceph/pull/19520

Back