Project

General

Profile

Actions

Bug #52581

open

Dangling fs snapshots on data pool after change of directory layout

Added by Frank Schilder over 2 years ago. Updated 5 months ago.

Status:
New
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
quincy,reef
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds, snapshots
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

# ceph version
ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)

After changing the data pool on the root directory of our ceph fs we seem to have deleted snapshots stuck in the new data pool. We are rotating daily snapshots. Our ceph fs status excluding stand-bys is

# ceph fs status
con-fs2 - 1640 clients
=======
+------+--------+---------+---------------+-------+-------+
| Rank | State  |   MDS   |    Activity   |  dns  |  inos |
+------+--------+---------+---------------+-------+-------+
|  0   | active | ceph-23 | Reqs:    5 /s | 2399k | 2346k |
|  1   | active | ceph-12 | Reqs:   25 /s | 1225k | 1203k |
|  2   | active | ceph-08 | Reqs:   25 /s | 2148k | 2027k |
|  3   | active | ceph-15 | Reqs:   26 /s | 2088k | 2032k |
+------+--------+---------+---------------+-------+-------+
+---------------------+----------+-------+-------+
|         Pool        |   type   |  used | avail |
+---------------------+----------+-------+-------+
|    con-fs2-meta1    | metadata | 4040M | 1314G |
|    con-fs2-meta2    |   data   |    0  | 1314G |
|     con-fs2-data    |   data   | 1361T | 6023T |
| con-fs2-data-ec-ssd |   data   |  239G | 4205G |
|    con-fs2-data2    |   data   | 35.8T | 5475T |
+---------------------+----------+-------+-------+

We changed the data pool on the root from the 8+2 EC pool con-fs2-data to the 8+3 EC pool con-fs2-data2. It looks like on the new pool some deleted snapshots are not purged (snippet from ceph osd pool ls detail):

pool 12 'con-fs2-meta1' replicated size 4 min_size 2 ... application cephfs
pool 13 'con-fs2-meta2' replicated size 4 min_size 2 ... application cephfs
    removed_snaps [2~18e,191~2c,1be~144,303~3,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~2]
pool 14 'con-fs2-data' erasure size 10 min_size 9 ... application cephfs
    removed_snaps [2~18e,191~2c,1be~144,303~3,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~2]
pool 17 'con-fs2-data-ec-ssd' erasure size 10 min_size 9 ... application cephfs
    removed_snaps [2~18e,191~2c,1be~144,303~3,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~2]
pool 19 'con-fs2-data2' erasure size 11 min_size 9 ... application cephfs
    removed_snaps [2d6~1,2d8~1,2da~1,2dc~1,2de~1,2e0~1,2e2~1,2e4~1,2e6~1,2e8~1,2ea~18,303~3,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~2]

The problematic snapshots are the ones still present in pool con-fs2-data2 in the set [2d6~1,2d8~1,2da~1,2dc~1,2de~1,2e0~1,2e2~1,2e4~1,2e6~1,2e8~1,2ea~18,303~3], which should not be present. They correspond to decimal snap IDs 727 729 731 733 735 737 739 741 743 745 747. All mds daemons report these snap IDs:

# ceph daemon mds.ceph-23 dump snaps | grep snapid
            "snapid": 400,
            "snapid": 445,
            "snapid": 770,
            "snapid": 774,
            "snapid": 776,
            "snapid": 778,
            "snapid": 780,
            "snapid": 782,
            "snapid": 784,
            "snapid": 786,
            "snapid": 788,
            "snapid": 791,

These extra snapshots seem to cause performance issues and I would like to know how to get rid of them.

Actions

Also available in: Atom PDF