Project

General

Profile

Actions

Bug #58376

open

CephFS Snapshots are accessible even when it's deleted from the other client

Added by Kotresh Hiremath Ravishankar over 1 year ago. Updated 5 months ago.

Status:
Triaged
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, MDS
Labels (FS):
snapshots
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The issue is seen by upstream user. The snapshot is still accessible from a client which was copying it even when it's deleted from the other client.
Local reproducer:

#Create 1GiB file and create a snapshot. Fuse mounts are being used.
$ dd if=/dev/urandom of=/mnt/file-1GB bs=4M count=250
250+0 records in
250+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 3.19042 s, 329 MB/s
[kotresh@fedora mnt]$ ls -lrth
total 1000M
-rw-r--r--. 1 kotresh kotresh 1000M Jan  4 14:54 file-1GB
$ mkdir /mnt/.snap/snapshot1

#Copy snapshot from client /mnt
$ cp -p /mnt/.snap/snapshot1/file-1GB ~/

# Delete from other client /mnt1
$ rmdir /mnt1/.snap/snapshot1
$ ls -l /mnt1/.snap
total 0

# Copy is still successful from client /mnt
$ cp -p /mnt/.snap/snapshot1/file-1GB ~/file3
$ ls -l /mnt/.snap/
total 0
$ cp -p /mnt/.snap/snapshot1/file-1GB ~/file4
$ ls -l /mnt/.snap/
total 0
$ cp -p /mnt/.snap/snapshot1/file-1GB ~/file5
$ cp -p /mnt/.snap/snapshot1/file-1GB ~/file6

Here is the copy of the mail from ceph users list (https://www.spinics.net/lists/ceph-users/msg75291.html).

Dear ceph community,

We have two ceph cluster of equal size, one main and one mirror, both using cephadm and on version

ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)

We are stuck with copying a large file (~ 64G) between the CephFS file systems of the two clusters.

The source path is a snapshot (i.e. something like /my/path/.snap/schedule_some-date/…).
But I don't think that should make any difference.

First, I was thinking that I need to adapt some rsync parameters to work better with bigger files on CephFS.
But when confirming by just copying the file with cp, the transfer get's also stuck.
Without any error message, the process just keeps running (rsync or cp).
But the file size on the target doesn't increase anymore at some point (almost 85%).

Main:
-rw------- 1 cockpit-ws printadmin 68360698297 16. Nov 13:40 LB22_2764_dragen.bam

Mirror:
-rw------- 1 root root 58099499008 22. Dez 15:54 LB22_2764_dragen.bam

Our CephFS file size limit is with 10 TB more than generous.
And as far as I know from clients there are indeed files in TB ranges on the cluster without issues.

I don't know if this is the file's fault or if this is some issue with either of the CephFS' or cluster.
And I don't know how to look and troubleshoot this.
Can anybody give me a tip where I can start looking and debug these kind of issues?

---

Trying to exclude clusters and/or clients might have gotten me on the right track. It might have been a client issue or actually a snapshot retention issue. As it turned out when I tried other routes for the data using a different client, the data was not available anymore since the snapshot had been trimmed.

We got behind syncing our snapshots a while ago (due to other issues). And now we are somewhere in between our weekly (16 weeks) and daily (30 days) snapshots. So, I assume before we catch up with daily (<30), there is a general risk that snapshots disappear while we are syncing them.

The funny/weird thing is though (and why I didn't catch up on this), the particular file (and potentially others) of this trimmed snapshot was apparently still available for the client I initially used for the transfer. I'm wondering if the client somehow cached the data until the snapshot got trimmed. And then just re-tried copying the incompletely cached data.

Continuing with the next available snapshot, mirroring/syncing is now catching up again. I expect it might happen again once we catch up to the 30-days threshold. If the time point of snapshot trimming falls into the syncinc time frame. But then I know to just cancel/skip the current snapshot and continue with the next one. Syncing time is short enough to get me over the hill then before the next trimming.

Note to myself: Next time something similar things happens, check if different clients AND different snapshots or original data behave the same.


Related issues 1 (0 open1 closed)

Related to CephFS - Bug #62674: cephfs snapshot remains visible in nfs export after deletion and new snaps not shownDuplicate

Actions
Actions

Also available in: Atom PDF