Project

General

Profile

Bug #56605

Snapshot and xattr scanning in cephfs-data-scan

Added by Xiubo Li 7 months ago. Updated 6 months ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
tools
Labels (FS):
scrub, snapshots
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are doing the recovery by steps with a alternate metadata pool, more detail please see https://docs.ceph.com/en/nautilus/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery.

Found that we couldn't recover the snapshot info:

For exmaple, there has a path /mnt/cephfs/mydir/myfile, and the 0th object of the myfile in data pool will have a parent xattr, which is the backtrace info of this object. And then make a snapshot in /mnt/cephfs/.snap, let's assume the snapid is a:

We can see that under the mydir, the dentry name of myfile is:

# rados -p recovery listomapkeys 1000098a1a4.00000000
myfile_head

And from the data pool we can see that the parent xattr is set for myfile:

# ./bin/rados -p cephfs.a.data listxattr 1000098a1a5.00000000
layout
parent

And then remove the myfile, the dentry name will become to:

# rados -p recovery listomapkeys 1000098a1a4.00000000
myfile_a

The myfile 0th object will lose the parent xattr:

# rados -p cephfs.a.data getxattr 1000098a1a5.00000000 parent 
error getting xattr cephfs.a.data/1000098a1a5.00000000/parent: (2) No such file or directory
./bin/rados -p cephfs.a.data listxattr 1000098a1a5.00000000
error getting xattr set cephfs.a.data/1000098a1a5.00000000: (2) No such file or directory

We can see the 1000098a1a5.00000000 object is still in the data pool:

./bin/rados -p cephfs.a.data ls
1000098a39c.00000002
10000000009.00000000
10000000b2b.00000000
1000000042d.00000000
1000098a39c.00000000
1000098a1a5.00000000
...

So when running the scan_inodes it could find a backtrace in object 1000098a1a5.00000000, and then couldn't add the myfile_a dentry to the mydir/:

cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem <original filesystem name> --force-corrupt --force-init <original data pool name>

So later for the scan_links it couldn't add the snapid a to the SnapServer table:

cephfs-data-scan scan_links --filesystem recovery-fs

Is this a bug ? Or else where should we find the parent xattr ?

History

#1 Updated by Greg Farnum 7 months ago

Do you have logs/shell output or can you reproduce this, demonstrating the presence of the xattr before taking the snapshot?

#2 Updated by Xiubo Li 7 months ago

Greg Farnum wrote:

Do you have logs/shell output or can you reproduce this, demonstrating the presence of the xattr before taking the snapshot?

Yeah, I can reproduce every time when I do the above test.

This is another test, which I do:

[root@lxbceph1 kcephfs]# mkdir mydir444
[root@lxbceph1 kcephfs]# echo dklfjsa > mydir444/fiel

[root@lxbceph1 build]# ./bin/rados -p cephfs.a.data getxattr 1000098a39e.00000000 parent
X������fiemydir444T
[root@lxbceph1 build]# ./bin/rados -p cephfs.a.data listxattr 1000098a39e.00000000
layout
parent

Then make snapshot and remove the file:

[root@lxbceph1 kcephfs]# mkdir .snap/dkfjals
[root@lxbceph1 kcephfs]# rm mydir444/fiel 
rm: remove regular file 'mydir444/fiel'? y

Then flush the journal and check the xattr again:

[root@lxbceph1 build]# ./bin/ceph daemon mds.a flush journal
{
    "message": "",
    "return_code": 0
}
[root@lxbceph1 build]# ./bin/rados -p cephfs.a.data getxattr 1000098a39e.00000000 parent
error getting xattr cephfs.a.data/1000098a39e.00000000/parent: (2) No such file or directory
[root@lxbceph1 build]# ./bin/rados -p cephfs.a.data listxattr 1000098a39e.00000000
error getting xattr set cephfs.a.data/1000098a39e.00000000: (2) No such file or directory

#3 Updated by Xiubo Li 7 months ago

I think about this more, even the xattrs are not lost, we still couldn't recovery the snapshot from the data pool. I didn't see anywhere in MDS is updating the parent xattr from myfile_head to myfile_${snapid} for the backtrace info.

The Rados will do the Copy-On-Write, and after COW happens the old contents will be copied to other objects(?), but we couldn't see them by using the rados -p $POOL ls. If we want to recover the snapshots by using a fresh metadata pool we must find one way to get them.

#4 Updated by Neha Ojha 7 months ago

  • Assignee set to Matan Breizman

#5 Updated by Xiubo Li 7 months ago

Let me describe how the cephfs act for this:

1, For the directory and it's contents, which are all metadata info and they are stored in the metadata pool.

2, For a file its metadatas will be stored together with its parent, which is a directory as mentioned above, and its data will be stored in a separated data pool.

3, In data pool the first object, which is also named as the 0th object, of a file will also have some metadata info, such as the parent xattr to store the backtrace of the file.
We need to backtrace to recovery the metadatas, with this we can know the dentry names and which directory it belongs to, etc.

4, If there has no snapshot made since this file's birth, if we delete the file from CephFS all the metadatas and data objects will be deleted from both pools.

5, But if we create a snapshot and then delete the file, they will be kept.

==============

It's very easy to reproduce this via CephFS, the following are the detail steps:

a), Build the ceph from source.

b), Setup a CephFS cluster:

  # cd ./ceph/build/ && MDS=1 MON=3 OSD=3 MGR=1 ../src/vstart.sh -n -X -G --msgr1
  # ./bin/ceph fs status
  a - 0 clients
  =
  RANK  STATE   MDS     ACTIVITY     DNS    INOS   DIRS   CAPS  
   0    active   c   Reqs:    0 /s    10     13     12      0   
       POOL        TYPE     USED  AVAIL  
  cephfs.a.meta  metadata  96.0k  98.9G  
  cephfs.a.data    data       0   98.9G  
  STANDBY MDS  
       a       
       b       
  MDS version: ceph version 17.0.0-13587-gdcc92e07b25 (dcc92e07b2557170293e55675763614717c12d98) quincy (dev)

You should see the metadata and data pools as above.

c), Mount a CephFS client:

  # ./bin/ceph-fuse /mnt/cephfs/

d), Make one directory and create a file:

  # mkdir /mnt/cephfs/mydir && echo 12345 > /mnt/cephfs/mydir/myfile

e), Flush the journal before checking the data pool:

  # ./bin/ceph daemon mds.a flush journal

f), Get the ino# for myfile:

  # stat /mnt/cephfs/mydir/myfile |grep Inode
  Device: 2fh/47d    Inode: 1099511627777  Links: 1

1099511627777 equals to 0x10000000001 in hex.

g), List myfile's 0th object from data pool and then list all the xattrs and then get the value of the parent xattr:

  # ./bin/rados -p cephfs.a.data ls
  10000000001.00000000

  # ./bin/rados -p cephfs.a.data listxattr 10000000001.00000000
  layout
  parent

  # ./bin/rados -p cephfs.a.data getxattr 10000000001.00000000 parent
  W�myfilemydir

h), Create a snapshot under mydir:

  # mkdir /mnt/cephfs/mydir/.snap/mysnap
  # ls /mnt/cephfs/mydir/.snap
  mysnap

i), Delete the myfile file:

  # rm /mnt/cephfs/mydir/myfile -f
  # ll /mnt/cephfs/mydir/
  total 0

j), Flush the journal logs again before checking the data pool:

  # ./bin/ceph daemon mds.a flush journal

k), Check the parent xattr from the 0th object of myfile:

  # ./bin/rados -p cephfs.a.data ls
  10000000001.00000000

  # ./bin/rados -p cephfs.a.data listxattr 10000000001.00000000
  error getting xattr set cephfs.a.data/10000000001.00000000: (2) No such file or directory

  # ./bin/rados -p cephfs.a.data getxattr 10000000001.00000000 parent
  error getting xattr cephfs.a.data/10000000001.00000000/parent: (2) No such file or directory

They are disappeared!

==============

I am still not sure how to reproduce it independent the CephFS, but in theory you could do:

aa), Create one pool
bb), Allocate one Object in it
cc), Set some xattrs in this Object
dd), Make a snapshot for it
ee), Delete the Object(?)
ff), Check the xattrs from that Object if it still there

Thanks!

#6 Updated by Radoslaw Zarzynski 7 months ago

  • Status changed from New to In Progress

#7 Updated by Matan Breizman 7 months ago

Hi Xiubo, Thank you for the detailed information!

From a RADOS standpoint everything is working as expected.
We are able to retrieve cloned objects xattrs by stating the `-s/--snap` <snapname> (as shown below)

After setting xattrs manually to the object (obj).

$ rados -p test_clone mksnap snpp
created pool test_clone snap snpp 

$ rados -p test_clone rm obj
$ rados -p test_clone ls
obj

$ rados -p test_clone listxattr obj
error getting xattr set test_clone/obj: (2) No such file or directory 

$ rados -p test_clone getxattr obj test
error getting xattr test_clone/obj/test: (2) No such file or directory 

$ rados -p test_clone -s snpp listxattr obj 
selected snap 1 'snpp'
test

$ rados -p test_clone -s snpp listxattr obj
selected snap 1 'snpp'
test1% 

The underlying issue is that the rados tool is not aware of the snapshots created with CephFS (self-managed snaps). It is also impossible to `lssnap` the snapshots created by CephFS.

For the test case provided earlier, I was able to locate the xattrs using the objectstore-tool (No bug in losing xattrs)

$ ceph-objectstore-tool --data-path ./dev/osd0 --pgid 3.7 '{"oid":"10000000001.00000000","key":"","snapid":2,"hash":211683143,"max":0,"pool":3,"namespace":"","max":0}' list-attrs

_
_layout
_parent

Meaning,

We can see the 1000098a1a5.00000000 object is still in the data pool: ...

Is the expected behavior as the head object doesn't have any xattrs.

#8 Updated by Greg Farnum 7 months ago

  • Project changed from RADOS to CephFS
  • Subject changed from All the xattrs will but lost after the files are delete if they have been snapshoted to Snapshot and xattr scanning in cephfs-data-scan
  • Status changed from In Progress to Need More Info
  • Assignee changed from Matan Breizman to Xiubo Li
  • Component(FS) tools added
  • Labels (FS) scrub, snapshots added

Matan Breizman wrote:

Meaning,

We can see the 1000098a1a5.00000000 object is still in the data pool: ...

Is the expected behavior as the head object doesn't have any xattrs.

Ah yes, I missed that we were asking the nonexistent HEAD object for xattrs, instead of the existing snapshot. I'm moving this back to CephFS so we can figure out if we need to make some adjustments to the interface we're using.

Xiubo, we do have interfaces that let us read snapshot data when scanning through PGs, but it's possible cephfs-data-scan isn't using them and currently just works on the HEAD state of the filesystem. Let's investigate that.

#9 Updated by Xiubo Li 7 months ago

Greg Farnum wrote:

Matan Breizman wrote:

Meaning,

We can see the 1000098a1a5.00000000 object is still in the data pool: ...

Is the expected behavior as the head object doesn't have any xattrs.

Ah yes, I missed that we were asking the nonexistent HEAD object for xattrs, instead of the existing snapshot. I'm moving this back to CephFS so we can figure out if we need to make some adjustments to the interface we're using.

Xiubo, we do have interfaces that let us read snapshot data when scanning through PGs, but it's possible cephfs-data-scan isn't using them and currently just works on the HEAD state of the filesystem. Let's investigate that.

Sure, I will investigate it and try to fix the cephfs-data-scan tool next week.

And thanks very much @Matan for your info.

#10 Updated by Xiubo Li 7 months ago

Here is my test case locally https://github.com/lxbsz/ceph/tree/wip-56605-draft.

By using:

int librados::IoCtx::list_snaps(const std::string& oid, snap_set_t *out_snaps)

I can list all the snapid of the objects in data pool, something like:

cloneid: 3 snaps: [2,3] overlap: [] size: 7

That all, only with the snapids we couldn't know which snaprealm the snapids belong to, and what's the hierarchy of the snapreams. Most important is that all the snapshot names will be lost.

And also if we create a snapshot under a empty directory this snapid will be lost too.

IMO if we want to recovery this from data pool we need to use a dedicated object in the data pool to save the hierarchy of the snaprealms.

@Greg, Any better idea ?

Thanks!

#11 Updated by Greg Farnum 6 months ago

Xiubo Li wrote:

Here is my test case locally https://github.com/lxbsz/ceph/tree/wip-56605-draft.

By using:

[...]

I can list all the snapid of the objects in data pool, something like:

[...]

That all, only with the snapids we couldn't know which snaprealm the snapids belong to, and what's the hierarchy of the snapreams. Most important is that all the snapshot names will be lost.

And also if we create a snapshot under a empty directory this snapid will be lost too.

IMO if we want to recovery this from data pool we need to use a dedicated object in the data pool to save the hierarchy of the snaprealms.

@Greg, Any better idea ?

Hmm I think we need to more architecture here now that you've pointed that out. A single object is a big problem both in terms of scalability and the potential for getting lost. We should be able to see that we're missing snapshots by listing snaps on objects? I don't remember if there's a cheap way to do that, though. :/ And it won't tell us clearly what the snapshot roots are supposed to be or what the start and end intervals are.

#12 Updated by Xiubo Li 6 months ago

Greg Farnum wrote:

Xiubo Li wrote:

Here is my test case locally https://github.com/lxbsz/ceph/tree/wip-56605-draft.

By using:

[...]

I can list all the snapid of the objects in data pool, something like:

[...]

That all, only with the snapids we couldn't know which snaprealm the snapids belong to, and what's the hierarchy of the snapreams. Most important is that all the snapshot names will be lost.

And also if we create a snapshot under a empty directory this snapid will be lost too.

IMO if we want to recovery this from data pool we need to use a dedicated object in the data pool to save the hierarchy of the snaprealms.

@Greg, Any better idea ?

Hmm I think we need to more architecture here now that you've pointed that out. A single object is a big problem both in terms of scalability and the potential for getting lost. We should be able to see that we're missing snapshots by listing snaps on objects? I don't remember if there's a cheap way to do that, though. :/ And it won't tell us clearly what the snapshot roots are supposed to be or what the start and end intervals are.

Our purpose here is to recover the snaprealms and snaptable from the data pool. It's hard to do this only with backtraces and snapids we can get from the objects.

I am thinking why not store a copy of the snaprealms to the data pool during committing them to the metadata pool ? Then this could also resolve empty directory issue as mentioned in previous comments.

#13 Updated by Xiubo Li 6 months ago

We should be able to see that we're missing snapshots by listing snaps on objects?

Yeah. If a file was snapshoted before deleting, the objects in data pool will be COWed and when listing the objects in data pool we can see them.

I found another issue when reading the source code of the recovery tools, from https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts:

Finally, you can regenerate metadata objects for missing files and directories based on the contents of a data pool. This is a three-phase process. First, scanning all objects to calculate size and mtime metadata for inodes. Second, scanning the first object from every file to collect this metadata and inject it into the metadata pool. Third, checking inode linkages and fixing found errors.

cephfs-data-scan scan_extents <data pool>
cephfs-data-scan scan_inodes <data pool>
cephfs-data-scan scan_links

The scan_inode miss recovering the snapshot dentry items. From MDS code when COWing a dentry/inode it will add the COWed dentry item to the parent directory, but the scan_inode doesn't do that.

Shouldn't we recover them ?

#14 Updated by Xiubo Li 6 months ago

The listsnaps could list the snapids of the objects:

# ./bin/rados --pool cephfs.a.data listsnaps 10000000006.00000000
10000000006.00000000:
cloneid    snaps    size    overlap
8    8    0    []
15    9,10,11,12,13,14,15    9    []
16    16    17    []
head    -    0

#15 Updated by Greg Farnum 6 months ago

Xiubo Li wrote:

Our purpose here is to recover the snaprealms and snaptable from the data pool. It's hard to do this only with backtraces and snapids we can get from the objects.

I am thinking why not store a copy of the snaprealms to the data pool during committing them to the metadata pool ? Then this could also resolve empty directory issue as mentioned in previous comments.

Where in the data pool would we store them? Are we going to copy the entire metadata backing objects into the data pool? Keep in mind we need to maintain our security model and I’m not sure namespaces will do enough there since users will be able to read a lot more than the mds necessarily shares with them.

Once we actually have some data, how do we use it to recover fully? If we’re making a full copy that’s very expensive; if we aren’t, how do we use it to get closer to recovering everything?

I don’t have good answers for this, and it doesn’t sound like you do yet either. That’s why I said we would need to do more architecture, and that needs to happen before we start any coding.
I can almost see how we can recover snapshots if we assume no renames happen, but with renames across snaprealms (and the things associated with past_parents in the mds) I just don’t see how we resolve this cleanly. :/

#16 Updated by Xiubo Li 6 months ago

  • Status changed from Need More Info to In Progress

Greg Farnum wrote:

Xiubo Li wrote:

Our purpose here is to recover the snaprealms and snaptable from the data pool. It's hard to do this only with backtraces and snapids we can get from the objects.

I am thinking why not store a copy of the snaprealms to the data pool during committing them to the metadata pool ? Then this could also resolve empty directory issue as mentioned in previous comments.

Where in the data pool would we store them?

In one dedicated object, like 3.00000000 in data pool.

Are we going to copy the entire metadata backing objects into the data pool?

No, only the snapreams, and save them to omap in <SnapRealm's ino#, struct SnapRealm> pair, which the SnapRealm's ino# is the parent directory's ino#.

Keep in mind we need to maintain our security model and I’m not sure namespaces will do enough there since users will be able to read a lot more than the mds necessarily shares with them.

Once we actually have some data, how do we use it to recover fully?

The other objects in the data pool could help us to recover the whole hierarchy of the filesystem. And then we just need to iterate the 3.00000000 object to get the key:value pairs and then just fill the SnapRealm into the parent directory.

If we’re making a full copy that’s very expensive; if we aren’t, how do we use it to get closer to recovering everything?

Since normally the snaprealms won't exist in every directory, so for a whole filesystem it shouldn't have too many snaprealms. IMO it won't expensive to store a copy of them in the data pool.

I don’t have good answers for this, and it doesn’t sound like you do yet either. That’s why I said we would need to do more architecture, and that needs to happen before we start any coding.

Yeah, right.

But currently this is what I can figure out as the best choice, there maybe a better one.

And also there is one exception, currently empty directories will be lost including the snamrealms in them.

I can almost see how we can recover snapshots if we assume no renames happen, but with renames across snaprealms (and the things associated with past_parents in the mds) I just don’t see how we resolve this cleanly. :/

Also available in: Atom PDF