Support #40906: Full CephFS causes hang when accessing inode. - CephFS - Ceph

Actions

Copy link

Support #40906

open

Full CephFS causes hang when accessing inode.

Added by Robert LeBlanc over 4 years ago. Updated over 4 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Ceph - v12.2.8

Component(FS):

Labels (FS):

Pull request ID:

Description

Our CephFS filled up and we are having trouble operating on a directory that has a file with a bad inode state. We would like to just have CephFS delete this bad inode, we can remove the RADOS objects manually if needed.

Actions

Copy link

Updated by Robert LeBlanc over 4 years ago

The complete logs were 81GB, so I just filtered on the client session. If you need some more complete logs, let me know the time span you want. The cluster was doing about 1.5K ops/s.

https://owncloud.leblancnet.us/owncloud/index.php/s/4KCp4UMclt4qfwZ

Actions

Copy link

Updated by Zheng Yan over 4 years ago

The log does not contain information about the bad inode.

To use rados to delete inode

1. find inode number of parent directory
2. rados -p cephfs_metadata listomapkey <inode number of parent directory in hex>.00000000.
3. rados -p cephfs_metadata rmomapkey <inode number of parent directory in hex>.00000000 <file name>_head

Actions

Copy link

Updated by Patrick Donnelly over 4 years ago

Tracker changed from Bug to Support
Status changed from New to Rejected

This discussion should move to the ceph-users list: https://lists.ceph.io/postorius/lists/ceph-users.ceph.io/

Actions

Copy link

Updated by Patrick Donnelly over 4 years ago

Status changed from Rejected to New

Reopening this as I see now that Zheng asked you to create the ticket.

Actions

Copy link

Updated by Robert LeBlanc over 4 years ago

Will deleting the RADOS object for the inode going to cause more problems with the MDS because they get out of sync?

Actions

Copy link

Updated by Zheng Yan over 4 years ago

after delete omap key. use scrub_path asok command to repair parent directory

Actions

Copy link

Updated by Robert LeBlanc over 4 years ago

Okay, I'll give it a shot next week on some of my smaller directories.

Just to be sure I understand the process.
1. Delete all RADOS objects for the bad inode. <inode_in_hex>.00000000-<inode_in_hex>.xxxxxxxx (file size divided by 4MB or all parts listed in `rados ls`.
2. Find the inode for the parent directory `ls -id /parent_dir`.
3. Get the omap key for the parent directory in the cephFS metadata pool `rados -p cephfs_metadata listomapkey <inode number of parent directory in hex>.00000000`
4. Remove the omap key for the bad file, basically 'erasing' it from the file system metadata. `rados -p cephfs_metadata rmomapkey <inode number of parent directory in hex>.00000000 <file name>_head`
5. Have ceph rescan the directory and rebuild an index to make sure it is in sync?? `ceph daemon mds.0
scrub_path /parent_dir` (do I need to specify a pool or files system or is mds.0 good enough, I'm not finding a lot of information about this command.)

If I understand what is going on, it helps me to feel more comfortable about it. Thanks!

Actions

Copy link

Updated by Robert LeBlanc over 4 years ago

Please confirm that I understand the process so that I can give it a try.

Thanks!

Actions

Copy link

Updated by Robert LeBlanc over 4 years ago

Okay, we had another data corruption incident, so I took some time to try looking deeper into the problem. I did some scan_path runs on some sub directories (we have a very large FS so doing it all at once is rough). I scanned the top directory here and it found some issues that it fixed, but walking all the files and sub directories, this one still had problems. I tried scan_path on the directory above it and it didn't find anything. I targeted one file in this example, when I pulled the object from RADOS, it was zero size, but the xattrs looked in tact. I'm guessing there is some discrepancy in the omap causing issues, I could not find any documentation on the format, but I figured out that the inode is in bytes 15-26(? not sure of the exact end) in big endian order. So I was able to figure out what the inode of the 'bad' file was to check it. I hope this helps.

root@rleblanc-ceph-test:/mnt/fuse/pcs01-fs# ls -lh 2019-08-04_v8_pv/test/10/
=====client hung here, I evicted the client=====
ls: cannot access '2019-08-04_v8_pv/test/10/0.lmdb': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/8.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/3.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/4.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/1.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/13.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/0.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/2.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/11.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/16.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/9.tfrecord': Cannot send after transport endpoint shutdown
ls: reading directory '2019-08-04_v8_pv/test/10/': Cannot send after transport endpoint shutdown
total 4.5G
-????????? ? ?        ?           ?            ? 0.lmdb
-????????? ? ?        ?           ?            ? 0.tfrecord
-rw-r--r-- 1 rleblanc rleblanc  26M Aug 12 10:15 10.tfrecord
-????????? ? ?        ?           ?            ? 11.tfrecord
-????????? ? ?        ?           ?            ? 13.tfrecord
-????????? ? ?        ?           ?            ? 16.tfrecord
-rw-r--r-- 1 rleblanc rleblanc 756M Aug 12 22:45 18.tfrecord
-????????? ? ?        ?           ?            ? 1.tfrecord
-rw-r--r-- 1 rleblanc rleblanc 2.2G Aug 12 10:18 20.tfrecord
-????????? ? ?        ?           ?            ? 2.tfrecord
-????????? ? ?        ?           ?            ? 3.tfrecord
-????????? ? ?        ?           ?            ? 4.tfrecord
-rw-r--r-- 1 rleblanc rleblanc 1.6G Aug 12 10:17 5.tfrecord
-????????? ? ?        ?           ?            ? 8.tfrecord
-????????? ? ?        ?           ?            ? 9.tfrecord
root@rleblanc-ceph-test:/mnt/fuse/pcs01-fs# cd ../..
root@rleblanc-ceph-test:/mnt/fuse# ls -lh pcs01-fs/2019-08-04_v8_pv/test/10/10.tfrecord
-rw-r--r-- 1 rleblanc rleblanc 26M Aug 12 10:15 pcs01-fs/2019-08-04_v8_pv/test/10/10.tfrecord
root@rleblanc-ceph-test:/mnt/fuse# ls -lh pcs01-fs/2019-08-04_v8_pv/test/10/18.tfrecord
-rw-r--r-- 1 rleblanc rleblanc 756M Aug 12 22:45 pcs01-fs/2019-08-04_v8_pv/test/10/18.tfrecord
root@rleblanc-ceph-test:/mnt/fuse# ls -lh pcs01-fs/2019-08-04_v8_pv/test/10/20.tfrecord
-rw-r--r-- 1 rleblanc rleblanc 2.2G Aug 12 10:18 pcs01-fs/2019-08-04_v8_pv/test/10/20.tfrecord
root@rleblanc-ceph-test:/mnt/fuse# ls -ldhn pcs01-fs/2019-08-04_v8_pv/test/10
drwxrwxrwx 2 1160776091 1160774145 72G Aug 13 00:42 pcs01-fs/2019-08-04_v8_pv/test/10
root@rleblanc-ceph-test:/mnt/fuse# ls -ldhi pcs01-fs/2019-08-04_v8_pv/test/10
1101963322799 drwxrwxrwx 2 1160776091 1160774145 72G Aug 13 00:42 pcs01-fs/2019-08-04_v8_pv/test/10
root@rleblanc-ceph-test:/mnt/fuse# ls -ldhi pcs01-fs/2019-08-04_v8_pv/test/1
1101963322781 drwxrwxrwx 2 1160776091 1160774145 46G Aug 13 02:22 pcs01-fs/2019-08-04_v8_pv/test/1
root@rleblanc-ceph-test:/mnt/fuse# ls -lhi pcs01-fs/2019-08-04_v8_pv/test/10/10.tfrecord
1101964471700 -rw-r--r-- 1 rleblanc rleblanc 26M Aug 12 10:15 pcs01-fs/2019-08-04_v8_pv/test/10/10.tfrecord
root@rleblanc-ceph-test:/mnt/fuse# ls -lhi pcs01-fs/2019-08-04_v8_pv/test/10/0.lmdb
=====client hung here, evicted the client=====
ls: cannot access 'pcs01-fs/2019-08-04_v8_pv/test/10/0.lmdb': Cannot send after transport endpoint shutdown

=====dig into the directory with the bad metadata=====
$ rados -p sun_pcs01_fs_meta listomapkeys 1009221e5af.00000000
0.lmdb_head
0.tfrecord_head
1.tfrecord_head
10.tfrecord_head
11.tfrecord_head
13.tfrecord_head
16.tfrecord_head
18.tfrecord_head
2.tfrecord_head
20.tfrecord_head
3.tfrecord_head
4.tfrecord_head
5.tfrecord_head
8.tfrecord_head
9.tfrecord_head
=====this file hung, lets look at it=====
$ rados -p sun_pcs01_fs_meta getomapval 1009221e5af.00000000 0.lmdb_head
value (500 bytes) :
00000000  02 00 00 00 00 00 00 00  49 0f 06 c9 01 00 00 ae  |........I.......|
00000010  53 4b 91 00 01 00 00 00  00 00 00 fe 69 52 5d fb  |SK..........iR].|
00000020  00 ce 25 a4 81 00 00 e8  03 00 00 e8 03 00 00 01  |..%.............|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 02 02 18 00  |................|
00000040  00 00 00 00 40 00 01 00  00 00 00 00 40 00 03 00  |....@.......@...|
00000050  00 00 00 00 00 00 00 00  00 00 00 d0 e2 1c 00 00  |................|
00000060  00 00 01 00 00 00 ff ff  ff ff ff ff ff ff 00 00  |................|
00000070  00 00 00 00 00 00 00 00  00 00 00 6a 52 5d f0 a3  |...........jR]..|
00000080  c3 07 fe 69 52 5d fb 00  ce 25 00 00 00 00 01 00  |...iR]...%......|
00000090  00 00 af 70 22 07 00 00  00 00 02 02 18 00 00 00  |...p"...........|
000000a0  00 00 00 00 00 00 00 00  00 00 00 3a 00 00 00 00  |...........:....|
000000b0  01 00 00 00 00 00 00 00  03 02 28 00 00 00 00 00  |..........(.....|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000e0  00 00 00 00 00 00 03 02  38 00 00 00 00 00 00 00  |........8.......|
000000f0  00 00 00 00 00 d0 e2 1c  00 00 00 00 01 00 00 00  |................|
00000100  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000120  00 00 00 00 03 02 38 00  00 00 00 00 00 00 00 00  |......8.........|
00000130  00 00 00 d0 e2 1c 00 00  00 00 01 00 00 00 00 00  |................|
00000140  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000160  00 00 47 33 00 00 00 00  00 00 00 00 00 00 00 00  |..G3............|
00000170  00 00 01 00 00 00 00 00  00 00 53 32 00 00 00 00  |..........S2....|
00000180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 ff ff  |................|
00000190  ff ff ff ff ff ff 00 00  00 00 01 01 10 00 00 00  |................|
000001a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001c0  00 00 00 00 fe 69 52 5d  fb 00 ce 25 00 00 00 00  |.....iR]...%....|
000001d0  00 00 00 00 ff ff ff ff  00 00 00 00 00 00 00 00  |................|
000001e0  00 00 00 00 00 00 00 00  fe ff ff ff ff ff ff ff  |................|
000001f0  00 00 00 00                                       |....|
000001f4
$ rados -p sun_pcs01_fs_data get 100914b53ae.00000000 test.100914b53ae
$ ls -lh test*
-rw-r--r-- 1 rleblanc rleblanc    0 Aug 13 18:46 test.100914b53ae
$ rados -p sun_pcs01_fs_data getxattr 100914b53ae.00000000 parentinode_backtrace_t import - decode dump_json
{
    "ino": 1101949260718,
    "ancestors": [
        {
            "dirino": 1101963322799,
            "dname": "0.lmdb",
            "version": 13127
        },
        {
            "dirino": 1101963322778,
            "dname": "10",
            "version": 6026198
        },
        {
            "dirino": 1101951188216,
            "dname": "test",
            "version": 72100
        },
        {
            "dirino": 1101005585283,
            "dname": "2019-08-04_v8_pv",
            "version": 1546747
        }
    ],
    "pool": 3,
    "old_pools": []
}

Actions

Copy link

#10

Updated by Zheng Yan over 4 years ago

please provide logs of both ceph-fuse and mds during accessing the bad file.

Actions

Copy link

#11

Updated by Robert LeBlanc over 4 years ago

It seems that after some time (hour, days, weeks) things get fixed, but it sure would be nice to know how to get it in a healthy state much faster. Luckily this directory still has the problem. Do you need 20 on both? The MDS is currently doing 3,000 IOPs and will lead to a huge log.

Actions

Copy link

#12

Updated by Robert LeBlanc over 4 years ago

I've sent an e-mail to Zheng with a link to download the logs due to sensitive info. The client requested the file 0.lmdb at 2019-08-16 10:02:12.644431 (the path is slightly different from what I posted due to removing sensitive data, but it is the same file).

Actions

Copy link

#13

Updated by Robert LeBlanc over 4 years ago

Did the logs provide the information that you needed, or do you need more/different information?

Actions

Copy link

#14

Updated by Robert LeBlanc over 4 years ago

The MDS crashed while I was working in the damaged directories at `2019-08-23 15:51:03.871241`. The standby took over just fine. I sent the last 11,000 lines of the log to your e-mail address. Thanks.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Support #40906

Full CephFS causes hang when accessing inode.

Updated by Robert LeBlanc over 4 years ago

Updated by Zheng Yan over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Robert LeBlanc over 4 years ago

Updated by Zheng Yan over 4 years ago

Updated by Robert LeBlanc over 4 years ago

Updated by Robert LeBlanc over 4 years ago

Updated by Robert LeBlanc over 4 years ago

Updated by Zheng Yan over 4 years ago

Updated by Robert LeBlanc over 4 years ago

Updated by Robert LeBlanc over 4 years ago

Updated by Robert LeBlanc over 4 years ago

Updated by Robert LeBlanc over 4 years ago