Project

General

Profile

Support #40906

Full CephFS causes hang when accessing inode.

Added by Robert LeBlanc about 2 months ago. Updated 24 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
07/23/2019
Due date:
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:

Description

Our CephFS filled up and we are having trouble operating on a directory that has a file with a bad inode state. We would like to just have CephFS delete this bad inode, we can remove the RADOS objects manually if needed.

History

#1 Updated by Robert LeBlanc about 2 months ago

The complete logs were 81GB, so I just filtered on the client session. If you need some more complete logs, let me know the time span you want. The cluster was doing about 1.5K ops/s.

https://owncloud.leblancnet.us/owncloud/index.php/s/4KCp4UMclt4qfwZ

#2 Updated by Zheng Yan about 2 months ago

The log does not contain information about the bad inode.

To use rados to delete inode

1. find inode number of parent directory
2. rados -p cephfs_metadata listomapkey <inode number of parent directory in hex>.00000000.
3. rados -p cephfs_metadata rmomapkey <inode number of parent directory in hex>.00000000 <file name>_head

#3 Updated by Patrick Donnelly about 2 months ago

  • Tracker changed from Bug to Support
  • Status changed from New to Rejected

This discussion should move to the ceph-users list: https://lists.ceph.io/postorius/lists/ceph-users.ceph.io/

#4 Updated by Patrick Donnelly about 2 months ago

  • Status changed from Rejected to New

Reopening this as I see now that Zheng asked you to create the ticket.

#5 Updated by Robert LeBlanc about 2 months ago

Will deleting the RADOS object for the inode going to cause more problems with the MDS because they get out of sync?

#6 Updated by Zheng Yan about 2 months ago

after delete omap key. use scrub_path asok command to repair parent directory

#7 Updated by Robert LeBlanc about 2 months ago

Okay, I'll give it a shot next week on some of my smaller directories.

Just to be sure I understand the process.
1. Delete all RADOS objects for the bad inode. <inode_in_hex>.00000000-<inode_in_hex>.xxxxxxxx (file size divided by 4MB or all parts listed in `rados ls`.
2. Find the inode for the parent directory `ls -id /parent_dir`.
3. Get the omap key for the parent directory in the cephFS metadata pool `rados -p cephfs_metadata listomapkey <inode number of parent directory in hex>.00000000`
4. Remove the omap key for the bad file, basically 'erasing' it from the file system metadata. `rados -p cephfs_metadata rmomapkey <inode number of parent directory in hex>.00000000 <file name>_head`
5. Have ceph rescan the directory and rebuild an index to make sure it is in sync?? `ceph daemon mds.0
scrub_path /parent_dir` (do I need to specify a pool or files system or is mds.0 good enough, I'm not finding a lot of information about this command.)

If I understand what is going on, it helps me to feel more comfortable about it. Thanks!

#8 Updated by Robert LeBlanc about 2 months ago

Please confirm that I understand the process so that I can give it a try.

Thanks!

#9 Updated by Robert LeBlanc about 1 month ago

Okay, we had another data corruption incident, so I took some time to try looking deeper into the problem. I did some scan_path runs on some sub directories (we have a very large FS so doing it all at once is rough). I scanned the top directory here and it found some issues that it fixed, but walking all the files and sub directories, this one still had problems. I tried scan_path on the directory above it and it didn't find anything. I targeted one file in this example, when I pulled the object from RADOS, it was zero size, but the xattrs looked in tact. I'm guessing there is some discrepancy in the omap causing issues, I could not find any documentation on the format, but I figured out that the inode is in bytes 15-26(? not sure of the exact end) in big endian order. So I was able to figure out what the inode of the 'bad' file was to check it. I hope this helps.

root@rleblanc-ceph-test:/mnt/fuse/pcs01-fs# ls -lh 2019-08-04_v8_pv/test/10/
=====client hung here, I evicted the client=====
ls: cannot access '2019-08-04_v8_pv/test/10/0.lmdb': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/8.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/3.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/4.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/1.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/13.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/0.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/2.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/11.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/16.tfrecord': Cannot send after transport endpoint shutdown
ls: cannot access '2019-08-04_v8_pv/test/10/9.tfrecord': Cannot send after transport endpoint shutdown
ls: reading directory '2019-08-04_v8_pv/test/10/': Cannot send after transport endpoint shutdown
total 4.5G
-????????? ? ?        ?           ?            ? 0.lmdb
-????????? ? ?        ?           ?            ? 0.tfrecord
-rw-r--r-- 1 rleblanc rleblanc  26M Aug 12 10:15 10.tfrecord
-????????? ? ?        ?           ?            ? 11.tfrecord
-????????? ? ?        ?           ?            ? 13.tfrecord
-????????? ? ?        ?           ?            ? 16.tfrecord
-rw-r--r-- 1 rleblanc rleblanc 756M Aug 12 22:45 18.tfrecord
-????????? ? ?        ?           ?            ? 1.tfrecord
-rw-r--r-- 1 rleblanc rleblanc 2.2G Aug 12 10:18 20.tfrecord
-????????? ? ?        ?           ?            ? 2.tfrecord
-????????? ? ?        ?           ?            ? 3.tfrecord
-????????? ? ?        ?           ?            ? 4.tfrecord
-rw-r--r-- 1 rleblanc rleblanc 1.6G Aug 12 10:17 5.tfrecord
-????????? ? ?        ?           ?            ? 8.tfrecord
-????????? ? ?        ?           ?            ? 9.tfrecord
root@rleblanc-ceph-test:/mnt/fuse/pcs01-fs# cd ../..
root@rleblanc-ceph-test:/mnt/fuse# ls -lh pcs01-fs/2019-08-04_v8_pv/test/10/10.tfrecord
-rw-r--r-- 1 rleblanc rleblanc 26M Aug 12 10:15 pcs01-fs/2019-08-04_v8_pv/test/10/10.tfrecord
root@rleblanc-ceph-test:/mnt/fuse# ls -lh pcs01-fs/2019-08-04_v8_pv/test/10/18.tfrecord
-rw-r--r-- 1 rleblanc rleblanc 756M Aug 12 22:45 pcs01-fs/2019-08-04_v8_pv/test/10/18.tfrecord
root@rleblanc-ceph-test:/mnt/fuse# ls -lh pcs01-fs/2019-08-04_v8_pv/test/10/20.tfrecord
-rw-r--r-- 1 rleblanc rleblanc 2.2G Aug 12 10:18 pcs01-fs/2019-08-04_v8_pv/test/10/20.tfrecord
root@rleblanc-ceph-test:/mnt/fuse# ls -ldhn pcs01-fs/2019-08-04_v8_pv/test/10
drwxrwxrwx 2 1160776091 1160774145 72G Aug 13 00:42 pcs01-fs/2019-08-04_v8_pv/test/10
root@rleblanc-ceph-test:/mnt/fuse# ls -ldhi pcs01-fs/2019-08-04_v8_pv/test/10
1101963322799 drwxrwxrwx 2 1160776091 1160774145 72G Aug 13 00:42 pcs01-fs/2019-08-04_v8_pv/test/10
root@rleblanc-ceph-test:/mnt/fuse# ls -ldhi pcs01-fs/2019-08-04_v8_pv/test/1
1101963322781 drwxrwxrwx 2 1160776091 1160774145 46G Aug 13 02:22 pcs01-fs/2019-08-04_v8_pv/test/1
root@rleblanc-ceph-test:/mnt/fuse# ls -lhi pcs01-fs/2019-08-04_v8_pv/test/10/10.tfrecord
1101964471700 -rw-r--r-- 1 rleblanc rleblanc 26M Aug 12 10:15 pcs01-fs/2019-08-04_v8_pv/test/10/10.tfrecord
root@rleblanc-ceph-test:/mnt/fuse# ls -lhi pcs01-fs/2019-08-04_v8_pv/test/10/0.lmdb
=====client hung here, evicted the client=====
ls: cannot access 'pcs01-fs/2019-08-04_v8_pv/test/10/0.lmdb': Cannot send after transport endpoint shutdown

=====dig into the directory with the bad metadata=====
$ rados -p sun_pcs01_fs_meta listomapkeys 1009221e5af.00000000
0.lmdb_head
0.tfrecord_head
1.tfrecord_head
10.tfrecord_head
11.tfrecord_head
13.tfrecord_head
16.tfrecord_head
18.tfrecord_head
2.tfrecord_head
20.tfrecord_head
3.tfrecord_head
4.tfrecord_head
5.tfrecord_head
8.tfrecord_head
9.tfrecord_head
=====this file hung, lets look at it=====
$ rados -p sun_pcs01_fs_meta getomapval 1009221e5af.00000000 0.lmdb_head
value (500 bytes) :
00000000  02 00 00 00 00 00 00 00  49 0f 06 c9 01 00 00 ae  |........I.......|
00000010  53 4b 91 00 01 00 00 00  00 00 00 fe 69 52 5d fb  |SK..........iR].|
00000020  00 ce 25 a4 81 00 00 e8  03 00 00 e8 03 00 00 01  |..%.............|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 02 02 18 00  |................|
00000040  00 00 00 00 40 00 01 00  00 00 00 00 40 00 03 00  |....@.......@...|
00000050  00 00 00 00 00 00 00 00  00 00 00 d0 e2 1c 00 00  |................|
00000060  00 00 01 00 00 00 ff ff  ff ff ff ff ff ff 00 00  |................|
00000070  00 00 00 00 00 00 00 00  00 00 00 6a 52 5d f0 a3  |...........jR]..|
00000080  c3 07 fe 69 52 5d fb 00  ce 25 00 00 00 00 01 00  |...iR]...%......|
00000090  00 00 af 70 22 07 00 00  00 00 02 02 18 00 00 00  |...p"...........|
000000a0  00 00 00 00 00 00 00 00  00 00 00 3a 00 00 00 00  |...........:....|
000000b0  01 00 00 00 00 00 00 00  03 02 28 00 00 00 00 00  |..........(.....|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000e0  00 00 00 00 00 00 03 02  38 00 00 00 00 00 00 00  |........8.......|
000000f0  00 00 00 00 00 d0 e2 1c  00 00 00 00 01 00 00 00  |................|
00000100  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000120  00 00 00 00 03 02 38 00  00 00 00 00 00 00 00 00  |......8.........|
00000130  00 00 00 d0 e2 1c 00 00  00 00 01 00 00 00 00 00  |................|
00000140  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000160  00 00 47 33 00 00 00 00  00 00 00 00 00 00 00 00  |..G3............|
00000170  00 00 01 00 00 00 00 00  00 00 53 32 00 00 00 00  |..........S2....|
00000180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 ff ff  |................|
00000190  ff ff ff ff ff ff 00 00  00 00 01 01 10 00 00 00  |................|
000001a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001c0  00 00 00 00 fe 69 52 5d  fb 00 ce 25 00 00 00 00  |.....iR]...%....|
000001d0  00 00 00 00 ff ff ff ff  00 00 00 00 00 00 00 00  |................|
000001e0  00 00 00 00 00 00 00 00  fe ff ff ff ff ff ff ff  |................|
000001f0  00 00 00 00                                       |....|
000001f4
$ rados -p sun_pcs01_fs_data get 100914b53ae.00000000 test.100914b53ae
$ ls -lh test*
-rw-r--r-- 1 rleblanc rleblanc    0 Aug 13 18:46 test.100914b53ae
$ rados -p sun_pcs01_fs_data getxattr 100914b53ae.00000000 parentinode_backtrace_t import - decode dump_json
{
    "ino": 1101949260718,
    "ancestors": [
        {
            "dirino": 1101963322799,
            "dname": "0.lmdb",
            "version": 13127
        },
        {
            "dirino": 1101963322778,
            "dname": "10",
            "version": 6026198
        },
        {
            "dirino": 1101951188216,
            "dname": "test",
            "version": 72100
        },
        {
            "dirino": 1101005585283,
            "dname": "2019-08-04_v8_pv",
            "version": 1546747
        }
    ],
    "pool": 3,
    "old_pools": []
}

#10 Updated by Zheng Yan about 1 month ago

please provide logs of both ceph-fuse and mds during accessing the bad file.

#11 Updated by Robert LeBlanc about 1 month ago

It seems that after some time (hour, days, weeks) things get fixed, but it sure would be nice to know how to get it in a healthy state much faster. Luckily this directory still has the problem. Do you need 20 on both? The MDS is currently doing 3,000 IOPs and will lead to a huge log.

#12 Updated by Robert LeBlanc about 1 month ago

I've sent an e-mail to Zheng with a link to download the logs due to sensitive info. The client requested the file 0.lmdb at 2019-08-16 10:02:12.644431 (the path is slightly different from what I posted due to removing sensitive data, but it is the same file).

#13 Updated by Robert LeBlanc 26 days ago

Did the logs provide the information that you needed, or do you need more/different information?

#14 Updated by Robert LeBlanc 24 days ago

The MDS crashed while I was working in the damaged directories at `2019-08-23 15:51:03.871241`. The standby took over just fine. I sent the last 11,000 lines of the log to your e-mail address. Thanks.

Also available in: Atom PDF