Support #40906
openFull CephFS causes hang when accessing inode.
0%
Description
Our CephFS filled up and we are having trouble operating on a directory that has a file with a bad inode state. We would like to just have CephFS delete this bad inode, we can remove the RADOS objects manually if needed.
Updated by Robert LeBlanc over 4 years ago
The complete logs were 81GB, so I just filtered on the client session. If you need some more complete logs, let me know the time span you want. The cluster was doing about 1.5K ops/s.
https://owncloud.leblancnet.us/owncloud/index.php/s/4KCp4UMclt4qfwZ
Updated by Zheng Yan over 4 years ago
The log does not contain information about the bad inode.
To use rados to delete inode
1. find inode number of parent directory
2. rados -p cephfs_metadata listomapkey <inode number of parent directory in hex>.00000000.
3. rados -p cephfs_metadata rmomapkey <inode number of parent directory in hex>.00000000 <file name>_head
Updated by Patrick Donnelly over 4 years ago
- Tracker changed from Bug to Support
- Status changed from New to Rejected
This discussion should move to the ceph-users list: https://lists.ceph.io/postorius/lists/ceph-users.ceph.io/
Updated by Patrick Donnelly over 4 years ago
- Status changed from Rejected to New
Reopening this as I see now that Zheng asked you to create the ticket.
Updated by Robert LeBlanc over 4 years ago
Will deleting the RADOS object for the inode going to cause more problems with the MDS because they get out of sync?
Updated by Zheng Yan over 4 years ago
after delete omap key. use scrub_path asok command to repair parent directory
Updated by Robert LeBlanc over 4 years ago
Okay, I'll give it a shot next week on some of my smaller directories.
Just to be sure I understand the process.
1. Delete all RADOS objects for the bad inode. <inode_in_hex>.00000000-<inode_in_hex>.xxxxxxxx (file size divided by 4MB or all parts listed in `rados ls`.
2. Find the inode for the parent directory `ls -id /parent_dir`.
3. Get the omap key for the parent directory in the cephFS metadata pool `rados -p cephfs_metadata listomapkey <inode number of parent directory in hex>.00000000`
4. Remove the omap key for the bad file, basically 'erasing' it from the file system metadata. `rados -p cephfs_metadata rmomapkey <inode number of parent directory in hex>.00000000 <file name>_head`
5. Have ceph rescan the directory and rebuild an index to make sure it is in sync?? `ceph daemon mds.0
scrub_path /parent_dir` (do I need to specify a pool or files system or is mds.0 good enough, I'm not finding a lot of information about this command.)
If I understand what is going on, it helps me to feel more comfortable about it. Thanks!
Updated by Robert LeBlanc over 4 years ago
Please confirm that I understand the process so that I can give it a try.
Thanks!
Updated by Robert LeBlanc over 4 years ago
Okay, we had another data corruption incident, so I took some time to try looking deeper into the problem. I did some scan_path runs on some sub directories (we have a very large FS so doing it all at once is rough). I scanned the top directory here and it found some issues that it fixed, but walking all the files and sub directories, this one still had problems. I tried scan_path on the directory above it and it didn't find anything. I targeted one file in this example, when I pulled the object from RADOS, it was zero size, but the xattrs looked in tact. I'm guessing there is some discrepancy in the omap causing issues, I could not find any documentation on the format, but I figured out that the inode is in bytes 15-26(? not sure of the exact end) in big endian order. So I was able to figure out what the inode of the 'bad' file was to check it. I hope this helps.
root@rleblanc-ceph-test:/mnt/fuse/pcs01-fs# ls -lh 2019-08-04_v8_pv/test/10/ =====client hung here, I evicted the client===== ls: cannot access '2019-08-04_v8_pv/test/10/0.lmdb': Cannot send after transport endpoint shutdown ls: cannot access '2019-08-04_v8_pv/test/10/8.tfrecord': Cannot send after transport endpoint shutdown ls: cannot access '2019-08-04_v8_pv/test/10/3.tfrecord': Cannot send after transport endpoint shutdown ls: cannot access '2019-08-04_v8_pv/test/10/4.tfrecord': Cannot send after transport endpoint shutdown ls: cannot access '2019-08-04_v8_pv/test/10/1.tfrecord': Cannot send after transport endpoint shutdown ls: cannot access '2019-08-04_v8_pv/test/10/13.tfrecord': Cannot send after transport endpoint shutdown ls: cannot access '2019-08-04_v8_pv/test/10/0.tfrecord': Cannot send after transport endpoint shutdown ls: cannot access '2019-08-04_v8_pv/test/10/2.tfrecord': Cannot send after transport endpoint shutdown ls: cannot access '2019-08-04_v8_pv/test/10/11.tfrecord': Cannot send after transport endpoint shutdown ls: cannot access '2019-08-04_v8_pv/test/10/16.tfrecord': Cannot send after transport endpoint shutdown ls: cannot access '2019-08-04_v8_pv/test/10/9.tfrecord': Cannot send after transport endpoint shutdown ls: reading directory '2019-08-04_v8_pv/test/10/': Cannot send after transport endpoint shutdown total 4.5G -????????? ? ? ? ? ? 0.lmdb -????????? ? ? ? ? ? 0.tfrecord -rw-r--r-- 1 rleblanc rleblanc 26M Aug 12 10:15 10.tfrecord -????????? ? ? ? ? ? 11.tfrecord -????????? ? ? ? ? ? 13.tfrecord -????????? ? ? ? ? ? 16.tfrecord -rw-r--r-- 1 rleblanc rleblanc 756M Aug 12 22:45 18.tfrecord -????????? ? ? ? ? ? 1.tfrecord -rw-r--r-- 1 rleblanc rleblanc 2.2G Aug 12 10:18 20.tfrecord -????????? ? ? ? ? ? 2.tfrecord -????????? ? ? ? ? ? 3.tfrecord -????????? ? ? ? ? ? 4.tfrecord -rw-r--r-- 1 rleblanc rleblanc 1.6G Aug 12 10:17 5.tfrecord -????????? ? ? ? ? ? 8.tfrecord -????????? ? ? ? ? ? 9.tfrecord root@rleblanc-ceph-test:/mnt/fuse/pcs01-fs# cd ../.. root@rleblanc-ceph-test:/mnt/fuse# ls -lh pcs01-fs/2019-08-04_v8_pv/test/10/10.tfrecord -rw-r--r-- 1 rleblanc rleblanc 26M Aug 12 10:15 pcs01-fs/2019-08-04_v8_pv/test/10/10.tfrecord root@rleblanc-ceph-test:/mnt/fuse# ls -lh pcs01-fs/2019-08-04_v8_pv/test/10/18.tfrecord -rw-r--r-- 1 rleblanc rleblanc 756M Aug 12 22:45 pcs01-fs/2019-08-04_v8_pv/test/10/18.tfrecord root@rleblanc-ceph-test:/mnt/fuse# ls -lh pcs01-fs/2019-08-04_v8_pv/test/10/20.tfrecord -rw-r--r-- 1 rleblanc rleblanc 2.2G Aug 12 10:18 pcs01-fs/2019-08-04_v8_pv/test/10/20.tfrecord root@rleblanc-ceph-test:/mnt/fuse# ls -ldhn pcs01-fs/2019-08-04_v8_pv/test/10 drwxrwxrwx 2 1160776091 1160774145 72G Aug 13 00:42 pcs01-fs/2019-08-04_v8_pv/test/10 root@rleblanc-ceph-test:/mnt/fuse# ls -ldhi pcs01-fs/2019-08-04_v8_pv/test/10 1101963322799 drwxrwxrwx 2 1160776091 1160774145 72G Aug 13 00:42 pcs01-fs/2019-08-04_v8_pv/test/10 root@rleblanc-ceph-test:/mnt/fuse# ls -ldhi pcs01-fs/2019-08-04_v8_pv/test/1 1101963322781 drwxrwxrwx 2 1160776091 1160774145 46G Aug 13 02:22 pcs01-fs/2019-08-04_v8_pv/test/1 root@rleblanc-ceph-test:/mnt/fuse# ls -lhi pcs01-fs/2019-08-04_v8_pv/test/10/10.tfrecord 1101964471700 -rw-r--r-- 1 rleblanc rleblanc 26M Aug 12 10:15 pcs01-fs/2019-08-04_v8_pv/test/10/10.tfrecord root@rleblanc-ceph-test:/mnt/fuse# ls -lhi pcs01-fs/2019-08-04_v8_pv/test/10/0.lmdb =====client hung here, evicted the client===== ls: cannot access 'pcs01-fs/2019-08-04_v8_pv/test/10/0.lmdb': Cannot send after transport endpoint shutdown =====dig into the directory with the bad metadata===== $ rados -p sun_pcs01_fs_meta listomapkeys 1009221e5af.00000000 0.lmdb_head 0.tfrecord_head 1.tfrecord_head 10.tfrecord_head 11.tfrecord_head 13.tfrecord_head 16.tfrecord_head 18.tfrecord_head 2.tfrecord_head 20.tfrecord_head 3.tfrecord_head 4.tfrecord_head 5.tfrecord_head 8.tfrecord_head 9.tfrecord_head =====this file hung, lets look at it===== $ rados -p sun_pcs01_fs_meta getomapval 1009221e5af.00000000 0.lmdb_head value (500 bytes) : 00000000 02 00 00 00 00 00 00 00 49 0f 06 c9 01 00 00 ae |........I.......| 00000010 53 4b 91 00 01 00 00 00 00 00 00 fe 69 52 5d fb |SK..........iR].| 00000020 00 ce 25 a4 81 00 00 e8 03 00 00 e8 03 00 00 01 |..%.............| 00000030 00 00 00 00 00 00 00 00 00 00 00 00 02 02 18 00 |................| 00000040 00 00 00 00 40 00 01 00 00 00 00 00 40 00 03 00 |....@.......@...| 00000050 00 00 00 00 00 00 00 00 00 00 00 d0 e2 1c 00 00 |................| 00000060 00 00 01 00 00 00 ff ff ff ff ff ff ff ff 00 00 |................| 00000070 00 00 00 00 00 00 00 00 00 00 00 6a 52 5d f0 a3 |...........jR]..| 00000080 c3 07 fe 69 52 5d fb 00 ce 25 00 00 00 00 01 00 |...iR]...%......| 00000090 00 00 af 70 22 07 00 00 00 00 02 02 18 00 00 00 |...p"...........| 000000a0 00 00 00 00 00 00 00 00 00 00 00 3a 00 00 00 00 |...........:....| 000000b0 01 00 00 00 00 00 00 00 03 02 28 00 00 00 00 00 |..........(.....| 000000c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 000000e0 00 00 00 00 00 00 03 02 38 00 00 00 00 00 00 00 |........8.......| 000000f0 00 00 00 00 00 d0 e2 1c 00 00 00 00 01 00 00 00 |................| 00000100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000120 00 00 00 00 03 02 38 00 00 00 00 00 00 00 00 00 |......8.........| 00000130 00 00 00 d0 e2 1c 00 00 00 00 01 00 00 00 00 00 |................| 00000140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000160 00 00 47 33 00 00 00 00 00 00 00 00 00 00 00 00 |..G3............| 00000170 00 00 01 00 00 00 00 00 00 00 53 32 00 00 00 00 |..........S2....| 00000180 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff |................| 00000190 ff ff ff ff ff ff 00 00 00 00 01 01 10 00 00 00 |................| 000001a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 000001c0 00 00 00 00 fe 69 52 5d fb 00 ce 25 00 00 00 00 |.....iR]...%....| 000001d0 00 00 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 |................| 000001e0 00 00 00 00 00 00 00 00 fe ff ff ff ff ff ff ff |................| 000001f0 00 00 00 00 |....| 000001f4 $ rados -p sun_pcs01_fs_data get 100914b53ae.00000000 test.100914b53ae $ ls -lh test* -rw-r--r-- 1 rleblanc rleblanc 0 Aug 13 18:46 test.100914b53ae $ rados -p sun_pcs01_fs_data getxattr 100914b53ae.00000000 parentinode_backtrace_t import - decode dump_json { "ino": 1101949260718, "ancestors": [ { "dirino": 1101963322799, "dname": "0.lmdb", "version": 13127 }, { "dirino": 1101963322778, "dname": "10", "version": 6026198 }, { "dirino": 1101951188216, "dname": "test", "version": 72100 }, { "dirino": 1101005585283, "dname": "2019-08-04_v8_pv", "version": 1546747 } ], "pool": 3, "old_pools": [] }
Updated by Zheng Yan over 4 years ago
please provide logs of both ceph-fuse and mds during accessing the bad file.
Updated by Robert LeBlanc over 4 years ago
It seems that after some time (hour, days, weeks) things get fixed, but it sure would be nice to know how to get it in a healthy state much faster. Luckily this directory still has the problem. Do you need 20 on both? The MDS is currently doing 3,000 IOPs and will lead to a huge log.
Updated by Robert LeBlanc over 4 years ago
I've sent an e-mail to Zheng with a link to download the logs due to sensitive info. The client requested the file 0.lmdb at 2019-08-16 10:02:12.644431 (the path is slightly different from what I posted due to removing sensitive data, but it is the same file).
Updated by Robert LeBlanc over 4 years ago
Did the logs provide the information that you needed, or do you need more/different information?
Updated by Robert LeBlanc over 4 years ago
The MDS crashed while I was working in the damaged directories at `2019-08-23 15:51:03.871241`. The standby took over just fine. I sent the last 11,000 lines of the log to your e-mail address. Thanks.