Bug #12428
closedgarbage data in osd data dir crashes ceph-objectstore-tool
0%
Description
Hi,
Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with
ENOTEMPTY suggests garbage data in osd data dir
There is indeed some "garbage" in there:
# find /var/lib/ceph/osd/ceph-171/current/36.10d_head/ /var/lib/ceph/osd/ceph-171/current/36.10d_head/ /var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D /var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D/DIR_0 /var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D/DIR_0/DIR_1 /var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24 /var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D/DIR_0/DIR_9
Greg suggested we use ceph-objectstore-tool to cleanly remove that PG. But ceph-objectstore-tool actually fails to list-pgs, namely:
# ceph-objectstore-tool --debug --op list-pgs --data-path /var/lib/ceph/osd/ceph-171/ --journal-path /var/lib/ceph/osd/ceph-171/journal 2015-07-22 10:50:11.374925 7f9662eab800 0 filestore(/var/lib/ceph/osd/ceph-171/) backend xfs (magic 0x58465342) 2015-07-22 10:50:11.377785 7f9662eab800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-171/) detect_features: FIEMAP ioctl is supported and appears to work 2015-07-22 10:50:11.377801 7f9662eab800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-171/) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-07-22 10:50:11.468428 7f9662eab800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-171/) detect_features: syscall(SYS_syncfs, fd) fully supported 2015-07-22 10:50:11.468588 7f9662eab800 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-171/) detect_features: disabling extsize, kernel 2.6.32-431.el6.x86_64 is older than 3.5 and has buggy extsize ioctl 2015-07-22 10:50:11.545517 7f9662eab800 0 filestore(/var/lib/ceph/osd/ceph-171/) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 2015-07-22 10:50:11.551059 7f9662eab800 1 journal _open /var/lib/ceph/osd/ceph-171/journal fd 12: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-07-22 10:50:11.807632 7f9662eab800 0 filestore(/var/lib/ceph/osd/ceph-171/) error (39) Directory not empty not handled on operation 0x3b8a716 (2253920.0.1, or op 1, counting from 0) 2015-07-22 10:50:11.807647 7f9662eab800 0 filestore(/var/lib/ceph/osd/ceph-171/) ENOTEMPTY suggests garbage data in osd data dir 2015-07-22 10:50:11.807650 7f9662eab800 0 filestore(/var/lib/ceph/osd/ceph-171/) transaction dump: { "ops": [ { "op_num": 0, "op_name": "remove", "collection": "36.10d_head", "oid": "10d\/\/head\/\/36" }, { "op_num": 1, "op_name": "rmcoll", "collection": "36.10d_head" } ] } os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f9662eab800 time 2015-07-22 10:50:11.807681 os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")
I didn't try the remove op yet, but I suspect it will fail the same way.
If we manually remove the garbage with:
cd /var/lib/ceph/osd/ceph-171/current/36.10d_head/ rm -rf *
then the OSD starts correctly.
Should the OSD and ceph-objectstore-tool better handle garbage? Or is the manual deletion procedure good enough?
Thanks, Dan
Updated by David Zafman almost 9 years ago
The ceph-objectstore-tool is doing an objectstore mount just as the OSD does. If the OSD is crashing during this process so will the ceph-objectstore-tool unless the --skip-journal-replay option is given. That what the option is for. I'm going to try to reproduce this scenario and verify that it will work with the --skip-journal-replay.
Updated by Dan van der Ster almost 9 years ago
Ahh, sorry for the obvious mistake.
So now list-pgs works.
Would it be safe to now --op delete that pg with the skip-journal-replay option? That lingering delete op will still be there when I try to restart the actual osd.
Updated by David Zafman over 8 years ago
You could try manually moving the pg directory somewhere for safe keeping, and then restart the osd.
Updated by Nathan Cutler over 8 years ago
- Has duplicate Bug #13815: OSDs failed after upgrade from 0.80.10 to 0.94.5 added
Updated by Sage Weil about 7 years ago
- Status changed from New to Can't reproduce