Project

General

Profile

Actions

Bug #12428

closed

garbage data in osd data dir crashes ceph-objectstore-tool

Added by Dan van der Ster almost 9 years ago. Updated about 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with

ENOTEMPTY suggests garbage data in osd data dir

There is indeed some "garbage" in there:

# find /var/lib/ceph/osd/ceph-171/current/36.10d_head/
/var/lib/ceph/osd/ceph-171/current/36.10d_head/
/var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D
/var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D/DIR_0
/var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D/DIR_0/DIR_1
/var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
/var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D/DIR_0/DIR_9

Greg suggested we use ceph-objectstore-tool to cleanly remove that PG. But ceph-objectstore-tool actually fails to list-pgs, namely:

# ceph-objectstore-tool --debug --op list-pgs --data-path /var/lib/ceph/osd/ceph-171/ --journal-path /var/lib/ceph/osd/ceph-171/journal
2015-07-22 10:50:11.374925 7f9662eab800  0 filestore(/var/lib/ceph/osd/ceph-171/) backend xfs (magic 0x58465342)
2015-07-22 10:50:11.377785 7f9662eab800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-171/) detect_features: FIEMAP ioctl is supported and appears to work
2015-07-22 10:50:11.377801 7f9662eab800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-171/) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-07-22 10:50:11.468428 7f9662eab800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-171/) detect_features: syscall(SYS_syncfs, fd) fully supported
2015-07-22 10:50:11.468588 7f9662eab800  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-171/) detect_features: disabling extsize, kernel 2.6.32-431.el6.x86_64 is older than 3.5 and has buggy extsize ioctl
2015-07-22 10:50:11.545517 7f9662eab800  0 filestore(/var/lib/ceph/osd/ceph-171/) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
2015-07-22 10:50:11.551059 7f9662eab800  1 journal _open /var/lib/ceph/osd/ceph-171/journal fd 12: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-07-22 10:50:11.807632 7f9662eab800  0 filestore(/var/lib/ceph/osd/ceph-171/)  error (39) Directory not empty not handled on operation 0x3b8a716 (2253920.0.1, or op 1, counting from 0)
2015-07-22 10:50:11.807647 7f9662eab800  0 filestore(/var/lib/ceph/osd/ceph-171/) ENOTEMPTY suggests garbage data in osd data dir
2015-07-22 10:50:11.807650 7f9662eab800  0 filestore(/var/lib/ceph/osd/ceph-171/)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "remove",
            "collection": "36.10d_head",
            "oid": "10d\/\/head\/\/36" 
        },
        {
            "op_num": 1,
            "op_name": "rmcoll",
            "collection": "36.10d_head" 
        }
    ]
}

os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f9662eab800 time 2015-07-22 10:50:11.807681
os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")

I didn't try the remove op yet, but I suspect it will fail the same way.

If we manually remove the garbage with:

cd /var/lib/ceph/osd/ceph-171/current/36.10d_head/
rm -rf *

then the OSD starts correctly.

Should the OSD and ceph-objectstore-tool better handle garbage? Or is the manual deletion procedure good enough?

Thanks, Dan


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #13815: OSDs failed after upgrade from 0.80.10 to 0.94.5Duplicate11/17/2015

Actions
Actions #1

Updated by David Zafman almost 9 years ago

The ceph-objectstore-tool is doing an objectstore mount just as the OSD does. If the OSD is crashing during this process so will the ceph-objectstore-tool unless the --skip-journal-replay option is given. That what the option is for. I'm going to try to reproduce this scenario and verify that it will work with the --skip-journal-replay.

Actions #2

Updated by Dan van der Ster almost 9 years ago

Ahh, sorry for the obvious mistake.
So now list-pgs works.

Would it be safe to now --op delete that pg with the skip-journal-replay option? That lingering delete op will still be there when I try to restart the actual osd.

Actions #3

Updated by David Zafman over 8 years ago

You could try manually moving the pg directory somewhere for safe keeping, and then restart the osd.

Actions #4

Updated by Nathan Cutler over 8 years ago

  • Has duplicate Bug #13815: OSDs failed after upgrade from 0.80.10 to 0.94.5 added
Actions #5

Updated by Sage Weil about 7 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF