Project

General

Profile

Actions

Bug #40300

open

ceph-osd segfault: "rocksdb: Corruption: file is too short"

Added by Harald Staub almost 5 years ago. Updated about 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous, mimic, nautilus, octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Cluster is Nautilus 14.2.1, 350 OSDs with BlueStore.

Steps that led to the problem:

1. There is a bucket with about 60 million objects, without shards.

2. radosgw-admin bucket reshard --bucket $BIG_BUCKET --num-shards 1024

3. Resharding looked fine first, it counted up to the number of objects,
but then it hang.

4. 3 OSDs crashed with a segfault: "rocksdb: Corruption: file is too short"

5. Trying to start the OSDs manually led to the same segfaults.

6. ceph-bluestore-tool repair ...

7. The repairs all aborted, with the same rocksdb error as above.

8. Now 1 PG is stale. It belongs to the radosgw bucket index pool, and it contained the index of this big bucket.

Is there any hope in getting these rocksdbs up again?

Some more details that may be of interest.

ceph-bluestore-repair says:

2019-06-12 11:15:38.345 7f56269670c0 -1 rocksdb: Corruption: file is too short (6139497190 bytes) to be an sstabledb/079728.sst
2019-06-12 11:15:38.345 7f56269670c0 -1 bluestore(/var/lib/ceph/osd/ceph-49) _open_db erroring opening db:
error from fsck: (5) Input/output error

The repairs also showed several warnings like:

tcmalloc: large alloc 17162051584 bytes == 0x56167918a000 @ 0x7f5626521887 0x56126a287229 0x56126a2873a3 0x56126a5dc1ec 0x56126a584ce2 0x56126a586a05 0x56126a587dd0 0x56126a589344 0x56126a38c3cf 0x56126a2eae94 0x56126a30654e 0x56126a337ae1 0x56126a1a73a1 0x7f561b228b97 0x56126a28077a

The processes showed up with like 45 GB of RAM used. Fortunately, there was no Out-Of-Memory.

Harry

Related issues 1 (0 open1 closed)

Related to bluestore - Bug #49170: BlueFS might end-up with huge WAL files when upgrading OMAPsResolved

Actions
Actions

Also available in: Atom PDF