Project

General

Profile

Actions

Bug #48827

closed

Ceph Bluestore OSDs fail to start on WAL corruption

Added by William Law over 3 years ago. Updated over 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi -

I posted a note to the Ceph user list also, but we've run into this bug and it unfortunately hit 5 OSDs at the same time, so we currently have some data unavailable. We need help recovering.

Here's the initial crash from one OSD:
@2021-01-07 23:19:54.352 7f6f4bbd5c00 4 rocksdb: [db/version_set.cc:3757] Recovered from manifest file:db/MANIFEST-066289 succeeded,manifest_file_number is 66289, next_file_number is 72848, last_sequence is 2591407985, log_number is 72845,prev_log_number is 0,max_column_family is 0,min_log_number_to_keep is 0

2021-01-07 23:19:54.352 7f6f4bbd5c00 4 rocksdb: [db/version_set.cc:3766] Column family [default] (ID 0), log number is 72845

2021-01-07 23:19:54.352 7f6f4bbd5c00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1610061594356573, "job": 1, "event": "recovery_started", "log_files": [72843, 72845]}
2021-01-07 23:19:54.352 7f6f4bbd5c00 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #72843 mode 0
2021-01-07 23:19:54.596 7f6f4bbd5c00 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #72845 mode 0
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 3163 bytes; Corruption: error in middle of record
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 20402 bytes; Corruption: missing start of fragmented record(2)
2021-01-07 23:19:54.680 7f6f4bbd5c00 4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work
2021-01-07 23:19:54.684 7f6f4bbd5c00 4 rocksdb: [db/db_impl.cc:563] Shutdown complete
2021-01-07 23:19:54.684 7f6f4bbd5c00 -1 rocksdb: Corruption: error in middle of record
2021-01-07 23:19:54.684 7f6f4bbd5c00 -1 bluestore(/var/lib/ceph/osd/ceph-337) _open_db erroring opening db:
2021-01-07 23:19:54.688 7f6f4bbd5c00 1 bluefs umount
2021-01-07 23:19:54.688 7f6f4bbd5c00 1 bdev(0x55aeff35b500 /var/lib/ceph/osd/ceph-337/block.db) close
2021-01-07 23:19:54.876 7f6f4bbd5c00 1 bdev(0x55aeff35b880 /var/lib/ceph/osd/ceph-337/block) close
2021-01-07 23:19:55.304 7f6f4bbd5c00 1 freelist shutdown
2021-01-07 23:19:55.308 7f6f4bbd5c00 1 bluestore(/var/lib/ceph/osd/ceph-337) _upgrade_super from 2, latest 2
2021-01-07 23:19:55.308 7f6f4bbd5c00 1 bluestore(/var/lib/ceph/osd/ceph-337) _upgrade_super done
2021-01-07 23:19:55.344 7f6f4bbd5c00 -1 ** Caught signal (Segmentation fault) *
in thread 7f6f4bbd5c00 thread_name:ceph-osd@

The full log is attached as ceph-osd.337.log.3.shortend.gz; I have the full logs but it came in at 70MB, so that is a segment. It looks like it suicide timed out and then wouldn't start.

Subsequently it shows this output; it is attached as ceph-osd.337.log.gz
@2021-01-11 04:40:42.512 7f22bcf8ac00 1 bluefs mount
2021-01-11 04:40:42.512 7f22bcf8ac00 1 bluefs _init_alloc id 1 alloc_size 0x100000 size 0x1440000000
2021-01-11 04:40:42.512 7f22bcf8ac00 1 bluefs _init_alloc id 2 alloc_size 0x10000 size 0x22ee7400000
2021-01-11 04:40:42.528 7f22bcf8ac00 1 /build/ceph-14.2.16/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_replay(bool, bool)' thread 7f22bcf8ac00 time 2021-01-11 04:40:42.527150
/build/ceph-14.2.16/src/os/bluestore/BlueFS.cc: 1029: FAILED ceph_assert(file
>fnode.ino)

ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55583ad1dfba]
2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55583ad1e195]
3: (BlueFS::_replay(bool, bool)+0x4fa5) [0x55583b33b8f5]
4: (BlueFS::mount()+0x229) [0x55583b33bd79]
5: (BlueStore::_open_bluefs(bool)+0x78) [0x55583b223958]
6: (BlueStore::_open_db(bool, bool, bool)+0x8a3) [0x55583b224e63]
7: (BlueStore::_open_db_and_around(bool)+0x44) [0x55583b2361a4]
8: (BlueStore::_mount(bool, bool)+0x584) [0x55583b28cb64]
9: (OSD::init()+0x3f3) [0x55583adcddb3]
10: (main()+0x5214) [0x55583ad26cf4]
11: (__libc_start_main()+0xe7) [0x7f22b9807bf7]
12: (_start()+0x2a) [0x55583ad5872a]

2021-01-11 04:40:42.532 7f22bcf8ac00 -1 ** Caught signal (Aborted) *
in thread 7f22bcf8ac00 thread_name:ceph-osd

ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
1: (()+0x12980) [0x7f22bab72980]
2: (gsignal()+0xc7) [0x7f22b9824fb7]
3: (abort()+0x141) [0x7f22b9826921]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x55583ad1e00b]
5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55583ad1e195]
6: (BlueFS::_replay(bool, bool)+0x4fa5) [0x55583b33b8f5]
7: (BlueFS::mount()+0x229) [0x55583b33bd79]
8: (BlueStore::_open_bluefs(bool)+0x78) [0x55583b223958]
9: (BlueStore::_open_db(bool, bool, bool)+0x8a3) [0x55583b224e63]
10: (BlueStore::_open_db_and_around(bool)+0x44) [0x55583b2361a4]
11: (BlueStore::_mount(bool, bool)+0x584) [0x55583b28cb64]
12: (OSD::init()+0x3f3) [0x55583adcddb3]
13: (main()+0x5214) [0x55583ad26cf4]
14: (__libc_start_main()+0xe7) [0x7f22b9807bf7]
15: (_start()+0x2a) [0x55583ad5872a]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.@

I've dd'd all the affected OSDs to a test env. Right now I've reconstituted this OSD, and get similar errors. I can't fsck/expand/repair it.

Thanks for any assistance you can provide.

William Law


Files

ceph-osd.337.log.gz (14.5 KB) ceph-osd.337.log.gz subsequent start William Law, 01/11/2021 03:37 PM
ceph-osd.337.log.3.shortend.gz (6.64 KB) ceph-osd.337.log.3.shortend.gz first issue William Law, 01/11/2021 04:02 PM
ceph-osd-startup.337.log.gz (254 KB) ceph-osd-startup.337.log.gz startup w debug_bluefs = 20 William Law, 01/11/2021 04:59 PM
ceph-osd.337.log.current.gz (512 KB) ceph-osd.337.log.current.gz William Law, 01/11/2021 09:19 PM
ceph-osd.337.log.gz (254 KB) ceph-osd.337.log.gz 337 log w debug_bluefs=20 William Law, 01/12/2021 02:02 PM
337-dump.out (32 KB) 337-dump.out hexdump William Law, 01/12/2021 02:02 PM
extract-337 (4 KB) extract-337 initial 4k extract William Law, 01/14/2021 11:03 PM
extract-337-modified (4 KB) extract-337-modified edited 4k extract William Law, 01/14/2021 11:03 PM
extract-337-modified-fixed (4 KB) extract-337-modified-fixed edited 4k extract (new fixed revision) William Law, 01/15/2021 05:55 PM

Related issues 1 (0 open1 closed)

Is duplicate of bluestore - Bug #45613: ObjectStore/StoreTestSpecificAUSize.SyntheticMatrixCompression/2 failedResolved

Actions
Actions

Also available in: Atom PDF