Bug #48827: Ceph Bluestore OSDs fail to start on WAL corruption - bluestore - Ceph

Actions

Copy link

Bug #48827

closed

Ceph Bluestore OSDs fail to start on WAL corruption

Added by William Law over 3 years ago. Updated over 1 year ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v14.2.16

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi -

I posted a note to the Ceph user list also, but we've run into this bug and it unfortunately hit 5 OSDs at the same time, so we currently have some data unavailable. We need help recovering.

Here's the initial crash from one OSD:
@2021-01-07 23:19:54.352 7f6f4bbd5c00 4 rocksdb: [db/version_set.cc:3757] Recovered from manifest file:db/MANIFEST-066289 succeeded,manifest_file_number is 66289, next_file_number is 72848, last_sequence is 2591407985, log_number is 72845,prev_log_number is 0,max_column_family is 0,min_log_number_to_keep is 0

2021-01-07 23:19:54.352 7f6f4bbd5c00 4 rocksdb: [db/version_set.cc:3766] Column family [default] (ID 0), log number is 72845

2021-01-07 23:19:54.352 7f6f4bbd5c00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1610061594356573, "job": 1, "event": "recovery_started", "log_files": [72843, 72845]}
2021-01-07 23:19:54.352 7f6f4bbd5c00 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #72843 mode 0
2021-01-07 23:19:54.596 7f6f4bbd5c00 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #72845 mode 0
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 3163 bytes; Corruption: error in middle of record
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-01-07 23:19:54.676 7f6f4bbd5c00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/072845.log: dropping 20402 bytes; Corruption: missing start of fragmented record(2)
2021-01-07 23:19:54.680 7f6f4bbd5c00 4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work
2021-01-07 23:19:54.684 7f6f4bbd5c00 4 rocksdb: [db/db_impl.cc:563] Shutdown complete
2021-01-07 23:19:54.684 7f6f4bbd5c00 -1 rocksdb: Corruption: error in middle of record
2021-01-07 23:19:54.684 7f6f4bbd5c00 -1 bluestore(/var/lib/ceph/osd/ceph-337) _open_db erroring opening db:
2021-01-07 23:19:54.688 7f6f4bbd5c00 1 bluefs umount
2021-01-07 23:19:54.688 7f6f4bbd5c00 1 bdev(0x55aeff35b500 /var/lib/ceph/osd/ceph-337/block.db) close
2021-01-07 23:19:54.876 7f6f4bbd5c00 1 bdev(0x55aeff35b880 /var/lib/ceph/osd/ceph-337/block) close
2021-01-07 23:19:55.304 7f6f4bbd5c00 1 freelist shutdown
2021-01-07 23:19:55.308 7f6f4bbd5c00 1 bluestore(/var/lib/ceph/osd/ceph-337) _upgrade_super from 2, latest 2
2021-01-07 23:19:55.308 7f6f4bbd5c00 1 bluestore(/var/lib/ceph/osd/ceph-337) _upgrade_super done
2021-01-07 23:19:55.344 7f6f4bbd5c00 -1 ** Caught signal (Segmentation fault) *
in thread 7f6f4bbd5c00 thread_name:ceph-osd@

The full log is attached as ceph-osd.337.log.3.shortend.gz; I have the full logs but it came in at 70MB, so that is a segment. It looks like it suicide timed out and then wouldn't start.

Subsequently it shows this output; it is attached as ceph-osd.337.log.gz
@2021-01-11 04:40:42.512 7f22bcf8ac00 1 bluefs mount
2021-01-11 04:40:42.512 7f22bcf8ac00 1 bluefs _init_alloc id 1 alloc_size 0x100000 size 0x1440000000
2021-01-11 04:40:42.512 7f22bcf8ac00 1 bluefs _init_alloc id 2 alloc_size 0x10000 size 0x22ee7400000
2021-01-11 04:40:42.528 7f22bcf8ac00 1 /build/ceph-14.2.16/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_replay(bool, bool)' thread 7f22bcf8ac00 time 2021-01-11 04:40:42.527150
/build/ceph-14.2.16/src/os/bluestore/BlueFS.cc: 1029: FAILED ceph_assert(file>fnode.ino)

ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55583ad1dfba]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55583ad1e195]
 3: (BlueFS::_replay(bool, bool)+0x4fa5) [0x55583b33b8f5]
 4: (BlueFS::mount()+0x229) [0x55583b33bd79]
 5: (BlueStore::_open_bluefs(bool)+0x78) [0x55583b223958]
 6: (BlueStore::_open_db(bool, bool, bool)+0x8a3) [0x55583b224e63]
 7: (BlueStore::_open_db_and_around(bool)+0x44) [0x55583b2361a4]
 8: (BlueStore::_mount(bool, bool)+0x584) [0x55583b28cb64]
 9: (OSD::init()+0x3f3) [0x55583adcddb3]
 10: (main()+0x5214) [0x55583ad26cf4]
 11: (__libc_start_main()+0xe7) [0x7f22b9807bf7]
 12: (_start()+0x2a) [0x55583ad5872a]

2021-01-11 04:40:42.532 7f22bcf8ac00 -1 ** Caught signal (Aborted) *
in thread 7f22bcf8ac00 thread_name:ceph-osd

ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
 1: (()+0x12980) [0x7f22bab72980]
 2: (gsignal()+0xc7) [0x7f22b9824fb7]
 3: (abort()+0x141) [0x7f22b9826921]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x55583ad1e00b]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55583ad1e195]
 6: (BlueFS::_replay(bool, bool)+0x4fa5) [0x55583b33b8f5]
 7: (BlueFS::mount()+0x229) [0x55583b33bd79]
 8: (BlueStore::_open_bluefs(bool)+0x78) [0x55583b223958]
 9: (BlueStore::_open_db(bool, bool, bool)+0x8a3) [0x55583b224e63]
 10: (BlueStore::_open_db_and_around(bool)+0x44) [0x55583b2361a4]
 11: (BlueStore::_mount(bool, bool)+0x584) [0x55583b28cb64]
 12: (OSD::init()+0x3f3) [0x55583adcddb3]
 13: (main()+0x5214) [0x55583ad26cf4]
 14: (__libc_start_main()+0xe7) [0x7f22b9807bf7]
 15: (_start()+0x2a) [0x55583ad5872a]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.@

I've dd'd all the affected OSDs to a test env. Right now I've reconstituted this OSD, and get similar errors. I can't fsck/expand/repair it.

Thanks for any assistance you can provide.

William Law

Files

Download all files

ceph-osd.337.log.gz (14.5 KB) ceph-osd.337.log.gz	subsequent start	William Law, 01/11/2021 03:37 PM
ceph-osd.337.log.3.shortend.gz (6.64 KB) ceph-osd.337.log.3.shortend.gz	first issue	William Law, 01/11/2021 04:02 PM
ceph-osd-startup.337.log.gz (254 KB) ceph-osd-startup.337.log.gz	startup w debug_bluefs = 20	William Law, 01/11/2021 04:59 PM
ceph-osd.337.log.current.gz (512 KB) ceph-osd.337.log.current.gz		William Law, 01/11/2021 09:19 PM
ceph-osd.337.log.gz (254 KB) ceph-osd.337.log.gz	337 log w debug_bluefs=20	William Law, 01/12/2021 02:02 PM
337-dump.out (32 KB) 337-dump.out	hexdump	William Law, 01/12/2021 02:02 PM
extract-337 (4 KB) extract-337	initial 4k extract	William Law, 01/14/2021 11:03 PM
extract-337-modified (4 KB) extract-337-modified	edited 4k extract	William Law, 01/14/2021 11:03 PM
extract-337-modified-fixed (4 KB) extract-337-modified-fixed	edited 4k extract (new fixed revision)	William Law, 01/15/2021 05:55 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

@William - would you please share OSD startup log with debug-bluefs set to 20?

Actions

Copy link

Updated by William Law over 3 years ago

File ceph-osd-startup.337.log.gz ceph-osd-startup.337.log.gz added

Igor, thank you! It's attached.

Will

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

@Will, would you please share the hex dump of block.db file starting offset 0xede000 length 0x8000.

Latest startup log would be nice to have as well to make sure nothing has changed at bluefs since the startup I have log for. (I have a feeling that Rocksdb might attempt data recovery at some point which modifies BlueFS. Hence want to make sure both dump and startup log are in-sync).

Actions

Copy link

Updated by William Law over 3 years ago

File ceph-osd.337.log.current.gz ceph-osd.337.log.current.gz added

HI Igor -

I feel like I did something wrong as hexdump returned nothing... My apologies we haven't slept much
root@ns-b14-4:~# hd -s 0xede000 -n 0x8000 /var/lib/ceph/ceph-337/block.db 2>&1 | tee hd-337-block.db.log root@ns-b14-4:~# cd /var/lib/ceph/osd root@ns-b14-4:/var/lib/ceph/osd# hd -s 0xede000 -n 0x8000 /var/lib/ceph/c^Ch-337/block.db 2>&1 | tee hd-337-block.db.log root@ns-b14-4:/var/lib/ceph/osd# cd root@ns-b14-4:~# cd /var/lib/ceph/osd^C root@ns-b14-4:~# hd -s 0xede000 -n 0x8000 /var/lib/ceph/osd/ceph-337/block.db 2>&1 | tee hd-337-block.db.log root@ns-b14-4:~# ls /var/lib/ceph/osd/ceph-337/block.db /var/lib/ceph/osd/ceph-337/block.db lrwxrwxrwx 1 ceph ceph 12 Jan 8 16:42 /var/lib/ceph/osd/ceph-337/block.db -> /dev/db0/337 root@ns-b14-4:~# ls -alth /dev/db0/337 lrwxrwxrwx 1 ceph ceph 8 Jan 11 17:32 /dev/db0/337 -> ../dm-26 root@ns-b14-4:~# hd -s 0xede000 -n 0x8000 /dev/dm-26 root@ns-b14-4:~# hexdump -s 0xede000 -n 0x8000 /var/lib/ceph/osd/ceph-337/block.db

I do see output w/o length:
root@ns-b14-4:~# hexdump -s 0xede000 /var/lib/ceph/osd/ceph-337/block.db 0ede000 0101 0071 0000 a5ed de4f 4a29 c145 f8be 0ede010 e171 232d af4d 6361 0039 0000 0000 0051 0ede020 0000 0108 4a01 0000 9200 038b e0b3 7aec 0ede030 6237 5ff8 a813 1222 0400 0000 0100 0701 0ede040 0000 9300 0018 8300 0108 0101 0007 0000 0ede050 18a3 0000 2083 0101 0801 0000 b300 001a

I've attached the start log.

Regards,

Will

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

@Will - to make block.db extract just use:

dd if=block.db ibs=1 skip=15589376 count=32768 of=dump.out

Actions

Copy link Download all files

Updated by William Law over 3 years ago

File ceph-osd.337.log.gz ceph-osd.337.log.gz added
File 337-dump.out 337-dump.out added

Thank you thank you. They are attached.

Best,

Will

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

So intermediate summary on the issue.
Bluefs log contains a transaction (seq = 3790994) with improperly ordered operations in it.
Operations are
OP_LINK db_wal/76554.log ino 50751
OP_UNLINK db_wal/76544.log
OP_UPDATE ino 50571 ...
...

Highly likely reordering OP_LINK and OP_UPDATE ops will fix the issue.
To do this one should swap extent 0xedf022~0x11 and extent 0xedf05c~0x36 in block.db

Actions

Copy link

Updated by William Law over 3 years ago

OK great! How do we do that? And we have at least 4 other OSDs this happened to; should it be the same or how do we check?

Thank you for your help!

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

William Law wrote:

OK great! How do we do that? And we have at least 4 other OSDs this happened to; should it be the same or how do we check?

Thank you for your help!

Generally any hex editor allowing insert mode should be sufficient. tweak works for me. Just swap the extents manually using editor...
Suggest to dump a 4K extent at 0xedf000 out of block.db using dd, backup it, edit with tweak and copy back to block.db. Perhaps more elegant way of doing this exists though.

As for other OSDs I suggest to treat this one and check if this is a working approach first. If it works more analysis for other OSDs is required as I'm not 100% certain they fail the same way...

Actions

Copy link Download all files

#10

Updated by William Law over 3 years ago

File extract-337 extract-337 added
File extract-337-modified extract-337-modified added

Hi Igor -

Thanks again for your kind and patient assistance. I think we got mixed up a little and hope you can help.

Did you mean 0xedf022~0x11 or did you mean 0xedf022~0x21? It looks to us like it should perhaps be 0xedf022~0x21.

We got the following error with either 0xedf022~0x21 or 0xedf022~0x11:
root@nero-ceph-debug:~/sthiell# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-337 --op list-pgs --no-mon-config
/build/ceph-14.2.16/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::_upgrade_super()' thread 7f304f4c09c0 time 2021-01-14 19:23:24.558046
/build/ceph-14.2.16/src/os/bluestore/BlueStore.cc: 10561: FAILED ceph_assert(ondisk_format > 0)
ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f304554dc82]
2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f304554de5d]
3: (BlueStore::_upgrade_super()+0x4b6) [0x55921ff1cc56]
4: (BlueStore::_mount(bool, bool)+0x592) [0x55921ff72fe2]
5: (main()+0x2f75) [0x55921f9ce415]
6: (__libc_start_main()+0xe7) [0x7f3043ce9bf7]
7: (_start()+0x2a) [0x55921f9fb20a]

Caught signal (Aborted) *
in thread 7f304f4c09c0 thread_name:ceph-objectstor
ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
1: (()+0x12980) [0x7f3044c12980]
2: (gsignal()+0xc7) [0x7f3043d06fb7]
3: (abort()+0x141) [0x7f3043d08921]
4: (ceph::__ceph_assert_fail(char const, char const*, int, char const*)+0x1a3) [0x7f304554dcd3]
5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f304554de5d]
6: (BlueStore::_upgrade_super()+0x4b6) [0x55921ff1cc56]
7: (BlueStore::_mount(bool, bool)+0x592) [0x55921ff72fe2]
8: (main()+0x2f75) [0x55921f9ce415]
9: (__libc_start_main()+0xe7) [0x7f3043ce9bf7]
10: (_start()+0x2a) [0x55921f9fb20a]
Aborted (core dumped)

Here are the commands we used to extract and apply:
extract: dd if=/var/lib/ceph/osd/ceph-337/block.db of=extract-337 skip=15593472 ibs=1 obs=1 count=4096
apply: dd if=extract-337-modified of=/var/lib/ceph/osd/ceph-337/block.db seek=15593472 obs=1 ibs=1 count=4096

I've attached those files.

Will

Actions

Copy link

#11

Updated by Igor Fedotov over 3 years ago

Hi William,
this should be 0xedf022~21 indeed.

But extract-337-modified doesn't look correct in any case.
There are at least the following issues:
1) Byte at offset 0x22 should be 0x8 not 0x28
2) Bytes starting 0x58 (22+36) should be 05 06 00 etc (i.e the same as at offset 0x43+ of the original data)
3) Bytes at offset 0x71 (22+36+19) should be 04 06 00 ... (originally present at offset 0x22+)

So the general idea once more.
Originally there are following chunks (with a chunk per op):
(1) 0x22~21, (2)0x43~19, (3)0x5c~36

Resulting list should look like (if I made the calculations properly):
(3)0x22~36, (2) 0x58~19, (3) 0x71~21

William Law wrote:

Hi Igor -

Thanks again for your kind and patient assistance. I think we got mixed up a little and hope you can help.

Did you mean 0xedf022~0x11 or did you mean 0xedf022~0x21? It looks to us like it should perhaps be 0xedf022~0x21.
We got the following error with either 0xedf022~0x21 or 0xedf022~0x11:
root@nero-ceph-debug:~/sthiell# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-337 --op list-pgs --no-mon-config
/build/ceph-14.2.16/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::_upgrade_super()' thread 7f304f4c09c0 time 2021-01-14 19:23:24.558046
/build/ceph-14.2.16/src/os/bluestore/BlueStore.cc: 10561: FAILED ceph_assert(ondisk_format > 0)
ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f304554dc82]
2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f304554de5d]
3: (BlueStore::_upgrade_super()+0x4b6) [0x55921ff1cc56]
4: (BlueStore::_mount(bool, bool)+0x592) [0x55921ff72fe2]
5: (main()+0x2f75) [0x55921f9ce415]
6: (__libc_start_main()+0xe7) [0x7f3043ce9bf7]
7: (_start()+0x2a) [0x55921f9fb20a]

Caught signal (Aborted) *
in thread 7f304f4c09c0 thread_name:ceph-objectstor
ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
1: (()+0x12980) [0x7f3044c12980]
2: (gsignal()+0xc7) [0x7f3043d06fb7]
3: (abort()+0x141) [0x7f3043d08921]
4: (ceph::__ceph_assert_fail(char const, char const*, int, char const*)+0x1a3) [0x7f304554dcd3]
5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f304554de5d]
6: (BlueStore::_upgrade_super()+0x4b6) [0x55921ff1cc56]
7: (BlueStore::_mount(bool, bool)+0x592) [0x55921ff72fe2]
8: (main()+0x2f75) [0x55921f9ce415]
9: (__libc_start_main()+0xe7) [0x7f3043ce9bf7]
10: (_start()+0x2a) [0x55921f9fb20a]
Aborted (core dumped)

Here are the commands we used to extract and apply:
extract: dd if=/var/lib/ceph/osd/ceph-337/block.db of=extract-337 skip=15593472 ibs=1 obs=1 count=4096
apply: dd if=extract-337-modified of=/var/lib/ceph/osd/ceph-337/block.db seek=15593472 obs=1 ibs=1 count=4096

I've attached those files.

Will

Actions

Copy link

#12

Updated by William Law over 3 years ago

File extract-337-modified-fixed extract-337-modified-fixed added

Hi Igor – thanks for the clarification. We actually did that but for whatever reason, we uploaded the wrong file, sorry. Please find extract-337-modified-fixed that should be correct this time. We also had to change the 32-bit checksum at the end of the transaction, otherwise we get a bad crc error.

Now, with extract-337-modified-fixed applied, ceph-objectstore-tool --op fsck is still failing with the following error:

7> 2021-01-15 17:13:28.959 7f64626459c0 10 bluefs _read h 0x55d8d26e3020 0x3df000~1000 from file(ino 1 size 0x3df000 mtime 0.000000 allocated 500000 extents [1:0x31300000~100000,1:0xc00000~400000])
    -6> 2021-01-15 17:13:28.959 7f64626459c0 20 bluefs _read left 0x21000 len 0x1000
    -5> 2021-01-15 17:13:28.959 7f64626459c0 20 bluefs _read got 4096
    -4> 2021-01-15 17:13:28.959 7f64626459c0 10 bluefs _replay 0x3df000: txn(seq 3760994 len 0xf6 crc 0x20e83d27)
    -3> 2021-01-15 17:13:28.959 7f64626459c0 20 bluefs _replay 0x3df000:  op_file_update  file(ino 50571 size 0xf6bb6de mtime 2021-01-08 13:46:31.307742 allocated f700000 extents [1:0x400000~100000,1:0x21c00000~f600000]) 
    -2> 2021-01-15 17:13:28.959 7f64626459c0 20 bluefs _replay 0x3df000:  op_dir_unlink  db.wal/076544.log
    -1> 2021-01-15 17:13:28.987 7f64626459c0 -1 /build/ceph-14.2.16/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_replay(bool, bool)' thread 7f64626459c0 time 2021-01-15 17:13:28.966088
/build/ceph-14.2.16/src/os/bluestore/BlueFS.cc: 1064: FAILED ceph_assert(r != q>second->file_map.end())

Please note that OP_UPDATE and OP_LINK/OP_UNLINK are not on the same inode (50571 vs 50751 - a bit confusing).

We also have looked at this further last night, and we think the unlink cannot succeed maybe because there is an earlier transaction that removes it:

-3690> 2021-01-15 17:13:28.947 7f64626459c0 10 bluefs _read h 0x55d8d26e3020 0x102000~1000 from file(ino 1 size 0x102000 mtime 0.000000 allocated 500000 extents [1:0x31300000~100000,1:0xc00000~400000])
 -3689> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _read left 0xfe000 len 0x1000
 -3688> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _read got 4096
 -3687> 2021-01-15 17:13:28.947 7f64626459c0 10 bluefs _replay 0x102000: txn(seq 3760261 len 0x10a crc 0xcb08369d)
 -3686> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_dir_unlink  db/CURRENT
 -3685> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_file_remove 50570
 -3684> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_dir_link  db/CURRENT to 50581
 -3683> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_dir_unlink  db/076554.dbtmp
 -3682> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_file_update  file(ino 50582 size 0x0 mtime 2021-01-08 16:58:40.964528 allocated 0 extents []) 
 -3681> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_dir_link  db.wal/076555.log to 50582
 -3680> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_dir_unlink  db.wal/076552.log
 -3679> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_file_remove 50578
 -3678> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_dir_unlink  db.wal/076544.log         <<<
 -3677> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_file_remove 50571
 -3676> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_dir_unlink  db/MANIFEST-076543
 -3675> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_file_remove 50569
 -3674> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_dir_unlink  db/076553.sst
 -3673> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000:  op_file_remove 50579

Let us know what you think. Thanks again.

Igor Fedotov wrote:

Hi William,
this should be 0xedf022~21 indeed.

But extract-337-modified doesn't look correct in any case.
There are at least the following issues:
1) Byte at offset 0x22 should be 0x8 not 0x28
2) Bytes starting 0x58 (22+36) should be 05 06 00 etc (i.e the same as at offset 0x43+ of the original data)
3) Bytes at offset 0x71 (22+36+19) should be 04 06 00 ... (originally present at offset 0x22+)

So the general idea once more.
Originally there are following chunks (with a chunk per op):
(1) 0x22~21, (2)0x43~19, (3)0x5c~36

Resulting list should look like (if I made the calculations properly):
(3)0x22~36, (2) 0x58~19, (3) 0x71~21

Actions

Copy link

#13

Updated by Igor Fedotov about 3 years ago

William Law wrote:

Hi Igor – thanks for the clarification. We actually did that but for whatever reason, we uploaded the wrong file, sorry. Please find extract-337-modified-fixed that should be correct this time. We also had to change the 32-bit checksum at the end of the transaction, otherwise we get a bad crc error.

Looks good now.
And yeah, sorry, my bad - forgot about checksum wrapping.

Now, with extract-337-modified-fixed applied, ceph-objectstore-tool --op fsck is still failing with the following error:

7> 2021-01-15 17:13:28.959 7f64626459c0 10 bluefs _read h 0x55d8d26e3020 0x3df000~1000 from file(ino 1 size 0x3df000 mtime 0.000000 allocated 500000 extents [1:0x31300000~100000,1:0xc00000~400000])
-6> 2021-01-15 17:13:28.959 7f64626459c0 20 bluefs _read left 0x21000 len 0x1000
-5> 2021-01-15 17:13:28.959 7f64626459c0 20 bluefs _read got 4096
-4> 2021-01-15 17:13:28.959 7f64626459c0 10 bluefs _replay 0x3df000: txn(seq 3760994 len 0xf6 crc 0x20e83d27)
-3> 2021-01-15 17:13:28.959 7f64626459c0 20 bluefs _replay 0x3df000: op_file_update file(ino 50571 size 0xf6bb6de mtime 2021-01-08 13:46:31.307742 allocated f700000 extents [1:0x400000~100000,1:0x21c00000~f600000])
-2> 2021-01-15 17:13:28.959 7f64626459c0 20 bluefs _replay 0x3df000: op_dir_unlink db.wal/076544.log
-1> 2021-01-15 17:13:28.987 7f64626459c0 -1 /build/ceph-14.2.16/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_replay(bool, bool)' thread 7f64626459c0 time 2021-01-15 17:13:28.966088
/build/ceph-14.2.16/src/os/bluestore/BlueFS.cc: 1064: FAILED ceph_assert(r != q>second->file_map.end())

Yeah, indeed now it asserts on the next op - removing previously erased db.wal/076544.log. See below for options...

Please note that OP_UPDATE and OP_LINK/OP_UNLINK are not on the same inode (50571 vs 50751 - a bit confusing).

IMO you're wrong here. inode is 50571 for both OP_UPDATE and IP_LINK. Here is the snippet from your original startup log:
-2> 2021-01-11 16:55:38.817 7f0157f26c00 20 bluefs _replay 0x3df000: op_dir_link db.wal/076554.log to 50571

and here is the new one:
> -3> 2021-01-15 17:13:28.959 7f64626459c0 20 bluefs _replay 0x3df000: op_file_update file(ino 50571 size 0xf6bb6de mtime 2021-01-08 13:46:31.307742

We also have looked at this further last night, and we think the unlink cannot succeed maybe because there is an earlier transaction that removes it:

-3690> 2021-01-15 17:13:28.947 7f64626459c0 10 bluefs _read h 0x55d8d26e3020 0x102000~1000 from file(ino 1 size 0x102000 mtime 0.000000 allocated 500000 extents [1:0x31300000~100000,1:0xc00000~400000])
-3689> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _read left 0xfe000 len 0x1000
-3688> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _read got 4096
-3687> 2021-01-15 17:13:28.947 7f64626459c0 10 bluefs _replay 0x102000: txn(seq 3760261 len 0x10a crc 0xcb08369d)
-3686> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_dir_unlink db/CURRENT
-3685> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_file_remove 50570
-3684> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_dir_link db/CURRENT to 50581
-3683> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_dir_unlink db/076554.dbtmp
-3682> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_file_update file(ino 50582 size 0x0 mtime 2021-01-08 16:58:40.964528 allocated 0 extents [])
-3681> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_dir_link db.wal/076555.log to 50582
-3680> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_dir_unlink db.wal/076552.log
-3679> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_file_remove 50578
-3678> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_dir_unlink db.wal/076544.log <<<
-3677> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_file_remove 50571
-3676> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_dir_unlink db/MANIFEST-076543
-3675> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_file_remove 50569
-3674> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_dir_unlink db/076553.sst
-3673> 2021-01-15 17:13:28.947 7f64626459c0 20 bluefs _replay 0x102000: op_file_remove 50579

Let us know what you think. Thanks again.

Yeah, this analysis is valid. Actually by swapping OP_UPDATE and OP_LINK I'm trying to work this around. But missed that OP_UNLINK would trigger another assertion for the same reason - already removed file.

So I can see two (non-trivial) options to proceed
1) Build custom ceph osd with a patch to avoid assertion on this "invalid" remove op.
2) Rebuild the whole transaction through manual hex editing and wipe out remove op from there. Which might need to rebuilding all the length fields within the transaction which wraps the op list. Not to mention another checksum update.

Igor Fedotov wrote:

Hi William,
this should be 0xedf022~21 indeed.

But extract-337-modified doesn't look correct in any case.
There are at least the following issues:
1) Byte at offset 0x22 should be 0x8 not 0x28
2) Bytes starting 0x58 (22+36) should be 05 06 00 etc (i.e the same as at offset 0x43+ of the original data)
3) Bytes at offset 0x71 (22+36+19) should be 04 06 00 ... (originally present at offset 0x22+)

So the general idea once more.
Originally there are following chunks (with a chunk per op):
(1) 0x22~21, (2)0x43~19, (3)0x5c~36

Resulting list should look like (if I made the calculations properly):
(3)0x22~36, (2) 0x58~19, (3) 0x71~21

Actions

Copy link

#14

Updated by William Law about 3 years ago

Hi Igor -

Thank you for your help and two options; we will discuss.

Given this, is there any way we could (relatively easily) extract PGs from the OSD?

Will

Actions

Copy link

#15

Updated by Igor Fedotov about 3 years ago

William Law wrote:

Hi Igor -

Thank you for your help and two options; we will discuss.

Given this, is there any way we could (relatively easily) extract PGs from the OSD?

I suppose this would require bluefs log replay anyway hence we are back to the same issue.

Actions

Copy link

#16

Updated by Wout van Heeswijk about 3 years ago

Hi @Igor Gajowiak,

It looks like we're experiencing this at the moment. We've moved some pools from hdd only osds to nvme osds and added some extra hdd osds. This results, to no surprise to anyone, in very intense backfills. All pools had already been successfully migrated to the new nvmes before I added the hdds. Now the hdds have started to die after a few hours of intense backfill activity. At the moment there is no indication that any of this is caused by hardware failures and we've now encountered this for 5 OSDs.
I've stopped all I/O on the cluster (pauserd,pausewr,noout,nobackfill,norebalance,norecover,nodeep-scrub). One pg is 'down', from which I could recover. I've started this because but I really can't afford to loose one more OSD.

I'm on a 15.2.2 cluster.

Do we have any idea what the cause of this bug is? The closest thing I could find was this bug report: https://tracker.ceph.com/issues/45613

We are have not tuned the osds at all and we are not using bluefs_buffered_io. Example from ceph daemon:

"bluefs_buffered_io": "false",
"bluefs_preextend_wal_files": "true",

Example log tail:


2021-02-26T03:56:41.442-0500 7fb4ae7d6f00  4 rocksdb: [db/version_set.cc:3766] Column family [default] (ID 0), log number is 76232

2021-02-26T03:56:41.445-0500 7fb4ae7d6f00  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1614329801446019, "job": 1, "event": "recovery_started", "log_files": [76207, 76232]}
2021-02-26T03:56:41.445-0500 7fb4ae7d6f00  4 rocksdb: [db/db_impl_open.cc:583] Recovering log #76207 mode 0
2021-02-26T03:56:43.968-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 107997389 bytes; Corruption: error in middle of record
2021-02-26T03:56:43.968-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-02-26T03:56:43.968-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-02-26T03:56:43.968-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-02-26T03:56:43.968-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-02-26T03:56:43.968-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-02-26T03:56:43.968-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-02-26T03:56:43.968-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
2021-02-26T03:56:43.968-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
~~~~~
2021-02-26T03:56:45.020-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)                                          
2021-02-26T03:56:45.020-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)                                          
2021-02-26T03:56:45.020-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)                                          
2021-02-26T03:56:45.020-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)                                          
2021-02-26T03:56:45.020-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)                                          
2021-02-26T03:56:45.020-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)                                          
2021-02-26T03:56:45.020-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)                                          
2021-02-26T03:56:45.020-0500 7fb4ae7d6f00  3 rocksdb: [db/db_impl_open.cc:518] db.wal/076207.log: dropping 29531 bytes; Corruption: missing start of fragmented record(2)                                          
2021-02-26T03:56:45.020-0500 7fb4ae7d6f00  4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work                                                                                                  
2021-02-26T03:56:45.031-0500 7fb4ae7d6f00  4 rocksdb: [db/db_impl.cc:563] Shutdown complete
2021-02-26T03:56:45.031-0500 7fb4ae7d6f00 -1 rocksdb: Corruption: error in middle of record
2021-02-26T03:56:45.031-0500 7fb4ae7d6f00 -1 bluestore(/var/lib/ceph/osd/ceph-49) _open_db erroring opening db:                                                                                                    
2021-02-26T03:56:45.031-0500 7fb4ae7d6f00  1 bluefs umount
2021-02-26T03:56:45.031-0500 7fb4ae7d6f00  1 fbmap_alloc 0x55aaac2a2800 shutdown
2021-02-26T03:56:45.032-0500 7fb4ae7d6f00  1 bdev(0x55aaacf8c700 /var/lib/ceph/osd/ceph-49/block) close
2021-02-26T03:56:45.050-0500 7fb4ae7d6f00  1 bdev(0x55aaacf8c000 /var/lib/ceph/osd/ceph-49/block) close
2021-02-26T03:56:45.298-0500 7fb4ae7d6f00 -1 osd.49 0 OSD:init: unable to mount object store
2021-02-26T03:56:45.298-0500 7fb4ae7d6f00 -1 ESC[0;31m ** ERROR: osd init failed: (5) Input/output errorESC[0m

Actions

Copy link

#17

Updated by Igor Fedotov about 3 years ago

Hi Wout,

I think https://tracker.ceph.com/issues/45613 is relevant indeed.
Suggest to set bluefs_preextend_wal_files to false ASAP.

Actions

Copy link

#18

Updated by Wout van Heeswijk about 3 years ago

Thanks for the reply @Igor Gajowiak.

Actions

Copy link

#19

Updated by Wout van Heeswijk about 3 years ago

Can we link the tickets with a "caused by" to https://tracker.ceph.com/issues/45613 ?

Actions

Copy link

#20

Updated by Neha Ojha about 3 years ago

Is duplicate of Bug #45613: ObjectStore/StoreTestSpecificAUSize.SyntheticMatrixCompression/2 failed added

Actions

Copy link

#21

Updated by Igor Fedotov over 2 years ago

Status changed from New to Duplicate

Actions

Copy link

#22

Updated by Adam Kupczyk over 1 year ago

In the customer case running luminous when OSD process was run twice (let's skip how), the assert
'file->fnode.ino'
from
BlueFS::_replay(...)
Was ~ 1/10 reasons of OSD fatal failures.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #48827

Ceph Bluestore OSDs fail to start on WAL corruption

Updated by Igor Fedotov over 3 years ago

Updated by William Law over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by William Law over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by William Law over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by William Law over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by William Law over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by William Law over 3 years ago

Updated by Igor Fedotov about 3 years ago

Updated by William Law about 3 years ago

Updated by Igor Fedotov about 3 years ago

Updated by Wout van Heeswijk about 3 years ago

Updated by Igor Fedotov about 3 years ago

Updated by Wout van Heeswijk about 3 years ago

Updated by Wout van Heeswijk about 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Igor Fedotov over 2 years ago

Updated by Adam Kupczyk over 1 year ago