Bug #39151: past end of block device is detected but not repaired - bluestore - Ceph

Actions

Copy link

Bug #39151

closed

past end of block device is detected but not repaired

Added by Daniel Poelzleithner about 5 years ago. Updated about 5 years ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v12.2.11

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

While reconfiguring our cluster we ran into the issue that bluestore somehow sheduled a transaction beyond the end of device.

We run a proxmox cluster with
ceph version 12.2.11 (c96e82ac735a75ae99d4847983711e1f2dbf12e5) luminous (stable)

While tracking down the problem, newer versions of ceph-bluestore-tool correctly identified the problem:

2019-04-08 21:27:18.347 7f2f3ab98f00 -1 bluestore(/var/lib/ceph/osd/ceph-7) fsck error:  #7:be05b038:::1000011945a.00000000:head# extent 0x1d199eb0000~10000 past end of block device
2019-04-08 21:27:18.671 7f2f3ab98f00 -1 bluestore(/var/lib/ceph/osd/ceph-7) fsck error:  #7:c993b83c:::100001194ab.00000000:head# extent 0x1d199ed0000~10000 past end of block device
2019-04-08 21:27:18.763 7f2f3ab98f00 -1 bluestore(/var/lib/ceph/osd/ceph-7) fsck error:  #7:ca705893:::10000119489.00000000:head# extent 0x1d199ec0000~10000 past end of block device
2019-04-08 21:27:18.819 7f2f3ab98f00 -1 bluestore(/var/lib/ceph/osd/ceph-7) fsck error:  #7:cb0b10cf:::100001194ee.00000000:head# extent 0x1d199ee0000~10000 past end of block device
2019-04-08 21:27:18.883 7f2f3ab98f00 -1 bluestore(/var/lib/ceph/osd/ceph-7) fsck error:  #7:cbb28521:::10000119441.00000000:head# extent 0x1d199e90000~10000 past end of block device

The rapair command however still crashes while trying to run the transaction which is beyond the end of the device as it's not commited yet.

docker run --privileged=true --rm -it --entrypoint /bin/bash -v /var/lib/ceph/osd:/var/lib/ceph/osd -v /dev:/dev -v /root:/root ceph/daemon                                                                                      
bash-4.2# ceph-bluestore-tool --command repair --log-level 10 -l /root/ceph-repair.log --path /var/lib/ceph/osd/ceph-7/
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluestore/KernelDevice.cc: In function 'virtua
l int KernelDevice::aio_write(uint64_t, ceph::bufferlist&, IOContext*, bool, int)' thread 7ff62617ff00 time 2019-04-08 21:39:42.411246
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluestore/KernelDevice.cc: 814: FAILED ceph_as
sert(is_valid_io(off, len))
 ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7ff61c1671a6]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7ff61c167374]
 3: (KernelDevice::aio_write(unsigned long, ceph::buffer::v14_2_0::list&, IOContext*, bool, int)+0x128c) [0x562d77d9cbbc]
 4: (BlueStore::_deferred_submit_unlock(BlueStore::OpSequencer*)+0x621) [0x562d77c01721]
 5: (BlueStore::deferred_try_submit()+0x440) [0x562d77c203f0]
 6: (BlueStore::_osr_drain_all()+0x19d) [0x562d77c227cd]
 7: (BlueStore::_deferred_replay()+0x1a6) [0x562d77c23556]
 8: (BlueStore::_fsck(bool, bool)+0x4ecc) [0x562d77c2f78c]
 9: (main()+0x107e) [0x562d77a88e6e]
 10: (__libc_start_main()+0xf5) [0x7ff618e0b3d5]
 11: (()+0x271f0f) [0x562d77b6ef0f]
*** Caught signal (Aborted) **
 in thread 7ff62617ff00 thread_name:ceph-bluestore-
 ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
 1: (()+0xf5d0) [0x7ff61a64d5d0]
 2: (gsignal()+0x37) [0x7ff618e1f207]
 3: (abort()+0x148) [0x7ff618e208f8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x7ff61c1671f5]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7ff61c167374]
 6: (KernelDevice::aio_write(unsigned long, ceph::buffer::v14_2_0::list&, IOContext*, bool, int)+0x128c) [0x562d77d9cbbc]
 7: (BlueStore::_deferred_submit_unlock(BlueStore::OpSequencer*)+0x621) [0x562d77c01721]
 8: (BlueStore::deferred_try_submit()+0x440) [0x562d77c203f0]
 9: (BlueStore::_osr_drain_all()+0x19d) [0x562d77c227cd]
 10: (BlueStore::_deferred_replay()+0x1a6) [0x562d77c23556]
 11: (BlueStore::_fsck(bool, bool)+0x4ecc) [0x562d77c2f78c]
 12: (main()+0x107e) [0x562d77a88e6e]
 13: (__libc_start_main()+0xf5) [0x7ff618e0b3d5]
 14: (()+0x271f0f) [0x562d77b6ef0f]
2019-04-08 21:39:42.411 7ff62617ff00 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluest
ore/KernelDevice.cc: In function 'virtual int KernelDevice::aio_write(uint64_t, ceph::bufferlist&, IOContext*, bool, int)' thread 7ff62617ff00 time 2019-04-08 21:39:42.411246
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluestore/KernelDevice.cc: 814: FAILED ceph_as
sert(is_valid_io(off, len))

Files

Download all files

ceph-fsck.log (118 KB) ceph-fsck.log		Daniel Poelzleithner, 04/08/2019 11:47 PM
fix_crash3.patch (2.73 KB) fix_crash3.patch		Daniel Poelzleithner, 04/10/2019 01:06 PM

Actions

Copy link

Updated by Daniel Poelzleithner about 5 years ago

http://files.poelzi.org/ceph/ceph-repair.log repair log

Actions

Copy link

Updated by Igor Fedotov about 5 years ago

@Daniel, wondering if you remember what was the initial Ceph version this cluster (or this specific OSD) was created by?

Actions

Copy link

Updated by Daniel Poelzleithner about 5 years ago

Definitly luminous but some low patch level number, maybe ~ 12.2.6

Actions

Copy link

Updated by Daniel Poelzleithner about 5 years ago

FIY I was able to rapair the osd by pruning the transaction log. There was only one transaction in the log, so not much problem there

ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-7/ rm-prefix L

Actions

Copy link

Updated by Daniel Poelzleithner about 5 years ago

File fix_crash3.patch fix_crash3.patch added

Before finding the kvstore solution I started developing a fix that is mostly finished. Not so sure about the proper way of correcting the problem. Should the transaction be dropped ?

Actions

Copy link

Updated by Igor Fedotov about 5 years ago

Looks like OSD was created by a release which improperly handled disk size unaligned with allocation granularity (64K for HDD by default).

I can see disk size = 1999738220544 (0x1d199afb000) bytes which isn't aligned with 64K:

378> 2019-04-08 21:16:30.099 7f23f682df00 1 bdev(0x55fde7ffee00 /var/lib/ceph/osd/ceph-7/block) open size 1999738220544 (0x1d199afb000, 1.8 TiB) block_size 4096 (4 KiB) rotational discard not supported

Most probably that's the root cause for past-end allocations,
Cleaning transaction log (which causes data lost by itself) is just a partial solution, new data still might go to invalid offsets causing more data loss.

The most straightforward fix is probably to redeploy the OSD. But if you don't mind I'd like to poke it a bit more to check how to both indicate and less painfully fix such a case.
If you're fine with that please colect another OSD startup log with 'debug bluestore' set to 20.

Actions

Copy link

Updated by Daniel Poelzleithner about 5 years ago

I rebuild the original osd with problems, but there is another one that now runs into the same problem. Log of it starting and crashing:
http://files.poelzi.org/ceph/ceph-osd.9.log.gz

Actions

Copy link

Updated by Igor Fedotov about 5 years ago

Thanks, Daniel.
In the new log I can see the following lines:
2019-04-12 11:54:13.109186 7fab6fc4fe00 10 bluestore(/var/lib/ceph/osd/ceph-9/block) _read_bdev_label got bdev(osd_uuid 5fb3a7cb-d807-4f26-8807-fea3adc6f161, size 0x1d1b8c11000, btime 2018-12-14 01:04:10.293969, desc main, 7 meta)
2019-04-12 11:54:13.109275 7fab6fc4fe00 1 bdev create path /var/lib/ceph/osd/ceph-9/block type kernel
2019-04-12 11:54:13.109282 7fab6fc4fe00 1 bdev(0x55b150838d80 /var/lib/ceph/osd/ceph-9/block) open path /var/lib/ceph/osd/ceph-9/block
2019-04-12 11:54:13.109535 7fab6fc4fe00 1 bdev(0x55b150838d80 /var/lib/ceph/osd/ceph-9/block) open size 1999738220544 (0x1d199afb000, 1.82TiB) block_size 4096 (4KiB) rotational

Please note size reported by _read_bdev_label: 0x1d1b8c11000 and size reported by the device on open: size 1999738220544 (0x1d199afb000)

The first one is the size that was provided by the disk (probably along with some manipulation by BlueStore code) at creation time and it doesn't look like a result of improper rounding up which I expected.

Just in case wondering if you performed some data migration (or disk replacement) using similar? disks/partitions after initial deployment?

Actions

Copy link

Updated by Daniel Poelzleithner about 5 years ago

If I remember correctly the disc where in a different server back then and the hardware died. We had to move the discs to a new old server with a different raid card (our cluster is build from donated hardware, total mess ;). I think we had to use raid0 instead of jbod because of raid hardware firmware issues, so it could be that the discs where truncated at some point. The new storage expander also has a 2tb limit,...). I definitely ran deep scrub on the osds without problems, but fsck did not exist on this version for bluestore.

Despite of administrative mistakes, I think it would be good if at least fsck would check for this mismatch and the osd reports some warning when starting.

Actions

Copy link

#10