Bug #53466: OSD is unable to allocate free space for BlueFS - bluestore - Ceph

Actions

Copy link

Bug #53466

closed

OSD is unable to allocate free space for BlueFS

Added by Jan Tilsner over 2 years ago. Updated 8 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Igor Fedotov

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

quincy, pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.6

ceph-qa-suite:

Pull request ID:

48854

Crash signature (v1):

Crash signature (v2):

Description

We have a ceph cluster using rook on kubernetes with some OSDs that are in CrashLoopBackOff state.

We have the suspicion that this occurred after the last update ( update rook to 1.5.12 and ceph to 15.2.11; update rook to 1.6.10 and ceph to 15.2.13; update rook to 1.7.4 and ceph to 16.2.6).
After that update the cluster was stable for several weeks (Oct. 18 til Nov. 15), now all OSDs that are restarted are crashing.

I will attach a log from a crashing OSD.

We have a second identical cluster that has the same problem (update on Oct. 18, crash on Nov. 24). Both of these clusters are virtual machines on ESX.
These are test clusters and not critical, but we did the same update and are running the identical software on our productive clusters (AWS EC2 and bare metal).
We fear that after some time they will be affected too.

We were able to repair one of the two clusters with adding a new osd and taking the broken one away. Unfortunately that didn't help with the other cluster (now 6 of 10 OSDs broken).

Since the error message is "unable to allocate 0x90000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x31ffc00000, block size 0x1000, free 0x5d008c000, fragmentation 0.211058, allocated 0x0" we fount the following issue:
https://tracker.ceph.com/issues/47883
We tried to change the allocator to bitmap, but still no luck. Changing the image tag of the OSD to quay.io/ceph/daemon-base:latest-pacific-devel did not help either.

rook-version: 1.7.4
ceph-version: 16.2.6

Files

osd-2.log (289 KB) osd-2.log

Jan Tilsner, 12/02/2021 03:10 PM

Related issues 7 (2 open — 5 closed)

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Project changed from Ceph to bluestore
Category deleted (~~OSD~~)

Actions

Copy link

Updated by Igor Fedotov over 2 years ago

This might be a duplicate of #53590.
As per my comment in that ticket - could you please collect and share "free-dump" output?

Actions

Copy link

Updated by Jan Tilsner over 2 years ago

I am not sure if I did it right:
- scale the rook operator down
- edit the deployment
-- command: sleep 3600
-- remove liveness probe
- exec into the OSD pod and fire the command "ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-7 --command free-dump"

2021-12-14T14:58:16.883+0000 7fa9046ea3c0 -1 bluefs _check_new_allocations invalid extent 1: 0xff2a0f000~81000: duplicate reference, ino 3507 2021-12-14T14:58:16.883+0000 7fa9046ea3c0 -1 bluefs mount failed to replay log: (14) Bad address 2021-12-14T14:58:16.883+0000 7fa9046ea3c0 -1 bluestore(/var/lib/ceph/osd/ceph-7) _open_bluefs failed bluefs mount: (14) Bad address 2021-12-14T14:58:16.883+0000 7fa9046ea3c0 -1 bluestore(/var/lib/ceph/osd/ceph-7) _open_db failed to prepare db environment: error from cold_open: (5) Input/output error command terminated with exit code 1

Actions

Copy link

Updated by Igor Fedotov over 2 years ago

Jan Tilsner wrote:

I am not sure if I did it right:
- scale the rook operator down
- edit the deployment
-- command: sleep 3600
-- remove liveness probe
- exec into the OSD pod and fire the command "ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-7 --command free-dump"

2021-12-14T14:58:16.883+0000 7fa9046ea3c0 -1 bluefs _check_new_allocations invalid extent 1: 0xff2a0f000~81000: duplicate reference, ino 3507 2021-12-14T14:58:16.883+0000 7fa9046ea3c0 -1 bluefs mount failed to replay log: (14) Bad address 2021-12-14T14:58:16.883+0000 7fa9046ea3c0 -1 bluestore(/var/lib/ceph/osd/ceph-7) _open_bluefs failed bluefs mount: (14) Bad address 2021-12-14T14:58:16.883+0000 7fa9046ea3c0 -1 bluestore(/var/lib/ceph/osd/ceph-7) _open_db failed to prepare db environment: error from cold_open: (5) Input/output error command terminated with exit code 1

No, something is definitely wrong with this attempt. Wrong device access or something... Unfortunately I'm unable to advise anything in this case due to pretty limited experience with Rook deployments.

Actions

Copy link

Updated by Alexander Trost about 2 years ago

Sorry for the lates response, Jan and I have been looking into the issue together.

We were unable to retrieve the ceph-bluestore-tool --command free-dump output from the initially mentioned OSDs and clusters. This did not seem to be caused by Rook,but interestingly went away after creating brand new OSDs. The brand new OSDs are crashing with the same error as above but with the new crashing OSDs we are able to get the `free-dump` output, e.g.,:

# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ --command free-dump > free-dump
2022-01-27T13:12:54.482+0000 7fd928c4a3c0 -1 asok(0x55c99cec4000) AdminSocket: request '[{"prefix": "bluestore allocator dump bluefs-db"}]' not defined
failure querying 'bluefs-db'
...

I have uploaded the `free-dump` output here: https://nextcloud.cloudical.net/s/Xy8fDyT3cWWTqDw (Password for the share is: "cephosd")

Would OSD logs with a higher verbosity level help as well?

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

Hi Alexander,
thanks for the dump, finally I've got a chance to take a look. Sorry for the delay.

I was hoping the dump would be collected for the same OSD which was suffering the outage. Which IIUC isn't the case.
Given the dump is for a different OSD I can't say what space BlueFS attempted to allocate. And hence I'm unable to try that allocation against your allocator snapshot..
So if possible please collect both free-dump and startup log for the same OSD unmodified between probing.

But actually I'm pretty sure the root cause is the lack of contiguous 64K extents which BlueFS tries to allocate. Which in turn is caused by high disk utilization/fragmentation and the fact that BlueStore uses 4K chunks for user data while BlueFS relies on 64K allocation. Once these allocation policies are mixed on the same volume larger chunks might become unavailable at some point. Even before out-of-free-space state is reached...

If my preliminary analysis is valid (which I'm 99% sure) you might want to start using additional standalone WAL/DB volume for your OSD even if physically it collocates the same disk. This workaround would provide less harmful fragmentation on DB volume - in fact BlueFS would be able to allocate space until there is some free space at DB volume. And even beyond that since it can spill data over to a main device. In the latter case the issue might reappear though so it's important to have DB volume large enough - which rather means some space wasting...
We're currently thinking about the allocation strategies redesign but that's evidently not available atm. Hence the workaround above.

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

Related to Fix #54299: osd error restart added

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

Related to Bug #53899: bluefs _allocate allocation failed - BlueFS.cc: 2768: ceph_abort_msg("bluefs enospc") added

Actions

Copy link

Updated by Jan Tilsner about 2 years ago

Hi Igor

thank you for your input.
You are correct, the logs and the dump are not from the same OSD, they only share the same error. Sorry for that, but we had to repair the cluster at some time.

Your theory about the high utilization might be right, but is not the case in our clusters: we have less than 25% usage.
High fragmentation could be an issue indeed, as the volumes are permanently created, used and deleted in those clusters.

Speaking about the 4K / 64K allocation:
If we add bluefs_shared_alloc_size=4096 to our ceph.conf for the OSDs, they start up and run normal for hours. The OSD then crashes with the following error message and restarts:

debug -4> 2022-02-18T11:26:20.074+0000 7f98000f6700 5 bluestore(/var/lib/ceph/osd/ceph-9) _kv_sync_thread utilization: idle 9.130752563s of 10.114350319s, submitted: 326 debug -3> 2022-02-18T11:26:20.534+0000 7f98000f6700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, uint64_t, uint64_t)' thread 7f98000f6700 time 2022-02-18T11:26:20.526317+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/os/bluestore/BlueFS.cc: 2571: FAILED ceph_assert(bl.length() <= runway)

We discussed putting the DB/WAL in a standalone volume, but that is not something that ROOK supports (at least not on the same physical disk).

Setting the OSD config to 4096 is a workaround that keeps our Clusters going - but it clearly isn't ideal.
Is there a timeline for the rethinking/redesign of the allocation strategy?

Actions

Copy link

#10

Updated by Igor Fedotov about 2 years ago

Hey Jan,

as for high utilization - both of your OSDs I'm aware of are at more than 80% space utilization:
- from attached osd-2.log (capacity vs. free) :
debug -4> 2021-11-15T12:49:14.556+0000 7f86cebf1080 1 bluefs _allocate unable to allocate 0x90000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x31ffc00000, block size 0x1000, free 0x5d008c000, fragmentation 0.211058, allocated 0x0
i.e. ~25GB of 200 free

- from free-dump output:
"capacity": 214744170496,
Free:66b5fb000 - that's what I got after free dump replay in my lab.
27GB of 210 GB

Setting bluefs-shared-alloc-size to 4K is generally looks like the proper solution but unfortunately it's not properly tested and hence apparently not 100% reliable. I'm aware about at least one more case where it didn't work. Not to mention about its performance impact...

And I'm sorry I wasn't aware about that ROOK limitation, good to know.

As for the redesign timeline - we're currently at the early steps discussing how to proceed. So it's hard to tell anything specific, sorry again.

Actions

Copy link

#11

Updated by Igor Fedotov about 2 years ago

Just an addendum to my previous comment.
From free-dump analysis the amount of free space available for 64K allocations is:
Available(strict align): 0x120000

The same for 16K allocations:
Available(strict align): 0x2daf84000

So may be try bluefs_shared_alloc_size=16384 ?
But the same concerns as with 4K to be applied. Plus this is still vulnerable to high fragmentation - no guarantee that it wouldn't become worse with the given access pattern...

Actions

Copy link

#12