Bug #57672: SSD OSD won't start after high framentation score! - bluestore - Ceph

Actions

Copy link

Bug #57672

closed

SSD OSD won't start after high framentation score!

Added by Vikhyat Umrao over 1 year ago. Updated over 1 year ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

One of the rook upstream users reported this issue in the upstream rook channel!

2022-09-21T17:02:18.340+0000 7ffaa8387540  1 bluefs _allocate unable to allocate 0x80000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x37e3ec00000, block size 0x1000, free 0xc3053b2000, fragmentation 0.948561, allocated 0x0

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Vikhyat Umrao over 1 year ago

The user was not able to capture any debug data because it hit the cluster so hard that it went down.

Actions

Copy link

Updated by Vikhyat Umrao over 1 year ago

User question:

any idea how to fix it?

Actions

Copy link

Updated by Igor Fedotov over 1 year ago

@Vikhyat Umrao - what Ceph release are we talking about?

Actions

Copy link

Updated by Igor Fedotov over 1 year ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Kevin Fox over 1 year ago

The cluster involved was provisioned with ceph:v14.2.4-20190917 Oct 2019. Its been running nautilus until last month. It was then upgraded to ceph:v15.2.17-20220809.
On Sep 20, in the morning, we upgraded to ceph:v16.2.10-20220805. Everything seemed fine and workload continued to function until around 10pm that night. Then the cluster fell apart.
When I got in, the morning, things were not functioning and we saw stuff like:
services:
mon: 3 daemons, quorum a,e,f (age 23h)
mgr: a(active, since 23h)
mds: 1/1 daemons up, 1 hot standby
osd: 152 osds: 61 up (since 7h), 113 in (since 10h); 236 remapped pgs
There are 38 hdd's in an archive set and 114 in the busy ssd set. All but 23 of the ssds were down.

After digging in, the down ones were all in a crashloop with messages like:
debug 2022-09-21T15:47:58.330+0000 7fdbe71fe200 4 rocksdb: [db_impl/db_impl_open.cc:760] Recovering log #370981 mode 2
debug 2022-09-21T15:47:59.832+0000 7fdbe71fe200 3 rocksdb: [le/block_based/filter_policy.cc:584] Using legacy Bloom filter with high (20) bits/key. Dramatic filter space and/or accuracy improvement is available
with format_version>=5.
debug 2022-09-21T15:48:02.582+0000 7fdbe71fe200 1 bluefs _allocate unable to allocate 0x90000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x37e3ec00000, block size 0x1000, free 0x9ddc9bc000
, fragmentation 0.935907, allocated 0x0
debug 2022-09-21T15:48:02.582+0000 7fdbe71fe200 -1 bluefs _allocate allocation failed, needed 0x80af0
debug 2022-09-21T15:48:02.582+0000 7fdbe71fe200 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x80af0
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: I
n function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fdbe71fe200 time 2022-09-21T15:48:02.584656+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: 2
768: ceph_abort_msg("bluefs enospc")
ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)

Actions

Copy link

Updated by Kevin Fox over 1 year ago

I switched one of the pods to /bin/bash and tried various things to fsck the osds. Every time it hit the point where it would try and access the data, it would fail similarly to above. Even read only things like just stating things failed the same way.

After hours of testing/debugging we managed to get the first one up using step 1-5 on the last comment of:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/6L5ZCE47LS7KOZKMVVGG2QWX6TWTTCGU/
No other suggestions on the thread worked.

So, adding space was the only solution. I came up with a final procedure late that night. I had to do some stuff not documented in the thread to get ceph volume lvm activate to automatically work with the new database volumes and start properly.

The remaining up osds were kind of deadlocked but throwing some other errors. after restart, they were doing the same allocation issue the rest of them were.

The morning of the 22nd, I applied the procedure to move the db to a separate volume on all the osds (took most of the day). The procedure worked on most of the osds, but 12 of them wouldn't allow moving the db to the separate volume. they complained similarly about not being able to allocate space.

Bringing up the osds that were able to successfully run though the procedure, there was still unavilabile/unknown/nonactive pgs.

I then took a list of remaining down osds. For all the remainders but 4, I had to add space via adding an additional drive partition as a pv, then extend it to the osd vg, extend the osd lv, and then do a ceph-bluestore-tool bluefs-bdev-expand.

That got all data available. I then removed the last 4 broken osds from the cluster and re-added them over the weekend.

Actions

Copy link

Updated by Kevin Fox over 1 year ago

I can find no evidence that the cluster got full. I've seen it occasionally go up a little past 85 (usually if I'm re-balancing things). So it was very unexpected to see the allocation issue about lack of free space, along with all the osds having the same issue.

Actions

Copy link

Updated by Vikhyat Umrao over 1 year ago

Thank you, Igor. I think Kevin answered as much as the background he had from the issue.

Actions

Copy link

Updated by Igor Fedotov over 1 year ago

Kevin Fox wrote:

I can find no evidence that the cluster got full. I've seen it occasionally go up a little past 85 (usually if I'm re-balancing things). So it was very unexpected to see the allocation issue about lack of free space, along with all the osds having the same issue.

Hey Kevin,
first of all thanks a lot for the thorough overview above.

Now let me provide some insight on why ENOSPACE might be returned even when there seems to be available space on a disk. This is primarily applied to a single shared disk setup when both main and DB volumes collocate a single partition/LV. I presume this is definitely the case for your SSD OSDs (and apparently for HDD ones too). But spilled over standalone BlueFS might suffer from the same issue as well. Are my assumptions about your disk layout above valid?

So one should realize that BlueFS (which stands behind DB) uses higher minimal allocation unit than the one for user data (64K vs. 4K). Hence after long extensive usage the disk space might reach a highly fragmented state (due to previous 4K alloc/release cycles) when there are no contiguous (and respective allocation unit aligned) 64K chunks. In that case allocator is unable to provide more space for RocksDB/BlueFS and ENOSPC is returned. Highly likely this happened in your case.

Need to mention that we had some bugs in Avl/Hybrid allocators before which cause unexpected ENOSPC returns as well. E.g. see https://tracker.ceph.com/issues/50656
This should be fixed in 16.2.10 but there could be other bugs as well. So it might be helpful to try to switch to bitmap/stupid allocator when such an issue pops up to make sure it's not a one more allocator's bug.
Generally there are no ideal solution to 100% guarantee safety in this respect. Using standalone DB volume would mostly help but it is apparently less cost-effective as one has to pre-allocate large enough volume which tend to be over-provisioned. And failing to provide that large DB volume would lead us back to unsafe state if BlueFS spillover occurs.

We could make deeper investigation if you have a broken OSD. But I presume there are no one for dissection any more, right?
It's pretty surprising that this happened simultaneously for a large bunch of OSD....

Actions

Copy link

#10

Updated by Kevin Fox over 1 year ago

Hi Igor,

Thanks for the details. That makes sense and helps me feel much more comfortable that the hack I put in place to get it going again will at least for now, keep it running rather then fail randomly again.

Yeah. When the issue first started there were only one disk per osd for all the osds. So that tracks.

Does the fragmentation score alone show how fragmented things are? I still see most osds at .93+ after moving the database to 30g volumes, and the usage (ceph_bluefs_db_used_bytes/ceph_bluefs_db_total_bytes) is in the 30% range now. The recently formatted osds quickly hit .80+ fragmentation. So, I'm thinking the fragmentation score alone matters a lot when single drive, but may not so much when its on a separate volume? Is there other metrics I should track here? With the db on separate volumes, it seem like the thing to track is ceph_bluefs_db_used_bytes/ceph_bluefs_db_total_bytes > .90

One of the things I tried when bringing back up the osds originally was changing the allocator to block. that didn't clear the issue so I set it back to hybrid. I hadn't known to try the stupid allocator. hopefully, I'll never get the "opportunity" in the future. :)

I'm ok over allocating some storage to ensure stability. Its unfortunate that the recommendations make it seem like its a fairly optional thing to do to have a separate db volume. I'd consider at least putting a caveat in the documentation saying, its optional if you never run the osd space > 50%? otherwise, strongly recommended or some other rule of thumb? Also, some kind of ceph cluster warning state on fragmentation being too high might help prevent this kind of problem for others.

Thanks,
Kevin

Actions

Copy link

#11

Updated by Kevin Fox over 1 year ago

One note I see in the rook documentation:
"Notably, ceph-volume will not use a device of the same device class (HDD, SSD, NVMe) as OSD devices for metadata, resulting in this failure."

This seems very wrong with respect to this issue. It can be life saving to have the db separated from the block store, so allowing SSD DB and SSD block should still be allowed?

Thanks,
Kevin

Actions

Copy link

#12

Updated by Kevin Fox over 1 year ago

Just saw this again, on a small scale. just one of the osds that I had moved the db off to its own volume, just entered a crash loop with:
debug -9> 2022-09-29T15:48:18.404+0000 7ffb68d58200 4 rocksdb: [version_set.cc:4568] Recovered from manifest file:db/MANIFEST-352469 succeeded,manifest_file_number is 352469, next_file_number is 367828, last_sequence is 104078451533, log_number is 367825,prev_log_number is 0,max_column_family is 0,min_log_number_to_keep is 0

debug -8> 2022-09-29T15:48:18.404+0000 7ffb68d58200 4 rocksdb: [version_set.cc:4577] Column family [default] (ID 0), log number is 367825

debug -7> 2022-09-29T15:48:18.405+0000 7ffb68d58200 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1664466498406749, "job": 1, "event": "recovery_started", "log_files": [367825]}
debug -6> 2022-09-29T15:48:18.405+0000 7ffb68d58200 4 rocksdb: [db_impl/db_impl_open.cc:760] Recovering log #367825 mode 2
debug -5> 2022-09-29T15:48:19.439+0000 7ffb68d58200 3 rocksdb: [le/block_based/filter_policy.cc:584] Using legacy Bloom filter with high (20) bits/key. Dramatic filter space and/or accuracy improvement is available with format_version>=5.
debug -4> 2022-09-29T15:48:27.861+0000 7ffb68d58200 1 bluefs _allocate unable to allocate 0x80000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x3eebec00000, block size 0x1000, free 0x11376261000, fragmentation 0.947781, allocated 0x0
debug -3> 2022-09-29T15:48:27.861+0000 7ffb68d58200 -1 bluefs _allocate allocation failed, needed 0x729ce
debug -2> 2022-09-29T15:48:27.861+0000 7ffb68d58200 -1 bluefs _flush_range allocated: 0x210000 offset: 0x2022fa length: 0x806d4
debug -1> 2022-09-29T15:48:27.878+0000 7ffb68d58200 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7ffb68d58200 time 2022-09-29T15:48:27.863299+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: 2768: ceph_abort_msg("bluefs enospc")

Actions

Copy link

#13

Updated by Kevin Fox over 1 year ago

Got some more info... during the outage, I had 12 drives that wouldn't recover by moving off the db. looking back through the logs, this drive I had to use the alternate, add a second drive pv to the vg, lv, and expand... so it doesn't have a separate db volume. so probably the same problem as before.

Actions

Copy link

#14

Updated by Kevin Fox over 1 year ago

Random other thing... during repairs, I see:
[root@pc20 ceph]# ceph-bluestore-tool --log-level 30 --path /var/lib/ceph/osd/ceph-$ROOK_OSD_ID --command repair

2022-09-29T16:24:24.126+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: store not yet converted to per-pg omap
2022-09-29T16:24:24.177+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: #-1:c1a3fc6e:::purged_snaps:0# has omap that is not per-pg or pgmeta
2022-09-29T16:24:33.464+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: #1:04ab7765:::csi.volume.f1293286-8997-11ea-80b3-4e9e86dc2a19:head# has omap that is not per-pg or pgmeta
2022-09-29T16:24:40.522+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: #1:09871afc:::csi.volume.7786c49a-8bc7-11ea-9888-4e9e86dc2a19:head# has omap that is not per-pg or pgmeta
2022-09-29T16:24:48.589+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: #1:0faa3e17:::csi.volume.04abe18e-29d8-11eb-b20e-b2a45f9bcc74:head# has omap that is not per-pg or pgmeta
2022-09-29T16:25:02.622+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: #1:18d724d6:::csi.volume.dbc7a9d1-312f-11ec-aafe-128649be98fe:head# has omap that is not per-pg or pgmeta
2022-09-29T16:25:12.442+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: #1:1a2aaab2:::csi.volume.af4f4b1f-29d5-11eb-b20e-b2a45f9bcc74:head# has omap that is not per-pg or pgmeta

This expected?

Actions

Copy link

#15

Updated by Igor Fedotov over 1 year ago

Kevin Fox wrote:

Hi Igor,

Does the fragmentation score alone show how fragmented things are? I still see most osds at .93+ after moving the database to 30g volumes, and the usage (ceph_bluefs_db_used_bytes/ceph_bluefs_db_total_bytes) is in the 30% range now. The recently formatted osds quickly hit .80+ fragmentation. So, I'm thinking the fragmentation score alone matters a lot when single drive, but may not so much when its on a separate volume? Is there other metrics I should track here? With the db on separate volumes, it seem like the thing to track is ceph_bluefs_db_used_bytes/ceph_bluefs_db_total_bytes > .90

The issue with that fragmentation score is that there is no strong math behind it. Originally it was introduced to briefly get some insight (at cheap computational cost!) on the fragmentation state. Primarily this is no more than just a pretty simple indicator/signal for more extensive inspection (which in fact isn't implemented). So I don't feel comfortable with exposing it for making global decision... Not to mention that I doubt that space fragmentation state can be well described with just a single number.
E.g. simple example - how can one express (and compare) the following two cases with a single number:
1) There is one 4K and one 1M chunk.
2) There is 2x512K chunks

Which is more fragmented?

IMO this is should be (at least?) a vector showing how many chunks of specific length range are available. And some logic above them to decide if this is critical/crucial in any aspect.

And yes ceph_bluefs_db_used_bytes/ceph_bluefs_db_total_bytes > .90 might to some degree be a indicator (yellow rather than red) for standalone DB volume since it warns about a close spillover event - which in turn can lead to either perf degradation (if main device is slower) or ENOSPC. The latter isn't deterministic (not sure this is the right word) though - the main device might still have plenty of unfragmented space...

One of the things I tried when bringing back up the osds originally was changing the allocator to block. that didn't clear the issue so I set it back to hybrid. I hadn't known to try the stupid allocator. hopefully, I'll never get the "opportunity" in the future. :)

I'm ok over allocating some storage to ensure stability. Its unfortunate that the recommendations make it seem like its a fairly optional thing to do to have a separate db volume. I'd consider at least putting a caveat in the documentation saying, its optional if you never run the osd space > 50%? otherwise, strongly recommended or some other rule of thumb? Also, some kind of ceph cluster warning state on fragmentation being too high might help prevent this kind of problem for others.

Actually we've never been planning/expecting fragmentation to be that critical... Neither I think we'll finally win it (global-wide/permanently) through tricking with DB volume layout. It should be either defragmentation (which makes sense mostly for spinners) or making BlueFS use 4K allocation unit. In the latter case we would get rid of the issue you're facing automatically.

Hence that little attention in the current upstream recommendations. In other words that my suggestion to use standalone DB volume is mostly a workaround not a full-grade solution...

Actions

Copy link

#16

Updated by Igor Fedotov over 1 year ago

Kevin Fox wrote:

One note I see in the rook documentation:
"Notably, ceph-volume will not use a device of the same device class (HDD, SSD, NVMe) as OSD devices for metadata, resulting in this failure."

This seems very wrong with respect to this issue. It can be life saving to have the db separated from the block store, so allowing SSD DB and SSD block should still be allowed?

Thanks,
Kevin

Apparently the same root cause as for lacking proper recommendations - standalone DB volume is a workaround not a final solution...

Actions

Copy link

#17

Updated by Igor Fedotov over 1 year ago

Kevin Fox wrote:

Just saw this again, on a small scale. just one of the osds that I had moved the db off to its own volume, just entered a crash loop with:
debug -9> 2022-09-29T15:48:18.404+0000 7ffb68d58200 4 rocksdb: [version_set.cc:4568] Recovered from manifest file:db/MANIFEST-352469 succeeded,manifest_file_number is 352469, next_file_number is 367828, last_sequence is 104078451533, log_number is 367825,prev_log_number is 0,max_column_family is 0,min_log_number_to_keep is 0

debug -8> 2022-09-29T15:48:18.404+0000 7ffb68d58200 4 rocksdb: [version_set.cc:4577] Column family [default] (ID 0), log number is 367825

debug -7> 2022-09-29T15:48:18.405+0000 7ffb68d58200 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1664466498406749, "job": 1, "event": "recovery_started", "log_files": [367825]}
debug -6> 2022-09-29T15:48:18.405+0000 7ffb68d58200 4 rocksdb: [db_impl/db_impl_open.cc:760] Recovering log #367825 mode 2
debug -5> 2022-09-29T15:48:19.439+0000 7ffb68d58200 3 rocksdb: [le/block_based/filter_policy.cc:584] Using legacy Bloom filter with high (20) bits/key. Dramatic filter space and/or accuracy improvement is available with format_version>=5.
debug -4> 2022-09-29T15:48:27.861+0000 7ffb68d58200 1 bluefs _allocate unable to allocate 0x80000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x3eebec00000, block size 0x1000, free 0x11376261000, fragmentation 0.947781, allocated 0x0
debug -3> 2022-09-29T15:48:27.861+0000 7ffb68d58200 -1 bluefs _allocate allocation failed, needed 0x729ce
debug -2> 2022-09-29T15:48:27.861+0000 7ffb68d58200 -1 bluefs _flush_range allocated: 0x210000 offset: 0x2022fa length: 0x806d4
debug -1> 2022-09-29T15:48:27.878+0000 7ffb68d58200 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7ffb68d58200 time 2022-09-29T15:48:27.863299+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: 2768: ceph_abort_msg("bluefs enospc")

If this is still available - may I ask you to run ceph-bluestore-tool's free-dump command and share the output?

Actions

Copy link

#18

Updated by Igor Fedotov over 1 year ago

Kevin Fox wrote:

Random other thing... during repairs, I see:
[root@pc20 ceph]# ceph-bluestore-tool --log-level 30 --path /var/lib/ceph/osd/ceph-$ROOK_OSD_ID --command repair

2022-09-29T16:24:24.126+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: store not yet converted to per-pg omap
2022-09-29T16:24:24.177+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: #-1:c1a3fc6e:::purged_snaps:0# has omap that is not per-pg or pgmeta
2022-09-29T16:24:33.464+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: #1:04ab7765:::csi.volume.f1293286-8997-11ea-80b3-4e9e86dc2a19:head# has omap that is not per-pg or pgmeta
2022-09-29T16:24:40.522+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: #1:09871afc:::csi.volume.7786c49a-8bc7-11ea-9888-4e9e86dc2a19:head# has omap that is not per-pg or pgmeta
2022-09-29T16:24:48.589+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: #1:0faa3e17:::csi.volume.04abe18e-29d8-11eb-b20e-b2a45f9bcc74:head# has omap that is not per-pg or pgmeta
2022-09-29T16:25:02.622+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: #1:18d724d6:::csi.volume.dbc7a9d1-312f-11ec-aafe-128649be98fe:head# has omap that is not per-pg or pgmeta
2022-09-29T16:25:12.442+0000 7f4d197bb540 -1 bluestore(/var/lib/ceph/osd/ceph-23) fsck warning: #1:1a2aaab2:::csi.volume.af4f4b1f-29d5-11eb-b20e-b2a45f9bcc74:head# has omap that is not per-pg or pgmeta

This expected?

This is totally irrelevant - these are warnings showing legacy formatted omaps for this OSD. It just hasn't been updated yet (which is performed through ceph-bluestore-tool's repair/quick fix). This causes no harm but incomplete (or - pretty unlikely - broken) cluster stats reporting.

Actions

Copy link

#19

Updated by Kevin Fox over 1 year ago

Igor Fedotov wrote:

This is totally irrelevant - these are warnings showing legacy formatted omaps for this OSD. It just hasn't been updated yet (which is performed through ceph-bluestore-tool's repair/quick fix). This causes no harm but incomplete (or - pretty unlikely - broken) cluster stats reporting.

Ok, thanks.

Actions

Copy link

#20

Updated by Kevin Fox over 1 year ago

Igor Fedotov wrote:

If this is still available - may I ask you to run ceph-bluestore-tool's free-dump command and share the output?

Unfortunately, I already repaired it by moving the db off to its own volume. Sorry.

If I see it again I'll try and do that.

Actions

Copy link

#21

Updated by Kevin Fox over 1 year ago

Igor Fedotov wrote:

The issue with that fragmentation score is that there is no strong math behind it. Originally it was introduced to briefly get some insight (at cheap computational cost!) on the fragmentation state. Primarily this is no more than just a pretty simple indicator/signal for more extensive inspection (which in fact isn't implemented). So I don't feel comfortable with exposing it for making global decision... Not to mention that I doubt that space fragmentation state can be well described with just a single number.
E.g. simple example - how can one express (and compare) the following two cases with a single number:
1) There is one 4K and one 1M chunk.
2) There is 2x512K chunks

Which is more fragmented?

Yeah. thanks. I think I understand the problem. and that its tricky. but from a user side, any indicator of possible impending disaster is better then completely blind. Having it in prom would let us start keeping an eye on it if it gets too high.

In particular, 4 of the drives I recently formatted after the incident are already at:
"fragmentation_rating": 0.87042085849503159
"fragmentation_rating": 0.86825600918096946
"fragmentation_rating": 0.86712285657496924
"fragmentation_rating": 0.86201485424058399

So I'm going to probably need to do something with them very soon. having that number is helping judge when I need to deal with that.

Next up is figuring out how to do the db move to a lvm volume when the underlying block device is raw. apparently rook only does raw volumes now, unless it can do the metadata db to a faster class of drive (which I currently do not have)

And yes ceph_bluefs_db_used_bytes/ceph_bluefs_db_total_bytes > .90 might to some degree be a indicator (yellow rather than red) for standalone DB volume since it warns about a close spillover event - which in turn can lead to either perf degradation (if main device is slower) or ENOSPC. The latter isn't deterministic (not sure this is the right word) though - the main device might still have plenty of unfragmented space...

Thanks. At this point, any indicator of a possible problem, is better then none. If its always safe under yellow, but possibly unsafe beyond that, that's still a useful metric. I can sleep much more soundly if its under that line I think.

Actually we've never been planning/expecting fragmentation to be that critical... Neither I think we'll finally win it (global-wide/permanently) through tricking with DB volume layout. It should be either defragmentation (which makes sense mostly for spinners) or making BlueFS use 4K allocation unit. In the latter case we would get rid of the issue you're facing automatically.

Hence that little attention in the current upstream recommendations. In other words that my suggestion to use standalone DB volume is mostly a workaround not a full-grade solution...

Yup. I understand. But at this point, I'd say to at least one workload, its fragmentation is a major problem and since its unlikely a defragmentation tool or making BlueFS using 4K(sounds like a good solution) will happen in the next X number of months, it may be worth getting the word out / tweaking some documentation to let people know this can be a major problem, so they can work around it before it becomes as bad as what happened to us.

One more datapoint. Of the 8 drives I initially repaired by extending the lv with another disk, each got a 500GB extension. the one that failed today had exhausted 500GB of extra space. In under a week. This problem did not happen until we upgraded to pacific and we've been running the same workload for years. So, some change in the allocator in pacific has made it substantially better at causing fragmentation I think. It may be that other clusters do get there with much less workload, just over longer periods of time. Something to consider.

Actions

Copy link

#22

Updated by Kevin Fox over 1 year ago

So, it looks like moving the db to a db volume works with ceph-bluestore-tool bluefs-bdev-migrate. So most of the way to fixing the last issue.

How does ceph-volume activate raw try and autodetect it? I'm missing some labels or something to get it to autodetect the block device along with the block.db link.

Actions

Copy link

#23

Updated by Kevin Fox over 1 year ago

For the record, ssd/ssd or hdd/hdd seems to work fine even though the documentation makes it sound like it doesn't.

This kind of thing works with rook:

    - name: "minikube-m03" 
      devices:
      - name: "vdb" 
        config:
          metadataDevice: db/db1
      - name: "vdc" 
        config:
          metadataDevice: db/db2

Probably should update the documentation of both ceph and rook so that others don't spend a lot of time trying to work around something that just works.

Actions

Copy link

#24

Updated by Igor Fedotov over 1 year ago

Kevin Fox wrote:

For the record, ssd/ssd or hdd/hdd seems to work fine even though the documentation makes it sound like it doesn't.

This kind of thing works with rook:
[...]

Probably should update the documentation of both ceph and rook so that others don't spend a lot of time trying to work around something that just works.

Mind creating another ticket?

Actions

Copy link

#25