Project

General

Profile

Bug #48276

Bug #47751: Hybrid allocator might segfault when fallback allocator is present

OSD Crash with ceph_assert(is_valid_io(off, len))

Added by Bastian Mäuser 10 months ago. Updated 9 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

Last night one OSD in a 3-node Cluster crashed with the attached Crashreport. I can pretty much rule out Hardware issues, dmesg shows no errors on that NVME. Hardware is a few month old. There was only very few load on the System when it happened.

OSD came back fine. I want to know more about the root cause.

{
"os_version_id": "10",
"utsname_machine": "x86_64",
"entity_name": "osd.29",
"backtrace": [
"(()+0x12730) [0x7fb4df645730]",
"(gsignal()+0x10b) [0x7fb4df1287bb]",
"(abort()+0x121) [0x7fb4df113535]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x55def3ba0419]",
"(()+0x5115a0) [0x55def3ba05a0]",
"(KernelDevice::aio_write(unsigned long, ceph::buffer::v14_2_0::list&, IOContext*, bool, int)+0x90) [0x55def4214570]",
"(BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)+0x2237) [0x55def40f4247]",
"(BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int)+0x318) [0x55def411cef8]",
"(BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int)+0xda) [0x55def411ddfa]",
"(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1671) [0x55def4121481]",
"(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x3c8) [0x55def4122eb8]",
"(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55def3e908b4]",
"(ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xdf8) [0x55def3f89978]",
"(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x267) [0x55def3f97ab7]",
"(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x57) [0x55def3ea8e17]",
"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x61f) [0x55def3e5784f]",
"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x392) [0x55def3c83f02]",
"(PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55def3f27e92]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x55def3c9fba7]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x55def426c0c4]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55def426ead0]",
"(()+0x7fa3) [0x7fb4df63afa3]",
"(clone()+0x3f) [0x7fb4df1ea4cf]"
],
"assert_line": 864,
"utsname_release": "5.4.65-1-pve",
"assert_file": "/build/ceph-JY24tx/ceph-14.2.11/src/os/bluestore/KernelDevice.cc",
"utsname_sysname": "Linux",
"os_version": "10 (buster)",
"os_id": "10",
"assert_thread_name": "tp_osd_tp",
"assert_msg": "/build/ceph-JY24tx/ceph-14.2.11/src/os/bluestore/KernelDevice.cc: In function 'virtual int KernelDevice::aio_write(uint64_t, ceph::bufferlist&, IOContext*, bool, int)' thread 7fb4c06c2700 time 2020-11-18 03:24:35.419736\n/build/ceph-JY24tx/ceph-14.2.11/src/os/bluestore/KernelDevice.cc: 864: FAILED ceph_assert(is_valid_io(off, len))\n",
"assert_func": "virtual int KernelDevice::aio_write(uint64_t, ceph::bufferlist&, IOContext*, bool, int)",
"ceph_version": "14.2.11",
"os_name": "Debian GNU/Linux 10 (buster)",
"timestamp": "2020-11-18 02:24:35.429967Z",
"process_name": "ceph-osd",
"archived": "2020-11-18 10:14:48.914391",
"utsname_hostname": "px3",
"crash_id": "2020-11-18_02:24:35.429967Z_800333e3-630a-406b-9a0e-c7c345336087",
"assert_condition": "is_valid_io(off, len)",
"utsname_version": "#1 SMP PVE 5.4.65-1 (Mon, 21 Sep 2020 15:40:22 +0200)"
}

cephlog-px2-osd17.gz - Ceph OSD Log (347 KB) Bastian Mäuser, 11/24/2020 12:39 PM

osd147_fsck.log View (18.1 KB) Jonas Jelten, 12/16/2020 05:37 PM


Related issues

Related to bluestore - Bug #46800: Octopus OSD died and fails to start with FAILED ceph_assert(is_valid_io(off, len)) Duplicate

History

#1 Updated by Igor Fedotov 10 months ago

The following patch once merged [and backported] will provide more insight on the issue's root cause.

https://github.com/ceph/ceph/pull/38240

#2 Updated by Igor Fedotov 10 months ago

Meanwhile I see no way to troubleshoot this unless one is able to repro the issue with debug-bdev set to 20.

#3 Updated by Bastian Mäuser 10 months ago

Hi Igor,

thanks for answering.

The thing is:

- Issue isn't reproduceable
- Happens on Production Systems.
- You don't want excessive logging on Production Environments.

An Update: It just happened on another OSD on another Node. I can Upload the Crashreport, but I don't know if it helps because it looks similar.

#4 Updated by Seena Fallah 10 months ago

I got the same issue in nautilus 14.2.14
Full trace: https://paste.ubuntu.com/p/4KHcCG9YQx/

#5 Updated by Bálint Szűcs 10 months ago

I got the same issue in nautilus 14.2.14

it happened four times on different nodes..

#6 Updated by Igor Fedotov 10 months ago

Thanks everybody for updates. Yeah I understand all the complexities for the debugging this sort of issues in a production system...

For now I can only suggest to switch back to bitmap allocator for both bluefs_allocator and bluestore_allocator and monitor if this helps... Perhaps the issue is caused by the new allocator which returns out-of-bound space... But just speculating for now..

It would be great to install the above patch (which is pretty painless as it just logs before the assertion) whenever possible to proceed with the troubleshooting. Hybrid allocator to be restored then unless one is able to reproduce the issue with bitmap one...

#7 Updated by Dan van der Ster 10 months ago

Igor Fedotov wrote:

Thanks everybody for updates. Yeah I understand all the complexities for the debugging this sort of issues in a production system...

For now I can only suggest to switch back to bitmap allocator for both bluefs_allocator and bluestore_allocator and monitor if this helps... Perhaps the issue is caused by the new allocator which returns out-of-bound space... But just speculating for now..

It would be great to install the above patch (which is pretty painless as it just logs before the assertion) whenever possible to proceed with the troubleshooting. Hybrid allocator to be restored then unless one is able to reproduce the issue with bitmap one...

Hi, 14.2.12 also uses hybrid allocator by default but we haven't seen this running lots of 14.2.12 osds.
But 14.2.13/14 have the new `bluestore_volume_selection_policy = use_some_extra` as default. Could that be the cause here? Should we use rocksdb_original config?

#8 Updated by Bálint Szűcs 10 months ago

I got the same issue in nautilus 14.2.11

it happened four times on different nodes..

#9 Updated by Igor Fedotov 10 months ago

v14.2.11 has got hybrid allocator enabled but bluestore_volume_selection_policy was still at original there. Hence the issue is rather not related to the bluestore_volume_selection_policy.
Not to mention that by its nature this looks unrelated too..

#10 Updated by Igor Fedotov 10 months ago

It would be great if one try to grep failing OSD's logs (prior to the assertion) for "constructing fallback allocator" substring and report back if it's there or not.

#11 Updated by Bastian Mäuser 10 months ago

Hi Igor,

it's there, but not prior but after - assertion occured at 20:24:41:

root@px2# zgrep fallback *
ceph-osd.17.log.6.gz:2020-11-18 20:25:44.202 7f7601366c80 1 HybridAllocator _spillover_range constructing fallback allocator

I have attached the OSD Logfile.

#12 Updated by Bastian Mäuser 10 months ago

Bastian Mäuser wrote:

Hi Igor,

it's there, but not prior but after - assertion occured at 20:24:41:

I meant 20:25:41

root@px2# zgrep fallback *
ceph-osd.17.log.6.gz:2020-11-18 20:25:44.202 7f7601366c80 1 HybridAllocator _spillover_range constructing fallback allocator

I have attached the OSD Logfile.

#13 Updated by Igor Fedotov 10 months ago

Bastian, thanks!

#14 Updated by Seena Fallah 10 months ago

I faced this issue again in nautilus 14.2.14 and there is a log about the HybridAllocator

2020-11-26 09:29:21.764 7ff2f2ad3c00  1 HybridAllocator _spillover_range constructing fallback allocator

#15 Updated by Igor Fedotov 10 months ago

Seena Fallah wrote:

I faced this issue again in nautilus 14.2.14 and there is a log about the HybridAllocator
[...]

It this the same OSD as before? If it's the same - how many days has passed since the previous failure?
How many volumes are behind this OSD? How large they are?
How intensive (approximately) is the write load for this OSD? 1% of disk size per day, 10%, 100% or what?

#16 Updated by Seena Fallah 10 months ago

Igor Fedotov wrote:

Seena Fallah wrote:

I faced this issue again in nautilus 14.2.14 and there is a log about the HybridAllocator
[...]

It this the same OSD as before? If it's the same - how many days has passed since the previous failure?
How many volumes are behind this OSD? How large they are?
How intensive (approximately) is the write load for this OSD? 1% of disk size per day, 10%, 100% or what?

No, it wasn't as before. There is just one disk behind each of my OSDs. They are 1.92TiB but ceph shows it 1.7TiB.
At this time this bug just causes in my SSD tier and none of my HDD tier disks get failed! The write disk throughput is about 20MB/s.

Another disk in my SSD tier gets failed again!

#17 Updated by Seena Fallah 10 months ago

Just to prioritize this issue another OSD from my SSD tier fails :(

#18 Updated by Igor Fedotov 10 months ago

Seena Fallah wrote:

Just to prioritize this issue another OSD from my SSD tier fails :(

Mind switching to bitmap allocator as per comment #6?

#19 Updated by Seena Fallah 10 months ago

Did QA and QE run on bitmap allocator too in nautilus 14.2.14?

#20 Updated by Igor Fedotov 10 months ago

Seena Fallah wrote:

Did QA and QE run on bitmap allocator too in nautilus 14.2.14?

Sorry I'm not getting the question...
The idea is just to switch bluefs/bluestore allocators back to bitmap and leave cluster run for a while. OSD restart is required after the switch.

#21 Updated by Seena Fallah 10 months ago

Igor Fedotov wrote:

Seena Fallah wrote:

Did QA and QE run on bitmap allocator too in nautilus 14.2.14?

Sorry I'm not getting the question...
The idea is just to switch bluefs/bluestore allocators back to bitmap and leave cluster run for a while. OSD restart is required after the switch.

I mean does QA runs on nautilus 14.2.14 with a bitmap allocator? This is my production cluster and I want to sure Ceph test all nautilus 14.2.11 till 14.2.14 changes with the bitmap allocator, too to be sure nothing will crash again!

#22 Updated by Igor Fedotov 10 months ago

Seena Fallah wrote:

Igor Fedotov wrote:

Seena Fallah wrote:

Did QA and QE run on bitmap allocator too in nautilus 14.2.14?

Sorry I'm not getting the question...
The idea is just to switch bluefs/bluestore allocators back to bitmap and leave cluster run for a while. OSD restart is required after the switch.

I mean does QA runs on nautilus 14.2.14 with a bitmap allocator? This is my production cluster and I want to sure Ceph test all nautilus 14.2.11 till 14.2.14 changes with the bitmap allocator, too to be sure nothing will crash again!

Ah... I see. Not sure it makes sense to do the switch then... Looks too complicated to me...

#23 Updated by Bastian Mäuser 10 months ago

It would be good to have a confirmation which Version fixes this regression. It must have been introduced after 14.2.9.

#24 Updated by Jonas Jelten 10 months ago

Is there any progress on tracking this down? Is it octopus also affected (#46800)?

I've lost 2 different OSDs running v14.2.13, about 10 days between the failures. They're 10krpm HDDs with 1.1T, and they were on the same host. Not that much IO load.

#25 Updated by Jonas Jelten 10 months ago

Another OSD died, this time on a different host. Also 1.1TiB 10k HDD. I can dump and analyze things there if you like.

#26 Updated by Igor Fedotov 10 months ago

  • Related to Bug #46800: Octopus OSD died and fails to start with FAILED ceph_assert(is_valid_io(off, len)) added

#27 Updated by Igor Fedotov 10 months ago

Jonas Jelten wrote:

Another OSD died, this time on a different host. Also 1.1TiB 10k HDD. I can dump and analyze things there if you like.

Could you please run "ceph-bluestore-tool --path <osd path> --allocator block --command free-dump" and share the output...

#28 Updated by Jonas Jelten 10 months ago

here's the two dumps. first i launched the the osd itself and it crashed. then the bluestore free-dump (which also crashed).

http://home.in.tum.de/~jelten/logs/ceph/osd147_is_valid_io.log.xz
http://home.in.tum.de/~jelten/logs/ceph/osd147_free_dump.log.xz

#29 Updated by Igor Fedotov 10 months ago

Jonas Jelten wrote:

here's the two dumps. first i launched the the osd itself and it crashed. then the bluestore free-dump (which also crashed).

http://home.in.tum.de/~jelten/logs/ceph/osd147_is_valid_io.log.xz
http://home.in.tum.de/~jelten/logs/ceph/osd147_free_dump.log.xz

Hi Jonas,
thanks a lot for the lgos.

Could you please also run "ceph-bluestore-tool fsck" for this OSD and share errors list if any?

#30 Updated by Jonas Jelten 10 months ago

here ya go, fsck crashes in BlueStore::_fsck_check_extents with ceph_assert(pos < bs.size()), so fsck also seeks beyond the device size.

#31 Updated by Igor Fedotov 9 months ago

  • Related to Bug #47751: Hybrid allocator might segfault when fallback allocator is present added

#32 Updated by Igor Fedotov 9 months ago

After some analysis IMO the root cause is highly likely the same as for https://tracker.ceph.com/issues/47751
Under some circumstances Hybrid allocator might claim out-of-bound extent from a fallback bitmap allocator which subsequently results is using this extent for disk write op. Hence the assertion in is_valid_io().
Using bitmap or avl allocators is a recommended workaround for now. Patches to fix allocator are on their ways:

master:
https://github.com/ceph/ceph/pull/37547

octopus:
https://github.com/ceph/ceph/pull/38428

nautilus:
https://github.com/ceph/ceph/pull/38637

#33 Updated by Igor Fedotov 9 months ago

  • Status changed from New to Triaged

#34 Updated by Bastian Mäuser 9 months ago

It would be good if you could issue a workaround howto.

#35 Updated by Jonas Jelten 9 months ago

It's set by bluefs_allocator at bluestore mkfs time: https://github.com/ceph/ceph/blob/b1d0fa70590c23e80a09638df958d763b9c704bb/src/os/bluestore/BlueFS.cc#L582-L584

You can query the OSD's current allocator with ceph daemon osd.$osdid config get bluefs_allocator.

Since I now have 6 dead OSDs it would be really good if the bugfix could actually be released :)

The dead OSDs are all quite old (created around 14.2.3 I think), none of the "newer" ones died so far.

#36 Updated by Bastian Mäuser 9 months ago

@Jonas:

Dead OSD's? permanently? In my case "just" the osd process died and restarted itsself.

#37 Updated by Bastian Mäuser 9 months ago

So is the proposed workaround to set bluefs_allocator to "bitmap" or what? Can I do that on a running cluster?

@Igor: A solid recommendation would be good.

#38 Updated by Jonas Jelten 9 months ago

Yes dead in permanently dead until I recreate the OSD. The drive itself is well, just the OSD data corrupts. The startup fails with the dump from comment 28.

You can change the space allocator when the OSD boots by setting bluefs_allocator = bitmap (or avl, stupid).

This doesn't revive my OSDs though :)

#39 Updated by Bastian Mäuser 9 months ago

Interessting. In my Case (I am on .11, osd's were initially created on .8) I don't need to recreate the OSD's. Maybe laster Versions (.15) have worsened the effect?

Any detailed clarification from the Ceph Developers regarding indication, impact and proposed workaround would be really helpful.

#40 Updated by Igor Fedotov 9 months ago

So generally the issue is that hybrid allocator might return out-of-bound extent while processing write request.
Depending on writing block size I/O request goes either directly to disk or is deferred.
1) When write is direct the assertion fires just ones (due to invalid extent allocation isn't persistent) and OSD is able to restart. Some day the issue might fire again but before that OSD data is consistent.
2) When write is deferred the invalid extent reference is persisted into RocksDB (at both onode extent map and at a deferred operation entry). Then OSD starts to crash while applying deferred ops. The latter takes place on each OSD start attempt. Not to mention the initial crash during regular operation.

The patch I shared earlier just fixes the allocator bug. It doesn't fix the broken data consistency for 2)
And since the bug is in hybrid allocator one can work around the issue by using different allocator type for both bluefs and bluestore. But again this doesn't fix the broken data consistency for 2). This just eliminates new invalid extent allocations.

Potentially it's possible to fix broken OSDs from 2) by repairing invalid offsets in DB by either manual DB cleanup or by implementing such a repair within bluestore fsck.
Wondering if the latter is really needed...

#41 Updated by Dan van der Ster 9 months ago

Igor or others, do you have any insight into which exact conditions can trigger the allocation bug? Any particular use-case, configuration, hdd/ssd, utilization, etc?

Also, for (2) is there a chance that the bug might be silently corrupting osds, only to manifest days or weeks later? Or do you expect even the deferred write scenario to assert within a few seconds of the bug happening.

(I'm just checking all bases -- we have been lucky so far to not see a single instance or this crash on 5000 osds).

#42 Updated by Seena Fallah 9 months ago

Igor Fedotov wrote:

So generally the issue is that hybrid allocator might return out-of-bound extent while processing write request.
Depending on writing block size I/O request goes either directly to disk or is deferred.
1) When write is direct the assertion fires just ones (due to invalid extent allocation isn't persistent) and OSD is able to restart. Some day the issue might fire again but before that OSD data is consistent.
2) When write is deferred the invalid extent reference is persisted into RocksDB (at both onode extent map and at a deferred operation entry). Then OSD starts to crash while applying deferred ops. The latter takes place on each OSD start attempt. Not to mention the initial crash during regular operation.

For (2), Do you mean if we enable for example bluefs_buffered_io? And if yes does it mean this can lead to a data loss because of the broken data that has been written to RocksDB?

#43 Updated by Bastian Mäuser 9 months ago

How does it behave on rbd only usage? Do deferred writes occur there at all? Maybe this is why my errors allowed all osds to restart again?

Which version will have the fix?

The severity of this bug is massive.

And again: is it safe to set the allocator to hybrid while the cluster is running to mitigate for this?

#44 Updated by Dan van der Ster 9 months ago

For (2), Do you mean if we enable for example bluefs_buffered_io?

Deferred writes have nothing to do with buffered io or rbd.
A deferred write is determined by its size and `bluestore_prefer_deferred_size`. By default, hdd writes smaller than 32kB are deferred, otherwise they are direct. And it looks like for ssds all writes are direct by default. My question above was about how long writes can be deferred for -- I assume this is a rather short period, like 30s, but don't know for sure.

And again: is it safe to set the allocator to hybrid while the cluster is running to mitigate for this?

Like Igor said, the mitigation is "using different allocator type for both bluefs and bluestore", i.e.:

[osd]
bluefs_allocator = bitmap
bluestore_allocator = bitmap

and then restart the osd.

#45 Updated by Konstantin Shalygin 9 months ago

(I'm just checking all bases -- we have been lucky so far to not see a single instance or this crash on 5000 osds)

Dan, not all your clusters already on Nautilus?

And again: is it safe to set the allocator to hybrid while the cluster is running to mitigate for this?

Bastian, yes it's safe. And Igor actually waiting for a weeks when you, as affected user, change it on your cluster.

#46 Updated by Dan van der Ster 9 months ago

Konstantin Shalygin wrote:

(I'm just checking all bases -- we have been lucky so far to not see a single instance or this crash on 5000 osds)

Dan, not all your clusters already on Nautilus?

All of our clusters are indeed nautilus, 14.2.11 plus PRs 36909 and 36982; I just can't count so well at 2am during a holiday.

#47 Updated by Seena Fallah 9 months ago

Isn't it better to change the default allocator to bitmap while the bug fix? I have a various heartbeat_map timeout in a day on my SSD tier that I don't know can this be related to it or not!

#48 Updated by Konstantin Shalygin 9 months ago

Seena, Igor already push fixes for hybrid allocator to review for next Nautilus release.

#49 Updated by Seena Fallah 9 months ago

Konstantin Shalygin wrote:

Seena, Igor already push fixes for hybrid allocator to review for next Nautilus release.

Can this be related to my heartbeat_map timed out?

#50 Updated by Igor Fedotov 9 months ago

Bastian Mäuser wrote:

How does it behave on rbd only usage? Do deferred writes occur there at all? Maybe this is why my errors allowed all osds to restart again?

Whether write is deferred depends on input blocks size, osd settings and the target object (overwrites only tend to be deferred).
Hence you likely to have deferred overwrites in your cluster.

But it's just an accident whether you have deferred write or not while hitting a bug.

Which version will have the fix?

Hopefully next Octopus and Nautilus releases.

The severity of this bug is massive.

And again: is it safe to set the allocator to hybrid while the cluster is running to mitigate for this?

You mean setting to bitmap? Yes, it's safe but OSD to be restarted.

#51 Updated by Igor Fedotov 9 months ago

Dan van der Ster wrote:

Igor or others, do you have any insight into which exact conditions can trigger the allocation bug? Any particular use-case, configuration, hdd/ssd, utilization, etc?

The following factors impact the probability to face the bug (not to mention it's specific to Hybrid allocator):
1) High space fragmentation which causes Hybrid allocator (which is a combination of primary AVL allocator plus optional bitmap one) to operate in "fallback" mode where secondary bitmap allocator is used to keep part of free extents to cap RAM usage.
2) Last disk extent is released and put into primary AVL allocator (which needs that extent to be rather long). Once this happens hybrid allocator attempts to merge this extent with some space above disk bound and bitmap allocator might improperly use out-of-bound bits to determine whether such a space if available. Depending on the state of these bits (which is a matter of chance) one can either get or do not get an invalid free extent lasting out of disk space. And once this invalid extent is allocated - disk write fails.
As one of the consequence of the above - if last extent is permanently allocated - the bug wouldn't fire.

3) The information about invalid free extent isn't persistent hence OSD restart fixes the allocator state making the wrong allocation impossible until conditions in 2) are met again. Hence long running OSDs are more vulnerable to the bug.

Also, for (2) is there a chance that the bug might be silently corrupting osds, only to manifest days or weeks later? Or do you expect even the deferred write scenario to assert within a few seconds of the bug happening.

(I'm just checking all bases -- we have been lucky so far to not see a single instance or this crash on 5000 osds).

No, I don't see such a chance. The bug becomes 100% visible when doing disk write to an invalid extent and I recall no scenario when OSDs preserves invalid metadata but totally bypasses disk write. Certainly some delay between user write and disk write might take place for deferred writes but that's no more than a matter of seconds.

#52 Updated by Igor Fedotov 9 months ago

Seena Fallah wrote:

Igor Fedotov wrote:

So generally the issue is that hybrid allocator might return out-of-bound extent while processing write request.
Depending on writing block size I/O request goes either directly to disk or is deferred.
1) When write is direct the assertion fires just ones (due to invalid extent allocation isn't persistent) and OSD is able to restart. Some day the issue might fire again but before that OSD data is consistent.
2) When write is deferred the invalid extent reference is persisted into RocksDB (at both onode extent map and at a deferred operation entry). Then OSD starts to crash while applying deferred ops. The latter takes place on each OSD start attempt. Not to mention the initial crash during regular operation.

For (2), Do you mean if we enable for example bluefs_buffered_io? And if yes does it mean this can lead to a data loss because of the broken data that has been written to RocksDB?

This has no relation to bluefs_buffered_io indeed.

Data loss can occur due to the bug but this is rather related to a failure to restart OSD afterwards. (which is theoretically repairable but we don't have any means for now).

#53 Updated by Igor Fedotov 9 months ago

Seena Fallah wrote:

Konstantin Shalygin wrote:

Seena, Igor already push fixes for hybrid allocator to review for next Nautilus release.

Can this be related to my heartbeat_map timed out?

Implicitly only. Something like the following chain of events.
OSD died due to the bug, this triggers data rebalance which in turn triggers PG migration and hence their removal at source PGs. And massive PG might cause RocksDB to operate slowly which results in hearbeat timeouts.
Certainly this hypothesis is just a speculation for now. More investigation is required.

#54 Updated by Dan van der Ster 9 months ago

Igor Fedotov wrote:

Dan van der Ster wrote:

Igor or others, do you have any insight into which exact conditions can trigger the allocation bug? Any particular use-case, configuration, hdd/ssd, utilization, etc?

The following factors impact the probability to face the bug (not to mention it's specific to Hybrid allocator):
1) High space fragmentation which causes Hybrid allocator (which is a combination of primary AVL allocator plus optional bitmap one) to operate in "fallback" mode where secondary bitmap allocator is used to keep part of free extents to cap RAM usage.

I can find "HybridAllocator _spillover_range constructing fallback allocator" in nearly all osds in our active clusters.

2) Last disk extent is released and put into primary AVL allocator (which needs that extent to be rather long). Once this happens hybrid allocator attempts to merge this extent with some space above disk bound and bitmap allocator might improperly use out-of-bound bits to determine whether such a space if available. Depending on the state of these bits (which is a matter of chance) one can either get or do not get an invalid free extent lasting out of disk space. And once this invalid extent is allocated - disk write fails.
As one of the consequence of the above - if last extent is permanently allocated - the bug wouldn't fire.

Don't know how I could check if this is getting likely. Do any of the `ceph daemon osd.x bluestore allocator dump` commands give a clue here?

To assess the risk (i.e. should we continue running until next nautilus has this fix vs change to bitmap now): Something I don't understand is, there seems to be only 5-10 users suffering this issue, right? It must be remarkably rare, given that telemetry is showing more than 25k osds running affected versions.

Another idea I'm considering is changing the deferred write size for hdd to 0 bytes, dynamically. If I understood correctly, this would prevent any permanent osd corruptions, at the cost of performance until the fix is released or when we can schedule the cluster reboots. I have no idea how bad the performance will get though, so maybe this isn't a viable mitigation.

#55 Updated by Seena Fallah 9 months ago

Igor Fedotov wrote:

Seena Fallah wrote:

Konstantin Shalygin wrote:

Seena, Igor already push fixes for hybrid allocator to review for next Nautilus release.

Can this be related to my heartbeat_map timed out?

Implicitly only. Something like the following chain of events.
OSD died due to the bug, this triggers data rebalance which in turn triggers PG migration and hence their removal at source PGs. And massive PG might cause RocksDB to operate slowly which results in hearbeat timeouts.
Certainly this hypothesis is just a speculation for now. More investigation is required.

It doesn't happen in crash failure time. OSDs are running and suddenly one of the OSDs logs the heartbeat_map timed out and the cluster gets failed! Because I see it in all of my OSD nodes randomly and it's just on my SSD tier I thought it might be related to this issue! Could it be related to? Is there any way I can find out why is it related or not?

#56 Updated by Igor Fedotov 9 months ago

Dan van der Ster wrote:

Igor Fedotov wrote:

Dan van der Ster wrote:

Igor or others, do you have any insight into which exact conditions can trigger the allocation bug? Any particular use-case, configuration, hdd/ssd, utilization, etc?

The following factors impact the probability to face the bug (not to mention it's specific to Hybrid allocator):
1) High space fragmentation which causes Hybrid allocator (which is a combination of primary AVL allocator plus optional bitmap one) to operate in "fallback" mode where secondary bitmap allocator is used to keep part of free extents to cap RAM usage.

I can find "HybridAllocator _spillover_range constructing fallback allocator" in nearly all osds in our active clusters.

2) Last disk extent is released and put into primary AVL allocator (which needs that extent to be rather long). Once this happens hybrid allocator attempts to merge this extent with some space above disk bound and bitmap allocator might improperly use out-of-bound bits to determine whether such a space if available. Depending on the state of these bits (which is a matter of chance) one can either get or do not get an invalid free extent lasting out of disk space. And once this invalid extent is allocated - disk write fails.
As one of the consequence of the above - if last extent is permanently allocated - the bug wouldn't fire.

Don't know how I could check if this is getting likely. Do any of the `ceph daemon osd.x bluestore allocator dump` commands give a clue here?

IMO there is no need to check anything here - there are some chances it might fire. These chances depend on a variety of factors some of which you can't control except the one - allocator type. Hence that's what one should modify at the momemnt.

To assess the risk (i.e. should we continue running until next nautilus has this fix vs change to bitmap now): Something I don't understand is, there seems to be only 5-10 users suffering this issue, right? It must be remarkably rare, given that telemetry is showing more than 25k osds running affected versions.

Yeah it's rare indeed. Some might also miss such a crash if OSD successfully restarts automatically...

Another idea I'm considering is changing the deferred write size for hdd to 0 bytes, dynamically. If I understood correctly, this would prevent any permanent osd corruptions, at the cost of performance until the fix is released or when we can schedule the cluster reboots. I have no idea how bad the performance will get though, so maybe this isn't a viable mitigation.

No, IMO that's a bad idea given the lack of experience in such a tune.
I suggest to switch to either bitmap (which had been a default before with a little concerns) or avl allocators (which will provide the performance similar or even higher that hybrid allocator). For the latter one should monitor OSD RAM utilization though as it might go high when space is fragmented.

#57 Updated by Igor Fedotov 9 months ago

Seena Fallah wrote:

Igor Fedotov wrote:

Seena Fallah wrote:

Konstantin Shalygin wrote:

Seena, Igor already push fixes for hybrid allocator to review for next Nautilus release.

Can this be related to my heartbeat_map timed out?

Implicitly only. Something like the following chain of events.
OSD died due to the bug, this triggers data rebalance which in turn triggers PG migration and hence their removal at source PGs. And massive PG might cause RocksDB to operate slowly which results in hearbeat timeouts.
Certainly this hypothesis is just a speculation for now. More investigation is required.

It doesn't happen in crash failure time. OSDs are running and suddenly one of the OSDs logs the heartbeat_map timed out and the cluster gets failed! Because I see it in all of my OSD nodes randomly and it's just on my SSD tier I thought it might be related to this issue! Could it be related to? Is there any way I can find out why is it related or not?

As I already mentioned this bug could implicitly cause some cluster misbehavior (as any other OSD crash though). But IMO this is unlikely. And I don't know the way to check the relation other than learn the root cause for suicide timeouts in your case...
If you don't want or can't run deep investigation - just enable bluefs_buffered_io and do manual DB compaction for all the OSDs as I recommended in ceph-users mailing list.

#58 Updated by Dan van der Ster 9 months ago

@Igor: Thanks for all the clear info. We've changed to the bitmap allocators on all our hdd clusters. (we left ssd-only clusters with hybrid for now, because as we understand they don't write deferred so won't persist the bad allocation).

#59 Updated by Igor Fedotov 9 months ago

Dan van der Ster wrote:

@Igor: Thanks for all the clear info. We've changed to the bitmap allocators on all our hdd clusters. (we left ssd-only clusters with hybrid for now, because as we understand they don't write deferred so won't persist the bad allocation).

One more clarification on deferred writes - they might still take place for all-flash BlueStore when doing overwrite with block size less than 4K (see BlueStore::_do_write_small() function:
...
// chunk-aligned deferred overwrite?
if (b->get_blob().get_ondisk_length() >= b_off + b_len &&
b_off % chunk_size 0 &&
b_len % chunk_size 0 &&
b->get_blob().is_allocated(b_off, b_len)) {
...
if (!g_conf()->bluestore_debug_omit_block_device_write) {
bluestore_deferred_op_t *op = _get_deferred_op(txc);
op->op = bluestore_deferred_op_t::OP_WRITE;
...

#60 Updated by Igor Fedotov 9 months ago

  • Status changed from Triaged to Duplicate
  • Parent task set to #47751

Also available in: Atom PDF