Bug #53139: OSD might wrongly attempt to use "slow" device when single device is backing the store - bluestore - Ceph

Actions

Copy link

Bug #53139

closed

OSD might wrongly attempt to use "slow" device when single device is backing the store

Added by Igor Fedotov over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Igor Fedotov

Target version:

% Done:

Source:

Tags:

Backport:

pacific

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v16.2.6

ceph-qa-suite:

Pull request ID:

43818

Crash signature (v1):

Crash signature (v2):

Description

This looks like a regression introduced by https://github.com/ceph/ceph/pull/42992
Providing RocksDB with an additional db.slow path allows DB to use it which results in an attempt to get space from non-present main/slow device allocator and a subsequent BLUEFS ENOSPC error.

In the case we observed this prevents OSD from starting. The only known workaround (for Pacific) so far is to downgrade to a version prior to 16.2.6.
The bug is pplicable to single-device OSDs only.

Nov 03 12:41:19 ceph-osd11 ceph-osd^3592789: -1288> 2021-11-03T12:41:18.649+0000 7f59bc519700 10 bluefs open_for_write db.slow/371619.sst
Nov 03 12:41:19 ceph-osd11 ceph-osd^3592789: -1287> 2021-11-03T12:41:18.649+0000 7f59bc519700 20 bluefs open_for_write mapping db.slow/371619.sst vsel_hint 0x2
Nov 03 12:41:19 ceph-osd11 ceph-osd^3592789: -1286> 2021-11-03T12:41:18.649+0000 7f59bc519700 10 bluefs open_for_write h 0x55a7307d1a40 on file(ino 361633 size 0x0 mtime 2021-11-03T12:41:18.655136+0000 allocated 0 extents [])
...
Nov 03 12:41:19 ceph-osd11 ceph-osd^3592789: -452> 2021-11-03T12:41:18.753+0000 7f59bc519700 10 bluefs _flush 0x55a7307d1a40 0x0~80020 to file(ino 361633 size 0x0 mtime 2021-11-03T12:41:18.655136+0000 allocated 0 extents [])
Nov 03 12:41:19 ceph-osd11 ceph-osd^3592789: -451> 2021-11-03T12:41:18.753+0000 7f59bc519700 10 bluefs _flush_range 0x55a7307d1a40 pos 0x0 0x0~80020 to file(ino 361633 size 0x0 mtime 2021-11-03T12:41:18.655136+0000 allocated 0 extents [])
Nov 03 12:41:19 ceph-osd11 ceph-osd^3592789: -450> 2021-11-03T12:41:18.753+0000 7f59bc519700 10 bluefs _allocate len 0x80020 from 2
Nov 03 12:41:19 ceph-osd11 ceph-osd^3592789: -449> 2021-11-03T12:41:18.753+0000 7f59bc519700 -1 bluefs _allocate allocation failed, needed 0x80020
Nov 03 12:41:19 ceph-osd11 ceph-osd^3592789: -448> 2021-11-03T12:41:18.753+0000 7f59bc519700 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x80020
...
Nov 03 12:41:19 ceph-osd11 ceph-osd^3592789: -238> 2021-11-03T12:41:18.773+0000 7f59bc519700 -1 /build/ceph-16.2.6/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f59bc5197
/build/ceph-16.2.6/src/os/bluestore/BlueFS.cc: 2768: ceph_abort_msg("bluefs enospc")

in thread 7f59bc519700 thread_name:rocksdb:low0

ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
                                               1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f59c7a7e730]
                                               2: gsignal()
                                               3: abort()
                                               4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > const&)+0x1a7) [0x55a724bc9ff0]
                                               5: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1185) [0x55a72523ae15]
                                               6: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0xa1) [0x55a72523afa1]
                                               7: (BlueFS::_flush(BlueFS::FileWriter*, bool, std::unique_lock&lt;std::mutex&gt;&)+0x2e) [0x55a72525339e]
                                               8: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) [0x55a725263d0b]

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Igor Fedotov over 2 years ago

Subject changed from OSD might try to use "slow" device when single device is backing the store to OSD might wrongly attempt to use "slow" device when single device is backing the store

Actions

Copy link

Updated by Adam Kupczyk over 2 years ago

It looks to me more like a problem with allocator code itself.
Checking douts from _allocate() (or rather lack of them), I can conclude that we did fail on condition:
```if (alloc_len < 0 || alloc_len < need)```
But still we got no extents.
This is possible only if call
```alloc_len = alloc[id]->allocate(need, alloc_size[id], hint, &extents);```
gives us valid alloc_len, but no extents.

Can the problem be replicated with debug_bluestore high enough so we can get something from allocator?

Actions

Copy link

Updated by Igor Fedotov over 2 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Igor Fedotov over 2 years ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 43818

Actions

Copy link

Updated by Igor Fedotov over 2 years ago

Adam Kupczyk wrote:

It looks to me more like a problem with allocator code itself.
Checking douts from _allocate() (or rather lack of them), I can conclude that we did fail on condition:
```if (alloc_len < 0 || alloc_len < need)```
But still we got no extents.
This is possible only if call
```alloc_len = alloc[id]->allocate(need, alloc_size[id], hint, &extents);```
gives us valid alloc_len, but no extents.

Can the problem be replicated with debug_bluestore high enough so we can get something from allocator?

As you can see in the above log snippet RocksDB attempts to put this file to db.slow.
This causes volume selector to return BDEV_SLOW as a preferred device and relevant allocator in alloc[BDEV_SLOW] is attempted but skipped as it's null - BDEV_DB is present only in this setup...

Actions

Copy link

Updated by A. Saber Shenouda over 2 years ago

In which case this bug can be reproduced exactly?

Actions

Copy link

Updated by Igor Fedotov over 2 years ago

A. Saber Shenouda wrote:

In which case this bug can be reproduced exactly?

I failed to reproduce that locally hence unable to provide all the details.

The key point is a usage of single device for BlueStore. In the case I observed the device size was ~1.9TB and DB size at around 190GB. At this point RocksDB attempted to put SST file to db.slow folder which resulted in an attempt to allocated space from non-existent main device - single device is treated as DB one in this setup.
Unfortunately I don't have available devices of that size and RocksDB didn't behave in the same manner with lower numbers...
The UT in the attached PR simulates that RocksDB behavior - it just opens a file at db.slow folder and writes some data and final effects are pretty the same.

Actions

Copy link