Project

General

Profile

Actions

Bug #50656

closed

bluefs _allocate unable to allocate, though enough free

Added by Jan-Philipp Litza almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Yesterday evening, 4 OSDs on SSDs on 2 hosts crashed almost simultaneously and didn't come back up (crashed again right on startup). Later last night, another one followed. The error was identical in every case:

2021-05-04T19:34:41.692+0200 7f9f1b761700  1 bluefs _allocate unable to allocate 0x400000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x3b9e400000, block size 0x1000, free 0x21132a7000, fragmentation 0.0240807, allocated 0x0
2021-05-04T19:34:41.692+0200 7f9f1b761700 -1 bluefs _allocate allocation failed, needed 0x400000
2021-05-04T19:34:41.708+0200 7f9f1b761700 -1 /build/ceph-16.2.1/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, uint64_t, uint64_t)' thread 7f9f1b761700 time 2021-05-04T19:34:41.699837+0200
/build/ceph-16.2.1/src/os/bluestore/BlueFS.cc: 2542: FAILED ceph_assert(r == 0)

 ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x561d9f2ec277]
 2: /usr/bin/ceph-osd(+0xb33461) [0x561d9f2ec461]
 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x209d) [0x561d9fa5f6cd]
 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x9e) [0x561d9fa5f7fe]
 5: (BlueRocksWritableFile::Sync()+0x63) [0x561d9fa837c3]
 6: (rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x4e) [0x561d9ff6a530]
 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x212) [0x561da015e2f4]
 8: (rocksdb::WritableFileWriter::Sync(bool)+0x177) [0x561da015dd05]
 9: (rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long)+0x396) [0x561d9ffc2b26]
 10: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x135a) [0x561d9ffbe988]
 11: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x59) [0x561d9ffbd593]
 12: (RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x81) [0x561d9ff0e7e1]
 13: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x97) [0x561d9ff0f1c7]
 14: (BlueStore::_kv_sync_thread()+0x2384) [0x561d9f9b83e4]
 15: (BlueStore::KVSyncThread::entry()+0xd) [0x561d9f9e22dd]
 16: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f9f3043b6db]
 17: clone()

2021-05-04T19:34:41.724+0200 7f9f1b761700 -1 *** Caught signal (Aborted) **
 in thread 7f9f1b761700 thread_name:bstore_kv_sync

 ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10) pacific (stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f9f30446980]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x561d9f2ec2d2]
 5: /usr/bin/ceph-osd(+0xb33461) [0x561d9f2ec461]
 6: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x209d) [0x561d9fa5f6cd]
 7: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x9e) [0x561d9fa5f7fe]
 8: (BlueRocksWritableFile::Sync()+0x63) [0x561d9fa837c3]
 9: (rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x4e) [0x561d9ff6a530]
 10: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x212) [0x561da015e2f4]
 11: (rocksdb::WritableFileWriter::Sync(bool)+0x177) [0x561da015dd05]
 12: (rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long)+0x396) [0x561d9ffc2b26]
 13: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x135a) [0x561d9ffbe988]
 14: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x59) [0x561d9ffbd593]
 15: (RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x81) [0x561d9ff0e7e1]
 16: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x97) [0x561d9ff0f1c7]
 17: (BlueStore::_kv_sync_thread()+0x2384) [0x561d9f9b83e4]
 18: (BlueStore::KVSyncThread::entry()+0xd) [0x561d9f9e22dd]
 19: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f9f3043b6db]
 20: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Note that free 0x21132a7000 >> needed 0x400000. The fragmentation score of each OSD in question was roughly 0.85

The crashes happened while the cluster was still upgrading from 14.2.20 to 16.2.1, although the OSDs in question finished the upgrade hours earlier, including the reformatting of omap accounting data. I'm not sure when hybrid became the default allocator, so it might also be worth mentioning we updated from 14.2.16 to 14.2.20 earlier the same day.

After 3 hours of research, we were able to start the OSDs again by setting bluestore_allocator = bitmap (it would have been 1 hour if I hadn't confused bluefs_allocator and bluestore_allocator - I wish there were a complete list of Ceph options in the docs!)

Here's the full log. It contains three parts: The original crash, one startup attempt with bluefs debug = 20 (that also shows there are plenty suitable free extents) and finally the working startup with bitmap allocator. Let me know if you need anything else, I guess I can reproduce the crash as soon as I switch back to hybrid...


Files

crash.log (159 KB) crash.log Neha Ojha, 05/14/2021 08:14 PM

Related issues 5 (0 open5 closed)

Related to bluestore - Bug #46804: nautilus: upgrade:nautilus-p2p/nautilus-p2p-stress-split: BlueFS::_flush_and_sync_log ceph_assert(r == 0)Duplicate

Actions
Related to bluestore - Bug #51684: OSD crashes after update to 16.2.4Duplicate

Actions
Copied to bluestore - Backport #51040: pacific: bluefs _allocate unable to allocate, though enough freeResolvedNeha OjhaActions
Copied to bluestore - Backport #51041: octopus: bluefs _allocate unable to allocate, though enough freeResolvedNeha OjhaActions
Copied to bluestore - Backport #51042: nautilus: bluefs _allocate unable to allocate, though enough freeResolvedIgor FedotovActions
Actions #1

Updated by Igor Fedotov almost 3 years ago

Could you please collect free blocks dump via 'ceph daemon OSD.N bluestore allocator dump block'?

Afterwards it would be great if you could verify the OSD above is still failing with hybrid allocator.
You might try to use 'ceph daemon OSD.N compact" command as a trigger if it doesn't fail after OSD restart

Actions #2

Updated by Jan-Philipp Litza almost 3 years ago

Could you please collect free blocks dump via 'ceph daemon OSD.N bluestore allocator dump block'?

Isn't that what the "freelist" lines in the log say?

Afterwards it would be great if you could verify the OSD above is still failing with hybrid allocator.
You might try to use 'ceph daemon OSD.N compact" command as a trigger if it doesn't fail after OSD restart

Too bad (or not), neither a restart nor a manual compaction trigger a crash now. And since the crash was 20 hours ago, the dump probably isn't worth much now, without a new crash.

One OSD on each host hasn't crashed yet. I removed the allocator setting from ceph.conf again, so if either of them crashes, it won't come back up automatically. I will then provide the output of ceph-bluestore-tool free-dump (that's identical to the command above, but works without running OSD, right?). If they don't crash in the next few days, I guess I'll continue with the cluster upgrade. Then there will be more hosts where crashes can occur.

Actions #3

Updated by Igor Fedotov almost 3 years ago

Jan-Philipp Litza wrote:

Could you please collect free blocks dump via 'ceph daemon OSD.N bluestore allocator dump block'?

Isn't that what the "freelist" lines in the log say?

It is!!! Thanks for pointing out, missed it's there.

Will do some experiments and report back..

Afterwards it would be great if you could verify the OSD above is still failing with hybrid allocator.
You might try to use 'ceph daemon OSD.N compact" command as a trigger if it doesn't fail after OSD restart

Too bad (or not), neither a restart nor a manual compaction trigger a crash now. And since the crash was 20 hours ago, the dump probably isn't worth much now, without a new crash.

One OSD on each host hasn't crashed yet. I removed the allocator setting from ceph.conf again, so if either of them crashes, it won't come back up automatically. I will then provide the output of ceph-bluestore-tool free-dump (that's identical to the command above, but works without running OSD, right?). If they don't crash in the next few days, I guess I'll continue with the cluster upgrade. Then there will be more hosts where crashes can occur.

Actions #4

Updated by Jan-Philipp Litza almost 3 years ago

Oh well, the OSDs ran roughly an hour with the hybrid allocator before crashing again, and again in a cascading manner. Switched to bitmap again, but apparently it just takes some time to trigger the crash.

Actions #5

Updated by Igor Fedotov almost 3 years ago

  • Status changed from New to Triaged

OK, I managed to reproduce the issue via allocator's log replay - looks to some degree similar to https://tracker.ceph.com/issues/47883

But this one occurs in "first-fit" mode when the only long-enough extent (or no extents at all) is available starting with cursor position. The resulting extent length after proper alignment isn't enough though. Since there is no allocation retry for positions beyond cursor and/or with smaller sizes - ENOSPC is returned.

Generally - Avl allocator design to be revised as its' 'cursor' model looks questionable.

Using Bitmap allocator seems to be the proper workaround for now.

Actions #6

Updated by Jan-Philipp Litza almost 3 years ago

Thanks for looking into this so quickly!

So in #47883 it says that hybrid is the default allocator since 14.2.11, which we have been running for months already. But I doubt it's a coincidence that the crashes happened so soon after the Pacific upgrade, right?

Also, is there any reason (besides more frequent usage probably) this hit the SSD-based OSDs and not those with HDDs? Should I set the bitmap allocator cluster-wide?

Actions #7

Updated by Igor Fedotov almost 3 years ago

Jan-Philipp Litza wrote:

Thanks for looking into this so quickly!

So in #47883 it says that hybrid is the default allocator since 14.2.11, which we have been running for months already. But I doubt it's a coincidence that the crashes happened so soon after the Pacific upgrade, right?

I don't have exact explanation for that. Just a speculation: it could be some modification at upper layer, e.g. BlueFS or RocksDB which caused longer extent allocations. Which in turn revealed the issue.

Also, is there any reason (besides more frequent usage probably) this hit the SSD-based OSDs and not those with HDDs? Should I set the bitmap allocator cluster-wide?

Probably SSD-based OSDs have higher fill ratio and/or fragmentation.
W.r.t to switching everything to bitmap - not sure it's mandatory at the moment as it doesn't look like a frequently met issue. But you can definitely do that to feel safer and/or more comfortable.

Actions #8

Updated by Neha Ojha almost 3 years ago

Quote from ceph-users thread - after upgrade to 16.2.3 16.2.4 and after adding few hdd's OSD's started to fail 1 by 1.

"Today I upgraded from 16.2.3 to 16.2.4 and added few hosts and osds.
After data migration for few hours, 1 SSD failed, then another and
another 1 by 1. Now I have cluster in pause and 5 failed SSD's, same
host has SSD and HDD, but only SSD's are failing so I think this has to
be balancing refiling or something bug and probably not upgrade bug.

Cluster has been in pause for 4 hours and no more OSD's are failing."

Using the workaround of switching to bitmap helped.

"It probably is same bug.

bluestore_allocator = bitmap

by setting this parameter all failed OSD started."

I have attached the relevant log.

Actions #9

Updated by Dan van der Ster almost 3 years ago

Igor, I was checking what is different between pacific's avl and octopus, and found it is only this: https://tracker.ceph.com/issues/48272

If `rs_start == t.begin()` in AvlAllocator::_block_picker, we get ENOSPACE.

Could this be the root cause?

Actions #10

Updated by Igor Fedotov almost 3 years ago

Dan van der Ster wrote:

Igor, I was checking what is different between pacific's avl and octopus, and found it is only this: https://tracker.ceph.com/issues/48272

If `rs_start == t.begin()` in AvlAllocator::_block_picker, we get ENOSPACE.

Could this be the root cause?

hmm... interesting finding but I don't think so. Will try to revert and double check on Jan-Philip's data though...

Actions #11

Updated by Igor Fedotov almost 3 years ago

Igor Fedotov wrote:

Dan van der Ster wrote:

Igor, I was checking what is different between pacific's avl and octopus, and found it is only this: https://tracker.ceph.com/issues/48272

If `rs_start == t.begin()` in AvlAllocator::_block_picker, we get ENOSPACE.

Could this be the root cause?

hmm... interesting finding but I don't think so. Will try to revert and double check on Jan-Philip's data though...

Reverting #48272 doesn't help when replaying Jan-Philip's log. So it isn't a regression.

Actions #12

Updated by Neha Ojha almost 3 years ago

  • Status changed from Triaged to Fix Under Review
  • Assignee set to Igor Fedotov
  • Pull request ID set to 41369
Actions #13

Updated by Neha Ojha almost 3 years ago

  • Backport set to pacific
Actions #14

Updated by Neha Ojha almost 3 years ago

  • Backport changed from pacific to pacific, octopus, nautilus
Actions #15

Updated by Igor Fedotov almost 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #16

Updated by Backport Bot almost 3 years ago

  • Copied to Backport #51040: pacific: bluefs _allocate unable to allocate, though enough free added
Actions #17

Updated by Backport Bot almost 3 years ago

  • Copied to Backport #51041: octopus: bluefs _allocate unable to allocate, though enough free added
Actions #18

Updated by Backport Bot almost 3 years ago

  • Copied to Backport #51042: nautilus: bluefs _allocate unable to allocate, though enough free added
Actions #19

Updated by Loïc Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions #20

Updated by Igor Fedotov over 2 years ago

  • Related to Bug #46804: nautilus: upgrade:nautilus-p2p/nautilus-p2p-stress-split: BlueFS::_flush_and_sync_log ceph_assert(r == 0) added
Actions #21

Updated by Igor Fedotov over 2 years ago

  • Related to Bug #51684: OSD crashes after update to 16.2.4 added
Actions

Also available in: Atom PDF