Project

General

Profile

Actions

Bug #50555

closed

AvlAllocator.cc: 60: FAILED ceph_assert(size != 0)

Added by Hector Martin almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus, nautilus
Regression:
Yes
Severity:
2 - major
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This started happening for an existing OSD after an upgrade from 15.2.2 octopus to 16.2.0 Pacific, but it turns out the underlying issue was introduced during the Octopus cycle. I haven't tested it on other OSDs yet (since I don't want to break the cluster).

It happens in at least 15.2.11, 16.2.0, and 16.2.1, and does not happen on 15.2.3 and earlier. Downgrading to 15.2.3 I was able to bring the OSD back up. I suspect commit cccf94da (30fcf028 in 15.2.4) might be the culprit here.

Here is a debug log (warning: 150MB): https://mrcn.st/p/A2tsOw35

Last bit of the log:

-9> 2021-04-28T17:38:20.621+0900 7fca0a9be240 10 freelist enumerate_next 0x733fcde0000~10000
-8> 2021-04-28T17:38:20.621+0900 7fca0a9be240 10 AvlAllocator init_add_free offset 0x733fcde0000 length 0x10000
-7> 2021-04-28T17:38:20.621+0900 7fca0a9be240 10 freelist enumerate_next 0x733fef50000~10000
-6> 2021-04-28T17:38:20.621+0900 7fca0a9be240 10 AvlAllocator init_add_free offset 0x733fef50000 length 0x10000
-5> 2021-04-28T17:38:20.621+0900 7fca0a9be240 10 freelist enumerate_next 0x733ff370000~10000
-4> 2021-04-28T17:38:20.621+0900 7fca0a9be240 10 AvlAllocator init_add_free offset 0x733ff370000 length 0x10000
-3> 2021-04-28T17:38:20.621+0900 7fca0a9be240 10 freelist enumerate_next 0x733ffe00000~0
-2> 2021-04-28T17:38:20.621+0900 7fca0a9be240 10 AvlAllocator init_add_free offset 0x733ffe00000 length 0x0
-1> 2021-04-28T17:38:20.623+0900 7fca0a9be240 -1 /var/tmp/portage/sys-cluster/ceph-16.2.1/work/ceph-16.2.1/src/os/bluestore/AvlAllocator.cc: In function 'virtual void AvlAllocator::_add_to_tree(uint64_t, uint64_t)' thread 7fca0a9be240 time 2021-04-28T17:38:20.623154+0900
/var/tmp/portage/sys-cluster/ceph-16.2.1/work/ceph-16.2.1/src/os/bluestore/AvlAllocator.cc: 60: FAILED ceph_assert(size != 0)
ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10) pacific (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x5629d70ed115]
2: /usr/bin/ceph-osd(+0x42d2b2) [0x5629d70ed2b2]
3: /usr/bin/ceph-osd(+0xb4c956) [0x5629d780c956]
4: (AvlAllocator::init_add_free(unsigned long, unsigned long)+0x6d) [0x5629d780e44d]
5: (BlueStore::_init_alloc()+0x192) [0x5629d76e9b52]
6: (BlueStore::_open_db_and_around(bool, bool)+0x309) [0x5629d7735c89]
7: (BlueStore::_mount()+0x191) [0x5629d7738671]
8: (OSD::init()+0x58e) [0x5629d71e165e]
9: main()
10: __libc_start_main()
11: _start()
0> 2021-04-28T17:38:20.624+0900 7fca0a9be240 -1 ** Caught signal (Aborted) *

It seems I have a (bad?) freelist entry with a zero size. It's worth mentioning that this OSD has had its bdev expanded in the past. 0x733ffe00000 is precisely the current device size (as returned by blockdev --getsize64 /var/lib/ceph/osd/ceph-0/block), so perhaps somehow the freelist ended up with a dummy 0-size entry right at the end at some point?

ceph-bluestore-tool also crashes with the same assert.


Related issues 3 (0 open3 closed)

Copied to bluestore - Backport #50780: nautilus: AvlAllocator.cc: 60: FAILED ceph_assert(size != 0)ResolvedIgor FedotovActions
Copied to bluestore - Backport #50781: octopus: AvlAllocator.cc: 60: FAILED ceph_assert(size != 0)ResolvedCory SnyderActions
Copied to bluestore - Backport #50782: pacific: AvlAllocator.cc: 60: FAILED ceph_assert(size != 0)ResolvedActions
Actions #1

Updated by Igor Fedotov almost 3 years ago

  • Status changed from New to Fix Under Review
  • Backport set to pacific, octopus, nautilus
  • Pull request ID set to 41092

Looks like 'expand" improperly marked out-of-bound blocks as unallocated due to the bug fixed by https://github.com/ceph/ceph/pull/34022
This permits BitmapFreelistManager to return extents with zero length. Which it turn isn't properly handled by Hybrid/Avl allocator. Since hybrid one has become a default allocator after the upgrade - the issue has been exposed.

The current workaround would be to switch back to bitmap allocator until the relevant patch is backported.

Actions #2

Updated by Kefu Chai almost 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #3

Updated by Backport Bot almost 3 years ago

  • Copied to Backport #50780: nautilus: AvlAllocator.cc: 60: FAILED ceph_assert(size != 0) added
Actions #4

Updated by Backport Bot almost 3 years ago

  • Copied to Backport #50781: octopus: AvlAllocator.cc: 60: FAILED ceph_assert(size != 0) added
Actions #5

Updated by Backport Bot almost 3 years ago

  • Copied to Backport #50782: pacific: AvlAllocator.cc: 60: FAILED ceph_assert(size != 0) added
Actions #6

Updated by Loïc Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF