Project

General

Profile

Actions

Bug #53899

open

bluefs _allocate allocation failed - BlueFS.cc: 2768: ceph_abort_msg("bluefs enospc")

Added by Pivert Dubuisson over 2 years ago. Updated 28 days ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

All OSDs failing to start after OSD near full. Cluster down.
3 nodes cluster (pve1, pve2, pve3) - Bluestore on a single LV (lvm) on NVME.

Sequence of events:
- 00:30 The ceph RBD was about 90% on a 3 dones cluster with about 140GB on each
- 00:47 pve2 Issued a restore VM/CT (in a Proxmox VE) to overwrite a CT of 20GB. From the pve logs:
- Logical volume "vm-105-disk-0" successfully removed
- restoring 'OneDriveSecure1:backup/vzdump-lxc-105-2022_01_15-23_14_19.tar.zst' now..
- 00:48:26 pve3 ceph-osd1681: 2022-01-16T00:48:26.215+0100 7f7bfa5a5700 1 bluefs _allocate allocation failed, needed 0x1687 -> ceph_abort_msg("bluefs enospc")
00:48:33 pve1 ceph-osd2157: 2022-01-16T00:48:33.904+0100 7f3b78139700 1 bluefs _allocate allocation failed, needed 0x6ff1 -> ceph_abort_msg("bluefs enospc")
01:20 pve2 The restore on pve2 never finished. Server hanging (iowaits). Reboot did not complete.
- 01:38 pve2 I had to power off and restart the server.
- 01:39:12 pve2 ceph-crash687: INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s
01:40:00 pve2 ceph-osd1681: 2022-01-16T01:40:00.023+0100 7f628f9cbf00 -1 bluefs _allocate allocation failed, needed 0x74240 -> ceph_abort_msg("bluefs enospc")

At this stage, all OSDs are down in a failing to restart endless loop.

All standard logs from the 3 pve hosts are attached.
(grep -a ceph /var/log/syslog | xz -z > /root/cephcrash_$HOSTNAME.xz)


Files

cephcrash_pve1.xz (944 KB) cephcrash_pve1.xz Pivert Dubuisson, 01/16/2022 05:24 PM
cephcrash_pve2.xz (59.2 KB) cephcrash_pve2.xz Pivert Dubuisson, 01/16/2022 05:24 PM
cephcrash_pve3.xz (750 KB) cephcrash_pve3.xz Pivert Dubuisson, 01/16/2022 05:24 PM
ceph_crash_logs_pve1.xz (840 KB) ceph_crash_logs_pve1.xz Pivert Dubuisson, 01/16/2022 11:14 PM
ceph_crash_logs_pve2.xz (87 KB) ceph_crash_logs_pve2.xz Pivert Dubuisson, 01/16/2022 11:14 PM
ceph_crash_logs_pve3.xz (758 KB) ceph_crash_logs_pve3.xz Pivert Dubuisson, 01/16/2022 11:14 PM
ceph-osd_pve1.1.log.xz (482 KB) ceph-osd_pve1.1.log.xz Pivert Dubuisson, 01/16/2022 11:36 PM
ceph-osd_pve2.0.log.xz (117 KB) ceph-osd_pve2.0.log.xz Pivert Dubuisson, 01/16/2022 11:36 PM
ceph-osd_pve3.2.log.xz (506 KB) ceph-osd_pve3.2.log.xz Pivert Dubuisson, 01/16/2022 11:36 PM
issue-53899-journalctl-log.txt (53.1 KB) issue-53899-journalctl-log.txt Niklas Hambuechen, 03/29/2024 12:12 AM
issue-53899-osd-log.txt.gz (103 KB) issue-53899-osd-log.txt.gz Niklas Hambuechen, 03/29/2024 12:12 AM

Related issues 2 (0 open2 closed)

Related to bluestore - Bug #53466: OSD is unable to allocate free space for BlueFSResolvedIgor Fedotov

Actions
Has duplicate bluestore - Bug #53590: ceph abort at bluefs enospcDuplicate

Actions
Actions #1

Updated by Igor Fedotov over 2 years ago

Hey Pivert,
is it possible to get OSD startup log more verbose than just output to stderr? Actually I need one with bluefs-debug set to 1. Perhaps you might need to locate it at /var/log/ceph or systemctl, not system log. And please evidently make sure debug-bluefs set to a value higher than 0.

Updated by Pivert Dubuisson over 2 years ago

Thanks Igor,

Just realized that the existing logs are already verbose and quite huge.
In attachment are the 3 results of the command:

find /var/lib/ceph/crash/ ! -newermt '2022-01-16T03:00:00' -newermt '2022-01-15T02:00:00' -print0 | tar -Jcvf ceph_crash_logs_$HOSTNAME.xz --null -T -

And the osd logs, truncated to the first 250k lines with commands like:

gzip -d -c /var/log/ceph/ceph-osd.1.log.1.gz | head -n 250000 | xz -z > ceph-osd_$HOSTNAME.1.log.xz

Tell me if you need extra logging.

Actions #3

Updated by Pivert Dubuisson over 2 years ago

Just realized that I forgot to mention the tar exension for some files which could be confusing.
The files like:
ceph_crash_logs_$HOSTNAME.xz
are tar.xz files.

Log level is already set to 1 by default on PVE for OSDs.

Actions #4

Updated by Pivert Dubuisson over 2 years ago

Hi,

Obviously, I've been able to get out of this terrible situation.
Since I'm on lvm (on top of nvme) and I was keeping some 27GB for Ceph DB for extra SSD volume, I was able to extend the underlying volume. This unlocked the situation. I did this on all the nodes (on pve3 I just gave 5GB instead of the 27GB on pve1 and 2, and probably should give the 27GB / 100% free on this node as well):

root@pve2:~# lvextend -l +100%FREE /dev/pve/cephosdnvme
  Size of logical volume pve/cephosdnvme changed from 290,00 GiB (74240 extents) to 317,57 GiB (81298 extents).
  Logical volume pve/cephosdnvme successfully resized.
root@pve2:~# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ bluefs-bdev-expand
inferring bluefs devices from bluestore path
1 : device size 0x4f64800000 : using 0x4a951a4000(298 GiB)
Expanding DB/WAL...
1 : expanding  from 0x4880000000 to 0x4f64800000
2022-01-17T01:15:41.775+0100 7f5fb6d0f240 -1 bluestore(/var/lib/ceph/osd/ceph-0) _read_bdev_label failed to read from /var/lib/ceph/osd/ceph-0: (21) Is a directory
2022-01-17T01:15:41.775+0100 7f5fb6d0f240 -1 bluestore(/var/lib/ceph/osd/ceph-0) unable to read label for /var/lib/ceph/osd/ceph-0: (21) Is a directory
(failed reverse-i-search)`start': systemctl ^Cart rclonemount.service
(failed reverse-i-search)`restart': view "/var/log/pve/tasks/3/UPID:pve2:000D2FF5:015D1638:61E35D23:vz^Cstore:105:root@pam:" 
root@pve2:~# systemctl start ceph-osd@0.service
root@pve2:~# ceph status
  cluster:
    id:     e7628d51-32b5-4f5c-8eec-1cafb41ead74
    health: HEALTH_WARN
            mon pve3 is low on available space
            1 backfillfull osd(s)
            1 nearfull osd(s)
            Degraded data redundancy: 497581/1492743 objects degraded (33.333%), 82 pgs degraded, 82 pgs undersized
            4 pool(s) backfillfull
            1344 daemons have recently crashed

  services:
    mon: 3 daemons, quorum pve1,pve3,pve2 (age 22h)
    mgr: pve1(active, since 23h), standbys: pve3, pve2
    mds: 1/1 daemons up, 2 standby
    osd: 3 osds: 2 up (since 7m), 2 in (since 7m)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 82 pgs
    objects: 497.58k objects, 269 GiB
    usage:   540 GiB used, 72 GiB / 613 GiB avail
    pgs:     497581/1492743 objects degraded (33.333%)
             82 active+undersized+degraded

  io:
    client:   682 B/s rd, 127 KiB/s wr, 0 op/s rd, 15 op/s wr

  progress:
    Global Recovery Event (0s)
      [............................] 

root@pve2:~# ceph status
  cluster:
    id:     e7628d51-32b5-4f5c-8eec-1cafb41ead74
    health: HEALTH_WARN
            mon pve3 is low on available space
            1 backfillfull osd(s)
            2 nearfull osd(s)
            4 pool(s) backfillfull
            1344 daemons have recently crashed

  services:
    mon: 3 daemons, quorum pve1,pve3,pve2 (age 22h)
    mgr: pve1(active, since 23h), standbys: pve3, pve2
    mds: 1/1 daemons up, 2 standby
    osd: 3 osds: 3 up (since 43s), 3 in (since 43s)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 82 pgs
    objects: 497.58k objects, 269 GiB
    usage:   810 GiB used, 120 GiB / 930 GiB avail
    pgs:     81 active+clean
             1  active+clean+scrubbing+deep

  io:
    client:   103 KiB/s wr, 0 op/s rd, 11 op/s wr

root@pve2:~# ceph status
  cluster:
    id:     e7628d51-32b5-4f5c-8eec-1cafb41ead74
    health: HEALTH_WARN
            mon pve3 is low on available space
            1 backfillfull osd(s)
            2 nearfull osd(s)
            4 pool(s) backfillfull
            1792 daemons have recently crashed

  services:
    mon: 3 daemons, quorum pve1,pve3,pve2 (age 22h)
    mgr: pve1(active, since 23h), standbys: pve3, pve2
    mds: 1/1 daemons up, 2 standby
    osd: 3 osds: 3 up (since 7m), 3 in (since 7m)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 82 pgs
    objects: 497.58k objects, 269 GiB
    usage:   810 GiB used, 120 GiB / 930 GiB avail
    pgs:     82 active+clean

  io:
    client:   247 KiB/s wr, 0 op/s rd, 42 op/s wr

Do you know how relevant are the bluefs-bdev-expand errors ?

Anyway, we should not reach that situation. What if I was on physical device ?

Many thanks for your support.
Feel free to ask any more detail.

Have a nice day ;-)

Actions #5

Updated by Igor Fedotov over 2 years ago

Pivert Dubuisson wrote:

Hi,

Obviously, I've been able to get out of this terrible situation.
Since I'm on lvm (on top of nvme) and I was keeping some 27GB for Ceph DB for extra SSD volume, I was able to extend the underlying volume. This unlocked the situation. I did this on all the nodes (on pve3 I just gave 5GB instead of the 27GB on pve1 and 2, and probably should give the 27GB / 100% free on this node as well):

[...]

Do you know how relevant are the bluefs-bdev-expand errors ?

You can ignore them.

Anyway, we should not reach that situation. What if I was on physical device ?

Many thanks for your support.
Feel free to ask any more detail.

Have a nice day ;-)

Well, you've found good enough workaround but unfortunately this hides the root cause of the issue.

Which IMO might be either software bug in hybrid allocator or the lack of free continuous 64K block due to high fragmentation and limited amount of free space.
Bad news is that the workaround doesn't guarantee the issue from the reappearance in the future, particularly if space goes low again. And right, it wouldn't work for the physical partitions.

Just in case let me try to do free space fragmentation analysis for the current allocation map. To do so please dump free extents with 'ceph tell osd.N bluestore allocator dump block' command.

Thanks in advance!

Actions #6

Updated by Igor Fedotov over 2 years ago

  • Has duplicate Bug #53590: ceph abort at bluefs enospc added
Actions #7

Updated by Pivert Dubuisson over 2 years ago

Hi,
the file size exceed the max 1MB rule.... The file is temporarily there:
www.pivert.org/osd.2_bluestore_allocator_dump_block.xz

osd.2 on pve3 was the first to fail. Do you need the 2 other nodes as well ?

Thanks,

Actions #8

Updated by Igor Fedotov over 2 years ago

Pivert Dubuisson wrote:

Hi,
the file size exceed the max 1MB rule.... The file is temporarily there:
www.pivert.org/osd.2_bluestore_allocator_dump_block.xz

osd.2 on pve3 was the first to fail. Do you need the 2 other nodes as well ?

Thanks,

Unfortunately the allocator's state looks totally different now - it's got 69GB free space vs. ~20GB one observed when crashed. So it makes no sense for any analysis.

Let's try dumps from other nodes but I expect they're in the similar state....

Regards,
Igor

Actions #9

Updated by Pivert Dubuisson over 2 years ago

Hi,

I added the 2 other dumps to the same location. But I expect the same: Specificity of my setup that might be related:
  • Ceph BlueStore on LVM on Proxmox7 (Debian 11.2) with ceph version 16.2.7
  • Using the default pve VG, on top of a 512GB NVME SSD
  • I did scratch & reconfigure the Ceph & data setup a couple of time using a self made script every nodes. The script is just below. (Don't try it unless your ready to loose all the data, including data from PVE local storage and PVE cluster data, basically resets PVE as if it was freshly installed)

When I look back at the script, I should also have 'rm -rf /var/lib/ceph' at the end of the script to also delete other configs such as osd... But it should not be the cause of the problem since the OSD config is in a mounted tmpfs.

[code]
#!/bin/bash
  1. scratch_pve_cluster.sh to delete all data and reset PVE as freshly installed

systemctl stop ceph-crash.service
systemctl stop ceph-mon.target
systemctl stop ceph-osd.target
systemctl stop ceph-mgr.target
systemctl stop ceph
pveceph purge
systemctl stop pvestatd.service
systemctl stop pvedaemon.service
systemctl stop pve-cluster.service
systemctl stop corosync
systemctl stop pve-cluster

mkdir /root/$(date +%Y%m%d)
mv -f /var/lib/pve-cluster/config.db /root/$(date +%Y%m%d)
mv /etc/pve/* /root/$(date +%Y%m%d)
rm -rf /etc/corosync/*
  1. rm /etc/corosync/uidgid.d/*
    rm /var/lib/corosync/*
    rm -rf /var/lib/ceph/mon/ceph-pve*
    apt purge ceph-base -y
    apt purge ceph -y
    rm -rf /etc/ceph
    [/code]

So, if someone experience the same situation, please run the below command on all the nodes before extending the BlueStore LV:
[code]
ceph tell osd.x bluestore allocator dump block | xz -c > osd.x_bluestore_allocator_dump_block_on_${HOSTNAME}.xz
[/code]
Replace x with the concerned OSD id (different on each node). You'll find info about OSD in PVE GUI or in subfolders of /var/lib/ceph/osd/

Actions #10

Updated by Pivert Dubuisson over 2 years ago

Re-posting the script as indentation broke it:

#!/bin/bash
# scratch_pve_cluster.sh to delete all data and reset PVE as freshly installed

systemctl stop ceph-crash.service
systemctl stop ceph-mon.target
systemctl stop ceph-osd.target
systemctl stop ceph-mgr.target
systemctl stop ceph
pveceph purge
systemctl stop pvestatd.service
systemctl stop pvedaemon.service
systemctl stop pve-cluster.service
systemctl stop corosync
systemctl stop pve-cluster

mkdir /root/$(date +%Y%m%d)
mv -f /var/lib/pve-cluster/config.db /root/$(date +%Y%m%d)
mv /etc/pve/* /root/$(date +%Y%m%d)
rm -rf /etc/corosync/*
# rm /etc/corosync/uidgid.d/*
rm /var/lib/corosync/*
rm -rf /var/lib/ceph/mon/ceph-pve*
apt purge ceph-base -y
apt purge ceph -y
rm -rf /etc/ceph
Actions #11

Updated by Igor Fedotov over 2 years ago

Thanks for the information!

Pivert Dubuisson wrote:

Hi,

I added the 2 other dumps to the same location. But I expect the same:

Yeah, this shows the same - much more free sace is currently available hence the issue isn't reproducible.

Specificity of my setup that might be related:

I don't think this was caused by any specifics of your setup other than low free space conditions. The major question is whether this was a proper failure caused by the lack of enough amount continuous 16K extents or this was a bug in allocator which was unable to provide these extents although they're available. IMO the former is highly likely.

Anyway we need proper allocator dumps for that, so let's wait for another occurrence.

Actions #12

Updated by Igor Fedotov over 2 years ago

  • Status changed from New to Need More Info
Actions #13

Updated by Seena Fallah about 2 years ago

Hi,

I got the same issue with some of my OSDs (pacific v16.2.7 & SSD)

Feb 16 14:14:48 osd-server docker[13282]: 2022-02-16T14:14:48.499+0330 7fd2cfad3080 -1 bluefs _allocate allocation failed, needed 0x73d8e
Feb 16 14:14:48 osd-server docker[13282]: 2022-02-16T14:14:48.499+0330 7fd2cfad3080 -1 bluefs _flush_range allocated: 0x390000 offset: 0x383b90 length: 0x801fe
Feb 16 14:14:48 osd-server docker[13282]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fd2cfad3080 time 2022-02-16T14:14:48.502032+0330
Feb 16 14:14:48 osd-server docker[13282]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/os/bluestore/BlueFS.cc: 2768: ceph_abort_msg("bluefs enospc")
Feb 16 14:14:48 osd-server docker[13282]:  ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
Feb 16 14:14:48 osd-server docker[13282]:  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x5571561272be]
Feb 16 14:14:48 osd-server docker[13282]:  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1131) [0x557156822eb1]
Feb 16 14:14:48 osd-server docker[13282]:  3: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x90) [0x557156823190]
Feb 16 14:14:48 osd-server docker[13282]:  4: (BlueFS::_flush(BlueFS::FileWriter*, bool, std::unique_lock<std::mutex>&)+0x32) [0x5571568342a2]
Feb 16 14:14:48 osd-server docker[13282]:  5: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) [0x55715684cbdb]
Feb 16 14:14:48 osd-server docker[13282]:  6: (rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x557156ce6e0f]
Feb 16 14:14:48 osd-server docker[13282]:  7: (rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned long)+0x58a) [0x557156df8b5a]
Feb 16 14:14:48 osd-server docker[13282]:  8: (rocksdb::WritableFileWriter::Append(rocksdb::Slice const&)+0x2d0) [0x557156df9fb0]
Feb 16 14:14:48 osd-server docker[13282]:  9: (rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool)+0xb6) [0x557156f15736]
Feb 16 14:14:48 osd-server docker[13282]:  10: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, bool)+0x26c) [0x557156f1607c]
Feb 16 14:14:48 osd-server docker[13282]:  11: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, bool)+0x3c) [0x557156f1677c]
Feb 16 14:14:48 osd-server docker[13282]:  12: (rocksdb::BlockBasedTableBuilder::Flush()+0x6d) [0x557156f1680d]
Feb 16 14:14:48 osd-server docker[13282]:  13: (rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, rocksdb::Slice const&)+0x2b8) [0x557156f19c78]
Feb 16 14:14:48 osd-server docker[13282]:  14: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0xa45) [0x557156ec46f5]
Feb 16 14:14:48 osd-server docker[13282]:  15: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcf5) [0x557156d29755]
Feb 16 14:14:48 osd-server docker[13282]:  16: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*)+0x1c2e) [0x557156d2be8e]
Feb 16 14:14:48 osd-server docker[13282]:  17: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*)+0xae8) [0x557156d2d1e8]
Feb 16 14:14:48 osd-server docker[13282]:  18: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x59d) [0x557156d26f0d]
Feb 16 14:14:48 osd-server docker[13282]:  19: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x15) [0x557156d282a5]
Feb 16 14:14:48 osd-server docker[13282]:  20: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x10c1) [0x557156ca0201]
Feb 16 14:14:48 osd-server docker[13282]:  21: (BlueStore::_open_db(bool, bool, bool)+0x8c7) [0x55715671d027]
Feb 16 14:14:48 osd-server docker[13282]:  22: (BlueStore::_open_db_and_around(bool, bool)+0x2f7) [0x557156787ab7]
Feb 16 14:14:48 osd-server docker[13282]:  23: (BlueStore::_mount()+0x204) [0x55715678a974]
Feb 16 14:14:48 osd-server docker[13282]:  24: (OSD::init()+0x380) [0x55715625e400]
Feb 16 14:14:48 osd-server docker[13282]:  25: main()
Feb 16 14:14:48 osd-server docker[13282]:  26: __libc_start_main()
Feb 16 14:14:48 osd-server docker[13282]:  27: _start()

Actions #14

Updated by Seena Fallah about 2 years ago

Seena Fallah wrote:

Hi,

I got the same issue with some of my OSDs (pacific v16.2.7 & SSD)
[...]

I tried to boot my OSD with `bitmap` and `stupid` allocator but it doesn't help!

Actions #15

Updated by Igor Fedotov about 2 years ago

Hi Seena,
I presume you're sharing stderr output only, right?
Could you please share more verbose OSD startup log.

Additionally please run ceph-bluestore-tool's free-dump command and share the output for the inspection.

Actions #16

Updated by Seena Fallah about 2 years ago

Igor Fedotov wrote:

Hi Seena,
I presume you're sharing stderr output only, right?
Could you please share more verbose OSD startup log.

Additionally please run ceph-bluestore-tool's free-dump command and share the output for the inspection.

free dump output is 2G and after compression, to tar.gz it gets 100MB and I can't upload more than 1MB here! =)
Do you want me to extract data for you?
For example, do you need these datas?

{
            "offset": "0x3900001000",
            "length": "0x1000" 
        },
        {
            "offset": "0x3900005000",
            "length": "0x1000" 
        },
        {
            "offset": "0x3900007000",
            "length": "0x4000" 
        },
        {
            "offset": "0x390000d000",
            "length": "0x2000" 
        },
        {
            "offset": "0x383b909000",
            "length": "0x1000" 
        },
        {
            "offset": "0x383b90c000",
            "length": "0x2000" 
        },

Or I'm in a wrong way? =)

Actions #17

Updated by Igor Fedotov about 2 years ago

Hi Seena,
no. that's right. The dump might be pretty large if your disk is very big. But unfortunately I need the full dump to simulate the allocation. So please use some external cloud storage to transfer it.

As for osd startup log I primarily need the line which has "_allocate unable to allocate 0x" substring as it contains all the failing allocation attempt's details. So you can omit sending the full OSD log...

Actions #18

Updated by Seena Fallah about 2 years ago

Igor Fedotov wrote:

Hi Seena,
no. that's right. The dump might be pretty large if your disk is very big. But unfortunately I need the full dump to simulate the allocation. So please use some external cloud storage to transfer it.

As for osd startup log I primarily need the line which has "_allocate unable to allocate 0x" substring as it contains all the failing allocation attempt's details. So you can omit sending the full OSD log...

Here you can find the free dump: https://ceph-debug.s3.ir-thr-at1.arvanstorage.com/osd.92.free-dump.tar.gz

This is the allocation that failed on OSD:

2022-02-16T20:19:35.333+0330 7f78ec092080  1 bluefs _allocate unable to allocate 0x80000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x1bf1f800000, block size 0x1000, free 0x50ccc46000, fragmentation 0.876362, allocated 0x0
2022-02-16T20:19:35.333+0330 7f78ec092080 -1 bluefs _allocate allocation failed, needed 0x73d8e

Actions #19

Updated by Seena Fallah about 2 years ago

Igor Fedotov wrote:

Hi Seena,
no. that's right. The dump might be pretty large if your disk is very big. But unfortunately I need the full dump to simulate the allocation. So please use some external cloud storage to transfer it.

As for osd startup log I primarily need the line which has "_allocate unable to allocate 0x" substring as it contains all the failing allocation attempt's details. So you can omit sending the full OSD log...

Also, I got interested in how do you simulate it =)
If it's an open-source tool or a written blogged please share it otherwise ignore it:)

Thanks :D

Actions #20

Updated by Igor Fedotov about 2 years ago

  • Related to Bug #53466: OSD is unable to allocate free space for BlueFS added
Actions #21

Updated by Igor Fedotov about 2 years ago

Seena Fallah wrote:

Igor Fedotov wrote:

Hi Seena,
no. that's right. The dump might be pretty large if your disk is very big. But unfortunately I need the full dump to simulate the allocation. So please use some external cloud storage to transfer it.

As for osd startup log I primarily need the line which has "_allocate unable to allocate 0x" substring as it contains all the failing allocation attempt's details. So you can omit sending the full OSD log...

Also, I got interested in how do you simulate it =)
If it's an open-source tool or a written blogged please share it otherwise ignore it:)

Thanks :D

There is a tool ceph_test_alloc_replay in Ceph's codebase which allows to replay that free-dump output and perform some operations (e.g. an allocation) against the resulting allocator.

One has to build it manually as it isn't published along with regular Ceph binaries though... Just some dev stuff....

Actions #22

Updated by Igor Fedotov about 2 years ago

Seena Fallah wrote:

Igor Fedotov wrote:

Hi Seena,
no. that's right. The dump might be pretty large if your disk is very big. But unfortunately I need the full dump to simulate the allocation. So please use some external cloud storage to transfer it.

As for osd startup log I primarily need the line which has "_allocate unable to allocate 0x" substring as it contains all the failing allocation attempt's details. So you can omit sending the full OSD log...

Here you can find the free dump: https://ceph-debug.s3.ir-thr-at1.arvanstorage.com/osd.92.free-dump.tar.gz

This is the allocation that failed on OSD:
[...]

I'm curious if free dump and log output are for exactly the same OSD? There are some minor mismatch, e.g. in available space which makes me suspect they aren't... If so could you please share log output from osd.92

Actions #23

Updated by Seena Fallah about 2 years ago

Igor Fedotov wrote:

Seena Fallah wrote:

Igor Fedotov wrote:

Hi Seena,
no. that's right. The dump might be pretty large if your disk is very big. But unfortunately I need the full dump to simulate the allocation. So please use some external cloud storage to transfer it.

As for osd startup log I primarily need the line which has "_allocate unable to allocate 0x" substring as it contains all the failing allocation attempt's details. So you can omit sending the full OSD log...

Here you can find the free dump: https://ceph-debug.s3.ir-thr-at1.arvanstorage.com/osd.92.free-dump.tar.gz

This is the allocation that failed on OSD:
[...]

I'm curious if free dump and log output are for exactly the same OSD? There are some minor mismatch, e.g. in available space which makes me suspect they aren't... If so could you please share log output from osd.92

I've dumped it again both from the same time (start osd first after it goes down, free-dump it)
free-dump: https://www.dropbox.com/s/5zkqjrenh1h0g60/osd.92.free-dump.tar.gz?dl=0
full log: https://www.dropbox.com/s/dug0j7avwphcwhp/ceph-osd.92.log?dl=0

Actions #24

Updated by Igor Fedotov about 2 years ago

Seena, thanks a lot for the logs.
So everything is fine (I mean there is no bug) with the allocator, your volume just doesn't have enough 64K chunks to allocate.
That's what I obtained from your free dump:
Free:50ccff6000
Available: 0xa7b00000 Available(strict align): 0x3b0000 alloc_unit: 0x10000

Note that just 0x3b0000 bytes are available for contiguous 64K block allocations.

And from osd.92 log BlueFS needs 0x390000 + 0x80000 bytes:
-4> 2022-02-17T13:14:53.281+0330 7f4141b23080 1 bluefs _allocate unable to allocate 0x80000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x1bf1f800000, block size 0x1000, free 0x50ccc46000, fragmentation 0.876362, allocated 0x0
-3> 2022-02-17T13:14:53.281+0330 7f4141b23080 -1 bluefs _allocate allocation failed, needed 0x73d8e
-2> 2022-02-17T13:14:53.281+0330 7f4141b23080 -1 bluefs _flush_range allocated: 0x390000 offset: 0x383b90 length: 0x801fe

Hence the exception.
The root cause for the issue is a high fragmentation of the main volume and DB and main volumes collocation. The main one uses 4K allocation unit while DB uses 64K one. And under high utilization one can get the state when there are not enough available contiguous 64K blocks despite the fact that the amount of free space still looks fine.
Unfortunately at this point I can't recommend anything but redeploy the OSD and start using standalone logical DB volume even if it physically collocates the same disk with the main one. See my comment #6 at https://tracker.ceph.com/issues/53466 for a bit more details on this approach.

Updated by Niklas Hambuechen 29 days ago

@Igor Gajowiak

It seems I hit this bug with a cluster initially that had initially deployed 16.2.7 and then upgraded to 18.2.1.

In thread https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/WPD2A2W45OGREMU2AD6WHPTOSATZFNLG/#WF2OFCHMHNTSY6SFEB74NACP7ZAX4LI6
you write:

Quincy brings support for 4K allocation unit but doesn't start using it immediately.
Instead it falls back to 4K when bluefs is unable to allocate more space with the default size.

So I would expect that issue to be gone.

But I still have an OSD crash-looping with:

Mar 28 23:55:26 backupfs-2 ceph-osd[864328]: 2024-03-28T23:55:26.072+0000 7fcf3ca7ee80 -1 bluefs _allocate allocation failed, needed 0x80000
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]: 2024-03-28T23:55:26.072+0000 7fcf3ca7ee80 -1 bluefs _flush_range_F allocated: 0x383000 offset: 0x3824da length: 0x805b1
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]: /build/ceph-18.2.1/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range_F(FileWriter*, uint64_t, uint64_t)' thread 7fcf3ca7ee80 time 2024-03-28T23:55:26.072920+0000
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]: /build/ceph-18.2.1/src/os/bluestore/BlueFS.cc: 3396: ceph_abort_msg("bluefs enospc")
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  ceph version 18.2.1 (e3fce6809130d78ac0058fc87e537ecd926cd213) reef (stable)
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xdc) [0x5642b4dbcdba]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  2: (BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned long, unsigned long)+0x908) [0x5642b54ff2f8]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  3: (BlueFS::_flush_F(BlueFS::FileWriter*, bool, bool*)+0x90) [0x5642b54ff400]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  4: (BlueFS::append_try_flush(BlueFS::FileWriter*, char const*, unsigned long)+0xd6) [0x5642b55116e6]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  5: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x1b) [0x5642b552de4b]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  6: /nix/store/fhqil7cb11jb5ckrkayjllsm173agiyd-rocksdb-7.9.2/lib/librocksdb.so.7(+0x590a56) [0x7fcf3d790a56]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  7: (rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned long, rocksdb::Env::IOPriority)+0x3fc) [0x7fcf3d7e696c]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  8: (rocksdb::WritableFileWriter::Append(rocksdb::Slice const&, unsigned int, rocksdb::Env::IOPriority)+0x6fe) [0x7fcf3d7ec9be]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  9: (rocksdb::BlockBasedTableBuilder::WriteMaybeCompressedBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, rocksdb::BlockType, rocksdb::Slice const*)+0x1f5) [0x7fcf3d86a7c5]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  10: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, rocksdb::BlockType)+0x124) [0x7fcf3d86af14]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  11: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, rocksdb::BlockType)+0x9f) [0x7fcf3d86c34f]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  12: (rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, rocksdb::Slice const&)+0x2c7) [0x7fcf3d86e8e7]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  13: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, std::vector<rocksdb::BlobFileAddition, std::allocator<rocksdb::BlobFileAddition> >*, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::SeqnoToTimeMapping const&, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, rocksdb::BlobFileCompletionCallback*, unsigned >
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  14: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0x112b) [0x7fcf3d6308bb]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  15: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*, rocksdb::DBImpl::RecoveryContext*)+0x1d15) [0x7fcf3d633825]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  16: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*)+0x1891) [0x7fcf3d635ab1]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  17: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x7df) [0x7fcf3d62c43f]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  18: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x24) [0x7fcf3d62e6f4]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  19: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xb43) [0x5642b5a96253]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  20: (BlueStore::_open_db(bool, bool, bool)+0x658) [0x5642b544a4e8]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  21: (BlueStore::_open_db_and_around(bool, bool)+0x481) [0x5642b5493711]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  22: (BlueStore::_mount()+0x38d) [0x5642b549648d]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  23: (OSD::init()+0x4f1) [0x5642b4edd881]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  24: main()
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  25: /nix/store/7jiqcrg061xi5clniy7z5pvkc4jiaqav-glibc-2.38-27/lib/libc.so.6(+0x280ce) [0x7fcf3c63f0ce]
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  26: __libc_start_main()
Mar 28 23:55:26 backupfs-2 ceph-osd[864328]:  27: _start()

I've attached the stderr (journalctl) and OSD log files.

I believe reason this appeared to me is via https://tracker.ceph.com/issues/65199 where a CephFS "default" pool (added to follow the docs-recommended best-practice for CephFS-on-EC) created a huge amount of objects on my SSDs that were too small.

As shown in that bug, these inodes count as "0 Bytes" so the fullness ratios may not prevent and OSD from becoming 100% full, if I udnderstand correctly.

Actions #26

Updated by Niklas Hambuechen 29 days ago

Small addition: `ceph osd status` shows the following:

# ceph osd status
ID  HOST          USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE                
 0  backupfs-1    122G  45.3G      0        0       0        0   exists,up          
...
12  backupfs-2      0      0       0        0       0        0   autoout,exists,full  
Actions #27

Updated by Igor Fedotov 29 days ago

Hi Niklas,

looks like your BlueFS is almost full. Given the following output
-5> 2024-03-28T23:55:26.039+0000 7fcf3ca7ee80 1 bluefs _allocate unable to allocate 0x90000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x29e7e00000, block size 0x1000, alloc unit 0x10000, free 0x3b8000, fragmentation 0.16, allocated 0x0
-4> 2024-03-28T23:55:26.072+0000 7fcf3ca7ee80 1 bluefs _allocate unable to allocate 0x80000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x29e7e00000, block size 0x1000, alloc unit 0x1000, free 0x35000, fragmentation 0.56, allocated 0x35000

we can see "free 0x3b8000" (=3899392) bytes are free only.
There is small discrepancy as the requested amount is 0x90000 (=589824) bytes which is less than free space and hence the allocation request should be rather satisfied. But that just makes the failure happen a couple allocation iterations earlier as free space is almost over.

Actions #28

Updated by Niklas Hambuechen 28 days ago

Thanks Igor for investigating.

I worked around it now by re-provisioning the OSD, and letting redundancy fix it.

But I feel I was lucky that it happened only to 1 OSD, so I had no downtime in the cluster.

Should it be possible that the BlueFS becomes this full? Do you think it's possible of the 0-Bytes inodes as I suspected above?

Actions #29

Updated by Igor Fedotov 28 days ago

You're using DB volume collocated with main one. So BlueFS shares the same space as user data. And your OSD ran out of that space.
Can't say for sure what triggered that but highly likely that's the issue you mentioned " created a huge amount of objects on my SSDs that were too small".
I believe you should be seeing some alerts about OSDs being too full in 'ceph -s" output or something...

Actions

Also available in: Atom PDF