Bug #43370: OSD crash in function bluefs::_flush_range with ceph_abort_msg "bluefs enospc" - bluestore - Ceph

Actions

#1

Updated by Igor Fedotov over 4 years ago

Looks like you're simply lacking free space for WAL/DB. The following output confirms this:

2019-12-13 04:13:55.914 7f9f0f83b700 1 bluefs _allocate unable to allocate 0x100000 on bdev 2, free 0xffffffffffffffff; fallback to slow device expander
-3> 2019-12-13 04:13:55.914 7f9f0f83b700 -1 bluefs _allocate failed to allocate 0x100000 on bdev 1, free 0x0
-2> 2019-12-13 04:13:55.914 7f9f0f83b700 -1 bluefs _flush_range allocated: 0xce00000 offset: 0xcdffe68 length: 0x40d
-1> 2019-12-13 04:13:55.938 7f9f0f83b700 -1 /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f9f0f83b700 time 2019-12-13 04:13:55.919740
/build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 2132: ceph_abort_msg("bluefs enospc")

I can't find the real device layout for BlueStore from the info you provided, but you mentioned 2/4 GB drives and I can see 1GB (=0x40000000) volume (which AFAIR sums main and Db devices) from this line:
-72> 2019-12-13 04:12:48.004 7f9efa35e700 5 osd.3 111 heartbeat osd_stat(store_statfs(0x50000/0x3f100000/0x40000000, data 0x881e0/0xe10000, compress 0x0/0x0/0x0, omap 0x0, meta 0x3f100000), peers [0,1,2] op hist [])

From my experience 1 GB is terribly low for any OSD deployment so suggest simply to increase the available space up to 10GB.

Actions

Copy link

#2

Updated by Gerdriaan Mulder over 4 years ago

Thanks for the clarification. If you can provide me with suitable commands to find the "real device layout for BlueStore", I'd be happy to oblige.

It seems ceph-deploy osd create --data /dev/sdb does not fully uses the available space on the disk?

On the smaller VM:

ceph-system@node4:~$ lsblk
NAME                                                                                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
fd0                                                                                                     2:0    1    4K  0 disk 
sda                                                                                                     8:0    0   16G  0 disk 
├─sda1                                                                                                  8:1    0  243M  0 part /boot/efi
└─sda2                                                                                                  8:2    0 15.8G  0 part /
sdb                                                                                                     8:16   0    2G  0 disk 
└─ceph--e58da8e8--c116--4670--97ff--ba96103533ea-osd--block--a2cc7ef0--ec82--4ca2--a395--eb78027b0d4c 252:0    0    1G  0 lvm  
sr0                                                                                                    11:0    1 1024M  0 rom

On the bigger VM:

ceph-system@node2:~$ lsblk
NAME                                                                                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
fd0                                                                                                     2:0    1    4K  0 disk 
sda                                                                                                     8:0    0   16G  0 disk 
├─sda1                                                                                                  8:1    0  243M  0 part /boot/efi
└─sda2                                                                                                  8:2    0 15.8G  0 part /
sdb                                                                                                     8:16   0    5G  0 disk 
└─ceph--1b971f0b--3fb1--48b2--a8ef--c62ac590de8a-osd--block--7da6c451--c890--4474--b81d--7eac644a2ef0 252:0    0    4G  0 lvm  
sr0                                                                                                    11:0    1 1024M  0 rom

(identical for node3 and node5)

What I find peculiar is that the cluster seems to automatically fill up. As I mentioned, the cluster was pretty much (if not completely) idle. So, in theory this could also happen when the OSD is 10GB, it just takes longer to notice.

When checking the logs, osd.2 also failed:

2019-12-21 01:23:25.639041 mon.node1 [WRN] Health check update: 36 pgs not deep-scrubbed in time (PG_NOT_DEEP_SCRUBBED)
2019-12-21 02:00:00.000241 mon.node1 [WRN] overall HEALTH_WARN Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized; 36 pgs not deep-scrubbed in time
2019-12-21 02:19:06.618841 mon.node1 [INF] osd.2 failed (root=default,datacenter=nijmegen,host=node4) (connection refused reported by osd.1)
2019-12-21 02:19:06.674791 mon.node1 [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-12-21 02:19:06.674853 mon.node1 [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2019-12-21 02:19:09.761255 mon.node1 [WRN] Health check update: Degraded data redundancy: 41/473 objects degraded (8.668%), 9 pgs degraded, 129 pgs undersized (PG_DEGRADED)
2019-12-21 02:19:19.466891 mon.node1 [WRN] Health check update: Degraded data redundancy: 90/473 objects degraded (19.027%), 13 pgs degraded, 129 pgs undersized (PG_DEGRADED)
2019-12-21 02:20:07.555790 mon.node1 [WRN] Health check failed: Reduced data availability: 128 pgs stale (PG_AVAILABILITY)
2019-12-21 02:20:09.624054 mon.node1 [WRN] Health check update: Degraded data redundancy: 90/473 objects degraded (19.027%), 13 pgs degraded, 143 pgs undersized (PG_DEGRADED)
2019-12-21 02:28:08.219386 mon.node1 [WRN] Health check update: 37 pgs not deep-scrubbed in time (PG_NOT_DEEP_SCRUBBED)
2019-12-21 02:29:09.594855 mon.node1 [INF] Marking osd.2 out (has been down for 602 seconds)
2019-12-21 02:29:09.595486 mon.node1 [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-12-21 02:29:09.595524 mon.node1 [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)

On the host (node4):

Dec 21 02:19:06 node4 ceph-osd[9962]: 2019-12-21 02:19:06.050 7f8454dcb700 -1 bluefs _allocate failed to allocate 0x400000 on bdev 1, free 0x0
Dec 21 02:19:06 node4 ceph-osd[9962]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'int
    BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, uint64_t, uint64_t)' thread 7f8454dcb700 time 2019-12-21 02:19:06.055308
Dec 21 02:19:06 node4 ceph-osd[9962]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 2001: FAILED ceph_assert(r == 0)
Dec 21 02:19:06 node4 ceph-osd[9962]:  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
Dec 21 02:19:06 node4 ceph-osd[9962]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x85c3c8]
Dec 21 02:19:06 node4 ceph-osd[9962]:  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x85c5a3]
Dec 21 02:19:06 node4 ceph-osd[9962]:  3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x1a30) [0xe86f50]
Dec 21 02:19:06 node4 ceph-osd[9962]:  4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x9c) [0xe881fc]
Dec 21 02:19:06 node4 ceph-osd[9962]:  5: (BlueRocksWritableFile::Sync()+0x63) [0xea62c3]
Dec 21 02:19:06 node4 ceph-osd[9962]:  6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x3e9) [0x15184b9]
Dec 21 02:19:06 node4 ceph-osd[9962]:  7: (rocksdb::WritableFileWriter::Sync(bool)+0x376) [0x151b8a6]
Dec 21 02:19:06 node4 ceph-osd[9962]:  8: (rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, 
    unsigned long*, bool, bool, unsigned long)+0x32c) [0x13866dc]
Dec 21 02:19:06 node4 ceph-osd[9962]:  9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*,
    rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x245d) [0x138f75d]
Dec 21 02:19:06 node4 ceph-osd[9962]:  10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x30) [0x138fdf0]
Dec 21 02:19:06 node4 ceph-osd[9962]:  11: (RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x81) [0xe0e0d1]
Dec 21 02:19:06 node4 ceph-osd[9962]:  12: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x97) [0xe0eab7]
Dec 21 02:19:06 node4 ceph-osd[9962]:  13: (BlueStore::_kv_sync_thread()+0x1b2a) [0xdc1b8a]
Dec 21 02:19:06 node4 ceph-osd[9962]:  14: (BlueStore::KVSyncThread::entry()+0xd) [0xde352d]
Dec 21 02:19:06 node4 ceph-osd[9962]:  15: (()+0x76ba) [0x7f8463e1f6ba]
Dec 21 02:19:06 node4 ceph-osd[9962]:  16: (clone()+0x6d) [0x7f846342641d]
Dec 21 02:19:06 node4 ceph-osd[9962]: 2019-12-21 02:19:06.078 7f8454dcb700 -1 /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'int
    BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, uint64_t, uint64_t)' thread 7f8454dcb700 time 2019-12-21 02:19:06.055308
Dec 21 02:19:06 node4 ceph-osd[9962]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 2001: FAILED ceph_assert(r == 0)
Dec 21 02:19:06 node4 ceph-osd[9962]:  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
[..]

Actions

Copy link

#3

Updated by Sage Weil over 4 years ago

Crash signature (v1) updated (diff)

crash sig (for the record):

{
    "crash_id": "2019-12-13_03:13:55.981131Z_e3f0cf94-b3a9-4acc-8917-4d6dcee735a6",
    "timestamp": "2019-12-13 03:13:55.981131Z",
    "process_name": "ceph-osd",
    "entity_name": "osd.3",
    "ceph_version": "14.2.4",
    "utsname_hostname": "node5",
    "utsname_sysname": "Linux",
    "utsname_release": "4.4.0-169-generic",
    "utsname_version": "#198-Ubuntu SMP Tue Nov 12 10:38:00 UTC 2019",
    "utsname_machine": "x86_64",
    "os_name": "Ubuntu",
    "os_id": "ubuntu",
    "os_version_id": "16.04",
    "os_version": "16.04.6 LTS (Xenial Xerus)",
    "assert_condition": "abort",
    "assert_func": "int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)",
    "assert_file": "/build/ceph-14.2.4/src/os/bluestore/BlueFS.cc",
    "assert_line": 2132,
    "assert_thread_name": "bstore_kv_sync",
    "assert_msg": "/build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f9f0f83b700 time 2019-12-13 04:13:55.919740\n/build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 2132: ceph_abort_msg(\"bluefs enospc\")\n",
    "backtrace": [
        "(()+0x11390) [0x7f9f1e899390]",
        "(gsignal()+0x38) [0x7f9f1ddc4428]",
        "(abort()+0x16a) [0x7f9f1ddc602a]",
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b0) [0x85cb43]",
        "(BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1e26) [0xe84d16]",
        "(BlueFS::_flush(BlueFS::FileWriter*, bool)+0x11c) [0xe8506c]",
        "(BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x4d) [0xe881ad]",
        "(BlueRocksWritableFile::Sync()+0x63) [0xea62c3]",
        "(rocksdb::WritableFileWriter::SyncInternal(bool)+0x3e9) [0x15184b9]",
        "(rocksdb::WritableFileWriter::Sync(bool)+0x376) [0x151b8a6]",
        "(rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long)+0x32c) [0x13866dc]",
        "(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x245d) [0x138f75d]",
        "(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x30) [0x138fdf0]",
        "(RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x81) [0xe0e0d1]",
        "(RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x97) [0xe0eab7]",
        "(BlueStore::_kv_sync_thread()+0x1b2a) [0xdc1b8a]",
        "(BlueStore::KVSyncThread::entry()+0xd) [0xde352d]",
        "(()+0x76ba) [0x7f9f1e88f6ba]",
        "(clone()+0x6d) [0x7f9f1de9641d]" 
    ]
}

Actions

Copy link

#4

Updated by Igor Fedotov about 4 years ago

Gerdriaan, sorry, missed you inquiry.
What I wanted is the output for:
ceph-bluestore-tool --path <path-to-osd> --command bluefs-bdev-sizes

Actions

Copy link

#5

Updated by Gerdriaan Mulder about 4 years ago

Hi Igor,

Output below for each OSD node:

root@node2:~# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ --command bluefs-bdev-sizes
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-0/block -> /dev/dm-0
2020-04-20 10:27:44.709 7f704fd81100 -1 bdev(0x3d5c380 /var/lib/ceph/osd/ceph-0/block) _lock flock failed on /var/lib/ceph/osd/ceph-0/block
2020-04-20 10:27:44.709 7f704fd81100 -1 bdev(0x3d5c380 /var/lib/ceph/osd/ceph-0/block) open failed to lock /var/lib/ceph/osd/ceph-0/block: (11) Resource temporarily unavailable
unable to open /var/lib/ceph/osd/ceph-0/block: (11) Resource temporarily unavailable

root@node3:~# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-1/ --command bluefs-bdev-sizes
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block -> /dev/dm-0
2020-04-20 10:28:37.524 7f6b2565e100 -1 bdev(0x408a380 /var/lib/ceph/osd/ceph-1/block) _lock flock failed on /var/lib/ceph/osd/ceph-1/block
2020-04-20 10:28:37.524 7f6b2565e100 -1 bdev(0x408a380 /var/lib/ceph/osd/ceph-1/block) open failed to lock /var/lib/ceph/osd/ceph-1/block: (11) Resource temporarily unavailable
unable to open /var/lib/ceph/osd/ceph-1/block: (11) Resource temporarily unavailable

root@node4:~# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-2/ --command bluefs-bdev-sizes
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-2/block -> /dev/dm-0
1 : device size 0x40000000 : own 0x[a00000~1400000,2000000~1400000,3900000~3c700000] = 0x3ef00000 : using 0x3ef00000(1007 MiB)

root@node5:~# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-3/ --command bluefs-bdev-sizes
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-3/block -> /dev/dm-0
1 : device size 0x40000000 : own 0x[a00000~1400000,2000000~1400000,3700000~3c900000] = 0x3f100000 : using 0x3f100000(1009 MiB)

Current cluster status (I've left it running since January):

root@node2:~# ceph -s
  cluster:
    id:     77b0f639-26c6-4d18-a41d-90599c28ca05
    health: HEALTH_WARN
            Reduced data availability: 128 pgs inactive
            Degraded data redundancy: 1/461 objects degraded (0.217%), 1 pg degraded, 8 pgs undersized
            136 pgs not deep-scrubbed in time
            136 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum node1,node3,node5 (age 3M)
    mgr: node4(active, since 3M), standbys: node2
    osd: 4 osds: 2 up (since 4M), 2 in (since 4M)
    rgw: 1 daemon active (node1)

  data:
    pools:   7 pools, 296 pgs
    objects: 235 objects, 20 MiB
    usage:   2.1 GiB used, 5.9 GiB / 8 GiB avail
    pgs:     43.243% pgs unknown
             1/461 objects degraded (0.217%)
             160 active+clean
             128 unknown
             7   active+undersized
             1   active+undersized+degraded

root@node2:~# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME               STATUS REWEIGHT PRI-AFF 
 -1       0.00970 root default                                    
-16       0.00970     datacenter nijmegen                         
 -3       0.00388         host node2                              
  0   hdd 0.00388             osd.0           up  1.00000 1.00000 
 -5       0.00388         host node3                              
  1   hdd 0.00388             osd.1           up  1.00000 1.00000 
 -7       0.00098         host node4                              
  2   ssd 0.00098             osd.2         down        0 1.00000 
 -9       0.00098         host node5                              
  3   ssd 0.00098             osd.3         down        0 1.00000

Actions

Copy link

#6

Updated by Igor Fedotov about 4 years ago

Status changed from New to Rejected

Hi Gerdriaan.
Thanks for the update.
First of all - ceph-bluestore-tool to be executed on an offline OSD hence the "Resource temporarily unavailable" error for ceph-0 and ceph-1.
But as I can see for ceph2 & 3 your OSDs have a single 1GB volume per OSD. And now they are almost out of space. Which prevents them from starting.

See:
1 : device size 0x40000000 : own 0x[a00000~1400000,2000000~1400000,3900000~3c700000] = 0x3ef00000 : using 0x3ef00000(1007 MiB)
or
1 : device size 0x40000000 : own 0x[a00000~1400000,2000000~1400000,3700000~3c900000] = 0x3f100000 : using 0x3f100000(1009 MiB)

As I mentioned in comment #1 one gigabyte volume per OSD is terribly low. We at SUSE have the following H/W recommendations of 20GB at minimum: https://documentation.suse.com/ses/5.5/html/ses-all/storage-bp-hwreq.html#deployment-osd-recommendation
Not to mention community H/W recommendation of 1 TB drive for pruduction (https://docs.ceph.com/docs/master/start/hardware-recommendations/)

Going to close the ticket, please feel free to reopen if you think it's still valid

Actions

Copy link

#7

Updated by Gerdriaan Mulder about 4 years ago

Thanks for your reply. Good to know that ceph-bluestore-tool needs to run on an offline OSD, I did not check that beforehand.

This setup was meant for testing purposes, so we did not account for the hardware recommendations. However, it's useful to see that such tiny clusters do not operate at all.

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #43370

OSD crash in function bluefs::_flush_range with ceph_abort_msg "bluefs enospc"

Updated by Igor Fedotov over 4 years ago

Updated by Gerdriaan Mulder over 4 years ago

Updated by Sage Weil over 4 years ago

Updated by Igor Fedotov about 4 years ago

Updated by Gerdriaan Mulder about 4 years ago

Updated by Igor Fedotov about 4 years ago

Updated by Gerdriaan Mulder about 4 years ago