Project

General

Profile

Actions

Bug #43370

closed

OSD crash in function bluefs::_flush_range with ceph_abort_msg "bluefs enospc"

Added by Gerdriaan Mulder over 4 years ago. Updated almost 4 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Setup:
  • 5x Ubuntu 16.04 VM, 2GB RAM, 16GB root disk
    • 2 VMs have 5GB extra (HDD) disk, 1 OSD per disk
    • 2 VMs have 2GB extra (SSD) disk, 1 OSD per disk
  • 3 monitors (in quorum), 2 managers (active+standby), 1 RGW
  • All nodes run Ceph version 14.2.4 (Nautilus), fresh install
  • OSDs created using ceph-deploy osd create --data /dev/sdb node{2,3,4,5}

The cluster ran in HEALTH_OK for about a week without any explicit I/O when one of the smaller OSDs crashed. From journalctl:

Dec 13 04:13:55 node5 ceph-osd[10184]: 2019-12-13 04:13:55.914 7f9f0f83b700 -1 bluefs _allocate failed to allocate 0x100000 on bdev 1, free 0x0 
Dec 13 04:13:55 node5 ceph-osd[10184]: 2019-12-13 04:13:55.914 7f9f0f83b700 -1 bluefs _flush_range allocated: 0xce00000 offset: 0xcdffe68 length: 0x40d
Dec 13 04:13:55 node5 ceph-osd[10184]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f9f0f83b700 time 2019-12-13 04:13:55.919740
Dec 13 04:13:55 node5 ceph-osd[10184]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 2132: ceph_abort_msg("bluefs enospc")
Dec 13 04:13:55 node5 ceph-osd[10184]:  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)

The OSD tries to restart again, but fails with:

Dec 13 04:13:58 node5 ceph-osd[33450]: 2019-12-13 04:13:58.834 7f7bfc889f80 -1 Falling back to public interface
Dec 13 04:14:28 node5 ceph-osd[33450]: 2019-12-13 04:14:28.878 7f7bfc889f80 -1 bluefs _allocate failed to allocate 0x400000 on bdev 1, free 0x0
Dec 13 04:14:28 node5 ceph-osd[33450]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'void BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)' thread 7f7bfc889f80 time 2019-12-13 04:14:28.881756
Dec 13 04:14:28 node5 ceph-osd[33450]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 1809: FAILED ceph_assert(r == 0)
Dec 13 04:14:28 node5 ceph-osd[33450]:  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)

The last error repeats until systemd limits the restart of this service.

$ sudo ceph -s
  cluster:
    id:     77b0f639-26c6-4d18-a41d-90599c28ca05
    health: HEALTH_WARN
            Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized
            13 pgs not deep-scrubbed in time

  services:
    mon: 3 daemons, quorum node1,node3,node5 (age 2w)
    mgr: node2(active, since 2w), standbys: node4
    osd: 4 osds: 3 up (since 5d), 3 in (since 5d)
    rgw: 1 daemon active (node1)

  data:
    pools:   7 pools, 296 pgs
    objects: 241 objects, 20 MiB
    usage:   3.2 GiB used, 5.8 GiB / 9 GiB avail
    pgs:     6/473 objects degraded (1.268%)
             167 active+clean
             123 active+undersized
             6   active+undersized+degraded
$ sudo ceph -w
[..]
2019-12-04 12:14:29.011074 mon.node1 [INF] Health check cleared: POOL_APP_NOT_ENABLED (was: application not enabled on 1 pool(s))
2019-12-04 12:14:29.012031 mon.node1 [INF] Cluster is now healthy
2019-12-04 13:00:00.000157 mon.node1 [INF] overall HEALTH_OK
2019-12-04 14:00:00.000164 mon.node1 [INF] overall HEALTH_OK
[..]
2019-12-13 03:00:00.000204 mon.node1 [INF] overall HEALTH_OK
2019-12-13 04:00:00.000202 mon.node1 [INF] overall HEALTH_OK
2019-12-13 04:13:56.491781 mon.node1 [INF] osd.3 failed (root=default,datacenter=nijmegen,host=node5) (connection refused reported by osd.1)
2019-12-13 04:13:56.544101 mon.node1 [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-12-13 04:13:56.544156 mon.node1 [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2019-12-13 04:13:59.616177 mon.node1 [WRN] Health check failed: Reduced data availability: 64 pgs inactive, 136 pgs peering (PG_AVAILABILITY)
2019-12-13 04:14:03.014943 mon.node1 [WRN] Health check failed: Degraded data redundancy: 116/473 objects degraded (24.524%), 16 pgs degraded (PG_DEGRADED)
2019-12-13 04:14:03.015008 mon.node1 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 64 pgs inactive, 136 pgs peering)
2019-12-13 04:14:59.060075 mon.node1 [WRN] Health check update: Degraded data redundancy: 116/473 objects degraded (24.524%), 16 pgs degraded, 144 pgs undersized (PG_DEGRADED)
2019-12-13 04:23:59.966904 mon.node1 [INF] Marking osd.3 out (has been down for 603 seconds)
2019-12-13 04:23:59.967441 mon.node1 [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-12-13 04:23:59.967517 mon.node1 [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2019-12-13 04:24:01.983211 mon.node1 [WRN] Health check update: Degraded data redundancy: 60/473 objects degraded (12.685%), 13 pgs degraded, 137 pgs undersized (PG_DEGRADED)
2019-12-13 04:24:08.093961 mon.node1 [WRN] Health check update: Degraded data redundancy: 110/473 objects degraded (23.256%), 11 pgs degraded, 129 pgs undersized (PG_DEGRADED)
2019-12-13 04:24:13.601381 mon.node1 [WRN] Health check update: Degraded data redundancy: 62/473 objects degraded (13.108%), 9 pgs degraded, 129 pgs undersized (PG_DEGRADED)
2019-12-13 04:24:19.973129 mon.node1 [WRN] Health check update: Degraded data redundancy: 41/473 objects degraded (8.668%), 8 pgs degraded, 129 pgs undersized (PG_DEGRADED)
2019-12-13 04:24:24.974256 mon.node1 [WRN] Health check update: Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized (PG_DEGRADED)
2019-12-13 05:00:00.000239 mon.node1 [WRN] overall HEALTH_WARN Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized
2019-12-13 06:00:00.000237 mon.node1 [WRN] overall HEALTH_WARN Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized
[..]
$ sudo ceph osd tree
ID  CLASS WEIGHT  TYPE NAME               STATUS REWEIGHT PRI-AFF 
 -1       0.00970 root default                                    
-16       0.00970     datacenter nijmegen                         
 -3       0.00388         host node2                              
  0   hdd 0.00388             osd.0           up  1.00000 1.00000 
 -5       0.00388         host node3                              
  1   hdd 0.00388             osd.1           up  1.00000 1.00000 
 -7       0.00098         host node4                              
  2   ssd 0.00098             osd.2           up  1.00000 1.00000 
 -9       0.00098         host node5                              
  3   ssd 0.00098             osd.3         down        0 1.00000 

Please let me know if you need more information.

Truncated log (due to attachment size limitation) from /var/lib/ceph/crash in attachment. It contains lines from a few minutes before the crash until the end of file.


Files

2019-12-18_log_truncated_1000KB.txt (26 KB) 2019-12-18_log_truncated_1000KB.txt /var/lib/ceph/crash/posted/datetime_uuid/log Gerdriaan Mulder, 12/18/2019 02:33 PM
Actions #1

Updated by Igor Fedotov over 4 years ago

Looks like you're simply lacking free space for WAL/DB. The following output confirms this:

2019-12-13 04:13:55.914 7f9f0f83b700 1 bluefs _allocate unable to allocate 0x100000 on bdev 2, free 0xffffffffffffffff; fallback to slow device expander
-3> 2019-12-13 04:13:55.914 7f9f0f83b700 -1 bluefs _allocate failed to allocate 0x100000 on bdev 1, free 0x0
-2> 2019-12-13 04:13:55.914 7f9f0f83b700 -1 bluefs _flush_range allocated: 0xce00000 offset: 0xcdffe68 length: 0x40d
-1> 2019-12-13 04:13:55.938 7f9f0f83b700 -1 /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f9f0f83b700 time 2019-12-13 04:13:55.919740
/build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 2132: ceph_abort_msg("bluefs enospc")

I can't find the real device layout for BlueStore from the info you provided, but you mentioned 2/4 GB drives and I can see 1GB (=0x40000000) volume (which AFAIR sums main and Db devices) from this line:
-72> 2019-12-13 04:12:48.004 7f9efa35e700 5 osd.3 111 heartbeat osd_stat(store_statfs(0x50000/0x3f100000/0x40000000, data 0x881e0/0xe10000, compress 0x0/0x0/0x0, omap 0x0, meta 0x3f100000), peers [0,1,2] op hist [])

From my experience 1 GB is terribly low for any OSD deployment so suggest simply to increase the available space up to 10GB.

Actions #2

Updated by Gerdriaan Mulder over 4 years ago

Thanks for the clarification. If you can provide me with suitable commands to find the "real device layout for BlueStore", I'd be happy to oblige.

It seems ceph-deploy osd create --data /dev/sdb does not fully uses the available space on the disk?

On the smaller VM:

ceph-system@node4:~$ lsblk
NAME                                                                                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
fd0                                                                                                     2:0    1    4K  0 disk 
sda                                                                                                     8:0    0   16G  0 disk 
├─sda1                                                                                                  8:1    0  243M  0 part /boot/efi
└─sda2                                                                                                  8:2    0 15.8G  0 part /
sdb                                                                                                     8:16   0    2G  0 disk 
└─ceph--e58da8e8--c116--4670--97ff--ba96103533ea-osd--block--a2cc7ef0--ec82--4ca2--a395--eb78027b0d4c 252:0    0    1G  0 lvm  
sr0                                                                                                    11:0    1 1024M  0 rom  

On the bigger VM:

ceph-system@node2:~$ lsblk
NAME                                                                                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
fd0                                                                                                     2:0    1    4K  0 disk 
sda                                                                                                     8:0    0   16G  0 disk 
├─sda1                                                                                                  8:1    0  243M  0 part /boot/efi
└─sda2                                                                                                  8:2    0 15.8G  0 part /
sdb                                                                                                     8:16   0    5G  0 disk 
└─ceph--1b971f0b--3fb1--48b2--a8ef--c62ac590de8a-osd--block--7da6c451--c890--4474--b81d--7eac644a2ef0 252:0    0    4G  0 lvm  
sr0                                                                                                    11:0    1 1024M  0 rom  

(identical for node3 and node5)

What I find peculiar is that the cluster seems to automatically fill up. As I mentioned, the cluster was pretty much (if not completely) idle. So, in theory this could also happen when the OSD is 10GB, it just takes longer to notice.

When checking the logs, osd.2 also failed:

2019-12-21 01:23:25.639041 mon.node1 [WRN] Health check update: 36 pgs not deep-scrubbed in time (PG_NOT_DEEP_SCRUBBED)
2019-12-21 02:00:00.000241 mon.node1 [WRN] overall HEALTH_WARN Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized; 36 pgs not deep-scrubbed in time
2019-12-21 02:19:06.618841 mon.node1 [INF] osd.2 failed (root=default,datacenter=nijmegen,host=node4) (connection refused reported by osd.1)
2019-12-21 02:19:06.674791 mon.node1 [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-12-21 02:19:06.674853 mon.node1 [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2019-12-21 02:19:09.761255 mon.node1 [WRN] Health check update: Degraded data redundancy: 41/473 objects degraded (8.668%), 9 pgs degraded, 129 pgs undersized (PG_DEGRADED)
2019-12-21 02:19:19.466891 mon.node1 [WRN] Health check update: Degraded data redundancy: 90/473 objects degraded (19.027%), 13 pgs degraded, 129 pgs undersized (PG_DEGRADED)
2019-12-21 02:20:07.555790 mon.node1 [WRN] Health check failed: Reduced data availability: 128 pgs stale (PG_AVAILABILITY)
2019-12-21 02:20:09.624054 mon.node1 [WRN] Health check update: Degraded data redundancy: 90/473 objects degraded (19.027%), 13 pgs degraded, 143 pgs undersized (PG_DEGRADED)
2019-12-21 02:28:08.219386 mon.node1 [WRN] Health check update: 37 pgs not deep-scrubbed in time (PG_NOT_DEEP_SCRUBBED)
2019-12-21 02:29:09.594855 mon.node1 [INF] Marking osd.2 out (has been down for 602 seconds)
2019-12-21 02:29:09.595486 mon.node1 [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-12-21 02:29:09.595524 mon.node1 [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)

On the host (node4):

Dec 21 02:19:06 node4 ceph-osd[9962]: 2019-12-21 02:19:06.050 7f8454dcb700 -1 bluefs _allocate failed to allocate 0x400000 on bdev 1, free 0x0
Dec 21 02:19:06 node4 ceph-osd[9962]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'int
    BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, uint64_t, uint64_t)' thread 7f8454dcb700 time 2019-12-21 02:19:06.055308
Dec 21 02:19:06 node4 ceph-osd[9962]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 2001: FAILED ceph_assert(r == 0)
Dec 21 02:19:06 node4 ceph-osd[9962]:  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
Dec 21 02:19:06 node4 ceph-osd[9962]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x85c3c8]
Dec 21 02:19:06 node4 ceph-osd[9962]:  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x85c5a3]
Dec 21 02:19:06 node4 ceph-osd[9962]:  3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x1a30) [0xe86f50]
Dec 21 02:19:06 node4 ceph-osd[9962]:  4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x9c) [0xe881fc]
Dec 21 02:19:06 node4 ceph-osd[9962]:  5: (BlueRocksWritableFile::Sync()+0x63) [0xea62c3]
Dec 21 02:19:06 node4 ceph-osd[9962]:  6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x3e9) [0x15184b9]
Dec 21 02:19:06 node4 ceph-osd[9962]:  7: (rocksdb::WritableFileWriter::Sync(bool)+0x376) [0x151b8a6]
Dec 21 02:19:06 node4 ceph-osd[9962]:  8: (rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, 
    unsigned long*, bool, bool, unsigned long)+0x32c) [0x13866dc]
Dec 21 02:19:06 node4 ceph-osd[9962]:  9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*,
    rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x245d) [0x138f75d]
Dec 21 02:19:06 node4 ceph-osd[9962]:  10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x30) [0x138fdf0]
Dec 21 02:19:06 node4 ceph-osd[9962]:  11: (RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x81) [0xe0e0d1]
Dec 21 02:19:06 node4 ceph-osd[9962]:  12: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x97) [0xe0eab7]
Dec 21 02:19:06 node4 ceph-osd[9962]:  13: (BlueStore::_kv_sync_thread()+0x1b2a) [0xdc1b8a]
Dec 21 02:19:06 node4 ceph-osd[9962]:  14: (BlueStore::KVSyncThread::entry()+0xd) [0xde352d]
Dec 21 02:19:06 node4 ceph-osd[9962]:  15: (()+0x76ba) [0x7f8463e1f6ba]
Dec 21 02:19:06 node4 ceph-osd[9962]:  16: (clone()+0x6d) [0x7f846342641d]
Dec 21 02:19:06 node4 ceph-osd[9962]: 2019-12-21 02:19:06.078 7f8454dcb700 -1 /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'int
    BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, uint64_t, uint64_t)' thread 7f8454dcb700 time 2019-12-21 02:19:06.055308
Dec 21 02:19:06 node4 ceph-osd[9962]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 2001: FAILED ceph_assert(r == 0)
Dec 21 02:19:06 node4 ceph-osd[9962]:  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
[..]

Actions #3

Updated by Sage Weil about 4 years ago

  • Crash signature (v1) updated (diff)

crash sig (for the record):

{
    "crash_id": "2019-12-13_03:13:55.981131Z_e3f0cf94-b3a9-4acc-8917-4d6dcee735a6",
    "timestamp": "2019-12-13 03:13:55.981131Z",
    "process_name": "ceph-osd",
    "entity_name": "osd.3",
    "ceph_version": "14.2.4",
    "utsname_hostname": "node5",
    "utsname_sysname": "Linux",
    "utsname_release": "4.4.0-169-generic",
    "utsname_version": "#198-Ubuntu SMP Tue Nov 12 10:38:00 UTC 2019",
    "utsname_machine": "x86_64",
    "os_name": "Ubuntu",
    "os_id": "ubuntu",
    "os_version_id": "16.04",
    "os_version": "16.04.6 LTS (Xenial Xerus)",
    "assert_condition": "abort",
    "assert_func": "int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)",
    "assert_file": "/build/ceph-14.2.4/src/os/bluestore/BlueFS.cc",
    "assert_line": 2132,
    "assert_thread_name": "bstore_kv_sync",
    "assert_msg": "/build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f9f0f83b700 time 2019-12-13 04:13:55.919740\n/build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 2132: ceph_abort_msg(\"bluefs enospc\")\n",
    "backtrace": [
        "(()+0x11390) [0x7f9f1e899390]",
        "(gsignal()+0x38) [0x7f9f1ddc4428]",
        "(abort()+0x16a) [0x7f9f1ddc602a]",
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b0) [0x85cb43]",
        "(BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1e26) [0xe84d16]",
        "(BlueFS::_flush(BlueFS::FileWriter*, bool)+0x11c) [0xe8506c]",
        "(BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x4d) [0xe881ad]",
        "(BlueRocksWritableFile::Sync()+0x63) [0xea62c3]",
        "(rocksdb::WritableFileWriter::SyncInternal(bool)+0x3e9) [0x15184b9]",
        "(rocksdb::WritableFileWriter::Sync(bool)+0x376) [0x151b8a6]",
        "(rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long)+0x32c) [0x13866dc]",
        "(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x245d) [0x138f75d]",
        "(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x30) [0x138fdf0]",
        "(RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x81) [0xe0e0d1]",
        "(RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x97) [0xe0eab7]",
        "(BlueStore::_kv_sync_thread()+0x1b2a) [0xdc1b8a]",
        "(BlueStore::KVSyncThread::entry()+0xd) [0xde352d]",
        "(()+0x76ba) [0x7f9f1e88f6ba]",
        "(clone()+0x6d) [0x7f9f1de9641d]" 
    ]
}

Actions #4

Updated by Igor Fedotov about 4 years ago

Gerdriaan, sorry, missed you inquiry.
What I wanted is the output for:
ceph-bluestore-tool --path <path-to-osd> --command bluefs-bdev-sizes

Actions #5

Updated by Gerdriaan Mulder almost 4 years ago

Hi Igor,

Output below for each OSD node:

root@node2:~# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ --command bluefs-bdev-sizes
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-0/block -> /dev/dm-0
2020-04-20 10:27:44.709 7f704fd81100 -1 bdev(0x3d5c380 /var/lib/ceph/osd/ceph-0/block) _lock flock failed on /var/lib/ceph/osd/ceph-0/block
2020-04-20 10:27:44.709 7f704fd81100 -1 bdev(0x3d5c380 /var/lib/ceph/osd/ceph-0/block) open failed to lock /var/lib/ceph/osd/ceph-0/block: (11) Resource temporarily unavailable
unable to open /var/lib/ceph/osd/ceph-0/block: (11) Resource temporarily unavailable
root@node3:~# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-1/ --command bluefs-bdev-sizes
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block -> /dev/dm-0
2020-04-20 10:28:37.524 7f6b2565e100 -1 bdev(0x408a380 /var/lib/ceph/osd/ceph-1/block) _lock flock failed on /var/lib/ceph/osd/ceph-1/block
2020-04-20 10:28:37.524 7f6b2565e100 -1 bdev(0x408a380 /var/lib/ceph/osd/ceph-1/block) open failed to lock /var/lib/ceph/osd/ceph-1/block: (11) Resource temporarily unavailable
unable to open /var/lib/ceph/osd/ceph-1/block: (11) Resource temporarily unavailable
root@node4:~# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-2/ --command bluefs-bdev-sizes
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-2/block -> /dev/dm-0
1 : device size 0x40000000 : own 0x[a00000~1400000,2000000~1400000,3900000~3c700000] = 0x3ef00000 : using 0x3ef00000(1007 MiB)
root@node5:~# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-3/ --command bluefs-bdev-sizes
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-3/block -> /dev/dm-0
1 : device size 0x40000000 : own 0x[a00000~1400000,2000000~1400000,3700000~3c900000] = 0x3f100000 : using 0x3f100000(1009 MiB)

Current cluster status (I've left it running since January):

root@node2:~# ceph -s
  cluster:
    id:     77b0f639-26c6-4d18-a41d-90599c28ca05
    health: HEALTH_WARN
            Reduced data availability: 128 pgs inactive
            Degraded data redundancy: 1/461 objects degraded (0.217%), 1 pg degraded, 8 pgs undersized
            136 pgs not deep-scrubbed in time
            136 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum node1,node3,node5 (age 3M)
    mgr: node4(active, since 3M), standbys: node2
    osd: 4 osds: 2 up (since 4M), 2 in (since 4M)
    rgw: 1 daemon active (node1)

  data:
    pools:   7 pools, 296 pgs
    objects: 235 objects, 20 MiB
    usage:   2.1 GiB used, 5.9 GiB / 8 GiB avail
    pgs:     43.243% pgs unknown
             1/461 objects degraded (0.217%)
             160 active+clean
             128 unknown
             7   active+undersized
             1   active+undersized+degraded

root@node2:~# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME               STATUS REWEIGHT PRI-AFF 
 -1       0.00970 root default                                    
-16       0.00970     datacenter nijmegen                         
 -3       0.00388         host node2                              
  0   hdd 0.00388             osd.0           up  1.00000 1.00000 
 -5       0.00388         host node3                              
  1   hdd 0.00388             osd.1           up  1.00000 1.00000 
 -7       0.00098         host node4                              
  2   ssd 0.00098             osd.2         down        0 1.00000 
 -9       0.00098         host node5                              
  3   ssd 0.00098             osd.3         down        0 1.00000 
Actions #6

Updated by Igor Fedotov almost 4 years ago

  • Status changed from New to Rejected

Hi Gerdriaan.
Thanks for the update.
First of all - ceph-bluestore-tool to be executed on an offline OSD hence the "Resource temporarily unavailable" error for ceph-0 and ceph-1.
But as I can see for ceph2 & 3 your OSDs have a single 1GB volume per OSD. And now they are almost out of space. Which prevents them from starting.

See:
1 : device size 0x40000000 : own 0x[a00000~1400000,2000000~1400000,3900000~3c700000] = 0x3ef00000 : using 0x3ef00000(1007 MiB)
or
1 : device size 0x40000000 : own 0x[a00000~1400000,2000000~1400000,3700000~3c900000] = 0x3f100000 : using 0x3f100000(1009 MiB)

As I mentioned in comment #1 one gigabyte volume per OSD is terribly low. We at SUSE have the following H/W recommendations of 20GB at minimum: https://documentation.suse.com/ses/5.5/html/ses-all/storage-bp-hwreq.html#deployment-osd-recommendation
Not to mention community H/W recommendation of 1 TB drive for pruduction (https://docs.ceph.com/docs/master/start/hardware-recommendations/)

Going to close the ticket, please feel free to reopen if you think it's still valid

Actions #7

Updated by Gerdriaan Mulder almost 4 years ago

Thanks for your reply. Good to know that ceph-bluestore-tool needs to run on an offline OSD, I did not check that beforehand.

This setup was meant for testing purposes, so we did not account for the hardware recommendations. However, it's useful to see that such tiny clusters do not operate at all.

Actions

Also available in: Atom PDF