Project

General

Profile

Actions

Bug #43370

closed

OSD crash in function bluefs::_flush_range with ceph_abort_msg "bluefs enospc"

Added by Gerdriaan Mulder over 4 years ago. Updated about 4 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Setup:
  • 5x Ubuntu 16.04 VM, 2GB RAM, 16GB root disk
    • 2 VMs have 5GB extra (HDD) disk, 1 OSD per disk
    • 2 VMs have 2GB extra (SSD) disk, 1 OSD per disk
  • 3 monitors (in quorum), 2 managers (active+standby), 1 RGW
  • All nodes run Ceph version 14.2.4 (Nautilus), fresh install
  • OSDs created using ceph-deploy osd create --data /dev/sdb node{2,3,4,5}

The cluster ran in HEALTH_OK for about a week without any explicit I/O when one of the smaller OSDs crashed. From journalctl:

Dec 13 04:13:55 node5 ceph-osd[10184]: 2019-12-13 04:13:55.914 7f9f0f83b700 -1 bluefs _allocate failed to allocate 0x100000 on bdev 1, free 0x0 
Dec 13 04:13:55 node5 ceph-osd[10184]: 2019-12-13 04:13:55.914 7f9f0f83b700 -1 bluefs _flush_range allocated: 0xce00000 offset: 0xcdffe68 length: 0x40d
Dec 13 04:13:55 node5 ceph-osd[10184]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f9f0f83b700 time 2019-12-13 04:13:55.919740
Dec 13 04:13:55 node5 ceph-osd[10184]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 2132: ceph_abort_msg("bluefs enospc")
Dec 13 04:13:55 node5 ceph-osd[10184]:  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)

The OSD tries to restart again, but fails with:

Dec 13 04:13:58 node5 ceph-osd[33450]: 2019-12-13 04:13:58.834 7f7bfc889f80 -1 Falling back to public interface
Dec 13 04:14:28 node5 ceph-osd[33450]: 2019-12-13 04:14:28.878 7f7bfc889f80 -1 bluefs _allocate failed to allocate 0x400000 on bdev 1, free 0x0
Dec 13 04:14:28 node5 ceph-osd[33450]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'void BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)' thread 7f7bfc889f80 time 2019-12-13 04:14:28.881756
Dec 13 04:14:28 node5 ceph-osd[33450]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 1809: FAILED ceph_assert(r == 0)
Dec 13 04:14:28 node5 ceph-osd[33450]:  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)

The last error repeats until systemd limits the restart of this service.

$ sudo ceph -s
  cluster:
    id:     77b0f639-26c6-4d18-a41d-90599c28ca05
    health: HEALTH_WARN
            Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized
            13 pgs not deep-scrubbed in time

  services:
    mon: 3 daemons, quorum node1,node3,node5 (age 2w)
    mgr: node2(active, since 2w), standbys: node4
    osd: 4 osds: 3 up (since 5d), 3 in (since 5d)
    rgw: 1 daemon active (node1)

  data:
    pools:   7 pools, 296 pgs
    objects: 241 objects, 20 MiB
    usage:   3.2 GiB used, 5.8 GiB / 9 GiB avail
    pgs:     6/473 objects degraded (1.268%)
             167 active+clean
             123 active+undersized
             6   active+undersized+degraded
$ sudo ceph -w
[..]
2019-12-04 12:14:29.011074 mon.node1 [INF] Health check cleared: POOL_APP_NOT_ENABLED (was: application not enabled on 1 pool(s))
2019-12-04 12:14:29.012031 mon.node1 [INF] Cluster is now healthy
2019-12-04 13:00:00.000157 mon.node1 [INF] overall HEALTH_OK
2019-12-04 14:00:00.000164 mon.node1 [INF] overall HEALTH_OK
[..]
2019-12-13 03:00:00.000204 mon.node1 [INF] overall HEALTH_OK
2019-12-13 04:00:00.000202 mon.node1 [INF] overall HEALTH_OK
2019-12-13 04:13:56.491781 mon.node1 [INF] osd.3 failed (root=default,datacenter=nijmegen,host=node5) (connection refused reported by osd.1)
2019-12-13 04:13:56.544101 mon.node1 [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-12-13 04:13:56.544156 mon.node1 [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2019-12-13 04:13:59.616177 mon.node1 [WRN] Health check failed: Reduced data availability: 64 pgs inactive, 136 pgs peering (PG_AVAILABILITY)
2019-12-13 04:14:03.014943 mon.node1 [WRN] Health check failed: Degraded data redundancy: 116/473 objects degraded (24.524%), 16 pgs degraded (PG_DEGRADED)
2019-12-13 04:14:03.015008 mon.node1 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 64 pgs inactive, 136 pgs peering)
2019-12-13 04:14:59.060075 mon.node1 [WRN] Health check update: Degraded data redundancy: 116/473 objects degraded (24.524%), 16 pgs degraded, 144 pgs undersized (PG_DEGRADED)
2019-12-13 04:23:59.966904 mon.node1 [INF] Marking osd.3 out (has been down for 603 seconds)
2019-12-13 04:23:59.967441 mon.node1 [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-12-13 04:23:59.967517 mon.node1 [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2019-12-13 04:24:01.983211 mon.node1 [WRN] Health check update: Degraded data redundancy: 60/473 objects degraded (12.685%), 13 pgs degraded, 137 pgs undersized (PG_DEGRADED)
2019-12-13 04:24:08.093961 mon.node1 [WRN] Health check update: Degraded data redundancy: 110/473 objects degraded (23.256%), 11 pgs degraded, 129 pgs undersized (PG_DEGRADED)
2019-12-13 04:24:13.601381 mon.node1 [WRN] Health check update: Degraded data redundancy: 62/473 objects degraded (13.108%), 9 pgs degraded, 129 pgs undersized (PG_DEGRADED)
2019-12-13 04:24:19.973129 mon.node1 [WRN] Health check update: Degraded data redundancy: 41/473 objects degraded (8.668%), 8 pgs degraded, 129 pgs undersized (PG_DEGRADED)
2019-12-13 04:24:24.974256 mon.node1 [WRN] Health check update: Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized (PG_DEGRADED)
2019-12-13 05:00:00.000239 mon.node1 [WRN] overall HEALTH_WARN Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized
2019-12-13 06:00:00.000237 mon.node1 [WRN] overall HEALTH_WARN Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized
[..]
$ sudo ceph osd tree
ID  CLASS WEIGHT  TYPE NAME               STATUS REWEIGHT PRI-AFF 
 -1       0.00970 root default                                    
-16       0.00970     datacenter nijmegen                         
 -3       0.00388         host node2                              
  0   hdd 0.00388             osd.0           up  1.00000 1.00000 
 -5       0.00388         host node3                              
  1   hdd 0.00388             osd.1           up  1.00000 1.00000 
 -7       0.00098         host node4                              
  2   ssd 0.00098             osd.2           up  1.00000 1.00000 
 -9       0.00098         host node5                              
  3   ssd 0.00098             osd.3         down        0 1.00000 

Please let me know if you need more information.

Truncated log (due to attachment size limitation) from /var/lib/ceph/crash in attachment. It contains lines from a few minutes before the crash until the end of file.


Files

2019-12-18_log_truncated_1000KB.txt (26 KB) 2019-12-18_log_truncated_1000KB.txt /var/lib/ceph/crash/posted/datetime_uuid/log Gerdriaan Mulder, 12/18/2019 02:33 PM
Actions

Also available in: Atom PDF