Actions
Bug #43370
closedOSD crash in function bluefs::_flush_range with ceph_abort_msg "bluefs enospc"
Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Setup:
- 5x Ubuntu 16.04 VM, 2GB RAM, 16GB root disk
- 2 VMs have 5GB extra (HDD) disk, 1 OSD per disk
- 2 VMs have 2GB extra (SSD) disk, 1 OSD per disk
- 3 monitors (in quorum), 2 managers (active+standby), 1 RGW
- All nodes run Ceph version 14.2.4 (Nautilus), fresh install
- OSDs created using
ceph-deploy osd create --data /dev/sdb node{2,3,4,5}
The cluster ran in HEALTH_OK for about a week without any explicit I/O when one of the smaller OSDs crashed. From journalctl:
Dec 13 04:13:55 node5 ceph-osd[10184]: 2019-12-13 04:13:55.914 7f9f0f83b700 -1 bluefs _allocate failed to allocate 0x100000 on bdev 1, free 0x0 Dec 13 04:13:55 node5 ceph-osd[10184]: 2019-12-13 04:13:55.914 7f9f0f83b700 -1 bluefs _flush_range allocated: 0xce00000 offset: 0xcdffe68 length: 0x40d Dec 13 04:13:55 node5 ceph-osd[10184]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f9f0f83b700 time 2019-12-13 04:13:55.919740 Dec 13 04:13:55 node5 ceph-osd[10184]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 2132: ceph_abort_msg("bluefs enospc") Dec 13 04:13:55 node5 ceph-osd[10184]: ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
The OSD tries to restart again, but fails with:
Dec 13 04:13:58 node5 ceph-osd[33450]: 2019-12-13 04:13:58.834 7f7bfc889f80 -1 Falling back to public interface Dec 13 04:14:28 node5 ceph-osd[33450]: 2019-12-13 04:14:28.878 7f7bfc889f80 -1 bluefs _allocate failed to allocate 0x400000 on bdev 1, free 0x0 Dec 13 04:14:28 node5 ceph-osd[33450]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: In function 'void BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)' thread 7f7bfc889f80 time 2019-12-13 04:14:28.881756 Dec 13 04:14:28 node5 ceph-osd[33450]: /build/ceph-14.2.4/src/os/bluestore/BlueFS.cc: 1809: FAILED ceph_assert(r == 0) Dec 13 04:14:28 node5 ceph-osd[33450]: ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
The last error repeats until systemd limits the restart of this service.
$ sudo ceph -s cluster: id: 77b0f639-26c6-4d18-a41d-90599c28ca05 health: HEALTH_WARN Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized 13 pgs not deep-scrubbed in time services: mon: 3 daemons, quorum node1,node3,node5 (age 2w) mgr: node2(active, since 2w), standbys: node4 osd: 4 osds: 3 up (since 5d), 3 in (since 5d) rgw: 1 daemon active (node1) data: pools: 7 pools, 296 pgs objects: 241 objects, 20 MiB usage: 3.2 GiB used, 5.8 GiB / 9 GiB avail pgs: 6/473 objects degraded (1.268%) 167 active+clean 123 active+undersized 6 active+undersized+degraded
$ sudo ceph -w [..] 2019-12-04 12:14:29.011074 mon.node1 [INF] Health check cleared: POOL_APP_NOT_ENABLED (was: application not enabled on 1 pool(s)) 2019-12-04 12:14:29.012031 mon.node1 [INF] Cluster is now healthy 2019-12-04 13:00:00.000157 mon.node1 [INF] overall HEALTH_OK 2019-12-04 14:00:00.000164 mon.node1 [INF] overall HEALTH_OK [..] 2019-12-13 03:00:00.000204 mon.node1 [INF] overall HEALTH_OK 2019-12-13 04:00:00.000202 mon.node1 [INF] overall HEALTH_OK 2019-12-13 04:13:56.491781 mon.node1 [INF] osd.3 failed (root=default,datacenter=nijmegen,host=node5) (connection refused reported by osd.1) 2019-12-13 04:13:56.544101 mon.node1 [WRN] Health check failed: 1 osds down (OSD_DOWN) 2019-12-13 04:13:56.544156 mon.node1 [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN) 2019-12-13 04:13:59.616177 mon.node1 [WRN] Health check failed: Reduced data availability: 64 pgs inactive, 136 pgs peering (PG_AVAILABILITY) 2019-12-13 04:14:03.014943 mon.node1 [WRN] Health check failed: Degraded data redundancy: 116/473 objects degraded (24.524%), 16 pgs degraded (PG_DEGRADED) 2019-12-13 04:14:03.015008 mon.node1 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 64 pgs inactive, 136 pgs peering) 2019-12-13 04:14:59.060075 mon.node1 [WRN] Health check update: Degraded data redundancy: 116/473 objects degraded (24.524%), 16 pgs degraded, 144 pgs undersized (PG_DEGRADED) 2019-12-13 04:23:59.966904 mon.node1 [INF] Marking osd.3 out (has been down for 603 seconds) 2019-12-13 04:23:59.967441 mon.node1 [INF] Health check cleared: OSD_DOWN (was: 1 osds down) 2019-12-13 04:23:59.967517 mon.node1 [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down) 2019-12-13 04:24:01.983211 mon.node1 [WRN] Health check update: Degraded data redundancy: 60/473 objects degraded (12.685%), 13 pgs degraded, 137 pgs undersized (PG_DEGRADED) 2019-12-13 04:24:08.093961 mon.node1 [WRN] Health check update: Degraded data redundancy: 110/473 objects degraded (23.256%), 11 pgs degraded, 129 pgs undersized (PG_DEGRADED) 2019-12-13 04:24:13.601381 mon.node1 [WRN] Health check update: Degraded data redundancy: 62/473 objects degraded (13.108%), 9 pgs degraded, 129 pgs undersized (PG_DEGRADED) 2019-12-13 04:24:19.973129 mon.node1 [WRN] Health check update: Degraded data redundancy: 41/473 objects degraded (8.668%), 8 pgs degraded, 129 pgs undersized (PG_DEGRADED) 2019-12-13 04:24:24.974256 mon.node1 [WRN] Health check update: Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized (PG_DEGRADED) 2019-12-13 05:00:00.000239 mon.node1 [WRN] overall HEALTH_WARN Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized 2019-12-13 06:00:00.000237 mon.node1 [WRN] overall HEALTH_WARN Degraded data redundancy: 6/473 objects degraded (1.268%), 6 pgs degraded, 129 pgs undersized [..]
$ sudo ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.00970 root default -16 0.00970 datacenter nijmegen -3 0.00388 host node2 0 hdd 0.00388 osd.0 up 1.00000 1.00000 -5 0.00388 host node3 1 hdd 0.00388 osd.1 up 1.00000 1.00000 -7 0.00098 host node4 2 ssd 0.00098 osd.2 up 1.00000 1.00000 -9 0.00098 host node5 3 ssd 0.00098 osd.3 down 0 1.00000
Please let me know if you need more information.
Truncated log (due to attachment size limitation) from /var/lib/ceph/crash
in attachment. It contains lines from a few minutes before the crash until the end of file.
Files
Actions