Project

General

Profile

Actions

Bug #64966

open

OSDs crash | Assert error | KernelDevice::aio_submit | when backfills 3 replica pool

Added by Victor Kotlyar about 2 months ago. Updated about 2 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
squid, reef, quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Dear all,
we faced a strange errors.

When one of osd died a cluster started to remap\bacfilling but than for one pool group (2.3 `3 replica) its primary osds started to crush on an "Assert error" [1] one by one making PG down.
When I disable backfill on the CEPH cluster they are back to normal but the cluster stays with undersized+degraded PGs.

I have attached this PG dump. And comments for the file: osd.27 - broken disk, osd.191&osd.146 alive disks, backfill started to osd.213 when 191&146 went to down loop state by "Assert error"
I have attached the log near crush from osd.146. The time of "Assert error" Mar 18 16:35:11

This cluster is Ansible setup in docker on Ubuntu 20.04 with kernel '5.4.0-169-generic #187-Ubuntu SMP Thu Nov 23 14:52:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux' on SAS disks. Whole disk a single bluestore OSD device over LVM.

Note that there was remapping process before osd.27 went down.

Is there any way to avoid such assert error?
Any help would be appreciated.

Best regards,
Victor Kotlyar

p.s. I also have added some, as I think, useful info. No anomaly logs from the system\kernel side were observed.

[1]

Mar 18 16:35:11 storage-3-5 docker[2710998]: ceph-osd: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/blk/kernel/KernelDevice.cc:837: virtual void KernelDevice::aio_submit(IOContext*): Assertion `pending <= std::numeric_limits<uint16_t>::max()' failed.
Mar 18 16:35:11 storage-3-5 docker[2710998]: *** Caught signal (Aborted) **
Mar 18 16:35:11 storage-3-5 docker[2710998]:  in thread 7f42ea171700 thread_name:tp_osd_tp
Mar 18 16:35:11 storage-3-5 docker[2710998]:  ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)


Files

pg.2.3.202403181512.json (18.8 KB) pg.2.3.202403181512.json Victor Kotlyar, 03/18/2024 12:25 PM
Assert.error.osd.146.202403181528.txt.gz (392 KB) Assert.error.osd.146.202403181528.txt.gz Victor Kotlyar, 03/18/2024 12:34 PM
ceph.status.detail.202403181540.txt (4.01 KB) ceph.status.detail.202403181540.txt Victor Kotlyar, 03/18/2024 12:43 PM
ceph.status.202403181540.txt (1.25 KB) ceph.status.202403181540.txt Victor Kotlyar, 03/18/2024 12:43 PM
osd.tree.202403181540.txt (20.2 KB) osd.tree.202403181540.txt Victor Kotlyar, 03/18/2024 12:43 PM
osd.146.info.txt (655 Bytes) osd.146.info.txt Victor Kotlyar, 03/18/2024 12:45 PM
osd.191.crash.txt.tar (357 KB) osd.191.crash.txt.tar Victor Kotlyar, 03/22/2024 06:23 AM
notes.txt (2.91 KB) notes.txt Victor Kotlyar, 03/22/2024 06:24 AM
Actions

Also available in: Atom PDF