Bug #64966
openOSDs crash | Assert error | KernelDevice::aio_submit | when backfills 3 replica pool
0%
Description
Dear all,
we faced a strange errors.
When one of osd died a cluster started to remap\bacfilling but than for one pool group (2.3 `3 replica) its primary osds started to crush on an "Assert error" [1] one by one making PG down.
When I disable backfill on the CEPH cluster they are back to normal but the cluster stays with undersized+degraded PGs.
I have attached this PG dump. And comments for the file: osd.27 - broken disk, osd.191&osd.146 alive disks, backfill started to osd.213 when 191&146 went to down loop state by "Assert error"
I have attached the log near crush from osd.146. The time of "Assert error" Mar 18 16:35:11
This cluster is Ansible setup in docker on Ubuntu 20.04 with kernel '5.4.0-169-generic #187-Ubuntu SMP Thu Nov 23 14:52:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux' on SAS disks. Whole disk a single bluestore OSD device over LVM.
Note that there was remapping process before osd.27 went down.
Is there any way to avoid such assert error?
Any help would be appreciated.
Best regards,
Victor Kotlyar
p.s. I also have added some, as I think, useful info. No anomaly logs from the system\kernel side were observed.
[1]
Mar 18 16:35:11 storage-3-5 docker[2710998]: ceph-osd: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/blk/kernel/KernelDevice.cc:837: virtual void KernelDevice::aio_submit(IOContext*): Assertion `pending <= std::numeric_limits<uint16_t>::max()' failed. Mar 18 16:35:11 storage-3-5 docker[2710998]: *** Caught signal (Aborted) ** Mar 18 16:35:11 storage-3-5 docker[2710998]: in thread 7f42ea171700 thread_name:tp_osd_tp Mar 18 16:35:11 storage-3-5 docker[2710998]: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
Files