Bug #64966: OSDs crash | Assert error | KernelDevice::aio_submit | when backfills 3 replica pool - bluestore - Ceph

Actions

Copy link

Bug #64966

open

OSDs crash | Assert error | KernelDevice::aio_submit | when backfills 3 replica pool

Added by Victor Kotlyar about 1 month ago. Updated about 1 month ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

squid, reef, quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v17.2.6

ceph-qa-suite:

Pull request ID:

56352

Crash signature (v1):

Crash signature (v2):

Description

Dear all,
we faced a strange errors.

When one of osd died a cluster started to remap\bacfilling but than for one pool group (2.3 `3 replica) its primary osds started to crush on an "Assert error" [1] one by one making PG down.
When I disable backfill on the CEPH cluster they are back to normal but the cluster stays with undersized+degraded PGs.

I have attached this PG dump. And comments for the file: osd.27 - broken disk, osd.191&osd.146 alive disks, backfill started to osd.213 when 191&146 went to down loop state by "Assert error"
I have attached the log near crush from osd.146. The time of "Assert error" Mar 18 16:35:11

This cluster is Ansible setup in docker on Ubuntu 20.04 with kernel '5.4.0-169-generic #187-Ubuntu SMP Thu Nov 23 14:52:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux' on SAS disks. Whole disk a single bluestore OSD device over LVM.

Note that there was remapping process before osd.27 went down.

Is there any way to avoid such assert error?
Any help would be appreciated.

Best regards,
Victor Kotlyar

p.s. I also have added some, as I think, useful info. No anomaly logs from the system\kernel side were observed.

[1]

Mar 18 16:35:11 storage-3-5 docker[2710998]: ceph-osd: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/blk/kernel/KernelDevice.cc:837: virtual void KernelDevice::aio_submit(IOContext*): Assertion `pending <= std::numeric_limits<uint16_t>::max()' failed.
Mar 18 16:35:11 storage-3-5 docker[2710998]: *** Caught signal (Aborted) **
Mar 18 16:35:11 storage-3-5 docker[2710998]:  in thread 7f42ea171700 thread_name:tp_osd_tp
Mar 18 16:35:11 storage-3-5 docker[2710998]:  ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

Files

Download all files

pg.2.3.202403181512.json (18.8 KB) pg.2.3.202403181512.json		Victor Kotlyar, 03/18/2024 12:25 PM
Assert.error.osd.146.202403181528.txt.gz (392 KB) Assert.error.osd.146.202403181528.txt.gz		Victor Kotlyar, 03/18/2024 12:34 PM
ceph.status.detail.202403181540.txt (4.01 KB) ceph.status.detail.202403181540.txt		Victor Kotlyar, 03/18/2024 12:43 PM
ceph.status.202403181540.txt (1.25 KB) ceph.status.202403181540.txt		Victor Kotlyar, 03/18/2024 12:43 PM
osd.tree.202403181540.txt (20.2 KB) osd.tree.202403181540.txt		Victor Kotlyar, 03/18/2024 12:43 PM
osd.146.info.txt (655 Bytes) osd.146.info.txt		Victor Kotlyar, 03/18/2024 12:45 PM
osd.191.crash.txt.tar (357 KB) osd.191.crash.txt.tar		Victor Kotlyar, 03/22/2024 06:23 AM
notes.txt (2.91 KB) notes.txt		Victor Kotlyar, 03/22/2024 06:24 AM

Actions

Copy link

Updated by Igor Fedotov about 1 month ago

I think the issue is caused by high fragmentation of some onode(s). Looks like there are more than 64K logical non-contiguous extents in such onode(s).

Is this an rbd pool? Does it have compression enabled?

You might want to set debug- bluestore to 2/15 for crashing osds and try to trigger the issue once again. Resulting log might prove the above hypothesis and provide some more insights, e.g. the crashing onode name. Please note that this setting can have negatively impact the performance.

If my hypothesis is correct then I can see the only way to work around the issue without patching and rebuilding the code - export-remove and import PG which keeps "bad" onodes back to its replica OSDs using ceph-objectstore-tool.Given the provided osd.146 log the first candidate for this operation would be PG 2.3. This operation should "defragment" the objects and hence work around the issue.

Actions

Copy link

Updated by Victor Kotlyar about 1 month ago

Dear Igor, many thx!

This is a cephfs pool [1]. Compression was enabled and then I disabled it as soon as there no benefit for data store.
Meantime I was able to run broken osd.27 and I were trying to migrate PGs from it.
I see your point and as soon as the problem returns (it will return definitely) I will update the issue.

By the way If I am be able to split PG NUM from 128 to 256 will it change the high fragmentation and solve the problem? Looks like it should?

Best regards,
Victor Kotlyar

[1]
```
pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 13461 flags hashpspool stripe_width 0 compression_algorithm lz4 compression_mode none application cephfs
```

Actions

Copy link

Updated by Igor Fedotov about 1 month ago

Hi Victor,
first of all I'd like to confirm my analysis above - I managed to reproduce the issue in my lab.

So the issue occurs when an onode keeps more than 64K non-contiguous extents - for my repro I wrote extents in the following manner (offset~length): 0~1, 4096~1, 8196~1...... up to offset = 4096 * 65536.

IMO PG split wouldn't fix that as this would rather trigger onode remapping/backfilling which is issues faulty read op. The key point is that there are two types of reads from ObjectStore:
1. reading continuous extent using offset,length pair
2. reading object content in a scattered manner using a set of intervals.

PG remap uses the second one coupled by onode's fiemap (i.e. list of non-empty extents). Hence it gets a huge list of intervals to read which results in the assertion - kernel I/O stack is unable to process that long list in a single shot.

On the other hand PG export/import uses read option 1. which will read an onode from offset 0 to its size as a whole including all the gaps. Just a single piece of data. And then write it back in the same manner - with no gaps. Hence (hopefully) working around the issue. Potentially this can cause some increase in required space to keep that PG but I doubt it would be a large raise.

Victor Kotlyar wrote:

Dear Igor, many thx!

This is a cephfs pool [1]. Compression was enabled and then I disabled it as soon as there no benefit for data store.
Meantime I was able to run broken osd.27 and I were trying to migrate PGs from it.
I see your point and as soon as the problem returns (it will return definitely) I will update the issue.

By the way If I am be able to split PG NUM from 128 to 256 will it change the high fragmentation and solve the problem? Looks like it should?

Best regards,
Victor Kotlyar

[1]
```
pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 13461 flags hashpspool stripe_width 0 compression_algorithm lz4 compression_mode none application cephfs
```

Actions

Copy link

Updated by Victor Kotlyar about 1 month ago

Hi Igor, many thx!
Ok I will back with results when hit this issue again. It may takes several days. It is at least a second PG I observed to crush in our CEPH cluster.

Cheers,
Victor

Actions

Copy link

Updated by Igor Fedotov about 1 month ago

Tracker changed from Support to Bug
Status changed from New to Fix Under Review
Backport set to squid, reef, quincy
Regression set to No
Severity set to 3 - minor
Pull request ID set to 56352

Actions

Copy link Download all files

Updated by Victor Kotlyar about 1 month ago

File osd.191.crash.txt.tar osd.191.crash.txt.tar added
File notes.txt notes.txt added

Dear Igor,

Export/import massage for suspicious PG (2.3) indeed fixed this particular problem!
Many thx!
I have attached the crush log for the primary OSD.191 in the moment of crush with debug_bluestore 2/15 (I hope)
Also I have attached my notes for recovery procedure (for somebody if needed).

So if it is really the bug we have to wait for the fix for it.
Meantime is there any way to find such defragmented PGs (I feel that it is 3 replica pool only in our case) or we have to wait until next assert occurred?
Cheers,
Victor

Actions

Copy link

Updated by Igor Fedotov about 1 month ago

Hi Victor,
thanks for the update. Indeed it confirms my hypothesis - there was very fragmented onode keeping 4-byte data extents every 8 bytes. Here is the relevant log snippet showing data layout in it:

readv 2.3_head #2:c1173368:::10000403d26.00000000:head# fiemap [224~4,232~4,240~4,248~4,256~4,264~4,272~4,280~4,288~4,296~4,304~4,312~4,320~4 ...

Which I presume is caused by a very specific user writing pattern.

The only way to find out additional "bad" objects which comes to my mind is running deep (--deep 1 switch) fsck using ceph-bluestore-tool for every OSD backing this pool. Coupled with setting bluestore_fsck_read_bytes_cap config parameter to a pretty large value, e.g. 2GB.

This will trigger object reads using 2GB data block and thus it will hopefully reveal (by exposing the same assertion) the problem for other objects if any. Might take a while though as deep fsck scans and reads the content for all the objects at an OSD.

Actions

Copy link

Updated by Victor Kotlyar about 1 month ago

Hi Igor,
thank you very much!

Ok, looks like we have to wait for a fix.

Cheers
Victor

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #64966

OSDs crash | Assert error | KernelDevice::aio_submit | when backfills 3 replica pool

Updated by Igor Fedotov about 1 month ago

Updated by Victor Kotlyar about 1 month ago

Updated by Igor Fedotov about 1 month ago

Updated by Victor Kotlyar about 1 month ago

Updated by Igor Fedotov about 1 month ago

Updated by Victor Kotlyar about 1 month ago

Updated by Igor Fedotov about 1 month ago

Updated by Victor Kotlyar about 1 month ago