Bug #53426: rbd_directory (and some others) corrupted after update from 15.2.10 to 16.2.6 - bluestore - Ceph

Actions

Copy link

Bug #53426

closed

rbd_directory (and some others) corrupted after update from 15.2.10 to 16.2.6

Added by Florian Florensa over 2 years ago. Updated over 2 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Igor Fedotov

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v16.2.6

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After an upgrade from 15.2.10 to 16.2.6, i noticed that some images were doing strange things ie, an image is appering in rbd ls but doing an rbd info on it return an ENOENT

$> rbd ls | grep scw-disk-47f0955d-ca3a-4d33-addb-f66c78c31ec4
scw-disk-47f0955d-ca3a-4d33-addb-f66c78c31ec4
$> rbd info scw-disk-47f0955d-ca3a-4d33-addb-f66c78c31ec4
2021-11-29T13:49:59.920+0000 7f9bf27fc700 -1 librbd::image::OpenRequest: failed to retrieve initial metadata: (2) No such file or directory
rbd: error opening image scw-disk-47f0955d-ca3a-4d33-addb-f66c78c31ec4: (2) No such file or directory

Inspecting the omap value of rbd_directory, the keys related to this particular image seems strange (as they look prefixed by non printable characters):

key (34 bytes):
00000000  00 00 00 00 00 00 00 01  00 00 00 00 00 00 12 7d  |...............}|
00000010  2e 69 64 5f 30 37 66 61  36 39 31 66 65 36 65 30  |.id_07fa691fe6e0|
00000020  35 33                                             |53|
00000022

value (49 bytes) :
00000000  2d 00 00 00 73 63 77 2d  64 69 73 6b 2d 61 62 34  |-...scw-disk-ab4|
00000010  34 34 65 64 37 2d 63 33  61 61 2d 34 39 35 34 2d  |44ed7-c3aa-4954-|
00000020  38 62 37 30 2d 63 63 30  61 38 36 31 37 66 37 66  |8b70-cc0a8617f7f|
00000030  37                                                |7|
00000031

key (67 bytes):
00000000  00 00 00 00 00 00 00 01  00 00 00 00 00 00 12 7d  |...............}|
00000010  2e 6e 61 6d 65 5f 73 63  77 2d 64 69 73 6b 2d 61  |.name_scw-disk-a|
00000020  62 34 34 34 65 64 37 2d  63 33 61 61 2d 34 39 35  |b444ed7-c3aa-495|
00000030  34 2d 38 62 37 30 2d 63  63 30 61 38 36 31 37 66  |4-8b70-cc0a8617f|
00000040  37 66 37                                          |7f7|
00000043

value (18 bytes) :
00000000  0e 00 00 00 30 37 66 61  36 39 31 66 65 36 65 30  |....07fa691fe6e0|
00000010  35 33                                             |53|
00000012

I left this image like that, but on another one, that exhibited the same issues, fixing the key to name_XXX // id_YYY fixed the rbd_info

Also, as far as i know snaptshot related stuff should be stored in rbd_children omap values, but :

$> rados -p rbd listomapvals  rbd_children
error getting omap keys rbd/rbd_children: (2) No such file or directory

I had also had a namespace that was not appearing in rbd namespace list, but doing an rbd ls --namespace XXX was working, readding the key in rbd_namespace omapval did the trick tho.

Files

Download all files

rbd_header.07fa691fe6e053.omap (3.18 KB) rbd_header.07fa691fe6e053.omap	rbd_header of a borken image	Florian Florensa, 11/30/2021 11:57 AM
rbd_header.6226b8b1561eb8.omap (6.74 KB) rbd_header.6226b8b1561eb8.omap		Florian Florensa, 11/30/2021 12:07 PM
osd_22.kvstore-tool.output (596 KB) osd_22.kvstore-tool.output		Florian Florensa, 11/30/2021 01:42 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Florian Florensa over 2 years ago

for the missing namespace, its key was looking like that :

key (31 bytes):
00000000  00 00 00 00 00 00 00 01  00 00 00 00 01 65 0a de  |.............e..|
00000010  2e 6e 61 6d 65 5f 69 6e  73 74 61 6e 63 65 73     |.name_instances|
0000001f

value (0 bytes) :

Actions

Copy link

Updated by Mykola Golub over 2 years ago

Actually this behavior looks expected for a partially removed image.

Some time ago the behavior of rbd remove command was changed to move the image to the trash as its first step and the proceed with removing objects (truncating) and then removing metadata. If the removal is interrupted the image is still in the trash but it is not shown by "normal" `rbd trash ls` command.

The `rbd ls` was also updated to list images from both `rbd_directory` and `rbd trash` (only those in "removing state" here). It was supposed to be useful for users to detect those partially removed images and complete the removal. For this you just need to run `rbd rm {image}` again.

The simplest way to check if it is the case I suppose is to run:

rados -p rbd listomapvals  rbd_trash

and see if you can find the image in the list here. You can provide the output here for review if unsure.

It looks like the current rbd behavior is rather confusing and could be actually improved. I remember we had some improvements in dashboard already, so if you use dashboard you may also want to try listing images there. In newer version the dashboard should show such images in "removing" state.

Actions

Copy link

Updated by Florian Florensa over 2 years ago

After checking, there are some images in the trash, but the one we have issues with in these cluster were not deleted, and are not present in the trash

Actions

Copy link

Updated by Mykola Golub over 2 years ago

Florian Florensa wrote:

After checking, there are some images in the trash, but the one we have issues with in these cluster were not deleted, and are not present in the trash

Just to make sure, you used `rados -p rbd listomapvals rbd_trash` command, right? (not `rbd trash ls`)

Actions

Copy link

Updated by Mykola Golub over 2 years ago

Ah, sorry, when reading the first time I was too fast and somehow missed your description about the non-binary chars in key names, and that fixing it fixed the problem. Yes, this looks strange indeed.

I was basing my assumption about partially removed image just on the `rbd ls` and `rbd info` behavior you experienced and that looked the same as if you had a partially removed image. Now I see that this is not the case.

Actions

Copy link

Updated by Mykola Golub over 2 years ago

Is the problem observed for images created before upgrade or after?

Actions

Copy link

Updated by Florian Florensa over 2 years ago

All the one that encountered this issue so far have all been created before the upgrade to ceph 16

Actions

Copy link

Updated by Florian Florensa over 2 years ago

File rbd_header.07fa691fe6e053.omap rbd_header.07fa691fe6e053.omap added

After inspecting the rbd_header of the broken images, all of its key exhibit the same issue, they are all prefixed by the same thing, please see the attached file, the command used is :

rados -p rbd listomapvals rbd_header.07fa691fe6e053 > rbd_header.07fa691fe6e053.omap

Actions

Copy link

Updated by Florian Florensa over 2 years ago

File rbd_header.6226b8b1561eb8.omap rbd_header.6226b8b1561eb8.omap added

I found another image that exhibit the same issue as the one above.
It seems like there is some kind of prefix that always the same lenght, but seems to be dependent on the object name (as it is consistently the same prefix for a given object)
I have attached the omapvals for this header.

Actions

Copy link

#10

Updated by Igor Fedotov over 2 years ago

Florian, this 17-byte prefix in the keys makes me think you're bitten by https://tracker.ceph.com/issues/53062

I still don't fully understand why this doesn't cause some failures at earlier stages, e.g. on OSD startup, but it looks like the keys are broken in the similar manner.
Could you please stop any arbitrary OSD and run:
ceph-kvstore-tool bluestore-kv <path-to-osd> list p

Please share the output to stdout if any

Actions

Copy link

#11

Updated by Florian Florensa over 2 years ago

File osd_22.kvstore-tool.output osd_22.kvstore-tool.output added

Here is the output, and the keys prefix in there looks familiar with what i was seeing in the omapvals

Actions

Copy link

#12

Updated by Igor Fedotov over 2 years ago

Florian Florensa wrote:

Here is the output, and the keys prefix in there looks familiar with what i was seeing in the omapvals

Well, there is some mess in OMAP keys indeed, compare these two records:
p %00%00%00%00%00%00%00%03%9d%1cj%e0%00%00%00%00%00%02%fd%b8.%00%00%00%00%00%00%00%03%00%00%00%00%00%00M%3b.20211028-001726
p %00%00%00%00%00%00%00%03%9d%1cj%e0%00%00%00%00%00%02%fd%b8.20211029-001108

The first ne has got a redundant substring in the middle:
%00%00%00%00%00%00%00%03%00%00%00%00%00%00M%3b.

And apparently something happened to cluster on Oct 28/29 2021 - every invalid record group ends with 20211028 in the key and good one starts with 20211029.

To a great extent this is similar to #53062 but there are still some unclear differences which prevents from exact matching to that ticket. And unfortunately our recovery tools are not applicable for now.
Working on both root cause analysis and recovery procedure...

Actions

Copy link

#13

Updated by Igor Fedotov over 2 years ago

Status changed from New to In Progress
Assignee set to Igor Fedotov

Actions

Copy link

#14

Updated by Florian Florensa over 2 years ago

I upgraded the cluster in the end of october, and by crawling a chat that i had with a coworkers the correct record group that starts from the 28th are the one i reinjected myself manually to repair basic functionnality on some images (IE, prior to that a volume that was in use was still usable, but detaching it made it impossible to reattach, i guess it is a matter of most of those metadata are not used after the connection is established)
We got another volume that started acting up today, but the reason i guess, is that the VM using it got restarted, and it was unable to "re-attach" to this volume because everything that tries to "stat(2)" it does not work.

Actions

Copy link

#15

Updated by Igor Fedotov over 2 years ago

Related to Bug #53062: OMAP upgrade to PER-PG format result in ill-formatted OMAP keys. added

Actions

Copy link

#16

Updated by Igor Fedotov over 2 years ago

Here is a summary for another similar case shared by the affected cluster's admin. I presume it provides good enough explanation how current state in this ticket has been achieved as well:
"
well, one of my osds didn't crash after my upgrade like the other ones did.
After repairing the crashing osds (and not repairing the one that didn't
crash) all rbd images were visible in rbd info (and usable as disk) but after
a few hours of resyncing the cluster, some rbd disks weren't accessible
anymore.

My guess is, that the osd that did not crash still had wrong omap keys and the
rebuilding spread them over the cluster or somehow activated these wrong omap
keys.
"
So we can probably mark this ticket as a duplicate of #53062 and apparently close. Fabian, any objecttions?

Actions

Copy link

#17