Project

General

Profile

Actions

Bug #40000

open

osds do not bound xattrs and/or aggregate xattr data in pg log

Added by Vaibhav Bhembre almost 5 years ago. Updated over 4 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
Peering
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Currently we are having our cluster in an HEALTH_ERR state with 4 PGs inactive (3 of which are "peering" and 4th is "activating+degraded") as seen below.

32.160                                            peering         [395,172,321,335,152,77]        395         [395,172,321,335,152,77]            395
32.756                                            peering          [197,395,50,65,384,369]        197          [197,395,50,65,384,369]            197
32.dd1                                            peering         [276,306,152,40,214,245]        276         [276,306,152,40,214,245]            276
32.1329                               activating+degraded         [306,276,129,43,186,241]        306         [306,276,129,43,186,241]            306

The primary OSDs all show a stream of following errors, that we think is causing to stay the OSDs in its inactive state.

2019-05-22 14:08:46.588585 7f47e9ba1700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:48.482542 7f47ea3a2700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:50.412120 7f47e9ba1700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:52.265173 7f47e9ba1700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:54.191465 7f47eaba3700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:56.022749 7f47eaba3700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:57.742697 7f47ea3a2700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer

We attempted to inject "ms_dump_corrupt_message_level = -1" to capture the hexdump of the message which is attached herewith. Would it be possible to see what might be causing this issue and have a possible workaround?

One thing to note is this issue seems to have started after we saw a lot of calls to "refcount.get" hung on certain OSDs. We could see these made from RGWs in their "objecter_requests". We could also see that these originated from S3 PutObjectCopy requests that were issued to those RGWs.

Another thing to note is, all these OSDs that currently show this message are provisioned on Filestore. We have a small number of BlueStore OSDs in the mix but they are not showing this issue (could be because we only have 4 PGs stuck in this state). The pool on these OSDs is erasure-coded.


Files

decode.83.issue.hexdump.txt.tar.gz (177 KB) decode.83.issue.hexdump.txt.tar.gz hexdump for decode 83 failure Vaibhav Bhembre, 05/22/2019 02:23 PM
Actions

Also available in: Atom PDF