Bug #40000: osds do not bound xattrs and/or aggregate xattr data in pg log - RADOS - Ceph

Actions

Copy link

Bug #40000

open

osds do not bound xattrs and/or aggregate xattr data in pg log

Added by Vaibhav Bhembre almost 5 years ago. Updated over 4 years ago.

Status:

New

Priority:

High

Assignee:

Category:

Peering

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v12.2.5

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Currently we are having our cluster in an HEALTH_ERR state with 4 PGs inactive (3 of which are "peering" and 4th is "activating+degraded") as seen below.

32.160                                            peering         [395,172,321,335,152,77]        395         [395,172,321,335,152,77]            395
32.756                                            peering          [197,395,50,65,384,369]        197          [197,395,50,65,384,369]            197
32.dd1                                            peering         [276,306,152,40,214,245]        276         [276,306,152,40,214,245]            276
32.1329                               activating+degraded         [306,276,129,43,186,241]        306         [306,276,129,43,186,241]            306

The primary OSDs all show a stream of following errors, that we think is causing to stay the OSDs in its inactive state.

2019-05-22 14:08:46.588585 7f47e9ba1700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:48.482542 7f47ea3a2700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:50.412120 7f47e9ba1700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:52.265173 7f47e9ba1700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:54.191465 7f47eaba3700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:56.022749 7f47eaba3700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:57.742697 7f47ea3a2700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer

We attempted to inject "ms_dump_corrupt_message_level = -1" to capture the hexdump of the message which is attached herewith. Would it be possible to see what might be causing this issue and have a possible workaround?

One thing to note is this issue seems to have started after we saw a lot of calls to "refcount.get" hung on certain OSDs. We could see these made from RGWs in their "objecter_requests". We could also see that these originated from S3 PutObjectCopy requests that were issued to those RGWs.

Another thing to note is, all these OSDs that currently show this message are provisioned on Filestore. We have a small number of BlueStore OSDs in the mix but they are not showing this issue (could be because we only have 4 PGs stuck in this state). The pool on these OSDs is erasure-coded.

Files

decode.83.issue.hexdump.txt.tar.gz (177 KB) decode.83.issue.hexdump.txt.tar.gz

hexdump for decode 83 failure

Vaibhav Bhembre, 05/22/2019 02:23 PM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #40000

osds do not bound xattrs and/or aggregate xattr data in pg log

Updated by Neha Ojha almost 5 years ago

Updated by Sage Weil almost 5 years ago

Updated by Sage Weil almost 5 years ago

Updated by Josh Durgin over 4 years ago

Updated by Josh Durgin over 4 years ago

Updated by Sage Weil over 4 years ago

Updated by Patrick Donnelly over 4 years ago