Bug #40000
openosds do not bound xattrs and/or aggregate xattr data in pg log
0%
Description
Currently we are having our cluster in an HEALTH_ERR state with 4 PGs inactive (3 of which are "peering" and 4th is "activating+degraded") as seen below.
32.160 peering [395,172,321,335,152,77] 395 [395,172,321,335,152,77] 395
32.756 peering [197,395,50,65,384,369] 197 [197,395,50,65,384,369] 197
32.dd1 peering [276,306,152,40,214,245] 276 [276,306,152,40,214,245] 276
32.1329 activating+degraded [306,276,129,43,186,241] 306 [306,276,129,43,186,241] 306
The primary OSDs all show a stream of following errors, that we think is causing to stay the OSDs in its inactive state.
2019-05-22 14:08:46.588585 7f47e9ba1700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:48.482542 7f47ea3a2700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:50.412120 7f47e9ba1700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:52.265173 7f47e9ba1700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:54.191465 7f47eaba3700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:56.022749 7f47eaba3700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
2019-05-22 14:08:57.742697 7f47ea3a2700 -1 failed to decode message of type 83 v5: buffer::end_of_buffer
We attempted to inject "ms_dump_corrupt_message_level = -1" to capture the hexdump of the message which is attached herewith. Would it be possible to see what might be causing this issue and have a possible workaround?
One thing to note is this issue seems to have started after we saw a lot of calls to "refcount.get" hung on certain OSDs. We could see these made from RGWs in their "objecter_requests". We could also see that these originated from S3 PutObjectCopy requests that were issued to those RGWs.
Another thing to note is, all these OSDs that currently show this message are provisioned on Filestore. We have a small number of BlueStore OSDs in the mix but they are not showing this issue (could be because we only have 4 PGs stuck in this state). The pool on these OSDs is erasure-coded.
Files