Project

General

Profile

Actions

Bug #48618

closed

failed to decode message of type 24 v4: End of buffer

Added by Patrick Donnelly over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Crash signature (v1):
Crash signature (v2):

Description

/ceph/teuthology-archive/pdonnell-2020-12-14_23:22:08-fs-wip-pdonnell-testing-20201214.211711-distro-basic-smithi/5708397/teuthology.log

and many others from that run.


Files

Actions #1

Updated by Jeff Layton over 3 years ago

Just recently, I saw some of those errors when I was working on the patches to update the ceph_mds_request_head structure. The client kernel in this case also has Ilya's msgr2 overhaul. We might need to bisect to find the culprit.

Actions #2

Updated by Jeff Layton over 3 years ago

  • Severity changed from 3 - minor to 2 - major
Actions #3

Updated by Ilya Dryomov over 3 years ago

  • Priority changed from Normal to Immediate

Type 24 is MClientRequest.

My first thought was that it had something to do msgr2 patches turning on SERVER_NAUTILUS and a couple of other features bits, but a quick look at MClientRequest encoding doesn't support that. The only feature bit dependency seems to be CEPH_FEATURE_FS_BTIME.

If you saw these while working on "ceph: implement updated ceph_mds_request_head structure", then msgr2 is unlikely to be the culprit. Since that patch is what brought the v3 -> v4 change to the kernel client, there is not much to bisect...

Actions #4

Updated by Jeff Layton over 3 years ago

Yeah, that probably is the culprit. I saw some of those when I was initially working on that patch, but once I got the formatting right, I didn't. Not sure why it's cropping up now, but I'll see if I can reproduce it.

Actions #5

Updated by Jeff Layton over 3 years ago

I was able to reproduce this by failing the MDS over to the standby (thanks, Patrick for the suggestion), and I turned up the corrupt packet dumper:

debug 2020-12-16T16:35:50.439+0000 7f18d25ab700 -1 failed to decode message of type 24 v4: buffer::end_of_buffer
debug 2020-12-16T16:35:50.439+0000 7f18d25ab700 -1 dump: 
00000000  01 00 a9 08 00 00 00 00  00 00 05 00 00 00 03 00  |................|
00000010  00 00 02 00 00 00 01 13  00 00 00 00 00 00 00 00  |................|
00000020  00 00 79 a6 00 00 00 01  00 00 41 02 00 00 b6 81  |..y.......A.....|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000040  00 00 45 00 00 00 00 00  00 00 00 00 00 00 00 00  |..E.............|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 01 77 a3 00 00 00  01 00 00 03 00 00 00 66  |...w...........f|
00000070  35 32 01 00 00 00 00 00  00 00 00 00 00 00 00 db  |52..............|
00000080  35 da 5f 48 36 b3 34                              |5._H6.4|
00000087

AFAICT, this looks legit. There is a set of 3 groups at the end, which is as expected. No idea what else the MDS is expecting to see here.

Actions #6

Updated by Ilya Dryomov over 3 years ago

I think I see the bug. It didn't reproduce for me yet, but I'm pretty sure that is it. Patch attached, I'll send it to the list later.

Actions #7

Updated by Ilya Dryomov over 3 years ago

Probably need to factor out gid_list encoding into a small helper...

Actions #8

Updated by Jeff Layton over 3 years ago

Well spotted! I should probably break all of that encoding out into a separate helper. I'll spin something up. Thanks!

Actions #9

Updated by Ilya Dryomov over 3 years ago

I've already done encode_gid_list() helper.

Actions #10

Updated by Jeff Layton over 3 years ago

Here's what I'm thinking. I'd like to move the timestamp encoding into the helper as well. The rest we can't really do much with since it's conditional.

Ilya, you already pulled the broken patch into ceph-client/master. Would you like me to post a corrected version that you can merge in its place? It'd be best to avoid a regression for anyone bisecting.

Actions #11

Updated by Ilya Dryomov over 3 years ago

No, for now the plan is to merge it as a separate fix. This isn't that easy to trigger, so I doubt anyone would ever run into it.

Actions #12

Updated by Jeff Layton over 3 years ago

What's the rationale for not just fixing the patch so there's no regression? That seems unnecessarily painful.

Actions #13

Updated by Ilya Dryomov over 3 years ago

ceph-client/master is frozen and I'd rather not rebase it.

Actions #14

Updated by Ilya Dryomov over 3 years ago

  • Category set to fs/ceph
  • Status changed from New to Fix Under Review

The fix has been committed to testing. Patrick, feel free to reschedule!

Actions #15

Updated by Patrick Donnelly over 3 years ago

Ilya Dryomov wrote:

The fix has been committed to testing. Patrick, feel free to reschedule!

Issue has gone away after rerun of failures: https://pulpito.ceph.com/pdonnell-2020-12-16_23:01:45-fs-wip-pdonnell-testing-20201214.211711-distro-basic-smithi/

Actions #16

Updated by Jeff Layton over 3 years ago

  • Assignee changed from Jeff Layton to Ilya Dryomov

Reassigning to Ilya since he did the patch.

Actions

Also available in: Atom PDF