Bug #48618
closedfailed to decode message of type 24 v4: End of buffer
0%
Description
/ceph/teuthology-archive/pdonnell-2020-12-14_23:22:08-fs-wip-pdonnell-testing-20201214.211711-distro-basic-smithi/5708397/teuthology.log
and many others from that run.
Files
Updated by Jeff Layton over 3 years ago
Just recently, I saw some of those errors when I was working on the patches to update the ceph_mds_request_head structure. The client kernel in this case also has Ilya's msgr2 overhaul. We might need to bisect to find the culprit.
Updated by Jeff Layton over 3 years ago
- Severity changed from 3 - minor to 2 - major
Updated by Ilya Dryomov over 3 years ago
- Priority changed from Normal to Immediate
Type 24 is MClientRequest.
My first thought was that it had something to do msgr2 patches turning on SERVER_NAUTILUS and a couple of other features bits, but a quick look at MClientRequest encoding doesn't support that. The only feature bit dependency seems to be CEPH_FEATURE_FS_BTIME.
If you saw these while working on "ceph: implement updated ceph_mds_request_head structure", then msgr2 is unlikely to be the culprit. Since that patch is what brought the v3 -> v4 change to the kernel client, there is not much to bisect...
Updated by Jeff Layton over 3 years ago
Yeah, that probably is the culprit. I saw some of those when I was initially working on that patch, but once I got the formatting right, I didn't. Not sure why it's cropping up now, but I'll see if I can reproduce it.
Updated by Jeff Layton over 3 years ago
I was able to reproduce this by failing the MDS over to the standby (thanks, Patrick for the suggestion), and I turned up the corrupt packet dumper:
debug 2020-12-16T16:35:50.439+0000 7f18d25ab700 -1 failed to decode message of type 24 v4: buffer::end_of_buffer debug 2020-12-16T16:35:50.439+0000 7f18d25ab700 -1 dump: 00000000 01 00 a9 08 00 00 00 00 00 00 05 00 00 00 03 00 |................| 00000010 00 00 02 00 00 00 01 13 00 00 00 00 00 00 00 00 |................| 00000020 00 00 79 a6 00 00 00 01 00 00 41 02 00 00 b6 81 |..y.......A.....| 00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000040 00 00 45 00 00 00 00 00 00 00 00 00 00 00 00 00 |..E.............| 00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000060 00 00 01 77 a3 00 00 00 01 00 00 03 00 00 00 66 |...w...........f| 00000070 35 32 01 00 00 00 00 00 00 00 00 00 00 00 00 db |52..............| 00000080 35 da 5f 48 36 b3 34 |5._H6.4| 00000087
AFAICT, this looks legit. There is a set of 3 groups at the end, which is as expected. No idea what else the MDS is expecting to see here.
Updated by Ilya Dryomov over 3 years ago
- File reencode-gids.diff reencode-gids.diff added
I think I see the bug. It didn't reproduce for me yet, but I'm pretty sure that is it. Patch attached, I'll send it to the list later.
Updated by Ilya Dryomov over 3 years ago
Probably need to factor out gid_list encoding into a small helper...
Updated by Jeff Layton over 3 years ago
Well spotted! I should probably break all of that encoding out into a separate helper. I'll spin something up. Thanks!
Updated by Ilya Dryomov over 3 years ago
I've already done encode_gid_list() helper.
Updated by Jeff Layton over 3 years ago
- File 0001-SQUASH-fix-reencoding-of-MClientRequest-for-retransm.patch 0001-SQUASH-fix-reencoding-of-MClientRequest-for-retransm.patch added
Here's what I'm thinking. I'd like to move the timestamp encoding into the helper as well. The rest we can't really do much with since it's conditional.
Ilya, you already pulled the broken patch into ceph-client/master. Would you like me to post a corrected version that you can merge in its place? It'd be best to avoid a regression for anyone bisecting.
Updated by Ilya Dryomov over 3 years ago
No, for now the plan is to merge it as a separate fix. This isn't that easy to trigger, so I doubt anyone would ever run into it.
Updated by Jeff Layton over 3 years ago
What's the rationale for not just fixing the patch so there's no regression? That seems unnecessarily painful.
Updated by Ilya Dryomov over 3 years ago
ceph-client/master is frozen and I'd rather not rebase it.
Updated by Ilya Dryomov over 3 years ago
- Category set to fs/ceph
- Status changed from New to Fix Under Review
The fix has been committed to testing. Patrick, feel free to reschedule!
Updated by Patrick Donnelly over 3 years ago
Ilya Dryomov wrote:
The fix has been committed to testing. Patrick, feel free to reschedule!
Issue has gone away after rerun of failures: https://pulpito.ceph.com/pdonnell-2020-12-16_23:01:45-fs-wip-pdonnell-testing-20201214.211711-distro-basic-smithi/
Updated by Jeff Layton over 3 years ago
- Assignee changed from Jeff Layton to Ilya Dryomov
Reassigning to Ilya since he did the patch.
Updated by Ilya Dryomov about 3 years ago
- Status changed from Fix Under Review to Resolved