Bug #49972
closedmds: failed to decode message of type 29 v1: void CapInfoPayload::decode
0%
Description
This was seen in pacific backport branch: https://pulpito.ceph.com/vshankar-2021-03-25_05:53:38-fs-wip-cephfs-mirror-pacific-backport-2-20210325-093706-distro-basic-smithi/5997033/
(Not sure if we hit this in master)
Basically, CapInfoPayload message from client to the MDS (mds.c) fails decode:
2021-03-25T06:25:13.694+0000 7fad52b3c700 1 -- [v2:172.21.15.16:6836/2167551815,v1:172.21.15.16:6837/2167551815] --> [v2:172.21.15.16:3300/0,v1:172.21.15.16:6789/0] -- mon_subscribe({osdmap=83}) v3 -- 0x55e8dfe85d40 con 0x55e8df14e800 2021-03-25T06:25:13.938+0000 7fad50337700 10 mds.pinger is_rank_lagging: rank=0 2021-03-25T06:25:13.938+0000 7fad50337700 1 -- [v2:172.21.15.16:6836/2167551815,v1:172.21.15.16:6837/2167551815] --> [v2:172.21.15.16:6832/16896,v1:172.21.15.16:6833/16896] -- mgrreport(unknown.c +0-0 packed 1494) v9 -- 0x55e8dff12380 con 0x55e8df14f800 2021-03-25T06:25:13.966+0000 7fad55341700 -1 failed to decode message of type 29 v1: void CapInfoPayload::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input 2021-03-25T06:25:13.966+0000 7fad55341700 1 dump: 00000000 08 00 00 00 00 00 00 00 01 01 18 00 00 00 03 00 |................| 00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 |................| 00000020 00 00 00 00 00 00 01 00 00 00 01 01 08 00 00 00 |................| 00000030 00 00 00 00 00 00 00 00 02 00 00 00 01 01 08 00 |................| 00000040 00 00 00 00 00 00 00 00 00 00 03 00 00 00 01 01 |................| 00000050 08 00 00 00 fa 99 00 00 00 b5 5b 12 04 00 00 00 |..........[.....| 00000060 01 01 18 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000070 00 00 00 00 00 00 01 00 00 00 00 00 00 00 05 00 |................| 00000080 00 00 01 01 10 00 00 00 00 00 00 00 00 00 00 00 |................| 00000090 01 00 00 00 00 00 00 00 06 00 00 00 01 01 10 00 |................| 000000a0 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 |................| 000000b0 00 00 07 00 00 00 01 01 10 00 00 00 00 00 00 00 |................| 000000c0 00 00 00 00 01 00 00 00 00 00 00 00 |............| 000000cc
Updated by Venky Shankar about 3 years ago
Updated by Venky Shankar about 3 years ago
Patrick mentioned that this could be related to the testing kernel (as Jeff merged some of Xiubo's patches that adds various metrics in the client).
Updated by Patrick Donnelly about 3 years ago
I think this might just be because the pacific branch was missing: https://tracker.ceph.com/issues/46865
Updated by Patrick Donnelly about 3 years ago
Patrick Donnelly wrote:
I think this might just be because the pacific branch was missing: https://tracker.ceph.com/issues/49687
Updated by Xiubo Li about 3 years ago
00000000 08 00 00 00 00 00 00 00 01 01 18 00 00 00 03 00 |................| 00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 |................| 00000020 00 00 00 00 00 00 01 00 00 00 01 01 08 00 00 00 |................| 00000030 00 00 00 00 00 00 00 00 02 00 00 00 01 01 08 00 |................| ...
Checked the above binary one by one, they are all correct so far:
The first 4 bytes means there have 8 items, and the second 4 bytes is the CapInfoPayload's type, which is 0. And the version/campat_version are 0x01 and 0x01, and the length for CapInfoPayload is 0x00000018(only 24 bytes), which is far less than the bl's total length of 204 bytes.
Updated by Xiubo Li about 3 years ago
Locally I have built the origin/pacific ceph and with the lasted origin/testing kclient, it works well:
# ./bin/ceph fs perf stats | python3 -m json.tool "version": 1, "global_counters": [ "cap_hit", "read_latency", "write_latency", "metadata_latency", "dentry_lease", "opened_files", "pinned_icaps", "opened_inodes" ], "counters": [], "client_metadata": { "client.4292": { "IP": "v1:10.72.47.117", "hostname": "lxbceph1", "root": "/", "mount_point": "N/A", "valid_metrics": [ "cap_hit", "read_latency", "write_latency", "metadata_latency", "dentry_lease", "opened_files", "pinned_icaps", "opened_inodes" ], "kernel_version": "5.12.0-rc2+" } }, "global_metrics": { "client.4292": [ [ 34, 4 ], [ 3744, 18000000 ], [ 41188, 300000000 ], [ 63285, 751000000 ], [ 5, 0 ], [ 0, 1 ], [ 1, 1 ], [ 0, 1 ] ] }, "metrics": { "delayed_ranks": [], "mds.0": { "client.4292": [] } } }
And added some debug log to dump the hex manully:
2021-03-26T12:43:07.850+0800 151893f4b700 0 dump: 00000000 08 00 00 00 00 00 00 00 01 01 18 00 00 00 22 00 |..............".| 00000010 00 00 00 00 00 00 04 00 00 00 00 00 00 00 01 00 |................| 00000020 00 00 00 00 00 00 01 00 00 00 01 01 08 00 00 00 |................| 00000030 a0 0e 00 00 80 a8 12 01 02 00 00 00 01 01 08 00 |................| 00000040 00 00 e4 a0 00 00 00 a3 e1 11 03 00 00 00 01 01 |................| 00000050 08 00 00 00 35 f7 00 00 c0 59 c3 2c 04 00 00 00 |....5....Y.,....| 00000060 01 01 18 00 00 00 05 00 00 00 00 00 00 00 00 00 |................| 00000070 00 00 00 00 00 00 01 00 00 00 00 00 00 00 05 00 |................| 00000080 00 00 01 01 10 00 00 00 00 00 00 00 00 00 00 00 |................| 00000090 01 00 00 00 00 00 00 00 06 00 00 00 01 01 10 00 |................| 000000a0 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 |................| 000000b0 00 00 07 00 00 00 01 01 10 00 00 00 00 00 00 00 |................| 000000c0 00 00 00 00 01 00 00 00 00 00 00 00 |............| 000000cc 2021-03-26T12:43:07.850+0800 151893f4b700 20 mds.0.5 get_session have 0x55b5ce709e00 client.4292 v1:10.72.47.117:0/246353567 state open 2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_client_metrics: mds.metrics: session=0x55b5ce709e00 2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=CAP_INFO, session=0x55b5ce709e00, hits=34, misses=4 2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=READ_LATENCY, session=0x55b5ce709e00, latency=3744.018000 2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=WRITE_LATENCY, session=0x55b5ce709e00, latency=41188.300000 2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=METADATA_LATENCY, session=0x55b5ce709e00, latenc]y=63285.751000 2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=DENTRY_LEASE, session=0x55b5ce709e00, hits=5, misses=0 2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=OPENED_FILES, session=0x55b5ce709e00, opened_files=0, total_inodes=1 2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=PINNED_ICAPS, session=0x55b5ce709e00, pinned_icaps=1, total_inodes=1 2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=OPENED_INODES, session=0x55b5ce709e00, opened_inodes=0, total_inodes=1
The same and are all as expected.
I just doubt whether is it possible in some platforms the ceph is decoding it in big endian mode ?
Updated by Venky Shankar about 3 years ago
Patrick Donnelly wrote:
I think this might just be because the pacific branch was missing: https://tracker.ceph.com/issues/46865
If the MDS does not understand a client metric type, it logs `UnknownPayload` and ignores the client message.
Updated by Xiubo Li about 3 years ago
@Venkey, with your backport patch I can reproduce it locally.
Updated by Venky Shankar about 3 years ago
Xiubo Li wrote:
@Venkey, with your backport patch I can reproduce it locally.
Which backport? cephfs-mirror series backports?
Updated by Xiubo Li about 3 years ago
Venky Shankar wrote:
Xiubo Li wrote:
@Venkey, with your backport patch I can reproduce it locally.
Which backport? cephfs-mirror series backports?
Patten for confusing. More exactly it is your test branch "wip-cephfs-mirror-pacific-backport-2-20210325-093706" in ceph-ci.
But checked the binary data, I still cannot figure out why the "struct_len > bl.get_remaining()":
895 2021-03-26T13:53:47.980+0800 1519c9504700 -1 failed to decode message of type 29 v1: void CapInfoPayload::decode(ceph::buffer::v15_2 _0::list::const_iterator&) decode past end of struct encoding: Malformed input 896 2021-03-26T13:53:48.030+0800 1519c9504700 0 dump: 897 00000000 05 00 00 00 01 00 00 00 01 01 08 00 00 00 00 00 |................| 898 00000010 00 00 00 00 00 00 02 00 00 00 01 01 08 00 00 00 |................| 899 00000020 00 00 00 00 00 00 00 00 03 00 00 00 01 01 08 00 |................| 900 00000030 00 00 00 00 00 00 b7 32 12 00 [00 00 00 00 01 01 |.......2........| 901 00000040 18 00 00 00 04 00 00 00 00 00 00 00 00 00 00 00 |................| 902 00000050 00 00 00 00 00 00 00 00 00 00 00 00] 04 00 00 00 |................| 903 00000060 01 01 18 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 904 00000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |..............| 905 0000007e
For the "CapInfoPayload", the strcut_len is 0x00000018, which equals to the bl's remaining as above in [].
I am adding some logs to print them to see what exactly the value of them.
Updated by Xiubo Li about 3 years ago
Venky Shankar wrote:
Xiubo Li wrote:
@Venkey, with your backport patch I can reproduce it locally.
Which backport? cephfs-mirror series backports?
I am afraid it was cause by this. I saw the metric items are not in the order we expected.
I will fix it.
Updated by Xiubo Li about 3 years ago
- Status changed from New to Fix Under Review
- Priority changed from Normal to Urgent
- Pull request ID set to 40427
Updated by Patrick Donnelly about 3 years ago
- Assignee set to Xiubo Li
- Backport set to pacific
- Component(FS) MDS added
- Labels (FS) qa-failure added
Updated by Patrick Donnelly about 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 3 years ago
- Copied to Backport #50225: pacific: mds: failed to decode message of type 29 v1: void CapInfoPayload::decode added
Updated by Loïc Dachary almost 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".