Project

General

Profile

Bug #49972

mds: failed to decode message of type 29 v1: void CapInfoPayload::decode

Added by Venky Shankar 27 days ago. Updated 14 days ago.

Status:
Pending Backport
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This was seen in pacific backport branch: https://pulpito.ceph.com/vshankar-2021-03-25_05:53:38-fs-wip-cephfs-mirror-pacific-backport-2-20210325-093706-distro-basic-smithi/5997033/

(Not sure if we hit this in master)

Basically, CapInfoPayload message from client to the MDS (mds.c) fails decode:

2021-03-25T06:25:13.694+0000 7fad52b3c700  1 -- [v2:172.21.15.16:6836/2167551815,v1:172.21.15.16:6837/2167551815] --> [v2:172.21.15.16:3300/0,v1:172.21.15.16:6789/0] -- mon_subscribe({osdmap=83}) v3 -- 0x55e8dfe85d40 con 0x55e8df14e800
2021-03-25T06:25:13.938+0000 7fad50337700 10 mds.pinger is_rank_lagging: rank=0
2021-03-25T06:25:13.938+0000 7fad50337700  1 -- [v2:172.21.15.16:6836/2167551815,v1:172.21.15.16:6837/2167551815] --> [v2:172.21.15.16:6832/16896,v1:172.21.15.16:6833/16896] -- mgrreport(unknown.c +0-0 packed 1494) v9 -- 0x55e8dff12380 con 0x55e8df14f800
2021-03-25T06:25:13.966+0000 7fad55341700 -1 failed to decode message of type 29 v1: void CapInfoPayload::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input
2021-03-25T06:25:13.966+0000 7fad55341700  1 dump: 
00000000  08 00 00 00 00 00 00 00  01 01 18 00 00 00 03 00  |................|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 01 00  |................|
00000020  00 00 00 00 00 00 01 00  00 00 01 01 08 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  02 00 00 00 01 01 08 00  |................|
00000040  00 00 00 00 00 00 00 00  00 00 03 00 00 00 01 01  |................|
00000050  08 00 00 00 fa 99 00 00  00 b5 5b 12 04 00 00 00  |..........[.....|
00000060  01 01 18 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000070  00 00 00 00 00 00 01 00  00 00 00 00 00 00 05 00  |................|
00000080  00 00 01 01 10 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  01 00 00 00 00 00 00 00  06 00 00 00 01 01 10 00  |................|
000000a0  00 00 01 00 00 00 00 00  00 00 01 00 00 00 00 00  |................|
000000b0  00 00 07 00 00 00 01 01  10 00 00 00 00 00 00 00  |................|
000000c0  00 00 00 00 01 00 00 00  00 00 00 00              |............|
000000cc

Related issues

Copied to CephFS - Backport #50225: pacific: mds: failed to decode message of type 29 v1: void CapInfoPayload::decode In Progress

History

#2 Updated by Venky Shankar 27 days ago

Patrick mentioned that this could be related to the testing kernel (as Jeff merged some of Xiubo's patches that adds various metrics in the client).

#3 Updated by Patrick Donnelly 27 days ago

I think this might just be because the pacific branch was missing: https://tracker.ceph.com/issues/46865

#4 Updated by Patrick Donnelly 27 days ago

Patrick Donnelly wrote:

I think this might just be because the pacific branch was missing: https://tracker.ceph.com/issues/49687

#5 Updated by Xiubo Li 27 days ago

00000000  08 00 00 00 00 00 00 00  01 01 18 00 00 00 03 00  |................|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 01 00  |................|
00000020  00 00 00 00 00 00 01 00  00 00 01 01 08 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  02 00 00 00 01 01 08 00  |................|
...

Checked the above binary one by one, they are all correct so far:
The first 4 bytes means there have 8 items, and the second 4 bytes is the CapInfoPayload's type, which is 0. And the version/campat_version are 0x01 and 0x01, and the length for CapInfoPayload is 0x00000018(only 24 bytes), which is far less than the bl's total length of 204 bytes.

#6 Updated by Xiubo Li 27 days ago

Locally I have built the origin/pacific ceph and with the lasted origin/testing kclient, it works well:

# ./bin/ceph fs perf stats | python3 -m json.tool
    "version": 1,
    "global_counters": [
        "cap_hit",
        "read_latency",
        "write_latency",
        "metadata_latency",
        "dentry_lease",
        "opened_files",
        "pinned_icaps",
        "opened_inodes" 
    ],
    "counters": [],
    "client_metadata": {
        "client.4292": {
            "IP": "v1:10.72.47.117",
            "hostname": "lxbceph1",
            "root": "/",
            "mount_point": "N/A",
            "valid_metrics": [
                "cap_hit",
                "read_latency",
                "write_latency",
                "metadata_latency",
                "dentry_lease",
                "opened_files",
                "pinned_icaps",
                "opened_inodes" 
            ],
            "kernel_version": "5.12.0-rc2+" 
        }
    },
    "global_metrics": {
        "client.4292": [
            [
                34,
                4
            ],
            [
                3744,
                18000000
            ],
            [
                41188,
                300000000
            ],
            [
                63285,
                751000000
            ],
            [
                5,
                0
            ],
            [
                0,
                1
            ],
            [
                1,
                1
            ],
            [
                0,
                1
            ]
        ]
    },
    "metrics": {
        "delayed_ranks": [],
        "mds.0": {
            "client.4292": []
        }
    }
}

And added some debug log to dump the hex manully:

2021-03-26T12:43:07.850+0800 151893f4b700  0 dump: 
00000000  08 00 00 00 00 00 00 00  01 01 18 00 00 00 22 00  |..............".|
00000010  00 00 00 00 00 00 04 00  00 00 00 00 00 00 01 00  |................|
00000020  00 00 00 00 00 00 01 00  00 00 01 01 08 00 00 00  |................|
00000030  a0 0e 00 00 80 a8 12 01  02 00 00 00 01 01 08 00  |................|
00000040  00 00 e4 a0 00 00 00 a3  e1 11 03 00 00 00 01 01  |................|
00000050  08 00 00 00 35 f7 00 00  c0 59 c3 2c 04 00 00 00  |....5....Y.,....|
00000060  01 01 18 00 00 00 05 00  00 00 00 00 00 00 00 00  |................|
00000070  00 00 00 00 00 00 01 00  00 00 00 00 00 00 05 00  |................|
00000080  00 00 01 01 10 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  01 00 00 00 00 00 00 00  06 00 00 00 01 01 10 00  |................|
000000a0  00 00 01 00 00 00 00 00  00 00 01 00 00 00 00 00  |................|
000000b0  00 00 07 00 00 00 01 01  10 00 00 00 00 00 00 00  |................|
000000c0  00 00 00 00 01 00 00 00  00 00 00 00              |............|
000000cc

2021-03-26T12:43:07.850+0800 151893f4b700 20 mds.0.5 get_session have 0x55b5ce709e00 client.4292 v1:10.72.47.117:0/246353567 state open
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_client_metrics: mds.metrics: session=0x55b5ce709e00
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=CAP_INFO, session=0x55b5ce709e00, hits=34, misses=4
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=READ_LATENCY, session=0x55b5ce709e00, latency=3744.018000
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=WRITE_LATENCY, session=0x55b5ce709e00, latency=41188.300000
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=METADATA_LATENCY, session=0x55b5ce709e00, latenc]y=63285.751000
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=DENTRY_LEASE, session=0x55b5ce709e00, hits=5, misses=0
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=OPENED_FILES, session=0x55b5ce709e00, opened_files=0, total_inodes=1
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=PINNED_ICAPS, session=0x55b5ce709e00, pinned_icaps=1, total_inodes=1
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=OPENED_INODES, session=0x55b5ce709e00, opened_inodes=0, total_inodes=1

The same and are all as expected.

I just doubt whether is it possible in some platforms the ceph is decoding it in big endian mode ?

#7 Updated by Venky Shankar 27 days ago

Patrick Donnelly wrote:

I think this might just be because the pacific branch was missing: https://tracker.ceph.com/issues/46865

If the MDS does not understand a client metric type, it logs `UnknownPayload` and ignores the client message.

#8 Updated by Xiubo Li 26 days ago

@Venkey, with your backport patch I can reproduce it locally.

#9 Updated by Venky Shankar 26 days ago

Xiubo Li wrote:

@Venkey, with your backport patch I can reproduce it locally.

Which backport? cephfs-mirror series backports?

#10 Updated by Xiubo Li 26 days ago

Venky Shankar wrote:

Xiubo Li wrote:

@Venkey, with your backport patch I can reproduce it locally.

Which backport? cephfs-mirror series backports?

Patten for confusing. More exactly it is your test branch "wip-cephfs-mirror-pacific-backport-2-20210325-093706" in ceph-ci.

But checked the binary data, I still cannot figure out why the "struct_len > bl.get_remaining()":

  895 2021-03-26T13:53:47.980+0800 1519c9504700 -1 failed to decode message of type 29 v1: void CapInfoPayload::decode(ceph::buffer::v15_2      _0::list::const_iterator&) decode past end of struct encoding: Malformed input
  896 2021-03-26T13:53:48.030+0800 1519c9504700  0 dump:
  897 00000000  05 00 00 00 01 00 00 00  01 01 08 00 00 00 00 00  |................|
  898 00000010  00 00 00 00 00 00 02 00  00 00 01 01 08 00 00 00  |................|
  899 00000020  00 00 00 00 00 00 00 00  03 00 00 00 01 01 08 00  |................|
  900 00000030  00 00 00 00 00 00 b7 32  12 00 [00 00 00 00 01 01  |.......2........|
  901 00000040  18 00 00 00 04 00 00 00  00 00 00 00 00 00 00 00  |................|
  902 00000050  00 00 00 00 00 00 00 00  00 00 00 00] 04 00 00 00  |................|
  903 00000060  01 01 18 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
  904 00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00        |..............|
  905 0000007e

For the "CapInfoPayload", the strcut_len is 0x00000018, which equals to the bl's remaining as above in [].

I am adding some logs to print them to see what exactly the value of them.

#11 Updated by Xiubo Li 26 days ago

Venky Shankar wrote:

Xiubo Li wrote:

@Venkey, with your backport patch I can reproduce it locally.

Which backport? cephfs-mirror series backports?

I am afraid it was cause by this. I saw the metric items are not in the order we expected.

I will fix it.

#12 Updated by Xiubo Li 26 days ago

  • Status changed from New to Fix Under Review
  • Priority changed from Normal to Urgent
  • Pull request ID set to 40427

#13 Updated by Patrick Donnelly 26 days ago

  • Assignee set to Xiubo Li
  • Backport set to pacific
  • Component(FS) MDS added
  • Labels (FS) qa-failure added

#14 Updated by Patrick Donnelly 14 days ago

  • Status changed from Fix Under Review to Pending Backport

#15 Updated by Backport Bot 14 days ago

  • Copied to Backport #50225: pacific: mds: failed to decode message of type 29 v1: void CapInfoPayload::decode added

Also available in: Atom PDF