Project

General

Profile

Actions

Bug #49972

closed

mds: failed to decode message of type 29 v1: void CapInfoPayload::decode

Added by Venky Shankar about 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This was seen in pacific backport branch: https://pulpito.ceph.com/vshankar-2021-03-25_05:53:38-fs-wip-cephfs-mirror-pacific-backport-2-20210325-093706-distro-basic-smithi/5997033/

(Not sure if we hit this in master)

Basically, CapInfoPayload message from client to the MDS (mds.c) fails decode:

2021-03-25T06:25:13.694+0000 7fad52b3c700  1 -- [v2:172.21.15.16:6836/2167551815,v1:172.21.15.16:6837/2167551815] --> [v2:172.21.15.16:3300/0,v1:172.21.15.16:6789/0] -- mon_subscribe({osdmap=83}) v3 -- 0x55e8dfe85d40 con 0x55e8df14e800
2021-03-25T06:25:13.938+0000 7fad50337700 10 mds.pinger is_rank_lagging: rank=0
2021-03-25T06:25:13.938+0000 7fad50337700  1 -- [v2:172.21.15.16:6836/2167551815,v1:172.21.15.16:6837/2167551815] --> [v2:172.21.15.16:6832/16896,v1:172.21.15.16:6833/16896] -- mgrreport(unknown.c +0-0 packed 1494) v9 -- 0x55e8dff12380 con 0x55e8df14f800
2021-03-25T06:25:13.966+0000 7fad55341700 -1 failed to decode message of type 29 v1: void CapInfoPayload::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input
2021-03-25T06:25:13.966+0000 7fad55341700  1 dump: 
00000000  08 00 00 00 00 00 00 00  01 01 18 00 00 00 03 00  |................|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 01 00  |................|
00000020  00 00 00 00 00 00 01 00  00 00 01 01 08 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  02 00 00 00 01 01 08 00  |................|
00000040  00 00 00 00 00 00 00 00  00 00 03 00 00 00 01 01  |................|
00000050  08 00 00 00 fa 99 00 00  00 b5 5b 12 04 00 00 00  |..........[.....|
00000060  01 01 18 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000070  00 00 00 00 00 00 01 00  00 00 00 00 00 00 05 00  |................|
00000080  00 00 01 01 10 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  01 00 00 00 00 00 00 00  06 00 00 00 01 01 10 00  |................|
000000a0  00 00 01 00 00 00 00 00  00 00 01 00 00 00 00 00  |................|
000000b0  00 00 07 00 00 00 01 01  10 00 00 00 00 00 00 00  |................|
000000c0  00 00 00 00 01 00 00 00  00 00 00 00              |............|
000000cc

Related issues 1 (0 open1 closed)

Copied to CephFS - Backport #50225: pacific: mds: failed to decode message of type 29 v1: void CapInfoPayload::decodeResolvedsinguliere _Actions
Actions #2

Updated by Venky Shankar about 3 years ago

Patrick mentioned that this could be related to the testing kernel (as Jeff merged some of Xiubo's patches that adds various metrics in the client).

Actions #3

Updated by Patrick Donnelly about 3 years ago

I think this might just be because the pacific branch was missing: https://tracker.ceph.com/issues/46865

Actions #4

Updated by Patrick Donnelly about 3 years ago

Patrick Donnelly wrote:

I think this might just be because the pacific branch was missing: https://tracker.ceph.com/issues/49687

Actions #5

Updated by Xiubo Li about 3 years ago

00000000  08 00 00 00 00 00 00 00  01 01 18 00 00 00 03 00  |................|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 01 00  |................|
00000020  00 00 00 00 00 00 01 00  00 00 01 01 08 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  02 00 00 00 01 01 08 00  |................|
...

Checked the above binary one by one, they are all correct so far:
The first 4 bytes means there have 8 items, and the second 4 bytes is the CapInfoPayload's type, which is 0. And the version/campat_version are 0x01 and 0x01, and the length for CapInfoPayload is 0x00000018(only 24 bytes), which is far less than the bl's total length of 204 bytes.

Actions #6

Updated by Xiubo Li about 3 years ago

Locally I have built the origin/pacific ceph and with the lasted origin/testing kclient, it works well:

# ./bin/ceph fs perf stats | python3 -m json.tool
    "version": 1,
    "global_counters": [
        "cap_hit",
        "read_latency",
        "write_latency",
        "metadata_latency",
        "dentry_lease",
        "opened_files",
        "pinned_icaps",
        "opened_inodes" 
    ],
    "counters": [],
    "client_metadata": {
        "client.4292": {
            "IP": "v1:10.72.47.117",
            "hostname": "lxbceph1",
            "root": "/",
            "mount_point": "N/A",
            "valid_metrics": [
                "cap_hit",
                "read_latency",
                "write_latency",
                "metadata_latency",
                "dentry_lease",
                "opened_files",
                "pinned_icaps",
                "opened_inodes" 
            ],
            "kernel_version": "5.12.0-rc2+" 
        }
    },
    "global_metrics": {
        "client.4292": [
            [
                34,
                4
            ],
            [
                3744,
                18000000
            ],
            [
                41188,
                300000000
            ],
            [
                63285,
                751000000
            ],
            [
                5,
                0
            ],
            [
                0,
                1
            ],
            [
                1,
                1
            ],
            [
                0,
                1
            ]
        ]
    },
    "metrics": {
        "delayed_ranks": [],
        "mds.0": {
            "client.4292": []
        }
    }
}

And added some debug log to dump the hex manully:

2021-03-26T12:43:07.850+0800 151893f4b700  0 dump: 
00000000  08 00 00 00 00 00 00 00  01 01 18 00 00 00 22 00  |..............".|
00000010  00 00 00 00 00 00 04 00  00 00 00 00 00 00 01 00  |................|
00000020  00 00 00 00 00 00 01 00  00 00 01 01 08 00 00 00  |................|
00000030  a0 0e 00 00 80 a8 12 01  02 00 00 00 01 01 08 00  |................|
00000040  00 00 e4 a0 00 00 00 a3  e1 11 03 00 00 00 01 01  |................|
00000050  08 00 00 00 35 f7 00 00  c0 59 c3 2c 04 00 00 00  |....5....Y.,....|
00000060  01 01 18 00 00 00 05 00  00 00 00 00 00 00 00 00  |................|
00000070  00 00 00 00 00 00 01 00  00 00 00 00 00 00 05 00  |................|
00000080  00 00 01 01 10 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  01 00 00 00 00 00 00 00  06 00 00 00 01 01 10 00  |................|
000000a0  00 00 01 00 00 00 00 00  00 00 01 00 00 00 00 00  |................|
000000b0  00 00 07 00 00 00 01 01  10 00 00 00 00 00 00 00  |................|
000000c0  00 00 00 00 01 00 00 00  00 00 00 00              |............|
000000cc

2021-03-26T12:43:07.850+0800 151893f4b700 20 mds.0.5 get_session have 0x55b5ce709e00 client.4292 v1:10.72.47.117:0/246353567 state open
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_client_metrics: mds.metrics: session=0x55b5ce709e00
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=CAP_INFO, session=0x55b5ce709e00, hits=34, misses=4
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=READ_LATENCY, session=0x55b5ce709e00, latency=3744.018000
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=WRITE_LATENCY, session=0x55b5ce709e00, latency=41188.300000
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=METADATA_LATENCY, session=0x55b5ce709e00, latenc]y=63285.751000
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=DENTRY_LEASE, session=0x55b5ce709e00, hits=5, misses=0
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=OPENED_FILES, session=0x55b5ce709e00, opened_files=0, total_inodes=1
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=PINNED_ICAPS, session=0x55b5ce709e00, pinned_icaps=1, total_inodes=1
2021-03-26T12:43:07.850+0800 151893f4b700 20 handle_payload: mds.metrics: type=OPENED_INODES, session=0x55b5ce709e00, opened_inodes=0, total_inodes=1

The same and are all as expected.

I just doubt whether is it possible in some platforms the ceph is decoding it in big endian mode ?

Actions #7

Updated by Venky Shankar about 3 years ago

Patrick Donnelly wrote:

I think this might just be because the pacific branch was missing: https://tracker.ceph.com/issues/46865

If the MDS does not understand a client metric type, it logs `UnknownPayload` and ignores the client message.

Actions #8

Updated by Xiubo Li about 3 years ago

@Venkey, with your backport patch I can reproduce it locally.

Actions #9

Updated by Venky Shankar about 3 years ago

Xiubo Li wrote:

@Venkey, with your backport patch I can reproduce it locally.

Which backport? cephfs-mirror series backports?

Actions #10

Updated by Xiubo Li about 3 years ago

Venky Shankar wrote:

Xiubo Li wrote:

@Venkey, with your backport patch I can reproduce it locally.

Which backport? cephfs-mirror series backports?

Patten for confusing. More exactly it is your test branch "wip-cephfs-mirror-pacific-backport-2-20210325-093706" in ceph-ci.

But checked the binary data, I still cannot figure out why the "struct_len > bl.get_remaining()":

  895 2021-03-26T13:53:47.980+0800 1519c9504700 -1 failed to decode message of type 29 v1: void CapInfoPayload::decode(ceph::buffer::v15_2      _0::list::const_iterator&) decode past end of struct encoding: Malformed input
  896 2021-03-26T13:53:48.030+0800 1519c9504700  0 dump:
  897 00000000  05 00 00 00 01 00 00 00  01 01 08 00 00 00 00 00  |................|
  898 00000010  00 00 00 00 00 00 02 00  00 00 01 01 08 00 00 00  |................|
  899 00000020  00 00 00 00 00 00 00 00  03 00 00 00 01 01 08 00  |................|
  900 00000030  00 00 00 00 00 00 b7 32  12 00 [00 00 00 00 01 01  |.......2........|
  901 00000040  18 00 00 00 04 00 00 00  00 00 00 00 00 00 00 00  |................|
  902 00000050  00 00 00 00 00 00 00 00  00 00 00 00] 04 00 00 00  |................|
  903 00000060  01 01 18 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
  904 00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00        |..............|
  905 0000007e

For the "CapInfoPayload", the strcut_len is 0x00000018, which equals to the bl's remaining as above in [].

I am adding some logs to print them to see what exactly the value of them.

Actions #11

Updated by Xiubo Li about 3 years ago

Venky Shankar wrote:

Xiubo Li wrote:

@Venkey, with your backport patch I can reproduce it locally.

Which backport? cephfs-mirror series backports?

I am afraid it was cause by this. I saw the metric items are not in the order we expected.

I will fix it.

Actions #12

Updated by Xiubo Li about 3 years ago

  • Status changed from New to Fix Under Review
  • Priority changed from Normal to Urgent
  • Pull request ID set to 40427
Actions #13

Updated by Patrick Donnelly about 3 years ago

  • Assignee set to Xiubo Li
  • Backport set to pacific
  • Component(FS) MDS added
  • Labels (FS) qa-failure added
Actions #14

Updated by Patrick Donnelly about 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #15

Updated by Backport Bot about 3 years ago

  • Copied to Backport #50225: pacific: mds: failed to decode message of type 29 v1: void CapInfoPayload::decode added
Actions #16

Updated by Loïc Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF