Project

General

Profile

Actions

Bug #43599

closed

kclient: corrupt message failure on RHEL8 distribution kernel

Added by Patrick Donnelly over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
kceph, qa-suite
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-01-13T09:20:06.126283+00:00 smithi142 kernel: libceph: mon1 172.21.15.120:6789 session established
2020-01-13T09:20:06.134352+00:00 smithi142 kernel: libceph: client4621 fsid 48ce1088-e239-461d-897a-70a91ffd5fc6
2020-01-13T09:20:06.142310+00:00 smithi142 kernel: ceph: mdsc_handle_session corrupt message mds0 len 38
2020-01-13T09:20:06.142350+00:00 smithi142 kernel: header: 00000000: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
2020-01-13T09:20:06.142369+00:00 smithi142 kernel: header: 00000010: 16 00 7f 00 01 00 26 00 00 00 00 00 00 00 00 00  ......&.........
2020-01-13T09:20:06.142386+00:00 smithi142 kernel: header: 00000020: 00 00 00 00 02 00 00 00 00 00 00 00 00 01 00 00  ................
2020-01-13T09:20:06.142402+00:00 smithi142 kernel: header: 00000030: 00 06 f4 9c 93                                   .....
2020-01-13T09:20:06.142421+00:00 smithi142 kernel: front: 00000000: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
2020-01-13T09:20:06.142438+00:00 smithi142 kernel: front: 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 01 01 04 00  ................
2020-01-13T09:20:06.142454+00:00 smithi142 kernel: front: 00000020: 00 00 00 00 00 00                                ......
2020-01-13T09:20:06.142472+00:00 smithi142 kernel: footer: 00000000: a0 8f fa 5d 00 00 00 00 00 00 00 00 5d 5f 89 eb  ...]........]_..
2020-01-13T09:20:06.142489+00:00 smithi142 kernel: footer: 00000010: a3 8b f2 88 05                                   .....
2020-01-13T09:21:10.061315+00:00 smithi142 kernel: ceph: mdsc_handle_session corrupt message mds0 len 38
2020-01-13T09:21:10.061384+00:00 smithi142 kernel: header: 00000000: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
2020-01-13T09:21:10.061408+00:00 smithi142 kernel: header: 00000010: 16 00 7f 00 01 00 26 00 00 00 00 00 00 00 00 00  ......&.........
2020-01-13T09:21:10.061425+00:00 smithi142 kernel: header: 00000020: 00 00 00 00 02 00 00 00 00 00 00 00 00 01 00 00  ................
2020-01-13T09:21:10.061442+00:00 smithi142 kernel: header: 00000030: 00 f1 40 fb de                                   ..@..
2020-01-13T09:21:10.061459+00:00 smithi142 kernel: front: 00000000: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
2020-01-13T09:21:10.061475+00:00 smithi142 kernel: front: 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 01 01 04 00  ................
2020-01-13T09:21:10.061492+00:00 smithi142 kernel: front: 00000020: 00 00 00 00 00 00                                ......
2020-01-13T09:21:10.061508+00:00 smithi142 kernel: footer: 00000000: e0 cf e5 da 00 00 00 00 00 00 00 00 43 18 24 12  ............C.$.
2020-01-13T09:21:10.061525+00:00 smithi142 kernel: footer: 00000010: dc f6 a6 a1 05                                   .....

From: /ceph/teuthology-archive/pdonnell-2020-01-13_01:55:25-kcephfs-wip-pdonnell-testing-20200112.224135-distro-basic-smithi/4661268/remote/smithi142/syslog/kern.log.gz

All of these failures:

Failure: Command failed on smithi124 with status 32: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage /bin/mount -t ceph :/ /home/ubuntu/cephtest/mnt.3 -v -o name=3,norequire_active_mds,conf=/etc/ceph/ceph.conf'
124 jobs
suites intersection: ['conf/{client.yaml', 'log-config.yaml', 'mds.yaml', 'mon.yaml', 'osd-asserts.yaml', 'osd.yaml}', 'overrides/{frag_enable.yaml', 'rhel_8.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}']
suites union: ['clusters/1-mds-1-client.yaml', 'clusters/1-mds-2-client.yaml', 'clusters/1-mds-4-client.yaml', 'conf/{client.yaml', 'inline/no.yaml', 'inline/yes.yaml', 'kcephfs/cephfs/{begin.yaml', 'kcephfs/mixed-clients/{begin.yaml', 'kcephfs/recovery/{begin.yaml', 'kcephfs/thrash/{begin.yaml', 'kclient-overrides/{distro/rhel/{k-distro.yaml', 'kclient/{mount.yaml', 'log-config.yaml', 'mds.yaml', 'mon.yaml', 'ms-die-on-skipped.yaml}', 'ms-die-on-skipped.yaml}}', 'objectstore-ec/bluestore-bitmap.yaml', 'objectstore-ec/bluestore-comp-ec-root.yaml', 'objectstore-ec/bluestore-comp.yaml', 'objectstore-ec/bluestore-ec-root.yaml', 'objectstore-ec/filestore-xfs.yaml', 'osd-asserts.yaml', 'osd.yaml}', 'overrides/{distro/rhel/{k-distro.yaml', 'overrides/{frag_enable.yaml', 'rhel_8.yaml}', 'tasks/acls-kernel-client.yaml}', 'tasks/auto-repair.yaml}', 'tasks/client-limits.yaml}', 'tasks/client-recovery.yaml}', 'tasks/data-scan.yaml}', 'tasks/failover.yaml}', 'tasks/journal-repair.yaml}', 'tasks/kclient_workunit_direct_io.yaml}', 'tasks/kclient_workunit_kernel_untar_build.yaml}', 'tasks/kclient_workunit_misc.yaml}', 'tasks/kclient_workunit_o_trunc.yaml}', 'tasks/kclient_workunit_snaps.yaml}', 'tasks/kclient_workunit_suites_dbench.yaml}', 'tasks/kclient_workunit_suites_ffsb.yaml}', 'tasks/kclient_workunit_suites_fsstress.yaml}', 'tasks/kclient_workunit_suites_fsx.yaml}', 'tasks/kclient_workunit_suites_fsync.yaml}', 'tasks/kclient_workunit_suites_iozone.yaml}', 'tasks/kclient_workunit_suites_pjd.yaml}', 'tasks/kclient_workunit_trivial_sync.yaml}', 'tasks/kernel_cfuse_workunits_untarbuild_blogbench.yaml}', 'tasks/mds-flush.yaml}', 'tasks/pool-perm.yaml}', 'tasks/sessionmap.yaml}', 'tasks/volume-client.yaml}', 'thrash-health-whitelist.yaml', 'thrashers/default.yaml', 'thrashers/mds.yaml', 'thrashers/mon.yaml', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}', 'workloads/kclient_workunit_suites_ffsb.yaml}', 'workloads/kclient_workunit_suites_iozone.yaml}']

The message looks like:

2020-01-13T09:20:06.134+0000 7fe9db520700  1 -- [v2:172.21.15.120:6834/3149285072,v1:172.21.15.120:6835/3149285072] --> v1:172.21.15.142:0/2769041657 -- client_session(open) v4 -- 0x5589aec98b40 con 0x5589adf24900

Related issues 1 (0 open1 closed)

Has duplicate CephFS - Bug #43664: mds: metric_spec is encoded into version 1 MClientSessionDuplicate

Actions
Actions #1

Updated by Jeff Layton over 4 years ago

What kernel is this?

Actions #2

Updated by Patrick Donnelly over 4 years ago

Jeff Layton wrote:

What kernel is this?

2020-01-13T09:09:32.013 INFO:teuthology.orchestra.run.smithi166:> uname -r
2020-01-13T09:09:32.029 INFO:teuthology.orchestra.run.smithi166.stdout:4.18.0-80.11.2.el8_0.x86_64

From: /ceph/teuthology-archive/pdonnell-2020-01-13_01:55:25-kcephfs-wip-pdonnell-testing-20200112.224135-distro-basic-smithi/4661268/teuthology.log

Actions #3

Updated by Patrick Donnelly over 4 years ago

  • Assignee changed from Patrick Donnelly to Jeff Layton
Actions #4

Updated by Jeff Layton over 4 years ago

This is just one of those places where the kernel client did not ever expect to see a struct be extended. I suspect this is something already fixed for RHEL8.2. handle_session has this:

        /* decode */
        if (msg->front.iov_len != sizeof(*h))
                goto bad;

...and the struct version got extended in mainline ceph.

This may be a regression in the MDS in 55d8fdef68740c4f0a83d55a04ac7a10ff4db15b. Should the presentation of metric_spec also be gated on something to ensure the client can parse it?

Actions #5

Updated by Patrick Donnelly over 4 years ago

  • Status changed from New to In Progress
  • Assignee changed from Jeff Layton to Patrick Donnelly
Actions #6

Updated by Patrick Donnelly over 4 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 32659
Actions #7

Updated by Patrick Donnelly over 4 years ago

  • Has duplicate Bug #43664: mds: metric_spec is encoded into version 1 MClientSession added
Actions #8

Updated by Patrick Donnelly over 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF