Project

General

Profile

Actions

Bug #64471

open

kernel: upgrades from quincy/v18.2.[01]/reef to main|squid fail with kernel oops

Added by Patrick Donnelly 2 months ago. Updated 22 days ago.

Status:
New
Priority:
Urgent
Assignee:
Category:
fs/ceph
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:{
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "target_image": "quay.ceph.io/ceph-ci/ceph:f78a58c0ffd401d1493058a1022c35f011d65275",
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "in_progress": true,
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "which": "Upgrading all daemon types on all hosts",
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "services_complete": [],
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "progress": "",
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "message": "Doing first pull of quay.ceph.io/ceph-ci/ceph:f78a58c0ffd401d1493058a1022c35f011d65275 image",
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "is_paused": false
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:}
...
2024-02-16T02:34:52.075 INFO:tasks.workunit.client.0.smithi032.stderr:+ pushd fsstress
2024-02-16T02:34:52.076 INFO:tasks.workunit.client.0.smithi032.stdout:~/cephtest/mnt.0/client.0/tmp/fsstress ~/cephtest/mnt.0/client.0/tmp
2024-02-16T02:34:52.076 INFO:tasks.workunit.client.0.smithi032.stderr:+ wget -q -O ltp-full.tgz http://download.ceph.com/qa/ltp-full-20091231.tgz
2024-02-16T02:35:10.364 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:09 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:09.988+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6804 osd.3 since back 2024-02-16T02:34:45.611069+0000 front 2024-02-16T02:34:45.611050+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000)
2024-02-16T02:35:10.364 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:09 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:09.988+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6812 osd.4 since back 2024-02-16T02:34:45.611191+0000 front 2024-02-16T02:34:45.611160+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000)
2024-02-16T02:35:10.364 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:09 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:09.988+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6820 osd.5 since back 2024-02-16T02:34:45.611126+0000 front 2024-02-16T02:34:45.611214+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000)
2024-02-16T02:35:11.316 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:10 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:10.982+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6804 osd.3 since back 2024-02-16T02:34:45.611069+0000 front 2024-02-16T02:34:45.611050+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000)
2024-02-16T02:35:11.316 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:10 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:10.982+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6812 osd.4 since back 2024-02-16T02:34:45.611191+0000 front 2024-02-16T02:34:45.611160+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000)
2024-02-16T02:35:11.317 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:10 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:10.982+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6820 osd.5 since back 2024-02-16T02:34:45.611126+0000 front 2024-02-16T02:34:45.611214+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000)

From: /teuthology/pdonnell-2024-02-16_01:25:08-fs:upgrade:mds_upgrade_sequence-wip-batrick-testing-20240215.160715-distro-default-smithi/7561891/teuthology.log

and many others in that run. It's quite reproducible. I don't think it happens with quincy -> main.

This might be related to #55258.


Related issues 1 (0 open1 closed)

Related to Linux kernel client - Bug #55258: lots of "heartbeat_check: no reply from X.X.X.X" in OSD logsResolvedJeff Layton

Actions
Actions #1

Updated by Patrick Donnelly 2 months ago

  • Related to Bug #55258: lots of "heartbeat_check: no reply from X.X.X.X" in OSD logs added
Actions #2

Updated by Patrick Donnelly 2 months ago

  • Priority changed from Immediate to Normal

This may be related to the MDSMap encoding issues #64440 crashing the kernel possibly. Lowering the severity until I can confirm. This would be most unexpected.

Actions #3

Updated by Radoslaw Zarzynski 2 months ago

I think we were investigating it together with Patrick on CEPH-RADOS. The finding was that the entire node was unresponsive, plausibly because of kernel's oops.

Actions #4

Updated by Xiubo Li about 2 months ago

Patrick,

From console_logs/smithi196.log, the kernel just crashed when copying data from userspace:

Entering kdb (current=0xffff9115d14fb980, pid 61925) on processor 5 Oops: (null)^M
due to oops @ 0xfffffffface3a1d2^M
CPU: 5 PID: 61925 Comm: ld Kdump: loaded Not tainted 5.14.0-421.el9.x86_64 #1^M
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015^M
RIP: 0010:usercopy_abort+0x74/0x76^M
Code: 14 74 ad 51 48 0f 44 d6 49 c7 c3 cb 9f 73 ad 4c 89 d1 57 48 c7 c6 60 83 75 ad 48 c7 c7 00 83 75 ad 49 0f 44 f3 e8 1b 3b ff ff <0f> 0b 0f b6 d3 4d 89 e0 48 89 e9 31 f6 48 c7 c7 7f 83 75 ad e8 73^M
RSP: 0018:ffffbb97c16af8d0 EFLAGS: 00010246^M
RAX: 0000000000000072 RBX: 0000000000000112 RCX: 0000000000000000^M
RDX: 0000000000000000 RSI: ffff911d1fd60840 RDI: ffff911d1fd60840^M
RBP: 0000000000004000 R08: 80000000ffff84b4 R09: 0000000000ffff0a^M
R10: 0000000000000004 R11: 0000000000000076 R12: ffff9115c0be8b00^M
R13: 0000000000000001 R14: ffff911665df9f68 R15: ffff9115d16be112^M
FS:  00007ff20442eb80(0000) GS:ffff911d1fd40000(0000) knlGS:0000000000000000^M
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
CR2: 00007ff20446142d CR3: 00000001215ec003 CR4: 00000000003706e0^M
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
Call Trace:^M
 <TASK>^M
 ? show_trace_log_lvl+0x1c4/0x2df^M
more>

And for the smith023 node it worked well and I didn't see any issue.

So IMO it shouldn't be the MDSMap encoding issues #64440 crashing the kernel. This should be a known issue as Rishabh hit weeks ago before my PTO as I remembered.

Actions #5

Updated by Xiubo Li about 2 months ago

Xiubo Li wrote:

Patrick,

From console_logs/smithi196.log, the kernel just crashed when copying data from userspace:

[...]

And for the smith023 node it worked well and I didn't see any issue.

So IMO it shouldn't be the MDSMap encoding issues #64440 crashing the kernel. This should be a known issue as Rishabh hit weeks ago before my PTO as I remembered.

I found it: https://pulpito.ceph.com/rishabh-2024-01-04_18:43:16-fs-rishabh-2024jan4-testing-default-smithi/7507410/:

In console_logs/smithi018.log

[ 1580.034891] usercopy: Kernel memory exposure attempt detected from SLUB object 'kmalloc-192' (offset 82, size 499712)!^M
[ 1580.045866] ------------[ cut here ]------------^M
[ 1580.050551] kernel BUG at mm/usercopy.c:102!^M
^M
Entering kdb (current=0xffff8881211f5500, pid 172901) on processor 4 Oops: (null)^M
due to oops @ 0xffffffff8138cabd^M
CPU: 4 PID: 172901 Comm: fsstress Tainted: G S                 6.6.0-g623393c9d50c #1^M
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015^M
RIP: 0010:usercopy_abort+0x6d/0x80^M
Code: 4c 0f 44 d0 41 53 48 c7 c0 1c e9 13 82 48 c7 c6 71 62 13 82 48 0f 45 f0 48 89 f9 48 c7 c7 f0 6b 1b 82 4c 89 d2 e8 63 2b df ff <0f> 0b 49 c7 c1 44 c8 14 82 4d 89 cb 4d 89 c8 eb a5 66 90 f3 0f 1e^M
RSP: 0018:ffffc90006dfba88 EFLAGS: 00010246^M
RAX: 000000000000006a RBX: 000000000007a000 RCX: 0000000000000000^M
RDX: 0000000000000000 RSI: ffff88885fd1d880 RDI: ffff88885fd1d880^M
RBP: 000000000007a000 R08: 0000000000000000 R09: c0000000ffffdfff^M
R10: 0000000000000001 R11: ffffc90006dfb930 R12: 0000000000000001^M
R13: ffff8882b7bbed12 R14: ffff88827a375830 R15: ffff8882b7b44d12^M
FS:  00007fb24c859500(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000^M
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
CR2: 000055c2bcf9eb00 CR3: 000000028956c005 CR4: 00000000001706e0^M
Call Trace:^M
 <TASK>^M
 ? kdb_main_loop+0x32c/0xa10^M
 ? kdb_stub+0x216/0x420^M
more>

Actions #6

Updated by Venky Shankar about 2 months ago

Xiubo Li wrote:

Patrick,

From console_logs/smithi196.log, the kernel just crashed when copying data from userspace:

[...]

And for the smith023 node it worked well and I didn't see any issue.

So IMO it shouldn't be the MDSMap encoding issues #64440 crashing the kernel. This should be a known issue as Rishabh hit weeks ago before my PTO as I remembered.

Was the testing kernel rebased from some rc release upstream at that point?

Actions #7

Updated by Patrick Donnelly about 1 month ago

  • Project changed from RADOS to Linux kernel client
  • Subject changed from osd: upgrades from v18.2.[01] to main fail with "heartbeat_check: no reply from" to kernel: upgrades from quincy/v18.2.[01]/reef to main|squid fail with kernel oops
  • Priority changed from Normal to Urgent
  • Target version changed from v19.0.0 to v20.0.0

So the scope of this issue is increasing (updated title). Upgrade from quincy to squid:

/teuthology/teuthology-2024-03-24_21:24:02-fs-squid-distro-default-smithi/7620150/teuthology.log

(after upgrade of mgrs)

https://pulpito.ceph.com/?branch=wip-batrick-testing-20240323.003144-squid

All upgrades from quincy|reef -> squid hit the kernel oops when using the kclient. One odd exception:

https://pulpito.ceph.com/pdonnell-2024-03-23_09:56:53-fs-wip-batrick-testing-20240323.003144-squid-distro-default-smithi/7618431/

The fuse failures are https://tracker.ceph.com/issues/64502

Actions #8

Updated by Xiubo Li about 1 month ago

I am trying to find a way to disable the kdb first. Locally I have tried it in the suite yaml code, and it work. Need to improve it.

Actions #9

Updated by Venky Shankar about 1 month ago

  • Category set to fs/ceph
  • Assignee set to Xiubo Li
  • Severity changed from 3 - minor to 1 - critical

Xiubo, I see you did some initial analysis on this. According to Patrick's latest update (note-7), its pretty reproducible (lately).

Could you plan to spend some cycles on this so see what's going on?

Actions #10

Updated by Xiubo Li about 1 month ago

Venky Shankar wrote:

Xiubo, I see you did some initial analysis on this. According to Patrick's latest update (note-7), its pretty reproducible (lately).

Could you plan to spend some cycles on this so see what's going on?

Yeah, sure, I just forgot to mention today I have already worked on it for a while. First I will try to disable the kdb first to see could we still see it.

Actions #11

Updated by Xiubo Li about 1 month ago

Patrick and Venky,

I raised a PR in teuthology to add a kdb option to allow us disable it when triggering the jobs, please see https://github.com/ceph/teuthology/pull/1929.

Actions #12

Updated by Venky Shankar 26 days ago

Xiubo Li wrote:

Patrick and Venky,

I raised a PR in teuthology to add a kdb option to allow us disable it when triggering the jobs, please see https://github.com/ceph/teuthology/pull/1929.

Looks good. Let'e get that in asap.

Actions #13

Updated by Venky Shankar 22 days ago

Seen again in squid testing. These are backport tests by Yuri so kdb was left enabled and https://github.com/ceph/teuthology/pull/1929 was merged just yesterday. Anyway:

Entering kdb (current=0xffff895569640000, pid 55999) on processor 4 Oops: (null)
due to oops @ 0xffffffff9fc3f12c
CPU: 4 PID: 55999 Comm: ld Kdump: loaded Not tainted 5.14.0-434.el9.x86_64 #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015
RIP: 0010:usercopy_abort+0x74/0x76
Code: 51 54 a0 51 48 0f 44 d6 49 c7 c3 74 dc 53 a0 4c 89 d1 57 48 c7 c6 80 c0 55 a0 48 c7 c7 20 c0 55 a0 49 0f 44 f3 e8 e9 39 ff ff <0f> 0b 0f b6 d3 4d 89 e0 48 89 e9 31 f6 48 c7 c7 9f c0 55 a0 e8 73
RSP: 0018:ffffb38142d83830 EFLAGS: 00010246
RAX: 000000000000006a RBX: 0000000000000112 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff895b9fd20880 RDI: ffff895b9fd20880
RBP: 0000000000004000 R08: 80000000ffff84e0 R09: 0000000000ffff0a
R10: 0000000000000004 R11: 000000000000006e R12: ffff895440042a00
R13: 0000000000000001 R14: ffff8955a7c75a00 R15: ffff89560662b312
FS:  00007f4abbc0ab80(0000) GS:ffff895b9fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005585ee783000 CR3: 000000014443c003 CR4: 00000000001706f0
Call Trace:
 <TASK>
 ? show_trace_log_lvl+0x1c4/0x2df
 ? show_trace_log_lvl+0x1c4/0x2df
 ? __check_heap_object+0xd5/0x110
Actions

Also available in: Atom PDF