Bug #64471: kernel: upgrades from quincy/v18.2.[01]/reef to main|squid fail with kernel oops - Linux kernel client - Ceph

Actions

Copy link

Bug #64471

open

kernel: upgrades from quincy/v18.2.[01]/reef to main|squid fail with kernel oops

Added by Patrick Donnelly 2 months ago. Updated 22 days ago.

Status:

New

Priority:

Urgent

Assignee:

Xiubo Li

Category:

fs/ceph

Target version:

Ceph - v20.0.0

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:{
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "target_image": "quay.ceph.io/ceph-ci/ceph:f78a58c0ffd401d1493058a1022c35f011d65275",
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "in_progress": true,
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "which": "Upgrading all daemon types on all hosts",
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "services_complete": [],
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "progress": "",
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "message": "Doing first pull of quay.ceph.io/ceph-ci/ceph:f78a58c0ffd401d1493058a1022c35f011d65275 image",
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:    "is_paused": false
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:}
...
2024-02-16T02:34:52.075 INFO:tasks.workunit.client.0.smithi032.stderr:+ pushd fsstress
2024-02-16T02:34:52.076 INFO:tasks.workunit.client.0.smithi032.stdout:~/cephtest/mnt.0/client.0/tmp/fsstress ~/cephtest/mnt.0/client.0/tmp
2024-02-16T02:34:52.076 INFO:tasks.workunit.client.0.smithi032.stderr:+ wget -q -O ltp-full.tgz http://download.ceph.com/qa/ltp-full-20091231.tgz
2024-02-16T02:35:10.364 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:09 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:09.988+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6804 osd.3 since back 2024-02-16T02:34:45.611069+0000 front 2024-02-16T02:34:45.611050+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000)
2024-02-16T02:35:10.364 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:09 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:09.988+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6812 osd.4 since back 2024-02-16T02:34:45.611191+0000 front 2024-02-16T02:34:45.611160+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000)
2024-02-16T02:35:10.364 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:09 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:09.988+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6820 osd.5 since back 2024-02-16T02:34:45.611126+0000 front 2024-02-16T02:34:45.611214+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000)
2024-02-16T02:35:11.316 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:10 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:10.982+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6804 osd.3 since back 2024-02-16T02:34:45.611069+0000 front 2024-02-16T02:34:45.611050+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000)
2024-02-16T02:35:11.316 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:10 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:10.982+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6812 osd.4 since back 2024-02-16T02:34:45.611191+0000 front 2024-02-16T02:34:45.611160+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000)
2024-02-16T02:35:11.317 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:10 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:10.982+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6820 osd.5 since back 2024-02-16T02:34:45.611126+0000 front 2024-02-16T02:34:45.611214+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000)

From: /teuthology/pdonnell-2024-02-16_01:25:08-fs:upgrade:mds_upgrade_sequence-wip-batrick-testing-20240215.160715-distro-default-smithi/7561891/teuthology.log

and many others in that run. It's quite reproducible. I don't think it happens with quincy -> main.

This might be related to #55258.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Patrick Donnelly 2 months ago

Related to Bug #55258: lots of "heartbeat_check: no reply from X.X.X.X" in OSD logs added

Actions

Copy link

Updated by Patrick Donnelly 2 months ago

Priority changed from Immediate to Normal

This may be related to the MDSMap encoding issues #64440 crashing the kernel possibly. Lowering the severity until I can confirm. This would be most unexpected.

Actions

Copy link

Updated by Radoslaw Zarzynski 2 months ago

I think we were investigating it together with Patrick on CEPH-RADOS. The finding was that the entire node was unresponsive, plausibly because of kernel's oops.

Actions

Copy link

Updated by Xiubo Li about 2 months ago

Patrick,

From console_logs/smithi196.log, the kernel just crashed when copying data from userspace:

Entering kdb (current=0xffff9115d14fb980, pid 61925) on processor 5 Oops: (null)^M
due to oops @ 0xfffffffface3a1d2^M
CPU: 5 PID: 61925 Comm: ld Kdump: loaded Not tainted 5.14.0-421.el9.x86_64 #1^M
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015^M
RIP: 0010:usercopy_abort+0x74/0x76^M
Code: 14 74 ad 51 48 0f 44 d6 49 c7 c3 cb 9f 73 ad 4c 89 d1 57 48 c7 c6 60 83 75 ad 48 c7 c7 00 83 75 ad 49 0f 44 f3 e8 1b 3b ff ff <0f> 0b 0f b6 d3 4d 89 e0 48 89 e9 31 f6 48 c7 c7 7f 83 75 ad e8 73^M
RSP: 0018:ffffbb97c16af8d0 EFLAGS: 00010246^M
RAX: 0000000000000072 RBX: 0000000000000112 RCX: 0000000000000000^M
RDX: 0000000000000000 RSI: ffff911d1fd60840 RDI: ffff911d1fd60840^M
RBP: 0000000000004000 R08: 80000000ffff84b4 R09: 0000000000ffff0a^M
R10: 0000000000000004 R11: 0000000000000076 R12: ffff9115c0be8b00^M
R13: 0000000000000001 R14: ffff911665df9f68 R15: ffff9115d16be112^M
FS:  00007ff20442eb80(0000) GS:ffff911d1fd40000(0000) knlGS:0000000000000000^M
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
CR2: 00007ff20446142d CR3: 00000001215ec003 CR4: 00000000003706e0^M
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
Call Trace:^M
 <TASK>^M
 ? show_trace_log_lvl+0x1c4/0x2df^M
more>

And for the smith023 node it worked well and I didn't see any issue.

So IMO it shouldn't be the MDSMap encoding issues #64440 crashing the kernel. This should be a known issue as Rishabh hit weeks ago before my PTO as I remembered.

Actions

Copy link

Updated by Xiubo Li about 2 months ago

Xiubo Li wrote:

Patrick,

From console_logs/smithi196.log, the kernel just crashed when copying data from userspace:

[...]

And for the smith023 node it worked well and I didn't see any issue.

So IMO it shouldn't be the MDSMap encoding issues #64440 crashing the kernel. This should be a known issue as Rishabh hit weeks ago before my PTO as I remembered.

I found it: https://pulpito.ceph.com/rishabh-2024-01-04_18:43:16-fs-rishabh-2024jan4-testing-default-smithi/7507410/:

In console_logs/smithi018.log

[ 1580.034891] usercopy: Kernel memory exposure attempt detected from SLUB object 'kmalloc-192' (offset 82, size 499712)!^M
[ 1580.045866] ------------[ cut here ]------------^M
[ 1580.050551] kernel BUG at mm/usercopy.c:102!^M
^M
Entering kdb (current=0xffff8881211f5500, pid 172901) on processor 4 Oops: (null)^M
due to oops @ 0xffffffff8138cabd^M
CPU: 4 PID: 172901 Comm: fsstress Tainted: G S                 6.6.0-g623393c9d50c #1^M
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015^M
RIP: 0010:usercopy_abort+0x6d/0x80^M
Code: 4c 0f 44 d0 41 53 48 c7 c0 1c e9 13 82 48 c7 c6 71 62 13 82 48 0f 45 f0 48 89 f9 48 c7 c7 f0 6b 1b 82 4c 89 d2 e8 63 2b df ff <0f> 0b 49 c7 c1 44 c8 14 82 4d 89 cb 4d 89 c8 eb a5 66 90 f3 0f 1e^M
RSP: 0018:ffffc90006dfba88 EFLAGS: 00010246^M
RAX: 000000000000006a RBX: 000000000007a000 RCX: 0000000000000000^M
RDX: 0000000000000000 RSI: ffff88885fd1d880 RDI: ffff88885fd1d880^M
RBP: 000000000007a000 R08: 0000000000000000 R09: c0000000ffffdfff^M
R10: 0000000000000001 R11: ffffc90006dfb930 R12: 0000000000000001^M
R13: ffff8882b7bbed12 R14: ffff88827a375830 R15: ffff8882b7b44d12^M
FS:  00007fb24c859500(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000^M
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
CR2: 000055c2bcf9eb00 CR3: 000000028956c005 CR4: 00000000001706e0^M
Call Trace:^M
 <TASK>^M
 ? kdb_main_loop+0x32c/0xa10^M
 ? kdb_stub+0x216/0x420^M
more>

Actions

Copy link

Updated by Venky Shankar about 2 months ago

Xiubo Li wrote:

Patrick,

From console_logs/smithi196.log, the kernel just crashed when copying data from userspace:

[...]

And for the smith023 node it worked well and I didn't see any issue.

So IMO it shouldn't be the MDSMap encoding issues #64440 crashing the kernel. This should be a known issue as Rishabh hit weeks ago before my PTO as I remembered.

Was the testing kernel rebased from some rc release upstream at that point?

Actions

Copy link

Updated by Patrick Donnelly about 1 month ago

Project changed from RADOS to Linux kernel client
Subject changed from osd: upgrades from v18.2.[01] to main fail with "heartbeat_check: no reply from" to kernel: upgrades from quincy/v18.2.[01]/reef to main|squid fail with kernel oops
Priority changed from Normal to Urgent
Target version changed from v19.0.0 to v20.0.0

So the scope of this issue is increasing (updated title). Upgrade from quincy to squid:

/teuthology/teuthology-2024-03-24_21:24:02-fs-squid-distro-default-smithi/7620150/teuthology.log

(after upgrade of mgrs)

https://pulpito.ceph.com/?branch=wip-batrick-testing-20240323.003144-squid

All upgrades from quincy|reef -> squid hit the kernel oops when using the kclient. One odd exception:

https://pulpito.ceph.com/pdonnell-2024-03-23_09:56:53-fs-wip-batrick-testing-20240323.003144-squid-distro-default-smithi/7618431/

The fuse failures are https://tracker.ceph.com/issues/64502

Actions

Copy link

Updated by Xiubo Li about 1 month ago

I am trying to find a way to disable the kdb first. Locally I have tried it in the suite yaml code, and it work. Need to improve it.

Actions

Copy link

Updated by Venky Shankar about 1 month ago

Category set to fs/ceph
Assignee set to Xiubo Li
Severity changed from 3 - minor to 1 - critical

Xiubo, I see you did some initial analysis on this. According to Patrick's latest update (note-7), its pretty reproducible (lately).

Could you plan to spend some cycles on this so see what's going on?

Actions

Copy link

#10

Updated by Xiubo Li about 1 month ago

Venky Shankar wrote:

Xiubo, I see you did some initial analysis on this. According to Patrick's latest update (note-7), its pretty reproducible (lately).

Could you plan to spend some cycles on this so see what's going on?

Yeah, sure, I just forgot to mention today I have already worked on it for a while. First I will try to disable the kdb first to see could we still see it.

Actions

Copy link

#11

Updated by Xiubo Li about 1 month ago

Patrick and Venky,

I raised a PR in teuthology to add a kdb option to allow us disable it when triggering the jobs, please see https://github.com/ceph/teuthology/pull/1929.

Actions

Copy link

#12

Updated by Venky Shankar 26 days ago

Xiubo Li wrote:

Patrick and Venky,

I raised a PR in teuthology to add a kdb option to allow us disable it when triggering the jobs, please see https://github.com/ceph/teuthology/pull/1929.

Looks good. Let'e get that in asap.

Actions

Copy link

#13

Updated by Venky Shankar 22 days ago

Seen again in squid testing. These are backport tests by Yuri so kdb was left enabled and https://github.com/ceph/teuthology/pull/1929 was merged just yesterday. Anyway:

Entering kdb (current=0xffff895569640000, pid 55999) on processor 4 Oops: (null)
due to oops @ 0xffffffff9fc3f12c
CPU: 4 PID: 55999 Comm: ld Kdump: loaded Not tainted 5.14.0-434.el9.x86_64 #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015
RIP: 0010:usercopy_abort+0x74/0x76
Code: 51 54 a0 51 48 0f 44 d6 49 c7 c3 74 dc 53 a0 4c 89 d1 57 48 c7 c6 80 c0 55 a0 48 c7 c7 20 c0 55 a0 49 0f 44 f3 e8 e9 39 ff ff <0f> 0b 0f b6 d3 4d 89 e0 48 89 e9 31 f6 48 c7 c7 9f c0 55 a0 e8 73
RSP: 0018:ffffb38142d83830 EFLAGS: 00010246
RAX: 000000000000006a RBX: 0000000000000112 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff895b9fd20880 RDI: ffff895b9fd20880
RBP: 0000000000004000 R08: 80000000ffff84e0 R09: 0000000000ffff0a
R10: 0000000000000004 R11: 000000000000006e R12: ffff895440042a00
R13: 0000000000000001 R14: ffff8955a7c75a00 R15: ffff89560662b312
FS:  00007f4abbc0ab80(0000) GS:ffff895b9fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005585ee783000 CR3: 000000014443c003 CR4: 00000000001706f0
Call Trace:
 <TASK>
 ? show_trace_log_lvl+0x1c4/0x2df
 ? show_trace_log_lvl+0x1c4/0x2df
 ? __check_heap_object+0xd5/0x110

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #64471

kernel: upgrades from quincy/v18.2.[01]/reef to main|squid fail with kernel oops

Updated by Patrick Donnelly 2 months ago

Updated by Patrick Donnelly 2 months ago

Updated by Radoslaw Zarzynski 2 months ago

Updated by Xiubo Li about 2 months ago

Updated by Xiubo Li about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Patrick Donnelly about 1 month ago

Updated by Xiubo Li about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by Xiubo Li about 1 month ago

Updated by Xiubo Li about 1 month ago

Updated by Venky Shankar 26 days ago

Updated by Venky Shankar 22 days ago