Bug #64471
openkernel: upgrades from quincy/v18.2.[01]/reef to main|squid fail with kernel oops
0%
Description
2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:{ 2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout: "target_image": "quay.ceph.io/ceph-ci/ceph:f78a58c0ffd401d1493058a1022c35f011d65275", 2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout: "in_progress": true, 2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout: "which": "Upgrading all daemon types on all hosts", 2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout: "services_complete": [], 2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout: "progress": "", 2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout: "message": "Doing first pull of quay.ceph.io/ceph-ci/ceph:f78a58c0ffd401d1493058a1022c35f011d65275 image", 2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout: "is_paused": false 2024-02-16T02:34:47.515 INFO:teuthology.orchestra.run.smithi032.stdout:} ... 2024-02-16T02:34:52.075 INFO:tasks.workunit.client.0.smithi032.stderr:+ pushd fsstress 2024-02-16T02:34:52.076 INFO:tasks.workunit.client.0.smithi032.stdout:~/cephtest/mnt.0/client.0/tmp/fsstress ~/cephtest/mnt.0/client.0/tmp 2024-02-16T02:34:52.076 INFO:tasks.workunit.client.0.smithi032.stderr:+ wget -q -O ltp-full.tgz http://download.ceph.com/qa/ltp-full-20091231.tgz 2024-02-16T02:35:10.364 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:09 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:09.988+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6804 osd.3 since back 2024-02-16T02:34:45.611069+0000 front 2024-02-16T02:34:45.611050+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000) 2024-02-16T02:35:10.364 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:09 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:09.988+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6812 osd.4 since back 2024-02-16T02:34:45.611191+0000 front 2024-02-16T02:34:45.611160+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000) 2024-02-16T02:35:10.364 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:09 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:09.988+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6820 osd.5 since back 2024-02-16T02:34:45.611126+0000 front 2024-02-16T02:34:45.611214+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000) 2024-02-16T02:35:11.316 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:10 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:10.982+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6804 osd.3 since back 2024-02-16T02:34:45.611069+0000 front 2024-02-16T02:34:45.611050+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000) 2024-02-16T02:35:11.316 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:10 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:10.982+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6812 osd.4 since back 2024-02-16T02:34:45.611191+0000 front 2024-02-16T02:34:45.611160+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000) 2024-02-16T02:35:11.317 INFO:journalctl@ceph.osd.2.smithi032.stdout:Feb 16 02:35:10 smithi032 ceph-a2883a70-cc72-11ee-95ba-87774f69a715-osd-2[62304]: 2024-02-16T02:35:10.982+0000 7f5e5529a700 -1 osd.2 46 heartbeat_check: no reply from 172.21.15.196:6820 osd.5 since back 2024-02-16T02:34:45.611126+0000 front 2024-02-16T02:34:45.611214+0000 (oldest deadline 2024-02-16T02:35:09.711083+0000)
From: /teuthology/pdonnell-2024-02-16_01:25:08-fs:upgrade:mds_upgrade_sequence-wip-batrick-testing-20240215.160715-distro-default-smithi/7561891/teuthology.log
and many others in that run. It's quite reproducible. I don't think it happens with quincy -> main.
This might be related to #55258.
Updated by Patrick Donnelly 2 months ago
- Related to Bug #55258: lots of "heartbeat_check: no reply from X.X.X.X" in OSD logs added
Updated by Patrick Donnelly 2 months ago
- Priority changed from Immediate to Normal
This may be related to the MDSMap encoding issues #64440 crashing the kernel possibly. Lowering the severity until I can confirm. This would be most unexpected.
Updated by Radoslaw Zarzynski 2 months ago
I think we were investigating it together with Patrick on CEPH-RADOS
. The finding was that the entire node was unresponsive, plausibly because of kernel's oops.
Updated by Xiubo Li about 2 months ago
Patrick,
From console_logs/smithi196.log, the kernel just crashed when copying data from userspace:
Entering kdb (current=0xffff9115d14fb980, pid 61925) on processor 5 Oops: (null)^M due to oops @ 0xfffffffface3a1d2^M CPU: 5 PID: 61925 Comm: ld Kdump: loaded Not tainted 5.14.0-421.el9.x86_64 #1^M Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015^M RIP: 0010:usercopy_abort+0x74/0x76^M Code: 14 74 ad 51 48 0f 44 d6 49 c7 c3 cb 9f 73 ad 4c 89 d1 57 48 c7 c6 60 83 75 ad 48 c7 c7 00 83 75 ad 49 0f 44 f3 e8 1b 3b ff ff <0f> 0b 0f b6 d3 4d 89 e0 48 89 e9 31 f6 48 c7 c7 7f 83 75 ad e8 73^M RSP: 0018:ffffbb97c16af8d0 EFLAGS: 00010246^M RAX: 0000000000000072 RBX: 0000000000000112 RCX: 0000000000000000^M RDX: 0000000000000000 RSI: ffff911d1fd60840 RDI: ffff911d1fd60840^M RBP: 0000000000004000 R08: 80000000ffff84b4 R09: 0000000000ffff0a^M R10: 0000000000000004 R11: 0000000000000076 R12: ffff9115c0be8b00^M R13: 0000000000000001 R14: ffff911665df9f68 R15: ffff9115d16be112^M FS: 00007ff20442eb80(0000) GS:ffff911d1fd40000(0000) knlGS:0000000000000000^M CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M CR2: 00007ff20446142d CR3: 00000001215ec003 CR4: 00000000003706e0^M DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M Call Trace:^M <TASK>^M ? show_trace_log_lvl+0x1c4/0x2df^M more>
And for the smith023 node it worked well and I didn't see any issue.
So IMO it shouldn't be the MDSMap encoding issues #64440 crashing the kernel. This should be a known issue as Rishabh hit weeks ago before my PTO as I remembered.
Updated by Xiubo Li about 2 months ago
Xiubo Li wrote:
Patrick,
From console_logs/smithi196.log, the kernel just crashed when copying data from userspace:
[...]
And for the smith023 node it worked well and I didn't see any issue.
So IMO it shouldn't be the MDSMap encoding issues #64440 crashing the kernel. This should be a known issue as Rishabh hit weeks ago before my PTO as I remembered.
I found it: https://pulpito.ceph.com/rishabh-2024-01-04_18:43:16-fs-rishabh-2024jan4-testing-default-smithi/7507410/:
In console_logs/smithi018.log
[ 1580.034891] usercopy: Kernel memory exposure attempt detected from SLUB object 'kmalloc-192' (offset 82, size 499712)!^M [ 1580.045866] ------------[ cut here ]------------^M [ 1580.050551] kernel BUG at mm/usercopy.c:102!^M ^M Entering kdb (current=0xffff8881211f5500, pid 172901) on processor 4 Oops: (null)^M due to oops @ 0xffffffff8138cabd^M CPU: 4 PID: 172901 Comm: fsstress Tainted: G S 6.6.0-g623393c9d50c #1^M Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015^M RIP: 0010:usercopy_abort+0x6d/0x80^M Code: 4c 0f 44 d0 41 53 48 c7 c0 1c e9 13 82 48 c7 c6 71 62 13 82 48 0f 45 f0 48 89 f9 48 c7 c7 f0 6b 1b 82 4c 89 d2 e8 63 2b df ff <0f> 0b 49 c7 c1 44 c8 14 82 4d 89 cb 4d 89 c8 eb a5 66 90 f3 0f 1e^M RSP: 0018:ffffc90006dfba88 EFLAGS: 00010246^M RAX: 000000000000006a RBX: 000000000007a000 RCX: 0000000000000000^M RDX: 0000000000000000 RSI: ffff88885fd1d880 RDI: ffff88885fd1d880^M RBP: 000000000007a000 R08: 0000000000000000 R09: c0000000ffffdfff^M R10: 0000000000000001 R11: ffffc90006dfb930 R12: 0000000000000001^M R13: ffff8882b7bbed12 R14: ffff88827a375830 R15: ffff8882b7b44d12^M FS: 00007fb24c859500(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000^M CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M CR2: 000055c2bcf9eb00 CR3: 000000028956c005 CR4: 00000000001706e0^M Call Trace:^M <TASK>^M ? kdb_main_loop+0x32c/0xa10^M ? kdb_stub+0x216/0x420^M more>
Updated by Venky Shankar about 2 months ago
Xiubo Li wrote:
Patrick,
From console_logs/smithi196.log, the kernel just crashed when copying data from userspace:
[...]
And for the smith023 node it worked well and I didn't see any issue.
So IMO it shouldn't be the MDSMap encoding issues #64440 crashing the kernel. This should be a known issue as Rishabh hit weeks ago before my PTO as I remembered.
Was the testing kernel rebased from some rc release upstream at that point?
Updated by Patrick Donnelly about 1 month ago
- Project changed from RADOS to Linux kernel client
- Subject changed from osd: upgrades from v18.2.[01] to main fail with "heartbeat_check: no reply from" to kernel: upgrades from quincy/v18.2.[01]/reef to main|squid fail with kernel oops
- Priority changed from Normal to Urgent
- Target version changed from v19.0.0 to v20.0.0
So the scope of this issue is increasing (updated title). Upgrade from quincy to squid:
/teuthology/teuthology-2024-03-24_21:24:02-fs-squid-distro-default-smithi/7620150/teuthology.log
(after upgrade of mgrs)
https://pulpito.ceph.com/?branch=wip-batrick-testing-20240323.003144-squid
All upgrades from quincy|reef -> squid hit the kernel oops when using the kclient. One odd exception:
The fuse failures are https://tracker.ceph.com/issues/64502
Updated by Xiubo Li about 1 month ago
I am trying to find a way to disable the kdb first. Locally I have tried it in the suite yaml code, and it work. Need to improve it.
Updated by Venky Shankar about 1 month ago
- Category set to fs/ceph
- Assignee set to Xiubo Li
- Severity changed from 3 - minor to 1 - critical
Xiubo, I see you did some initial analysis on this. According to Patrick's latest update (note-7), its pretty reproducible (lately).
Could you plan to spend some cycles on this so see what's going on?
Updated by Xiubo Li about 1 month ago
Venky Shankar wrote:
Xiubo, I see you did some initial analysis on this. According to Patrick's latest update (note-7), its pretty reproducible (lately).
Could you plan to spend some cycles on this so see what's going on?
Yeah, sure, I just forgot to mention today I have already worked on it for a while. First I will try to disable the kdb first to see could we still see it.
Updated by Xiubo Li about 1 month ago
Patrick and Venky,
I raised a PR in teuthology to add a kdb option to allow us disable it when triggering the jobs, please see https://github.com/ceph/teuthology/pull/1929.
Updated by Venky Shankar 26 days ago
Xiubo Li wrote:
Patrick and Venky,
I raised a PR in teuthology to add a kdb option to allow us disable it when triggering the jobs, please see https://github.com/ceph/teuthology/pull/1929.
Looks good. Let'e get that in asap.
Updated by Venky Shankar 22 days ago
Seen again in squid testing. These are backport tests by Yuri so kdb was left enabled and https://github.com/ceph/teuthology/pull/1929 was merged just yesterday. Anyway:
Entering kdb (current=0xffff895569640000, pid 55999) on processor 4 Oops: (null) due to oops @ 0xffffffff9fc3f12c CPU: 4 PID: 55999 Comm: ld Kdump: loaded Not tainted 5.14.0-434.el9.x86_64 #1 Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015 RIP: 0010:usercopy_abort+0x74/0x76 Code: 51 54 a0 51 48 0f 44 d6 49 c7 c3 74 dc 53 a0 4c 89 d1 57 48 c7 c6 80 c0 55 a0 48 c7 c7 20 c0 55 a0 49 0f 44 f3 e8 e9 39 ff ff <0f> 0b 0f b6 d3 4d 89 e0 48 89 e9 31 f6 48 c7 c7 9f c0 55 a0 e8 73 RSP: 0018:ffffb38142d83830 EFLAGS: 00010246 RAX: 000000000000006a RBX: 0000000000000112 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff895b9fd20880 RDI: ffff895b9fd20880 RBP: 0000000000004000 R08: 80000000ffff84e0 R09: 0000000000ffff0a R10: 0000000000000004 R11: 000000000000006e R12: ffff895440042a00 R13: 0000000000000001 R14: ffff8955a7c75a00 R15: ffff89560662b312 FS: 00007f4abbc0ab80(0000) GS:ffff895b9fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005585ee783000 CR3: 000000014443c003 CR4: 00000000001706f0 Call Trace: <TASK> ? show_trace_log_lvl+0x1c4/0x2df ? show_trace_log_lvl+0x1c4/0x2df ? __check_heap_object+0xd5/0x110