Project

General

Profile

Actions

Bug #53819

closed

Kernel null pointer derefecence during kernel mount fsync on Linux 5.15

Added by Niklas Hambuechen over 2 years ago. Updated 12 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
kcephfs
Crash signature (v1):
Crash signature (v2):

Description

Extracted from https://tracker.ceph.com/issues/53809#note-8:

After upgrading from a 5.10 kernel to 5.15 kernel today, it crashed all 3 server/OSD nodes of my Ceph cluster simulatneously with a kernel null pointer dereference:

Jan 10 15:23:46 node-4 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Jan 10 15:23:46 node-4 kernel: #PF: supervisor read access in kernel mode
Jan 10 15:23:46 node-4 kernel: #PF: error_code(0x0000) - not-present page

I suspect that the Crash is in the Ceph kernel module, because these machines mainly run Ceph, and there wouldn't be much else that would be able to synchronise those machines to crash at exactly the same time.

2 more nodes that are Ceph clients (not servers) were also upgraded to the 5.15 kernel. Those did not crash.

The Ceph servers and OSDs run v16.2.7.

More evidence that it's related to Ceph, since there's fsync in the crash trace:

Call Trace:
 <TASK>
 ? __fget_files+0x97/0xc0
 __x64_sys_fsync+0x34/0x60
 do_syscall_64...

The crash appeared approximately 13:12 hours after the 5.15 kernel booted and mounted the CephFS.

This is a regression, since the 5.10 kernel did not crash.


Files

Actions #1

Updated by Niklas Hambuechen over 2 years ago

Attaching photo of one of the crashed machines' physical screens showing the visible part of the kernel dump.

Actions #2

Updated by Xiubo Li over 2 years ago

Please see whether will the following commit will fix it:

commit 708c87168b6121abc74b2a57d0c498baaf70cbea
Author: Dan Carpenter <dan.carpenter@oracle.com>
Date:   Mon Sep 6 12:43:01 2021 +0300

    ceph: fix off by one bugs in unsafe_request_wait()

    The "> max" tests should be ">= max" to prevent an out of bounds access
    on the next lines.

    Fixes: e1a4541ec0b9 ("ceph: flush the mdlog before waiting on unsafe reqs")
    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

And also please try the lasted upstream kernel if possible to see could you still see the crash, it works well for me.

Actions #3

Updated by Niklas Hambuechen over 2 years ago

Xiubo Li wrote:

Please see whether will the following commit will fix it:

According to https://github.com/torvalds/linux/commit/708c87168b6121abc74b2a57d0c498baaf70cbea, that patch is already in tag v5.15, so the kernel I was running that crash should already have it.

Actions #4

Updated by Xiubo Li over 2 years ago

I didn't see any crash by using your test script by using https://github.com/ceph/ceph-client, please check whether has this been fixed in upstream ?

Actions #5

Updated by Niklas Hambuechen over 2 years ago

Sorry for the misunderstanding:

The test script I posted in https://tracker.ceph.com/issues/53809 cannot be used to reproduce this crash; it is not what crashed my nodes here.

The slowness of that script was fixed by kernel kernel 5.15. So I deployed 5.15 to my general production cluster that runs all kinds of tasks (including some fsyncs). That cluster crashed after ~13 hours of operation on the 5.15 kernel.

I don't have a reproducer to immediately crash the cluster, it just happened after normal operations. The only reason why I think it is related to https://tracker.ceph.com/issues/53809 is because there is fsync in the stack trace.

Actions #6

Updated by Xiubo Li over 2 years ago

  • Status changed from New to Need More Info

Niklas Hambuechen wrote:

Sorry for the misunderstanding:

The test script I posted in https://tracker.ceph.com/issues/53809 cannot be used to reproduce this crash; it is not what crashed my nodes here.

The slowness of that script was fixed by kernel kernel 5.15. So I deployed 5.15 to my general production cluster that runs all kinds of tasks (including some fsyncs). That cluster crashed after ~13 hours of operation on the 5.15 kernel.

I don't have a reproducer to immediately crash the cluster, it just happened after normal operations. The only reason why I think it is related to https://tracker.ceph.com/issues/53809 is because there is fsync in the stack trace.

Locally I have ran the xfstest test cases and couldn't see any issue with the testing branch in ceph-client tree. And couldn't see any crash issue like this.

Could you try to reproduce it and post more logs or crash call traces ? Or provide the steps to reproduce it ?

Actions #7

Updated by Xiubo Li over 2 years ago

  • Assignee set to Xiubo Li
Actions #8

Updated by Niklas Hambuechen over 2 years ago

So far I couldn't get a setup to reproduce it easily, since the crash happened on my production cluseter only, and experimenting with that needs some effort.

On the crash call traces, I posted a trace in the screenshot of the crashed machine which shows the kernel panic. Is that the type of trace you are looking for?

Actions #9

Updated by Xiubo Li over 2 years ago

  • Status changed from Need More Info to Fix Under Review

Niklas Hambuechen wrote:

So far I couldn't get a setup to reproduce it easily, since the crash happened on my production cluseter only, and experimenting with that needs some effort.

On the crash call traces, I posted a trace in the screenshot of the crashed machine which shows the kernel panic. Is that the type of trace you are looking for?

I have one fix in https://patchwork.kernel.org/project/ceph-devel/list/?series=604729, this patch will fix some memory leakage issues and also one possible ZERO_SIZE_PTR dereference bug, which could cause the NULL pointer crash.

The `mail@...` in the ceph project is your mail right ? I will ping Jeff to add the 'Reported-by:' tag in that patch.

Actions #10

Updated by Niklas Hambuechen over 2 years ago

Xiubo Li wrote:

The `mail@...` in the ceph project is your mail right ? I will ping Jeff to add the 'Reported-by:' tag in that patch.

Yes, I'm https://github.com/nh2

Actions #11

Updated by Xiubo Li over 2 years ago

Niklas Hambuechen wrote:

Xiubo Li wrote:

The `mail@...` in the ceph project is your mail right ? I will ping Jeff to add the 'Reported-by:' tag in that patch.

Yes, I'm https://github.com/nh2

Sure, thanks.

Actions #12

Updated by Xiubo Li over 2 years ago

  • Status changed from Fix Under Review to Resolved
Actions #13

Updated by Niklas Hambuechen 12 months ago

Xiubo Li wrote:

I have one fix in https://patchwork.kernel.org/project/ceph-devel/list/?series=604729, this patch will fix some memory leakage issues and also one possible ZERO_SIZE_PTR dereference bug, which could cause the NULL pointer crash.

The link is no longer working. could you provide a stable link to the supposed fix?

Actions #14

Updated by Niklas Hambuechen 12 months ago

I think this is a stable link: https://patchwork.kernel.org/project/ceph-devel/patch/20220112042904.8557-1-xiubli@redhat.com/

The commit message title is:

ceph: put the requests/sessions when it fails to alloc memory
Actions #15

Updated by Xiubo Li 12 months ago

Niklas Hambuechen wrote:

I think this is a stable link: https://patchwork.kernel.org/project/ceph-devel/patch/20220112042904.8557-1-xiubli@redhat.com/

The commit message title is:

[...]

It's just archived and you need to change your filter.

Actions #16

Updated by Niklas Hambuechen 12 months ago

Hey Xiubo,

I deployed a kernel that has the linked fix in, and the kernel null pointer derefecence has not occured since then.

So it does look like your fix did resove it.

Thank you!

Actions

Also available in: Atom PDF