Bug #53819: Kernel null pointer derefecence during kernel mount fsync on Linux 5.15 - Linux kernel client - Ceph

Actions

Copy link

Bug #53819

closed

Kernel null pointer derefecence during kernel mount fsync on Linux 5.15

Added by Niklas Hambuechen over 2 years ago. Updated 12 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Xiubo Li

Category:

fs/ceph

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Yes

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

kcephfs

Crash signature (v1):

Crash signature (v2):

Description

Extracted from https://tracker.ceph.com/issues/53809#note-8:

After upgrading from a 5.10 kernel to 5.15 kernel today, it crashed all 3 server/OSD nodes of my Ceph cluster simulatneously with a kernel null pointer dereference:

Jan 10 15:23:46 node-4 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Jan 10 15:23:46 node-4 kernel: #PF: supervisor read access in kernel mode
Jan 10 15:23:46 node-4 kernel: #PF: error_code(0x0000) - not-present page

I suspect that the Crash is in the Ceph kernel module, because these machines mainly run Ceph, and there wouldn't be much else that would be able to synchronise those machines to crash at exactly the same time.

2 more nodes that are Ceph clients (not servers) were also upgraded to the 5.15 kernel. Those did not crash.

The Ceph servers and OSDs run v16.2.7.

More evidence that it's related to Ceph, since there's fsync in the crash trace:

Call Trace:
 <TASK>
 ? __fget_files+0x97/0xc0
 __x64_sys_fsync+0x34/0x60
 do_syscall_64...

The crash appeared approximately 13:12 hours after the 5.15 kernel booted and mounted the CephFS.

This is a regression, since the 5.10 kernel did not crash.

Files

node-6-kernel-5.15-crash-dump-screenshot.png (603 KB) node-6-kernel-5.15-crash-dump-screenshot.png

Niklas Hambuechen, 01/10/2022 05:12 PM

Actions

Copy link

Updated by Niklas Hambuechen over 2 years ago

File node-6-kernel-5.15-crash-dump-screenshot.png node-6-kernel-5.15-crash-dump-screenshot.png added

Attaching photo of one of the crashed machines' physical screens showing the visible part of the kernel dump.

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Please see whether will the following commit will fix it:

commit 708c87168b6121abc74b2a57d0c498baaf70cbea
Author: Dan Carpenter <dan.carpenter@oracle.com>
Date:   Mon Sep 6 12:43:01 2021 +0300

    ceph: fix off by one bugs in unsafe_request_wait()

    The "> max" tests should be ">= max" to prevent an out of bounds access
    on the next lines.

    Fixes: e1a4541ec0b9 ("ceph: flush the mdlog before waiting on unsafe reqs")
    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

And also please try the lasted upstream kernel if possible to see could you still see the crash, it works well for me.

Actions

Copy link

Updated by Niklas Hambuechen over 2 years ago

Xiubo Li wrote:

Please see whether will the following commit will fix it:

According to https://github.com/torvalds/linux/commit/708c87168b6121abc74b2a57d0c498baaf70cbea, that patch is already in tag v5.15, so the kernel I was running that crash should already have it.

Actions

Copy link

Updated by Xiubo Li over 2 years ago

I didn't see any crash by using your test script by using https://github.com/ceph/ceph-client, please check whether has this been fixed in upstream ?

Actions

Copy link

Updated by Niklas Hambuechen over 2 years ago

Sorry for the misunderstanding:

The test script I posted in https://tracker.ceph.com/issues/53809 cannot be used to reproduce this crash; it is not what crashed my nodes here.

The slowness of that script was fixed by kernel kernel 5.15. So I deployed 5.15 to my general production cluster that runs all kinds of tasks (including some fsyncs). That cluster crashed after ~13 hours of operation on the 5.15 kernel.

I don't have a reproducer to immediately crash the cluster, it just happened after normal operations. The only reason why I think it is related to https://tracker.ceph.com/issues/53809 is because there is fsync in the stack trace.

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Status changed from New to Need More Info

Niklas Hambuechen wrote:

Sorry for the misunderstanding:

The test script I posted in https://tracker.ceph.com/issues/53809 cannot be used to reproduce this crash; it is not what crashed my nodes here.

The slowness of that script was fixed by kernel kernel 5.15. So I deployed 5.15 to my general production cluster that runs all kinds of tasks (including some fsyncs). That cluster crashed after ~13 hours of operation on the 5.15 kernel.

I don't have a reproducer to immediately crash the cluster, it just happened after normal operations. The only reason why I think it is related to https://tracker.ceph.com/issues/53809 is because there is fsync in the stack trace.

Locally I have ran the xfstest test cases and couldn't see any issue with the testing branch in ceph-client tree. And couldn't see any crash issue like this.

Could you try to reproduce it and post more logs or crash call traces ? Or provide the steps to reproduce it ?

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Assignee set to Xiubo Li

Actions

Copy link

Updated by Niklas Hambuechen over 2 years ago

So far I couldn't get a setup to reproduce it easily, since the crash happened on my production cluseter only, and experimenting with that needs some effort.

On the crash call traces, I posted a trace in the screenshot of the crashed machine which shows the kernel panic. Is that the type of trace you are looking for?

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Status changed from Need More Info to Fix Under Review

Niklas Hambuechen wrote:

So far I couldn't get a setup to reproduce it easily, since the crash happened on my production cluseter only, and experimenting with that needs some effort.

On the crash call traces, I posted a trace in the screenshot of the crashed machine which shows the kernel panic. Is that the type of trace you are looking for?

I have one fix in https://patchwork.kernel.org/project/ceph-devel/list/?series=604729, this patch will fix some memory leakage issues and also one possible ZERO_SIZE_PTR dereference bug, which could cause the NULL pointer crash.

The `mail@...` in the ceph project is your mail right ? I will ping Jeff to add the 'Reported-by:' tag in that patch.

Actions

Copy link

#10

Updated by Niklas Hambuechen over 2 years ago

Xiubo Li wrote:

The `mail@...` in the ceph project is your mail right ? I will ping Jeff to add the 'Reported-by:' tag in that patch.

Yes, I'm https://github.com/nh2

Actions

Copy link

#11

Updated by Xiubo Li over 2 years ago

Niklas Hambuechen wrote:

Xiubo Li wrote:

The `mail@...` in the ceph project is your mail right ? I will ping Jeff to add the 'Reported-by:' tag in that patch.

Yes, I'm https://github.com/nh2

Sure, thanks.

Actions

Copy link

#12

Updated by Xiubo Li over 2 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

#13

Updated by Niklas Hambuechen 12 months ago

Xiubo Li wrote:

I have one fix in https://patchwork.kernel.org/project/ceph-devel/list/?series=604729, this patch will fix some memory leakage issues and also one possible ZERO_SIZE_PTR dereference bug, which could cause the NULL pointer crash.

The link is no longer working. could you provide a stable link to the supposed fix?

Actions

Copy link

#14

Updated by Niklas Hambuechen 12 months ago

I think this is a stable link: https://patchwork.kernel.org/project/ceph-devel/patch/20220112042904.8557-1-xiubli@redhat.com/

The commit message title is:

ceph: put the requests/sessions when it fails to alloc memory

Actions

Copy link

#15

Updated by Xiubo Li 12 months ago

Niklas Hambuechen wrote:

I think this is a stable link: https://patchwork.kernel.org/project/ceph-devel/patch/20220112042904.8557-1-xiubli@redhat.com/

The commit message title is:

[...]

It's just archived and you need to change your filter.

Actions

Copy link

#16

Updated by Niklas Hambuechen 12 months ago

Hey Xiubo,

I deployed a kernel that has the linked fix in, and the kernel null pointer derefecence has not occured since then.

So it does look like your fix did resove it.

Thank you!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #53819

Kernel null pointer derefecence during kernel mount fsync on Linux 5.15

Updated by Niklas Hambuechen over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Niklas Hambuechen over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Niklas Hambuechen over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Niklas Hambuechen over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Niklas Hambuechen over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Xiubo Li over 2 years ago

Updated by Niklas Hambuechen 12 months ago

Updated by Niklas Hambuechen 12 months ago

Updated by Xiubo Li 12 months ago

Updated by Niklas Hambuechen 12 months ago