Bug #53819
closedKernel null pointer derefecence during kernel mount fsync on Linux 5.15
0%
Description
Extracted from https://tracker.ceph.com/issues/53809#note-8:
After upgrading from a 5.10 kernel to 5.15 kernel today, it crashed all 3 server/OSD nodes of my Ceph cluster simulatneously with a kernel null pointer dereference:
Jan 10 15:23:46 node-4 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008 Jan 10 15:23:46 node-4 kernel: #PF: supervisor read access in kernel mode Jan 10 15:23:46 node-4 kernel: #PF: error_code(0x0000) - not-present page
I suspect that the Crash is in the Ceph kernel module, because these machines mainly run Ceph, and there wouldn't be much else that would be able to synchronise those machines to crash at exactly the same time.
2 more nodes that are Ceph clients (not servers) were also upgraded to the 5.15 kernel. Those did not crash.
The Ceph servers and OSDs run v16.2.7.
More evidence that it's related to Ceph, since there's fsync in the crash trace:
Call Trace: <TASK> ? __fget_files+0x97/0xc0 __x64_sys_fsync+0x34/0x60 do_syscall_64...
The crash appeared approximately 13:12 hours after the 5.15 kernel booted and mounted the CephFS.
This is a regression, since the 5.10 kernel did not crash.
Files