Bug #42288
closedceph:: kernel BUG at mm/slub.c:3901!
0%
Description
Environment
OS: CentOS Linux release 7.3.1611 (Core)
kernel version: 4.19.15
application: kcephfs+nfs
Issue
kernel panic - not syncing: fatal exception
[5452201.213885] ------------[ cut here ]------------
[5452201.213889] kernel BUG at mm/slub.c:3901!
[5452201.213938] invalid opcode: 0000 [#1] SMP PTI
[5452201.213971] CPU: 35 PID: 3037447 Comm: kworker/35:1 Kdump: loaded Not tainted 4.19.15 #1
[5452201.214020] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
[5452201.214088] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[5452201.214129] RIP: 0010:kfree+0x15b/0x170
[5452201.214156] Code: 8b 02 f6 c4 80 75 08 49 8b 42 08 a8 01 74 1b 49 8b 02 31 f6 f6 c4 80 74 05 41 0f b6 72 51 5b 5d 41 5c 4c 89 d7 e9 95 03 f9 ff <0f> 0b 48 83 e8 01 e9 01 ff ff ff 49 83 ea 01 e9 e9 fe ff ff 90 0f
[5452201.214262] RSP: 0018:ffffb8c3a0607cb0 EFLAGS: 00010246
[5452201.214296] RAX: ffffeee840000008 RBX: ffff9130c0000000 RCX: 0000000080200016
[5452201.214339] RDX: 00006f0ec0000000 RSI: 0000000000000000 RDI: ffff9130c0000000
[5452201.214383] RBP: ffff91107f823970 R08: 0000000000000001 R09: 0000000000000000
[5452201.214426] R10: ffffeee840000000 R11: 0000000000000001 R12: ffffffffc076c45d
[5452201.214469] R13: ffff91107f823970 R14: ffff91107f8239e0 R15: ffff91107f823900
[5452201.214513] FS: 0000000000000000(0000) GS:ffff9110bfbc0000(0000) knlGS:0000000000000000
[5452201.214562] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[5452201.214598] CR2: 000055993ab29620 CR3: 0000003a1e00a003 CR4: 00000000003606e0
[5452201.214641] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[5452201.214685] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[5452201.214728] Call Trace:
[5452201.214759] ceph_msg_release+0x15d/0x190 [libceph]
[5452201.214811] dispatch+0x66/0xa50 [ceph]
[5452201.214846] try_read+0x7f3/0x11d0 [libceph]
[5452201.214878] ? dequeue_entity+0x37e/0x7e0
[5452201.214907] ? pick_next_task_fair+0x291/0x610
[5452201.214937] ? dequeue_task_fair+0x5d/0x700
[5452201.214966] ? __switch_to+0x8c/0x470
[5452201.214999] ceph_con_workfn+0xa2/0x5b0 [libceph]
[5452201.215033] process_one_work+0x16b/0x370
[5452201.215062] worker_thread+0x49/0x3f0
[5452201.215089] kthread+0xf5/0x130
[5452201.215112] ? max_active_store+0x80/0x80
[5452201.215139] ? kthread_bind+0x10/0x10
[5452201.215167] ret_from_fork+0x1f/0x30
Source Code
3889 void kfree(const void *x)
3890 {
3891 struct page *page;
3892 void *object = (void *)x;
3893
3894 trace_kfree(_RET_IP_, x);
3895
3896 if (unlikely(ZERO_OR_NULL_PTR(x)))
3897 return;
3898
3899 page = virt_to_head_page(x);
3900 if (unlikely(!PageSlab(page))) {
3901 BUG_ON(!PageCompound(page)); <== here
3902 kfree_hook(object);
3903 __free_pages(page, compound_order(page));
3904 return;
3905 }
3906 slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
3907 }
3908 EXPORT_SYMBOL(kfree);
more detail see vmcore-dmesg.txt
Files
Updated by Yanhu Cao over 4 years ago
fs/ceph/mds_client.c
static void dispatch(struct ceph_connection *con, struct ceph_msg *msg)
{
struct ceph_mds_session *s = con->private;
struct ceph_mds_client *mdsc = s->s_mdsc;
int type = le16_to_cpu(msg->hdr.type);
mutex_lock(&mdsc->mutex);
if (__verify_registered_session(mdsc, s) < 0) {
mutex_unlock(&mdsc->mutex);
goto out; <== should return? if mds session does not exist. e.g. https://github.com/gmayyyha/ceph-client/blob/1c789249578895bb14ab62b4327306439b754857/net/ceph/mon_client.c#L1208
}
mutex_unlock(&mdsc->mutex);
switch (type) {
...
default:
pr_err("received unknown message type %d %s\n", type,
ceph_msg_type_name(type));
}
out:
ceph_msg_put(msg);
}
Updated by Patrick Donnelly over 4 years ago
- Project changed from CephFS to Linux kernel client
- Subject changed from cephfs kernel: kernel BUG at mm/slub.c:3901! to ceph:: kernel BUG at mm/slub.c:3901!
- Status changed from New to Fix Under Review
- Start date deleted (
10/12/2019)
Updated by Jeff Layton over 4 years ago
Looks like slabcache corruption at first glance, or maybe a bogus pointer got passed to kfree. 4.19.15 is pretty old at this point, do you see this on newer v4.19 kernels too?
Updated by Yanhu Cao over 4 years ago
Jeff Layton wrote:
Looks like slabcache corruption at first glance, or maybe a bogus pointer got passed to kfree. 4.19.15 is pretty old at this point, do you see this on newer v4.19 kernels too?
The newer v4.19.79 kernel, same as the old v4.19.15.
Updated by Jeff Layton over 4 years ago
(cc'ing Ilya since he knows this code better)
I don't know the mon code as well, but the monitor dispatch function looks leaky to me too. process_message takes in_msg from the con, and in the case where monc->private is NULL, it just drops that msg onto the floor without releasing the reference. Maybe there is something that mitigates that but I'm not seeing it right offhand.
Updated by Ilya Dryomov over 4 years ago
I think you meant con->private. That check is bogus, con->private is set when ceph_connection is initialized and is never cleared. I'll remove it.
Updated by Jeff Layton over 4 years ago
Ilya Dryomov wrote:
I think you meant con->private. That check is bogus, con->private is set when ceph_connection is initialized and is never cleared. I'll remove it.
Yes, I meant con->private. Great, that was my take too after going over the code.
So to summarize, I think the patch proposed to fix this is not right. We need to put the reference to the message so it doesn't leak. If there's a bogus pointer in the message that's causing this, then we probably need to deal with it another way.
Updated by Yanhu Cao over 4 years ago
ceph_msg_release
-> ceph_msg_data_destroy <= foreach
-> ceph_pagelist_release <= data->type == CEPH_MSG_DATA_PAGELIST
-> kfree <== crash here ?
-> kvfree <= !m->pool
-> kfree <== crash here ?
I only have vmcore-dmesg.txt, so I'm not pretty sure which kfree caused the crash.
Updated by Jeff Layton over 4 years ago
Do you have a way to reproduce this at will?
If you have the debug symbols, you may be able to open the module with gdb and determine where it was:
$ gdb path/to/libceph.ko ... gdb> list *(ceph_msg_release+0x15d)
That should show you the next instruction after the kfree returns, and you can use that to infer which one it was. That said this relies on having debug symbols available and the same sources, so ymmv.
Updated by Yanhu Cao over 4 years ago
Only wait for the production environment to reappear.
Reading symbols from /export/linux-4.19.15/net/ceph/libceph.ko...done.
(gdb) list *(ceph_msg_release+0x15d)
0x348d is in ceph_msg_release (net/ceph/messenger.c:3534).
3529
3530 if (m->pool)
3531 ceph_msgpool_put(m->pool, m);
3532 else
3533 ceph_msg_free(m); <== here
3534 }
3535
Updated by Jeff Layton over 4 years ago
Do you have a coredump and a way to analyze it?
If so, then it'd be good to track down the msg and see what you can determine about the pointer that's being freed.
This may be unrelated to MDS traffic as well. It'd be good to figure out what sort of socket the thread was servicing. If you can identify the ceph_connection in some of the earlier stack frames, then you may be able to peek at it and determine the socket address of the peer.
It also wouldn't hurt to try a newer kernel on the off chance that this may already be fixed. I'm not aware of anything specific that might address this though.
So in ceph_msg_free:
static void ceph_msg_free(struct ceph_msg *m) { dout("%s %p\n", __func__, m); kvfree(m->front.iov_base); kfree(m->data); kmem_cache_free(ceph_msg_cache, m); }
"m" comes from a fixed-size slabcache, so the crash probably happened before it got freed. It'd be interesting to see what "m" looks like at this point.
Updated by Yanhu Cao over 4 years ago
No coredump file.
Reading symbols from /export/linux-4.19.15/fs/ceph/ceph.ko...done.
(gdb) list *(dispatch+0x66)
0x29266 is in dispatch (fs/ceph/mds_client.c:4149).
4144 pr_err("received unknown message type %d %s\n", type,
4145 ceph_msg_type_name(type));
4146 }
4147 out:
4148 ceph_msg_put(msg);
4149 }
It should be related to MDS according to the next instruction returned by ceph_msg_free, and there is only mds_client.c: dispatch in the CEPH module.
struct ceph_msg *ceph_msg_new(int type, int front_len, gfp_t flags,
bool can_fail)
{
struct ceph_msg *m;
m = kmem_cache_zalloc(ceph_msg_cache, flags);
...
/* front */
if (front_len) {
m->front.iov_base = ceph_kvmalloc(front_len, flags);
if (m->front.iov_base == NULL) {
dout("ceph_msg_new can't allocate %d bytes\n",
front_len);
goto out2;
}
} else {
m->front.iov_base = NULL; <== Can be judged whether it is NULL when kvfree(m->front.iov_base)?
}
...
}
Updated by Yanhu Cao over 4 years ago
void *ceph_kvmalloc(size_t size, gfp_t flags)
{
if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
void *ptr = kmalloc(size, flags | __GFP_NOWARN);
if (ptr)
return ptr;
}
return __vmalloc(size, flags | __GFP_HIGHMEM, PAGE_KERNEL); <== Why use __GFP_HIGHMEM?
}
The __GFP_HIGHMEM flag indicates that "the allocated memory may be located in high memory". High Memory is the part of physical memory in a computer which is not directly mapped by the page tables of its operating system kernel.
This zone of memory is not mapped in the kernel's virtual address space, and this prevents the kernel from being capable of directly referring it.
Updated by Yanhu Cao over 4 years ago
Yanhu Cao wrote:
[...]
The __GFP_HIGHMEM flag indicates that "the allocated memory may be located in high memory". High Memory is the part of physical memory in a computer which is not directly mapped by the page tables of its operating system kernel.
This zone of memory is not mapped in the kernel's virtual address space, and this prevents the kernel from being capable of directly referring it.
Does not match the kernel version used. Ignore this.
Updated by Jeff Layton over 4 years ago
Yanhu Cao wrote:
No coredump file.
Too bad. That would help expedite this.
It should be related to MDS according to the next instruction returned by ceph_msg_free, and there is only mds_client.c: dispatch in the CEPH module.
Quite right. I missed that that address was in ceph.ko...and yes, x86_64 doesn't really have high memory since it's a 64-bit arch.
I suspect there are two possibilities here:
- some pointer in the msg structure has been double-freed
- some pointer in the msg structure has been scribbled over
Determining that without a core will be difficult. Are you building your own kernels? One possibility is to turn on KASAN. It will impact performance, but it may help detect what's going wrong earlie, and provide some clues about what it is.
Updated by Ilya Dryomov over 4 years ago
- Status changed from Fix Under Review to Need More Info
Updated by Jeff Layton about 4 years ago
- Status changed from Need More Info to Rejected
No response in 6 months. Please reopen if this is still an issue.