Project

General

Profile

Bug #42288

ceph:: kernel BUG at mm/slub.c:3901!

Added by Yanhu Cao over 1 year ago. Updated about 1 year ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Environment
OS: CentOS Linux release 7.3.1611 (Core)
kernel version: 4.19.15
application: kcephfs+nfs

Issue
kernel panic - not syncing: fatal exception

[5452201.213885] ------------[ cut here ]------------
[5452201.213889] kernel BUG at mm/slub.c:3901!
[5452201.213938] invalid opcode: 0000 [#1] SMP PTI
[5452201.213971] CPU: 35 PID: 3037447 Comm: kworker/35:1 Kdump: loaded Not tainted 4.19.15 #1
[5452201.214020] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
[5452201.214088] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[5452201.214129] RIP: 0010:kfree+0x15b/0x170
[5452201.214156] Code: 8b 02 f6 c4 80 75 08 49 8b 42 08 a8 01 74 1b 49 8b 02 31 f6 f6 c4 80 74 05 41 0f b6 72 51 5b 5d 41 5c 4c 89 d7 e9 95 03 f9 ff <0f> 0b 48 83 e8 01 e9 01 ff ff ff 49 83 ea 01 e9 e9 fe ff ff 90 0f
[5452201.214262] RSP: 0018:ffffb8c3a0607cb0 EFLAGS: 00010246
[5452201.214296] RAX: ffffeee840000008 RBX: ffff9130c0000000 RCX: 0000000080200016
[5452201.214339] RDX: 00006f0ec0000000 RSI: 0000000000000000 RDI: ffff9130c0000000
[5452201.214383] RBP: ffff91107f823970 R08: 0000000000000001 R09: 0000000000000000
[5452201.214426] R10: ffffeee840000000 R11: 0000000000000001 R12: ffffffffc076c45d
[5452201.214469] R13: ffff91107f823970 R14: ffff91107f8239e0 R15: ffff91107f823900
[5452201.214513] FS:  0000000000000000(0000) GS:ffff9110bfbc0000(0000) knlGS:0000000000000000
[5452201.214562] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[5452201.214598] CR2: 000055993ab29620 CR3: 0000003a1e00a003 CR4: 00000000003606e0
[5452201.214641] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[5452201.214685] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[5452201.214728] Call Trace:
[5452201.214759]  ceph_msg_release+0x15d/0x190 [libceph]
[5452201.214811]  dispatch+0x66/0xa50 [ceph]
[5452201.214846]  try_read+0x7f3/0x11d0 [libceph]
[5452201.214878]  ? dequeue_entity+0x37e/0x7e0
[5452201.214907]  ? pick_next_task_fair+0x291/0x610
[5452201.214937]  ? dequeue_task_fair+0x5d/0x700
[5452201.214966]  ? __switch_to+0x8c/0x470
[5452201.214999]  ceph_con_workfn+0xa2/0x5b0 [libceph]
[5452201.215033]  process_one_work+0x16b/0x370
[5452201.215062]  worker_thread+0x49/0x3f0
[5452201.215089]  kthread+0xf5/0x130
[5452201.215112]  ? max_active_store+0x80/0x80
[5452201.215139]  ? kthread_bind+0x10/0x10
[5452201.215167]  ret_from_fork+0x1f/0x30

Source Code

3889 void kfree(const void *x)
3890 {
3891         struct page *page;
3892         void *object = (void *)x;
3893
3894         trace_kfree(_RET_IP_, x);
3895
3896         if (unlikely(ZERO_OR_NULL_PTR(x)))
3897                 return;
3898
3899         page = virt_to_head_page(x);
3900         if (unlikely(!PageSlab(page))) {
3901                 BUG_ON(!PageCompound(page));   <== here
3902                 kfree_hook(object);
3903                 __free_pages(page, compound_order(page));
3904                 return;
3905         }
3906         slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
3907 }
3908 EXPORT_SYMBOL(kfree);

more detail see vmcore-dmesg.txt

vmcore-dmesg.txt View (265 KB) Yanhu Cao, 10/12/2019 02:11 AM

libceph.patch View (638 Bytes) Yanhu Cao, 10/12/2019 03:17 AM

History

#1 Updated by Yanhu Cao over 1 year ago

fs/ceph/mds_client.c

static void dispatch(struct ceph_connection *con, struct ceph_msg *msg)
{
    struct ceph_mds_session *s = con->private;
    struct ceph_mds_client *mdsc = s->s_mdsc;
    int type = le16_to_cpu(msg->hdr.type);

    mutex_lock(&mdsc->mutex);
    if (__verify_registered_session(mdsc, s) < 0) {
        mutex_unlock(&mdsc->mutex);
        goto out;      <== should return? if mds session does not exist. e.g. https://github.com/gmayyyha/ceph-client/blob/1c789249578895bb14ab62b4327306439b754857/net/ceph/mon_client.c#L1208
    }
    mutex_unlock(&mdsc->mutex);

    switch (type) {
    ...
    default:
        pr_err("received unknown message type %d %s\n", type,
               ceph_msg_type_name(type));
    }
out:
    ceph_msg_put(msg);
}

#2 Updated by Yanhu Cao over 1 year ago

#3 Updated by Patrick Donnelly over 1 year ago

  • Project changed from CephFS to Linux kernel client
  • Subject changed from cephfs kernel: kernel BUG at mm/slub.c:3901! to ceph:: kernel BUG at mm/slub.c:3901!
  • Status changed from New to Fix Under Review
  • Start date deleted (10/12/2019)

#4 Updated by Jeff Layton over 1 year ago

Looks like slabcache corruption at first glance, or maybe a bogus pointer got passed to kfree. 4.19.15 is pretty old at this point, do you see this on newer v4.19 kernels too?

#5 Updated by Yanhu Cao over 1 year ago

Jeff Layton wrote:

Looks like slabcache corruption at first glance, or maybe a bogus pointer got passed to kfree. 4.19.15 is pretty old at this point, do you see this on newer v4.19 kernels too?

The newer v4.19.79 kernel, same as the old v4.19.15.

#6 Updated by Jeff Layton over 1 year ago

(cc'ing Ilya since he knows this code better)

I don't know the mon code as well, but the monitor dispatch function looks leaky to me too. process_message takes in_msg from the con, and in the case where monc->private is NULL, it just drops that msg onto the floor without releasing the reference. Maybe there is something that mitigates that but I'm not seeing it right offhand.

#7 Updated by Ilya Dryomov over 1 year ago

I think you meant con->private. That check is bogus, con->private is set when ceph_connection is initialized and is never cleared. I'll remove it.

#8 Updated by Jeff Layton over 1 year ago

Ilya Dryomov wrote:

I think you meant con->private. That check is bogus, con->private is set when ceph_connection is initialized and is never cleared. I'll remove it.

Yes, I meant con->private. Great, that was my take too after going over the code.

So to summarize, I think the patch proposed to fix this is not right. We need to put the reference to the message so it doesn't leak. If there's a bogus pointer in the message that's causing this, then we probably need to deal with it another way.

#9 Updated by Yanhu Cao over 1 year ago

ceph_msg_release
  -> ceph_msg_data_destroy <= foreach
    -> ceph_pagelist_release   <= data->type == CEPH_MSG_DATA_PAGELIST
      -> kfree <== crash here ?
  -> kvfree  <= !m->pool
    -> kfree  <== crash here ? 

I only have vmcore-dmesg.txt, so I'm not pretty sure which kfree caused the crash.

#10 Updated by Jeff Layton over 1 year ago

Do you have a way to reproduce this at will?

If you have the debug symbols, you may be able to open the module with gdb and determine where it was:

$ gdb path/to/libceph.ko
...
gdb> list *(ceph_msg_release+0x15d)

That should show you the next instruction after the kfree returns, and you can use that to infer which one it was. That said this relies on having debug symbols available and the same sources, so ymmv.

#11 Updated by Jeff Layton over 1 year ago

  • Assignee set to Jeff Layton

#12 Updated by Yanhu Cao over 1 year ago

Only wait for the production environment to reappear.

Reading symbols from /export/linux-4.19.15/net/ceph/libceph.ko...done.
(gdb) list *(ceph_msg_release+0x15d)
0x348d is in ceph_msg_release (net/ceph/messenger.c:3534).
3529
3530        if (m->pool)
3531            ceph_msgpool_put(m->pool, m);
3532        else
3533            ceph_msg_free(m);   <== here  
3534    }
3535

#13 Updated by Jeff Layton over 1 year ago

Do you have a coredump and a way to analyze it?

If so, then it'd be good to track down the msg and see what you can determine about the pointer that's being freed.

This may be unrelated to MDS traffic as well. It'd be good to figure out what sort of socket the thread was servicing. If you can identify the ceph_connection in some of the earlier stack frames, then you may be able to peek at it and determine the socket address of the peer.

It also wouldn't hurt to try a newer kernel on the off chance that this may already be fixed. I'm not aware of anything specific that might address this though.

So in ceph_msg_free:

static void ceph_msg_free(struct ceph_msg *m)                                   
{                                                                               
        dout("%s %p\n", __func__, m);                                           
        kvfree(m->front.iov_base);                                              
        kfree(m->data);                                                         
        kmem_cache_free(ceph_msg_cache, m);                                     
}

"m" comes from a fixed-size slabcache, so the crash probably happened before it got freed. It'd be interesting to see what "m" looks like at this point.

#14 Updated by Yanhu Cao over 1 year ago

No coredump file.

Reading symbols from /export/linux-4.19.15/fs/ceph/ceph.ko...done.
(gdb) list *(dispatch+0x66)
0x29266 is in dispatch (fs/ceph/mds_client.c:4149).
4144            pr_err("received unknown message type %d %s\n", type,
4145                   ceph_msg_type_name(type));
4146        }
4147    out:
4148        ceph_msg_put(msg);
4149    }

It should be related to MDS according to the next instruction returned by ceph_msg_free, and there is only mds_client.c: dispatch in the CEPH module.

struct ceph_msg *ceph_msg_new(int type, int front_len, gfp_t flags,
                  bool can_fail)
{
    struct ceph_msg *m;

    m = kmem_cache_zalloc(ceph_msg_cache, flags);
    ...
    /* front */
    if (front_len) {
        m->front.iov_base = ceph_kvmalloc(front_len, flags);
        if (m->front.iov_base == NULL) {
            dout("ceph_msg_new can't allocate %d bytes\n",
                 front_len);
            goto out2;
        }
    } else {
        m->front.iov_base = NULL;   <== Can be judged whether it is NULL when kvfree(m->front.iov_base)?
    }
    ...
}

#15 Updated by Yanhu Cao over 1 year ago

void *ceph_kvmalloc(size_t size, gfp_t flags)
{
    if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
        void *ptr = kmalloc(size, flags | __GFP_NOWARN);
        if (ptr)
            return ptr;
    }

    return __vmalloc(size, flags | __GFP_HIGHMEM, PAGE_KERNEL);  <== Why use __GFP_HIGHMEM?
}

The __GFP_HIGHMEM flag indicates that "the allocated memory may be located in high memory". High Memory is the part of physical memory in a computer which is not directly mapped by the page tables of its operating system kernel.

This zone of memory is not mapped in the kernel's virtual address space, and this prevents the kernel from being capable of directly referring it.

#16 Updated by Yanhu Cao over 1 year ago

Yanhu Cao wrote:

[...]

The __GFP_HIGHMEM flag indicates that "the allocated memory may be located in high memory". High Memory is the part of physical memory in a computer which is not directly mapped by the page tables of its operating system kernel.

This zone of memory is not mapped in the kernel's virtual address space, and this prevents the kernel from being capable of directly referring it.

Does not match the kernel version used. Ignore this.

#17 Updated by Jeff Layton over 1 year ago

Yanhu Cao wrote:

No coredump file.

Too bad. That would help expedite this.

It should be related to MDS according to the next instruction returned by ceph_msg_free, and there is only mds_client.c: dispatch in the CEPH module.

Quite right. I missed that that address was in ceph.ko...and yes, x86_64 doesn't really have high memory since it's a 64-bit arch.

I suspect there are two possibilities here:

  1. some pointer in the msg structure has been double-freed
  2. some pointer in the msg structure has been scribbled over

Determining that without a core will be difficult. Are you building your own kernels? One possibility is to turn on KASAN. It will impact performance, but it may help detect what's going wrong earlie, and provide some clues about what it is.

#18 Updated by Ilya Dryomov over 1 year ago

  • Status changed from Fix Under Review to Need More Info

#19 Updated by Jeff Layton about 1 year ago

  • Status changed from Need More Info to Rejected

No response in 6 months. Please reopen if this is still an issue.

Also available in: Atom PDF