Project

General

Profile

Support #15302

Kernel panic when using CEPHFS (Hammer) /w kernel vivid-lts on Ubuntu 14.04

Added by Dennis Kramer almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
fs/ceph
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:

Description

Nowadays i'm getting a kernel panic when i use the vivid-lts (3.19.x) kernel on ubuntu 14.04 after I try to mount cephfs. Log indicates:

kernel BUG at /build/linux-lts-vivid-ZTSmDy/linux-lts-vivid-3.19.0/fs/ceph/mds_client.c:1928[ 27.305173] kernel BUG at /build/linux-lts-vivid-ZTSmDy/linux-lts-vivid-3.19.0/fs/ceph/mds_client.c:1928!
[ 27.307226] invalid opcode: 0000 [#1] SMP
[ 27.307599] Modules linked in: ceph libceph libcrc32c fscache ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_crypt ppdev parport_pc 8250_fintek pvpanic serio_raw joydev parport i2c_piix4 mac_hid hid_generic usbhid hid syscopyarea sysfillrect sysimgblt ttm drm_kms_helper psmouse drm pata_acpi floppy
[ 27.307599] CPU: 0 PID: 118 Comm: kworker/0:2 Not tainted 3.19.0-56-generic #62~14.04.1-Ubuntu
[ 27.307599] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.7.5-20150310_111955-batsu 04/01/2014
[ 27.307599] Workqueue: ceph-msgr con_work [libceph]
[ 27.307599] task: ffff88007ca0ce80 ti: ffff88003657c000 task.ti: ffff88003657c000
[ 27.307599] RIP: 0010:[<ffffffffc02e3ed0>] [<ffffffffc02e3ed0>] __prepare_send_request+0x7d0/0x800 [ceph]
[ 27.307599] RSP: 0018:ffff88003657fb48 EFLAGS: 00010283
[ 27.307599] RAX: ffff880077e1bb02 RBX: ffff8800364af000 RCX: 0000000000000000
[ 27.307599] RDX: 0000000038a516d3 RSI: 0000000000000000 RDI: ffff880077e1baf2
[ 27.307599] RBP: ffff88003657fbe8 R08: 0000000000000000 R09: 0000000000000000
[ 27.307599] R10: ffffffffc0275fe6 R11: 0000000000000000 R12: ffff88007873c700
[ 27.307599] R13: ffff880078458000 R14: 0000000000000000 R15: ffff880077e1ba80
[ 27.307599] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 27.307599] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 27.307599] CR2: 00007ff6e2973000 CR3: 0000000036387000 CR4: 00000000000006f0
[ 27.307599] Stack:
[ 27.307599] ffff88003657fb8c ffffffffc02e1109 0100000000000015 0000000000000001
[ 27.307599] ffff8800362e0118 ffff880077e1bafa ffff88003657fbe8 0000000000000000
[ 27.307599] 0000000000000000 ffff8800362e0118 0000000000000000 0000000000000001
[ 27.307599] Call Trace:
[ 27.307599] [<ffffffffc02e1109>] ? __choose_mds+0x119/0x470 [ceph]
[ 27.307599] [<ffffffffc02e4130>] __do_request+0x230/0x310 [ceph]
[ 27.307599] [<ffffffffc02e4288>] __wake_requests+0x78/0xb0 [ceph]
[ 27.307599] [<ffffffffc02e7430>] dispatch+0x550/0xad0 [ceph]
[ 27.307599] [<ffffffffc027abbb>] try_read+0x4cb/0x10f0 [libceph]
[ 27.307599] [<ffffffff810ab41e>] ? dequeue_task_fair+0x44e/0x660
[ 27.307599] [<ffffffff810ac171>] ? put_prev_entity+0x31/0x3f0
[ 27.307599] [<ffffffff810a41c8>] ? sched_clock_cpu+0x98/0xc0
[ 27.307599] [<ffffffffc027b899>] con_work+0xb9/0x620 [libceph]
[ 27.307599] [<ffffffff8108dc8f>] process_one_work+0x14f/0x400
[ 27.307599] [<ffffffff8108e428>] worker_thread+0x118/0x510
[ 27.307599] [<ffffffff8108e310>] ? rescuer_thread+0x3d0/0x3d0
[ 27.307599] [<ffffffff81093902>] kthread+0xd2/0xf0
[ 27.307599] [<ffffffff81093830>] ? kthread_create_on_node+0x1c0/0x1c0
[ 27.307599] [<ffffffff817b8b98>] ret_from_fork+0x58/0x90
[ 27.307599] [<ffffffff81093830>] ? kthread_create_on_node+0x1c0/0x1c0
[ 27.307599] Code: e9 27 f9 ff ff 44 89 a3 70 02 00 00 48 89 de 4c 89 ef 44 89 65 88 e8 a0 c3 ff ff 8b 45 88 e9 09 f9 ff ff 4c 63 e0 e9 d9 f9 ff ff <0f> 0b 49 8b 8f 98 fc ff ff 4d 8b 87 a0 fc ff ff 4c 89 fa 48 c7
[ 27.307599] RIP [<ffffffffc02e3ed0>] __prepare_send_request+0x7d0/0x800 [ceph]
[ 27.307599] RSP <ffff88003657fb48>
[ 27.379039] ---[ end trace 1928182fad693b4d ]---
[ 27.379983] BUG: unable to handle kernel paging request at ffffffffffffffd8
[ 27.381275] IP: [<ffffffff81094000>] kthread_data+0x10/0x20
[ 27.382382] PGD 1c19067 PUD 1c1b067 PMD 0
[ 27.383004] Oops: 0000 [#2] SMP
[ 27.383004] Modules linked in: ceph libceph libcrc32c fscache ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_crypt ppdev parport_pc 8250_fintek pvpanic serio_raw joydev parport i2c_piix4 mac_hid hid_generic usbhid hid syscopyarea sysfillrect sysimgblt ttm drm_kms_helper psmouse drm pata_acpi floppy
[ 27.383004] CPU: 0 PID: 118 Comm: kworker/0:2 Tainted: G D 3.19.0-56-generic #62~14.04.1-Ubuntu
[ 27.383004] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.7.5-20150310_111955-batsu 04/01/2014
[ 27.383004] task: ffff88007ca0ce80 ti: ffff88003657c000 task.ti: ffff88003657c000
[ 27.383004] RIP: 0010:[<ffffffff81094000>] [<ffffffff81094000>] kthread_data+0x10/0x20
[ 27.383004] RSP: 0018:ffff88003657f7c8 EFLAGS: 00010096
[ 27.383004] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000000d
[ 27.383004] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88007ca0ce80
[ 27.383004] RBP: ffff88003657f7c8 R08: 0000000000000000 R09: 0000000000000246
[ 27.383004] R10: ffffffff810721a4 R11: ffffea0001dfba00 R12: ffff88007ca0d3a0
[ 27.383004] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88007ca0ce80
[ 27.383004] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 27.383004] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 27.383004] CR2: 0000000000000028 CR3: 0000000036387000 CR4: 00000000000006f0
[ 27.383004] Stack:
[ 27.383004] ffff88003657f7e8 ffffffff8108ed05 ffff88003657f7e8 ffff88007fc13e80
[ 27.383004] ffff88003657f858 ffffffff817b45fb ffff88007ca0ce80 0000000000013e80
[ 27.383004] ffff88003657ffd8 0000000000013e80 0000000000000004 ffff88007ca0ce80
[ 27.383004] Call Trace:
[ 27.383004] [<ffffffff8108ed05>] wq_worker_sleeping+0x15/0xa0
[ 27.383004] [<ffffffff817b45fb>] __schedule+0x5bb/0x820
[ 27.383004] [<ffffffff817b4889>] schedule+0x29/0x70
[ 27.383004] [<ffffffff810777ff>] do_exit+0x69f/0xb00
[ 27.383004] [<ffffffff810187f8>] oops_end+0xa8/0x120
[ 27.383004] [<ffffffff81018deb>] die+0x4b/0x70
[ 27.383004] [<ffffffff810153a0>] do_trap+0xb0/0x150
[ 27.383004] [<ffffffff81015a37>] do_error_trap+0x97/0x150
[ 27.383004] [<ffffffffc02e3ed0>] ? __prepare_send_request+0x7d0/0x800 [ceph]
[ 27.383004] [<ffffffff810160c0>] do_invalid_op+0x20/0x30
[ 27.383004] [<ffffffff817ba7be>] invalid_op+0x1e/0x30
[ 27.383004] [<ffffffffc0275fe6>] ? ceph_kvmalloc+0x26/0x50 [libceph]
[ 27.383004] [<ffffffffc02e3ed0>] ? __prepare_send_request+0x7d0/0x800 [ceph]
[ 27.383004] [<ffffffffc02e3a44>] ? __prepare_send_request+0x344/0x800 [ceph]
[ 27.383004] [<ffffffffc02e1109>] ? __choose_mds+0x119/0x470 [ceph]
[ 27.383004] [<ffffffffc02e4130>] __do_request+0x230/0x310 [ceph]
[ 27.383004] [<ffffffffc02e4288>] __wake_requests+0x78/0xb0 [ceph]
[ 27.383004] [<ffffffffc02e7430>] dispatch+0x550/0xad0 [ceph]
[ 27.383004] [<ffffffffc027abbb>] try_read+0x4cb/0x10f0 [libceph]
[ 27.383004] [<ffffffff810ab41e>] ? dequeue_task_fair+0x44e/0x660
[ 27.383004] [<ffffffff810ac171>] ? put_prev_entity+0x31/0x3f0
[ 27.383004] [<ffffffff810a41c8>] ? sched_clock_cpu+0x98/0xc0
[ 27.383004] [<ffffffffc027b899>] con_work+0xb9/0x620 [libceph]
[ 27.383004] [<ffffffff8108dc8f>] process_one_work+0x14f/0x400
[ 27.383004] [<ffffffff8108e428>] worker_thread+0x118/0x510
[ 27.383004] [<ffffffff8108e310>] ? rescuer_thread+0x3d0/0x3d0
[ 27.383004] [<ffffffff81093902>] kthread+0xd2/0xf0
[ 27.383004] [<ffffffff81093830>] ? kthread_create_on_node+0x1c0/0x1c0
[ 27.383004] [<ffffffff817b8b98>] ret_from_fork+0x58/0x90
[ 27.383004] [<ffffffff81093830>] ? kthread_create_on_node+0x1c0/0x1c0
[ 27.383004] Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 c8 04 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[ 27.383004] RIP [<ffffffff81094000>] kthread_data+0x10/0x20
[ 27.383004] RSP <ffff88003657f7c8>
[ 27.383004] CR2: ffffffffffffffd8
[ 27.383004] ---[ end trace 1928182fad693b4e ]---
[ 27.383004] Fixing recursive fault but reboot is needed!

History

#1 Updated by Greg Farnum almost 4 years ago

  • Tracker changed from Bug to Support
  • Category set to fs/ceph
  • Assignee set to Zheng Yan

What's the output of "ceph -s" during this time? Kinda looks like all your connections to the cluster are failed.

#2 Updated by Randy Orr almost 4 years ago

I believe I am hitting this issue as well:

Ubuntu 14.04
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)

ceph -s reports healthy. Mapping and mounting an rbd works without issue, but mounting cephfs will consistently give the error.

I have tested the following kernels:

linux-image-3.19.0-42-generic - mounting cephfs works fine
linux-image-3.19.0-47-generic - mounting cephfs works fine
linux-image-3.19.0-49-generic - mounting cephfs fails with above error
linux-image-3.19.0-51-generic - mounting cephfs fails with above error
linux-image-3.19.0-56-generic - mounting cephfs fails with above error

I can reproduce this on multiple hosts in my environment. Is there any other information I can provide to help debug this issue?

#3 Updated by Zheng Yan almost 4 years ago

ubuntu-vidid.git includes commit

commit ebeec2ef4cd85c971ab78a434a1f824a2fcb0447
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Wed Sep 30 15:04:42 2015 +0200

    ceph: fix message length computation

    BugLink: http://bugs.launchpad.net/bugs/1523652

    commit 777d738a5e58ba3b6f3932ab1543ce93703f4873 upstream.

    create_request_message() computes the maximum length of a message,
    but uses the wrong type for the time stamp: sizeof(struct timespec)
    may be 8 or 16 depending on the architecture, while sizeof(struct
    ceph_timespec) is always 8, and that is what gets put into the
    message.

    Found while auditing the uses of timespec for y2038 problems.

    Fixes: b8e69066d8af ("ceph: include time stamp in every MDS request")
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
    Signed-off-by: Kamal Mostafa <kamal@canonical.com>

but does contains backport for commit

commit 1f041a89b4f22cf2e701514f4b8f73a8b1e06a3e
Author: Yan, Zheng <zyan@redhat.com>
Date:   Tue Jan 13 15:20:52 2015 +0800

    ceph: fix request time stamp encoding

    struct timespec uses 'long' to present second and nanosecond. 'long'
    is 64 bits on 64bits machine. ceph MDS expects time stamp to be
    encoded as struct ceph_timespec, which uses 'u32' to present second
    and nanosecond.

    Signed-off-by: Yan, Zheng <zyan@redhat.com>

#4 Updated by Zheng Yan almost 4 years ago

  • Status changed from New to 4

sent bug report to

#5 Updated by Kamal Mostafa almost 4 years ago

The Ubuntu Kernel team is now tracking this bug here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1564950

And ...

I've constructed a test kernel for Ubuntu Vivid (amd64), comprised of 3.19.0-56.62 plus 1f041a89b4 (ceph: fix request time stamp encoding). Please confirm that this test kernel fixes the problem (installing just the linux-image-3.19...deb should be sufficient):

http://kernel.ubuntu.com/~kamal/lp1564950/

#6 Updated by Randy Orr almost 4 years ago

I can confirm that mounting a cephfs filesystem is successful using the provided test kernel.

#7 Updated by Kamal Mostafa almost 4 years ago

Thanks very much Randy. I'll see that the fix gets into the affected Ubuntu and -ckt stable kernels ASAP.

-Kamal

#8 Updated by Dennis Kramer almost 4 years ago

Thank you.
I can also confirm that it's fixed in the latest linux-generic-lts-vivid (3.19.0.58.41) which i installed from the default trusty repository.

#9 Updated by Dennis Kramer almost 4 years ago

Sorry, i was wrong. I'm still getting the same panic with 3.19.0.58.41

#10 Updated by Kamal Mostafa almost 4 years ago

This fix is still queued and pending in the Ubuntu repo's, but has not yet been released (as of 3.19.0-58.x).

In the meantime, here's an updated interim kernel for Ubuntu Vivid (amd64) for use on affected machines, comprised of 3.19.0-58 plus 1f041a89b4 (ceph: fix request time stamp encoding):

http://kernel.ubuntu.com/~kamal/lp1564950-58/

#11 Updated by Ilya Dryomov almost 4 years ago

v3.18.[26-30] are also affected. I've asked Sasha to queue up "ceph: fix request timestamp encoding".

#12 Updated by Zheng Yan over 3 years ago

  • Status changed from 4 to Resolved

Also available in: Atom PDF