Project

General

Profile

Actions

Bug #40339

closed

kernel BUG at fs/ceph/mds_client.c:600! invalid opcode: 0000 [#1] SMP

Added by Xiaoxi Chen almost 5 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

It happens on 3.10.0-957.1.3.el7.x86_64

Back to that time,
1. there is a mds session stucking in "opening" due to networking issue in the past(this will be tracked by another bug tracker).

2. Trying to recover by umount + mount, however the umount take significant time.
3. Before the umount finish,   We did a fail-over aganist the mds which the "opening" session targeted at.

Then panic as below, not sure if it related with https://tracker.ceph.com/issues/36299?

[2581794.468273] libceph: mds31 10.199.74.135:6801 connection reset
[2581794.475705] libceph: reset on mds31
[2581794.477068] ceph: mds31 closed our session
[2581794.478554] ceph: mds31 reconnect start
[2581794.480180] ------------[ cut here ]------------
[2581794.481739] kernel BUG at fs/ceph/mds_client.c:600!
[2581794.483300] invalid opcode: 0000 [#1] SMP
[2581794.484787] Modules linked in: tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag ceph libceph libcrc32c dns_resolver sunrpc ppdev iosf_mbi kvm_intel kvm irqbypass ttm crc32_pclmul parport_pc ghash_clmulni_intel drm_kms_helper parport syscopyarea joydev sysfillrect sysimgblt fb_sys_fops drm aesni_intel lrw gf128mul glue_helper ablk_helper virtio_net cryptd virtio_balloon i2c_piix4 drm_panel_orientation_quirks pcspkr ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk floppy ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci virtio_ring virtio
[2581794.501352] CPU: 3 PID: 29878 Comm: kworker/3:2 Not tainted 3.10.0-957.1.3.el7.x86_64 #1
[2581794.504302] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[2581794.507406] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[2581794.509133] task: ffff95e8188f5140 ti: ffff95e81c38c000 task.ti: ffff95e81c38c000
[2581794.511875] RIP: 0010:[<ffffffffc054f07a>] [<ffffffffc054f07a>] __unregister_request+0x1da/0x1e0 [ceph]
[2581794.514874] RSP: 0018:ffff95e81c38fb58 EFLAGS: 00010246
[2581794.516466] RAX: 0000000000000000 RBX: ffff95e81b7f4c00 RCX: dead000000000200
[2581794.519051] RDX: ffff95e81b7f4f58 RSI: ffff95e1cfbb9100 RDI: ffff95e81b7f4f58
[2581794.521643] RBP: ffff95e81c38fb70 R08: ffff95e81b7f4f58 R09: ffff95e819730f90
[2581794.524263] R10: 0000000000004b29 R11: ffffd9b9606b1e00 R12: ffff95e81b7f4c08
[2581794.526853] R13: ffff95e1cfbb9000 R14: ffff95e1cfbb9000 R15: 0000000000000000
[2581794.529409] FS: 0000000000000000(0000) GS:ffff95e81f2c0000(0000) knlGS:0000000000000000
[2581794.532225] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2581794.533913] CR2: 0000000001999270 CR3: 000000081327a000 CR4: 00000000001606e0
[2581794.536548] Call Trace:
[2581794.537680] [<ffffffffc05510bc>] __do_request+0xac/0x430 [ceph]
[2581794.539480] [<ffffffffc05514ca>] __wake_requests+0x8a/0xe0 [ceph]
[2581794.541372] [<ffffffffc05519cd>] send_mds_reconnect+0x4ad/0x630 [ceph]
[2581794.543204] [<ffffffffc0551b7e>] peer_reset+0x2e/0x40 [ceph]
[2581794.544965] [<ffffffffc0452311>] try_read+0x791/0x12c0 [libceph]
[2581794.546723] [<ffffffff87adca7e>] ? account_entity_dequeue+0xae/0xd0
[2581794.548485] [<ffffffff87ae060c>] ? dequeue_entity+0x11c/0x5e0
[2581794.550200] [<ffffffff88019417>] ? kernel_sendmsg+0x37/0x50
[2581794.551861] [<ffffffffc0452f24>] ceph_con_workfn+0xe4/0x1530 [libceph]
[2581794.553712] [<ffffffff8816778f>] ? __schedule+0x3ff/0x890
[2581794.555365] [<ffffffff87ab9d4f>] process_one_work+0x17f/0x440
[2581794.557121] [<ffffffff87abade6>] worker_thread+0x126/0x3c0
[2581794.558825] [<ffffffff87abacc0>] ? manage_workers.isra.25+0x2a0/0x2a0
[2581794.560677] [<ffffffff87ac1c31>] kthread+0xd1/0xe0
[2581794.562270] [<ffffffff87ac1b60>] ? insert_kthread_work+0x40/0x40
[2581794.564060] [<ffffffff88174c37>] ret_from_fork_nospec_begin+0x21/0x21
[2581794.565865] [<ffffffff87ac1b60>] ? insert_kthread_work+0x40/0x40
[2581794.567587] Code: 89 85 f8 00 00 00 e9 98 fe ff ff 48 8b 0e 48 89 f2 48 c7 c7 50 21 57 c0 48 c7 c6 b8 41 56 c0 31 c0 e8 eb 2b 85 c7 e9 47 fe ff ff <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55
[2581794.579749] RIP [<ffffffffc054f07a>] __unregister_request+0x1da/0x1e0 [ceph]
[2581794.582339] RSP <ffff95e81c38fb58>
[2581794.584374] ---[ end trace 9808db269de46594 ]---

Actions #1

Updated by Zheng Yan almost 5 years ago

maybe this one

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 0c84b74ee34..f324ba4c8c6 100644
--- a/fs/ceph/mds_client.c
@@ -3666,6 +3669,7 @@ static void wait_requests(struct ceph_mds_client *mdsc)
                while ((req = __get_oldest_req(mdsc))) {
                        dout("wait_requests timed out on tid %llu\n",
                             req->r_tid);
+                       list_del_init(&req->r_wait);
                        __unregister_request(mdsc, req);
                }
        }

Actions #2

Updated by Xiaoxi Chen almost 5 years ago

Thanks zheng, can you explain more backgorund?

Actions #3

Updated by Zheng Yan almost 5 years ago

request was unregistered twice. one is from wait_requests, another is from __wake_requests

Actions #4

Updated by Jeff Layton almost 5 years ago

  • Assignee set to Zheng Yan
Actions #5

Updated by Zheng Yan almost 5 years ago

fixed by "ceph: remove request from waiting list before unregister" in testing branch

Actions #6

Updated by Zheng Yan almost 5 years ago

  • Status changed from New to 7
Actions #7

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 7 to Fix Under Review
Actions #8

Updated by Zheng Yan about 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF