Project

General

Profile

Actions

Bug #8568

closed

libceph: kernel BUG at net/ceph/osd_client.c:885

Added by Gut Wielki almost 10 years ago. Updated almost 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Kernel panic after kill all OSD deamons at the same time.
All kernels.

Jun 10 10:12:34 client01 kernel: [ 1113.393881] ------------[ cut here ]------------
Jun 10 10:12:34 client01 kernel: [ 1113.393897] kernel BUG at net/ceph/osd_client.c:885!
Jun 10 10:12:34 client01 kernel: [ 1113.393907] invalid opcode: 0000 [#1] SMP
Jun 10 10:12:34 client01 kernel: [ 1113.393925] CPU: 0 PID: 3992 Comm: kworker/0:2 Not tainted 3.10.42 #1
Jun 10 10:12:34 client01 kernel: [ 1113.393936] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 1.1 10/03/2012
Jun 10 10:12:34 client01 kernel: [ 1113.393954] Workqueue: ceph-msgr con_work
Jun 10 10:12:34 client01 kernel: [ 1113.393967] task: ffff880460581cc0 ti: ffff88045e7be000 task.ti: ffff88045e7be000
Jun 10 10:12:34 client01 kernel: [ 1113.394400] RIP: 0010:[<ffffffff8171a751>] [<ffffffff8171a751>] osd_reset+0x12a/0x1c4
Jun 10 10:12:34 client01 kernel: [ 1113.394835] RSP: 0018:ffff88045e7bfd88 EFLAGS: 00010287
Jun 10 10:12:34 client01 kernel: [ 1113.395056] RAX: ffff8806668cc420 RBX: ffff8806668ccaf0 RCX: ffff8806668ccb10
Jun 10 10:12:34 client01 kernel: [ 1113.395281] RDX: ffff8806668ccb40 RSI: ffff880666a98c90 RDI: ffff880666a98c80
Jun 10 10:12:34 client01 kernel: [ 1113.395506] RBP: ffff880447257720 R08: 000000000000000a R09: 00000000fffffff8
Jun 10 10:12:34 client01 kernel: [ 1113.395731] R10: 0000000000000000 R11: ffffffff81e27940 R12: ffff8806668cc440
Jun 10 10:12:34 client01 kernel: [ 1113.395955] R13: ffff880447257730 R14: ffff880447257778 R15: ffff8804472577e0
Jun 10 10:12:34 client01 kernel: [ 1113.396182] FS: 0000000000000000(0000) GS:ffff88046fc00000(0000) knlGS:0000000000000000
Jun 10 10:12:34 client01 kernel: [ 1113.396619] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 10 10:12:34 client01 kernel: [ 1113.396843] CR2: 0000000002178000 CR3: 000000066999e000 CR4: 00000000000407f0
Jun 10 10:12:34 client01 kernel: [ 1113.397069] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 10 10:12:34 client01 kernel: [ 1113.397293] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 10 10:12:34 client01 kernel: [ 1113.397519] Stack:
Jun 10 10:12:34 client01 kernel: [ 1113.397733] ffff880666a98c90 0000000000000000 ffff88045e7bfd98 ffff88045e7bfd98
Jun 10 10:12:34 client01 kernel: [ 1113.398175] ffff880666a98830 ffff880666a98801 0000000000000013 ffffffff81e27880
Jun 10 10:12:34 client01 kernel: [ 1113.398615] 0000000000000000 ffff880666a98c10 ffffffff81715bdc ffff88046fc12b40
Jun 10 10:12:34 client01 kernel: [ 1113.399057] Call Trace:
Jun 10 10:12:34 client01 kernel: [ 1113.399279] [<ffffffff81715bdc>] ? con_work+0x37c/0x1970
Jun 10 10:12:34 client01 kernel: [ 1113.399508] [<ffffffff81078072>] ? mmdrop+0xd/0x1c
Jun 10 10:12:34 client01 kernel: [ 1113.399733] [<ffffffff8107892d>] ? finish_task_switch+0x7c/0xaa
Jun 10 10:12:34 client01 kernel: [ 1113.399959] [<ffffffff8106c1e0>] ? process_one_work+0x17a/0x28f
Jun 10 10:12:34 client01 kernel: [ 1113.400184] [<ffffffff8106c451>] ? worker_thread+0x139/0x1de
Jun 10 10:12:34 client01 kernel: [ 1113.400409] [<ffffffff8106c318>] ? process_scheduled_works+0x23/0x23
Jun 10 10:12:34 client01 kernel: [ 1113.400633] [<ffffffff81070fb0>] ? kthread+0x7d/0x85
Jun 10 10:12:34 client01 kernel: [ 1113.400855] [<ffffffff81070000>] ? common_timer_create+0x7/0x11
Jun 10 10:12:34 client01 kernel: [ 1113.401074] [<ffffffff81070f33>] ? __kthread_parkme+0x59/0x59
Jun 10 10:12:34 client01 kernel: [ 1113.401295] [<ffffffff817415ac>] ? ret_from_fork+0x7c/0xb0
Jun 10 10:12:34 client01 kernel: [ 1113.401513] [<ffffffff81070f33>] ? __kthread_parkme+0x59/0x59
Jun 10 10:12:34 client01 kernel: [ 1113.401732] Code: 00 48 89 34 24 48 8d 58 b0 48 8b 00 48 83 e8 50 48 8d 53 50 48 3b 14 24 0f 84 80 00 00 00 4c 8b 63 20 48 8d 4b 20 49 39 cc 74 02 <0f> 0b 48 89 de 48 89 ef 48 89 44 24 08 e8 22 fe ff ff 48 8b 8d
Jun 10 10:12:34 client01 kernel: [ 1113.402556] RIP [<ffffffff8171a751>] osd_reset+0x12a/0x1c4
Jun 10 10:12:34 client01 kernel: [ 1113.402784] RSP <ffff88045e7bfd88>
Jun 10 10:12:34 client01 kernel: [ 1113.403379] ---[ end trace 60793e12025d7abc ]---


Related issues 2 (0 open2 closed)

Related to Linux kernel client - Bug #11960: Kernel panic when deleting a pool, which contains a mapped RBDClosedIlya Dryomov06/11/2015

Actions
Related to Linux kernel client - Feature #9779: libceph: sync up with objecterResolvedIlya Dryomov10/14/2014

Actions
Actions #1

Updated by Josh Durgin almost 10 years ago

  • Priority changed from Normal to High
Actions #2

Updated by Ilya Dryomov over 9 years ago

  • Status changed from New to 12
Actions #3

Updated by Ilya Dryomov over 9 years ago

  • Status changed from 12 to New
  • Priority changed from High to Normal
Actions #4

Updated by Ilya Dryomov over 9 years ago

  • Assignee set to Ilya Dryomov
  • Priority changed from Normal to High
Actions #5

Updated by Ian Colle over 9 years ago

  • Project changed from rbd to Linux kernel client
Actions #6

Updated by Ilya Dryomov over 9 years ago

  • Subject changed from kernel BUG at net/ceph/osd_client.c:885 to libceph: kernel BUG at net/ceph/osd_client.c:885

BUG_ON(!list_empty(&req->r_req_lru_item)) in __kick_osd_requests()

Can't reproduce but need to look harder into how this could have happened and see if maybe #8806 changes take care of this.

Actions #7

Updated by Ilya Dryomov over 9 years ago

  • Priority changed from High to Normal
Actions #8

Updated by Ilya Dryomov almost 9 years ago

  • Status changed from New to Closed
  • Regression set to No

OK, so I still don't understand how this can happen if all OSDs are killed at the same time, but while looking into #11960 I remembered about this issue. With MON=1 OSD=1 on 3.16 this BUG_ON can be triggered with:

$ cat notarget-outdown-repro.sh
#!/bin/bash
rbd create --size 1 test
rbd map test
ceph osd out 0
ceph osd down 0
sleep 3
dd if=/dev/rbd0 of=/dev/null count=1 & # will block
pkill ceph-osd

The only difference between this and http://tracker.ceph.com/issues/11960#note-7 is the way we force registered lingering request on the notarget list. Here it's the out-down, there we delete the pool which contains the object.

In newer kernels BUG_ON has been changed to WARN_ON and, starting with 3.18, this reproducer wouldn't trigger anything, because after commit a390de0208e7 ("libceph: unlink from o_linger_requests when clearing r_osd") __map_request() does the right thing in the no-osd case. The no-pool case is completely separate and not only we don't unlink in __map_request(), we do nothing in kick_requests() as well:

2023                 err = __map_request(osdc, req,
2024                                     force_resend || force_resend_writes);
2025                 dout("__map_request returned %d\n", err);
2026                 if (err < 0)
2027                         continue;  /* hrm! */

This means http://tracker.ceph.com/issues/11960#note-7 reproducer will trigger a WARN_ON on 4.1.
In userspace, a registered lingering request gets cancelled if the underlying pool is deleted. Clearly, resolving #9779 is in order.

Actions

Also available in: Atom PDF