Project

General

Profile

Actions

Bug #3950

closed

krbd: new assertion failure running concurrent rbd test

Added by Alex Elder about 11 years ago. Updated about 11 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

(I think this is a new issue, I haven't investigated it yet.)

I hit an assertion failure while running my new test script.
Maybe it's an old problem, we'll see... It was not expected
though.

[ 683.573981]
[ 683.573981] Assertion failure in rbd_osd_req_callback() at line 1280:
[ 683.573981]
[ 683.573981] rbd_assert(osd_req == obj_request->osd_req);
[ 683.573981]

Entering kdb (current=0xffff8801bdbede20, pid 4626) on processor 0 Oops: (null)
due to oops @ 0xffffffffa0272a4d
CPU 0 dPid: 4626, comm: kworker/0:2 Not tainted 3.6.0-ceph-00226-g9e08c73 #1 Dell Inc. PowerEdge R410/01V648
dRIP: 0010:[<ffffffffa0272a4d>] [<ffffffffa0272a4d>] rbd_osd_req_callback+0x17d/0x1e0 [rbd]
dRSP: 0018:ffff8801bd4e9b70 EFLAGS: 00010286
dRAX: 000000000000006a RBX: ffff88020cddc500 RCX: 0000000000009291
dRDX: ffff8801bdbede20 RSI: 0000000000000001 RDI: 0000000000000246
dRBP: ffff8801bd4e9b80 R08: 0000000000000002 R09: 0000000000000000
dR10: 0000000000000452 R11: 0000000000005000 R12: 0000000000001e51
dR13: 0000000000000004 R14: ffff88020d992800 R15: 0000000000000000
dFS: 0000000000000000(0000) GS:ffff880227200000(0000) knlGS:0000000000000000
dCS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
dCR2: 00000000018e1058 CR3: 00000002223f7000 CR4: 00000000000007f0
dDR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
dDR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
dProcess kworker/0:2 (pid: 4626, threadinfo ffff8801bd4e8000, task ffff8801bdbede20)
dStack:
ffff88020cddce00 0000000000001e51 ffff8801bd4e9bf0 ffffffffa04049ef
ffff88020b9dd1b0 ffff880200000025 ffff88020cdb2800 ffff88020cff5950
00000000bd4e9be0 ffff88020cff5a28 ffff88020b9dd030 ffff88020b9dd030
dCall Trace:
d [<ffffffffa04049ef>] dispatch+0x26f/0x500 [libceph]
more>
d [<ffffffffa03fb13b>] con_work+0xb7b/0x2fd0 [libceph]
d [<ffffffff8108da40>] ? cpuacct_charge+0xb0/0xf0
d [<ffffffff8108d9b8>] ? cpuacct_charge+0x28/0xf0
d [<ffffffff810848a8>] ? finish_task_switch+0x48/0x110
d [<ffffffff810717e0>] process_one_work+0x1a0/0x5f0
d [<ffffffff81071776>] ? process_one_work+0x136/0x5f0
d [<ffffffffa03fa5c0>] ? ceph_msg_new+0x2e0/0x2e0 [libceph]
d [<ffffffff810735fd>] worker_thread+0x18d/0x4f0
d [<ffffffff81073470>] ? manage_workers+0x320/0x320
d [<ffffffff810791fe>] kthread+0xae/0xc0
d [<ffffffff810b393d>] ? trace_hardirqs_on+0xd/0x10
d [<ffffffff8163f3c4>] kernel_thread_helper+0x4/0x10
d [<ffffffff81635880>] ? _raw_spin_unlock_irq+0x30/0x40
d [<ffffffff816360b0>] ? retint_restore_args+0x13/0x13
d [<ffffffff81079150>] ? flush_kthread_work+0x1a0/0x1a0
d [<ffffffff8163f3c0>] ? gs_change+0x13/0x13
dCode: ff c7 43 60 00 00 00 00 eb 99 48 c7 c1 48 8c 27 a0 ba 00 05 00 00 48 c7 c6 f0 9e 27 a0 48 c7 c7 78 8b 27 a0 31 c0 e8 be 7b

[0]kdb>

Actions #1

Updated by Alex Elder about 11 years ago

Weird. It looks to me like the object request that's
just completing is already done, meaning we got
a callback from an osd request more than once. That
would explain why osd_req->r_priv did not point to
an object request pointer that pointed back to the
osd request.

The (r_priv) object request pointer does appear to be
valid (or recently valid) though. But its osd_req
field points to something else--not an osd request.

Actions #2

Updated by Alex Elder about 11 years ago

The object being operated on is the rbd header image, in
this case named "image.5X5ZNB.rbd". The object request type
is "obj_req_nodata" so it's either a CEPH_OSD_OP_WATCH or a
CEPH_OSD_OP_NOTIFY_ACK. But without access to the osd request
I can't tell.

This test only uses format 1 images.

There's still a chance this is just a duplicate of 3937.

Actions #3

Updated by Alex Elder about 11 years ago

OK, I do have the osd request pointer now. It was available
in register R14. And with a little work I can determine that
the op in progress was (little endian):

  op = 0x220f = CEPH_OSD_OP_WATCH
  flags = 0x00000000
  cookie = 0x0000000000000014
  ver = 0x0000000000000000
  flag = 0x00

So the request was shutting down the watch request for the
header object, from rbd_dev_release().

Is there any way an osd request will supply more than one
response to a request (that has not been marked for linger)?

Actions #4

Updated by Ian Colle about 11 years ago

  • Status changed from In Progress to Can't reproduce

Not seeing this in current code.

Actions

Also available in: Atom PDF