Project

General

Profile

Actions

Bug #855

closed

fsstress kernel opps

Added by Samuel Just about 13 years ago. Updated about 13 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

/usr/lib/ltp/testcases/bin/fsstress -d . -l 1 -n 1000 -p 10

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] Oops: 0000 [#1] PREEMPT SMP
ceph1:/sepia/ceph1#

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] Stack:
ceph1:/sepia/ceph1#
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] Call Trace:

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] CR2: 0000000000000163

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] Oops: 0000 [#2] PREEMPT SMP

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] Stack:

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] Call Trace:

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] CR2: 0000000000000336

Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] Oops: 0000 [#3] PREEMPT SMP

Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger

Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] Stack:

Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] Call Trace:

Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9

Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] CR2: 0000000000000273

ceph version: a6167332fb7bff23a2907bf69166779e63fe836c (next)
Kernel client: 7751b01

Actions #1

Updated by Samuel Just about 13 years ago

Forgot to mention: This is on ceph1 with sepia mounted. Also, the kernel output occurred just just after I hit ^C to kill the test (the test appeared to have hung).

Actions #2

Updated by Sage Weil about 13 years ago

somewhat more readable, from teh console:

[15383.558054] ---[ end trace 07ae65cfa15aa1d9 ]---
[15383.315347] NULL pointer dereference at 0000000000000336
[15383.315347] IP: [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[15383.315347] PGD 119e2e067 PUD 11937f067 PMD 0 
[15383.315347] Oops: 0000 [#2] PREEMPT SMP 
[15383.315347] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger
[15383.315347] CPU 0 

[15383.315347] Modules linked in: ceph libceph btrfs lzo_decompress lzo_compress
[15383.315347] 
[15383.315347] Pid: 3205, comm: fsstress Tainted: G      D     2.6.38-rc6+ #70 Supermicro PDSMi/PDSMi+
[15383.315347] RIP: 0010:[<ffffffffa0093bc3>]  [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[15383.315347] RSP: 0018:ffff88011baefc68  EFLAGS: 00010202
[15383.315347] RAX: 0000000000000000 RBX: 0000000000000336 RCX: 0000000000078210
[15383.315347] RDX: 0000000000000001 RSI: 0000000000000388 RDI: 0000000000000336
[15383.315347] RBP: ffff88011baefc98 R08: ffff880119fa8240 R09: 0000000000000001
[15383.315347] R10: 0000000000000246 R11: ffff88011baefb88 R12: 0000000000000000
[15383.315347] R13: 0000000000000336 R14: 0000000000000001 R15: 0000000000000388
[15383.315347] FS:  00007f8721143700(0000) GS:ffff8800df600000(0000) knlGS:0000000000000000
[15383.315347] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[15383.315347] CR2: 0000000000000336 CR3: 00000001149ae000 CR4: 00000000000006f0

[15383.315347] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[15383.315347] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[15383.315347] Process fsstress (pid: 3205, threadinfo ffff88011baee000, task ffff880114914900)
[15383.315347] Stack:
[15383.315347]  fffffffffffffffe 0000000000387df0 0000000000078210 00000000fffffe00
[15383.315347]  0000000000000000 0000000000000000 ffff88011baefde8 ffffffffa00c4381
[15383.315347]  0000000000000003 0000000000078210 0000000000000336 ffff880100000388
[15383.315347] Call Trace:
[15383.315347]  [<ffffffffa00c4381>] ceph_aio_read+0x749/0x910 [ceph]
[15383.315347]  [<ffffffff8105db1c>] ? put_lock_stats+0xe/0x29
[15383.315347]  [<ffffffff810ba0dd>] do_sync_read+0xcb/0x112
[15383.315347]  [<ffffffff8122d797>] ? security_file_permission+0x81/0x8a
[15383.315347]  [<ffffffff810ba8a1>] vfs_read+0xae/0x134
[15383.315347]  [<ffffffff810babf5>] sys_read+0x4a/0x76
[15383.315347]  [<ffffffff810029bb>] system_call_fastpath+0x16/0x1b
[15383.315347] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9 

[15383.315347] RIP  [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[15383.315347]  RSP <ffff88011baefc68>
[15383.315347] CR2: 0000000000000336
[15383.315640] BUG: unable to handle kernel 
[15383.815681] libceph: get_reply unknown tid 3384032 from osd18
[15383.815695] libceph: get_reply unknown tid 3384030 from osd14
[15383.815706] libceph: get_reply unknown tid 3384035 from osd16
[15383.815717] libceph: get_reply unknown tid 3384036 from osd2
[15383.815841] ---[ end trace 07ae65cfa15aa1da ]---
[15383.817332] libceph: get_reply unknown tid 3384034 from osd18
[15383.315640] NULL pointer dereference at 0000000000000273
[15383.315640] IP: [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[15383.315640] PGD 11959e067 PUD 119fd6067 PMD 0 
[15383.315640] Oops: 0000 [#3] PREEMPT SMP 
[15383.315640] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger
[15383.315640] CPU 1 
[15383.315640] Modules linked in: ceph libceph btrfs lzo_decompress lzo_compress
[15383.315640] 

[15383.315640] Pid: 3204, comm: fsstress Tainted: G      D     2.6.38-rc6+ #70 Supermicro PDSMi/PDSMi+
[15383.315640] RIP: 0010:[<ffffffffa0093bc3>]  [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[15383.315640] RSP: 0018:ffff880114a1fc68  EFLAGS: 00010202
[15383.315640] RAX: 0000000000000000 RBX: 0000000000000273 RCX: 0000000000111d71
[15383.315640] RDX: 0000000000000001 RSI: 00000000000002ef RDI: 0000000000000273
[15383.315640] RBP: ffff880114a1fc98 R08: ffff880116fe4240 R09: ffff880114a1fba8
[15383.315640] R10: 0000000000000246 R11: ffff880119d08dc0 R12: 0000000000000000
[15383.315640] R13: 0000000000000273 R14: 0000000000000001 R15: 00000000000002ef
[15383.315640] FS:  00007f8721143700(0000) GS:ffff8800df800000(0000) knlGS:0000000000000000
[15383.315640] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[15383.315640] CR2: 0000000000000273 CR3: 0000000119cf4000 CR4: 00000000000006e0

[15383.315640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[15383.315640] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[15383.315640] Process fsstress (pid: 3204, threadinfo ffff880114a1e000, task ffff880114a068c0)
[15383.315640] Stack:
[15383.315640]  fffffffffffffffe 00000000002ee28f 0000000000111d71 00000000fffffe00
[15383.315640]  0000000000000000 0000000000000000 ffff880114a1fde8 ffffffffa00c4381
[15383.315640]  0000000000000001 ffffffffffffffff 0000000000000273 ffff8801000002ef
[15383.315640] Call Trace:
[15383.315640]  [<ffffffffa00c4381>] ceph_aio_read+0x749/0x910 [ceph]
[15383.315640]  [<ffffffff8105db1c>] ? put_lock_stats+0xe/0x29
[15383.315640]  [<ffffffff810ba0dd>] do_sync_read+0xcb/0x112
[15383.315640]  [<ffffffff8122d797>] ? security_file_permission+0x81/0x8a
[15383.315640]  [<ffffffff810ba8a1>] vfs_read+0xae/0x134
[15383.315640]  [<ffffffff810babf5>] sys_read+0x4a/0x76
[15383.315640]  [<ffffffff810029bb>] system_call_fastpath+0x16/0x1b
[15383.315640] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9 

[15383.315640] RIP  [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[15383.315640]  RSP <ffff880114a1fc68>
[15383.315640] CR2: 0000000000000273
[15384.107989] ---[ end trace 07ae65cfa15aa1db ]---

Actions #3

Updated by Sage Weil about 13 years ago

  • Project changed from Ceph to Linux kernel client
Actions #4

Updated by Sage Weil about 13 years ago

  • Priority changed from Normal to Urgent
  • Target version set to v2.6.38
Actions #5

Updated by Sage Weil about 13 years ago

  • Assignee set to Samuel Just

Sam, are you able to reproduce this? I have a possible fix, but want to make sure it's the right thing first. If we print out the message with a positive (or any?) number after the below patch is applied then I'm on the right track...

diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 1a040e6..a599773 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -31,6 +31,7 @@ struct page **ceph_get_direct_page_vector(const char __user *data,
        return pages;

 fail:
+       dout("get_direct_page_vector failed with %d\n", rc);
        ceph_put_page_vector(pages, rc > 0 ? rc : 0, false);
        return ERR_PTR(rc);
 }

Actions #6

Updated by Samuel Just about 13 years ago

New debug output, pages is invalid out of ceph_get_direct_page_vector on O_DIRECT.

[ 1231.684328] ceph: file.c:639 : aio_read ffff8801193e1790 100000001d9.fffffffffffffffe 0~3690496 got cap refs on Fcr
[ 1231.684331] ceph: file.c:383 : sync_read on file ffff880118ee8e40 503808~3690496 O_DIRECT
[ 1231.684334] ceph: file.c:386 : Ceph O_DIRECT branch with ceph_get_direct_page_vector
[ 1231.684971] ceph: file.c:397 : pages is ffff880115900000
[ 1231.684973] ceph: file.c:409 : about to striped_read
[ 1231.686166] ceph: file.c:325 : striped_read 503808~3690496 (read 0) got 0 SHORT
[ 1231.686168] ceph: file.c:364 : striped_read returns 0
[ 1231.686251] ceph: file.c:425 : sync_read result 0
[ 1231.686254] ceph: file.c:651 : aio_read ffff8801193e1790 100000001d9.fffffffffffffffe dropping cap refs on Fcr = 0
[ 1231.686263] ceph: file.c:659 : aio_read sync_read hit hole, reading more
[ 1231.686269] ceph: file.c:639 : aio_read ffff8801193e1790 100000001d9.fffffffffffffffe 0~3690496 got cap refs on Fcr
[ 1231.686272] ceph: file.c:383 : sync_read on file ffff880118ee8e40 503808~3690496 O_DIRECT
[ 1231.686274] ceph: file.c:386 : Ceph O_DIRECT branch with ceph_get_direct_page_vector
[ 1231.686796] ceph: file.c:397 : pages is 00000000000000ef
[ 1231.686798] ceph: file.c:409 : about to striped_read
[ 1231.686878] ceph: file.c:325 : striped_read 503808~3690496 (read 0) got -512
[ 1231.686880] ceph: file.c:364 : striped_read returns -512
[ 1231.686888] BUG: unable to handle kernel NULL pointer dereference at 00000000000000ef
[ 1231.686891] IP: [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[ 1231.686910] PGD 1158b7067 PUD 1158b8067 PMD 0
[ 1231.686915] Oops: 0000 [#1] PREEMPT SMP
[ 1231.686918] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger
[ 1231.686922] CPU 0

Actions #7

Updated by Samuel Just about 13 years ago

Annoyingly, I am suddenly having some trouble triggering the bug, but the previous output suggests the get_user_pages in ceph_get_direct_page_vector is returning fewer pages than requested (which is consistent with the documentation in mm/memory.c). The code currently seems to assume that either the correct amount was returned or rc is an error code.

Actions #8

Updated by Samuel Just about 13 years ago

Ah, just noticed that it's fixed in fe7c17ec5ec0851550e32be986788e367b1f8e00.

Actions #9

Updated by Sage Weil about 13 years ago

  • Status changed from New to Resolved

Samuel Just wrote:

Ah, just noticed that it's fixed in commit:fe7c17ec5ec0851550e32be986788e367b1f8e00.

Yeah that bug is... My only worry is that the printk I added before didn't show up? Oh, duh... I did dout("blah") not printk("blah"), and we didn't crank up debugging.. that's why it didn't show up on console.

So, same bug. Yay, I'm closing this out then!

Actions

Also available in: Atom PDF