Project

General

Profile

Bug #855

fsstress kernel opps

Added by Samuel Just about 13 years ago. Updated about 13 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

/usr/lib/ltp/testcases/bin/fsstress -d . -l 1 -n 1000 -p 10

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] Oops: 0000 [#1] PREEMPT SMP
ceph1:/sepia/ceph1#

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] Stack:
ceph1:/sepia/ceph1#
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] Call Trace:

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] CR2: 0000000000000163

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] Oops: 0000 [#2] PREEMPT SMP

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] Stack:

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] Call Trace:

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9

Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] CR2: 0000000000000336

Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] Oops: 0000 [#3] PREEMPT SMP

Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger

Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] Stack:

Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] Call Trace:

Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9

Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] CR2: 0000000000000273

ceph version: a6167332fb7bff23a2907bf69166779e63fe836c (next)
Kernel client: 7751b01

History

#1 Updated by Samuel Just about 13 years ago

Forgot to mention: This is on ceph1 with sepia mounted. Also, the kernel output occurred just just after I hit ^C to kill the test (the test appeared to have hung).

#2 Updated by Sage Weil about 13 years ago

somewhat more readable, from teh console:

[15383.558054] ---[ end trace 07ae65cfa15aa1d9 ]---
[15383.315347] NULL pointer dereference at 0000000000000336
[15383.315347] IP: [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[15383.315347] PGD 119e2e067 PUD 11937f067 PMD 0 
[15383.315347] Oops: 0000 [#2] PREEMPT SMP 
[15383.315347] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger
[15383.315347] CPU 0 

[15383.315347] Modules linked in: ceph libceph btrfs lzo_decompress lzo_compress
[15383.315347] 
[15383.315347] Pid: 3205, comm: fsstress Tainted: G      D     2.6.38-rc6+ #70 Supermicro PDSMi/PDSMi+
[15383.315347] RIP: 0010:[<ffffffffa0093bc3>]  [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[15383.315347] RSP: 0018:ffff88011baefc68  EFLAGS: 00010202
[15383.315347] RAX: 0000000000000000 RBX: 0000000000000336 RCX: 0000000000078210
[15383.315347] RDX: 0000000000000001 RSI: 0000000000000388 RDI: 0000000000000336
[15383.315347] RBP: ffff88011baefc98 R08: ffff880119fa8240 R09: 0000000000000001
[15383.315347] R10: 0000000000000246 R11: ffff88011baefb88 R12: 0000000000000000
[15383.315347] R13: 0000000000000336 R14: 0000000000000001 R15: 0000000000000388
[15383.315347] FS:  00007f8721143700(0000) GS:ffff8800df600000(0000) knlGS:0000000000000000
[15383.315347] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[15383.315347] CR2: 0000000000000336 CR3: 00000001149ae000 CR4: 00000000000006f0

[15383.315347] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[15383.315347] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[15383.315347] Process fsstress (pid: 3205, threadinfo ffff88011baee000, task ffff880114914900)
[15383.315347] Stack:
[15383.315347]  fffffffffffffffe 0000000000387df0 0000000000078210 00000000fffffe00
[15383.315347]  0000000000000000 0000000000000000 ffff88011baefde8 ffffffffa00c4381
[15383.315347]  0000000000000003 0000000000078210 0000000000000336 ffff880100000388
[15383.315347] Call Trace:
[15383.315347]  [<ffffffffa00c4381>] ceph_aio_read+0x749/0x910 [ceph]
[15383.315347]  [<ffffffff8105db1c>] ? put_lock_stats+0xe/0x29
[15383.315347]  [<ffffffff810ba0dd>] do_sync_read+0xcb/0x112
[15383.315347]  [<ffffffff8122d797>] ? security_file_permission+0x81/0x8a
[15383.315347]  [<ffffffff810ba8a1>] vfs_read+0xae/0x134
[15383.315347]  [<ffffffff810babf5>] sys_read+0x4a/0x76
[15383.315347]  [<ffffffff810029bb>] system_call_fastpath+0x16/0x1b
[15383.315347] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9 

[15383.315347] RIP  [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[15383.315347]  RSP <ffff88011baefc68>
[15383.315347] CR2: 0000000000000336
[15383.315640] BUG: unable to handle kernel 
[15383.815681] libceph: get_reply unknown tid 3384032 from osd18
[15383.815695] libceph: get_reply unknown tid 3384030 from osd14
[15383.815706] libceph: get_reply unknown tid 3384035 from osd16
[15383.815717] libceph: get_reply unknown tid 3384036 from osd2
[15383.815841] ---[ end trace 07ae65cfa15aa1da ]---
[15383.817332] libceph: get_reply unknown tid 3384034 from osd18
[15383.315640] NULL pointer dereference at 0000000000000273
[15383.315640] IP: [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[15383.315640] PGD 11959e067 PUD 119fd6067 PMD 0 
[15383.315640] Oops: 0000 [#3] PREEMPT SMP 
[15383.315640] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger
[15383.315640] CPU 1 
[15383.315640] Modules linked in: ceph libceph btrfs lzo_decompress lzo_compress
[15383.315640] 

[15383.315640] Pid: 3204, comm: fsstress Tainted: G      D     2.6.38-rc6+ #70 Supermicro PDSMi/PDSMi+
[15383.315640] RIP: 0010:[<ffffffffa0093bc3>]  [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[15383.315640] RSP: 0018:ffff880114a1fc68  EFLAGS: 00010202
[15383.315640] RAX: 0000000000000000 RBX: 0000000000000273 RCX: 0000000000111d71
[15383.315640] RDX: 0000000000000001 RSI: 00000000000002ef RDI: 0000000000000273
[15383.315640] RBP: ffff880114a1fc98 R08: ffff880116fe4240 R09: ffff880114a1fba8
[15383.315640] R10: 0000000000000246 R11: ffff880119d08dc0 R12: 0000000000000000
[15383.315640] R13: 0000000000000273 R14: 0000000000000001 R15: 00000000000002ef
[15383.315640] FS:  00007f8721143700(0000) GS:ffff8800df800000(0000) knlGS:0000000000000000
[15383.315640] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[15383.315640] CR2: 0000000000000273 CR3: 0000000119cf4000 CR4: 00000000000006e0

[15383.315640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[15383.315640] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[15383.315640] Process fsstress (pid: 3204, threadinfo ffff880114a1e000, task ffff880114a068c0)
[15383.315640] Stack:
[15383.315640]  fffffffffffffffe 00000000002ee28f 0000000000111d71 00000000fffffe00
[15383.315640]  0000000000000000 0000000000000000 ffff880114a1fde8 ffffffffa00c4381
[15383.315640]  0000000000000001 ffffffffffffffff 0000000000000273 ffff8801000002ef
[15383.315640] Call Trace:
[15383.315640]  [<ffffffffa00c4381>] ceph_aio_read+0x749/0x910 [ceph]
[15383.315640]  [<ffffffff8105db1c>] ? put_lock_stats+0xe/0x29
[15383.315640]  [<ffffffff810ba0dd>] do_sync_read+0xcb/0x112
[15383.315640]  [<ffffffff8122d797>] ? security_file_permission+0x81/0x8a
[15383.315640]  [<ffffffff810ba8a1>] vfs_read+0xae/0x134
[15383.315640]  [<ffffffff810babf5>] sys_read+0x4a/0x76
[15383.315640]  [<ffffffff810029bb>] system_call_fastpath+0x16/0x1b
[15383.315640] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9 

[15383.315640] RIP  [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[15383.315640]  RSP <ffff880114a1fc68>
[15383.315640] CR2: 0000000000000273
[15384.107989] ---[ end trace 07ae65cfa15aa1db ]---

#3 Updated by Sage Weil about 13 years ago

  • Project changed from Ceph to Linux kernel client

#4 Updated by Sage Weil about 13 years ago

  • Priority changed from Normal to Urgent
  • Target version set to v2.6.38

#5 Updated by Sage Weil about 13 years ago

  • Assignee set to Samuel Just

Sam, are you able to reproduce this? I have a possible fix, but want to make sure it's the right thing first. If we print out the message with a positive (or any?) number after the below patch is applied then I'm on the right track...

diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 1a040e6..a599773 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -31,6 +31,7 @@ struct page **ceph_get_direct_page_vector(const char __user *data,
        return pages;

 fail:
+       dout("get_direct_page_vector failed with %d\n", rc);
        ceph_put_page_vector(pages, rc > 0 ? rc : 0, false);
        return ERR_PTR(rc);
 }

#6 Updated by Samuel Just about 13 years ago

New debug output, pages is invalid out of ceph_get_direct_page_vector on O_DIRECT.

[ 1231.684328] ceph: file.c:639 : aio_read ffff8801193e1790 100000001d9.fffffffffffffffe 0~3690496 got cap refs on Fcr
[ 1231.684331] ceph: file.c:383 : sync_read on file ffff880118ee8e40 503808~3690496 O_DIRECT
[ 1231.684334] ceph: file.c:386 : Ceph O_DIRECT branch with ceph_get_direct_page_vector
[ 1231.684971] ceph: file.c:397 : pages is ffff880115900000
[ 1231.684973] ceph: file.c:409 : about to striped_read
[ 1231.686166] ceph: file.c:325 : striped_read 503808~3690496 (read 0) got 0 SHORT
[ 1231.686168] ceph: file.c:364 : striped_read returns 0
[ 1231.686251] ceph: file.c:425 : sync_read result 0
[ 1231.686254] ceph: file.c:651 : aio_read ffff8801193e1790 100000001d9.fffffffffffffffe dropping cap refs on Fcr = 0
[ 1231.686263] ceph: file.c:659 : aio_read sync_read hit hole, reading more
[ 1231.686269] ceph: file.c:639 : aio_read ffff8801193e1790 100000001d9.fffffffffffffffe 0~3690496 got cap refs on Fcr
[ 1231.686272] ceph: file.c:383 : sync_read on file ffff880118ee8e40 503808~3690496 O_DIRECT
[ 1231.686274] ceph: file.c:386 : Ceph O_DIRECT branch with ceph_get_direct_page_vector
[ 1231.686796] ceph: file.c:397 : pages is 00000000000000ef
[ 1231.686798] ceph: file.c:409 : about to striped_read
[ 1231.686878] ceph: file.c:325 : striped_read 503808~3690496 (read 0) got -512
[ 1231.686880] ceph: file.c:364 : striped_read returns -512
[ 1231.686888] BUG: unable to handle kernel NULL pointer dereference at 00000000000000ef
[ 1231.686891] IP: [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[ 1231.686910] PGD 1158b7067 PUD 1158b8067 PMD 0
[ 1231.686915] Oops: 0000 [#1] PREEMPT SMP
[ 1231.686918] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger
[ 1231.686922] CPU 0

#7 Updated by Samuel Just about 13 years ago

Annoyingly, I am suddenly having some trouble triggering the bug, but the previous output suggests the get_user_pages in ceph_get_direct_page_vector is returning fewer pages than requested (which is consistent with the documentation in mm/memory.c). The code currently seems to assume that either the correct amount was returned or rc is an error code.

#8 Updated by Samuel Just about 13 years ago

Ah, just noticed that it's fixed in fe7c17ec5ec0851550e32be986788e367b1f8e00.

#9 Updated by Sage Weil about 13 years ago

  • Status changed from New to Resolved

Samuel Just wrote:

Ah, just noticed that it's fixed in commit:fe7c17ec5ec0851550e32be986788e367b1f8e00.

Yeah that bug is... My only worry is that the printk I added before didn't show up? Oh, duh... I did dout("blah") not printk("blah"), and we didn't crank up debugging.. that's why it didn't show up on console.

So, same bug. Yay, I'm closing this out then!

Also available in: Atom PDF