Bug #855
fsstress kernel opps
0%
Description
/usr/lib/ltp/testcases/bin/fsstress -d . -l 1 -n 1000 -p 10
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] Oops: 0000 [#1] PREEMPT SMP
ceph1:/sepia/ceph1#
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] Stack:
ceph1:/sepia/ceph1#
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] Call Trace:
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315653] CR2: 0000000000000163
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] Oops: 0000 [#2] PREEMPT SMP
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] Stack:
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] Call Trace:
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9
Message from syslogd@ceph1 at Wed Mar 2 15:13:50 2011 ...
ceph1 kernel: [15383.315347] CR2: 0000000000000336
Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] Oops: 0000 [#3] PREEMPT SMP
Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger
Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] Stack:
Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] Call Trace:
Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9
Message from syslogd@ceph1 at Wed Mar 2 15:13:51 2011 ...
ceph1 kernel: [15383.315640] CR2: 0000000000000273
ceph version: a6167332fb7bff23a2907bf69166779e63fe836c (next)
Kernel client: 7751b01
History
#1 Updated by Samuel Just about 13 years ago
Forgot to mention: This is on ceph1 with sepia mounted. Also, the kernel output occurred just just after I hit ^C to kill the test (the test appeared to have hung).
#2 Updated by Sage Weil about 13 years ago
somewhat more readable, from teh console:
[15383.558054] ---[ end trace 07ae65cfa15aa1d9 ]--- [15383.315347] NULL pointer dereference at 0000000000000336 [15383.315347] IP: [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph] [15383.315347] PGD 119e2e067 PUD 11937f067 PMD 0 [15383.315347] Oops: 0000 [#2] PREEMPT SMP [15383.315347] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger [15383.315347] CPU 0 [15383.315347] Modules linked in: ceph libceph btrfs lzo_decompress lzo_compress [15383.315347] [15383.315347] Pid: 3205, comm: fsstress Tainted: G D 2.6.38-rc6+ #70 Supermicro PDSMi/PDSMi+ [15383.315347] RIP: 0010:[<ffffffffa0093bc3>] [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph] [15383.315347] RSP: 0018:ffff88011baefc68 EFLAGS: 00010202 [15383.315347] RAX: 0000000000000000 RBX: 0000000000000336 RCX: 0000000000078210 [15383.315347] RDX: 0000000000000001 RSI: 0000000000000388 RDI: 0000000000000336 [15383.315347] RBP: ffff88011baefc98 R08: ffff880119fa8240 R09: 0000000000000001 [15383.315347] R10: 0000000000000246 R11: ffff88011baefb88 R12: 0000000000000000 [15383.315347] R13: 0000000000000336 R14: 0000000000000001 R15: 0000000000000388 [15383.315347] FS: 00007f8721143700(0000) GS:ffff8800df600000(0000) knlGS:0000000000000000 [15383.315347] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [15383.315347] CR2: 0000000000000336 CR3: 00000001149ae000 CR4: 00000000000006f0 [15383.315347] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [15383.315347] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [15383.315347] Process fsstress (pid: 3205, threadinfo ffff88011baee000, task ffff880114914900) [15383.315347] Stack: [15383.315347] fffffffffffffffe 0000000000387df0 0000000000078210 00000000fffffe00 [15383.315347] 0000000000000000 0000000000000000 ffff88011baefde8 ffffffffa00c4381 [15383.315347] 0000000000000003 0000000000078210 0000000000000336 ffff880100000388 [15383.315347] Call Trace: [15383.315347] [<ffffffffa00c4381>] ceph_aio_read+0x749/0x910 [ceph] [15383.315347] [<ffffffff8105db1c>] ? put_lock_stats+0xe/0x29 [15383.315347] [<ffffffff810ba0dd>] do_sync_read+0xcb/0x112 [15383.315347] [<ffffffff8122d797>] ? security_file_permission+0x81/0x8a [15383.315347] [<ffffffff810ba8a1>] vfs_read+0xae/0x134 [15383.315347] [<ffffffff810babf5>] sys_read+0x4a/0x76 [15383.315347] [<ffffffff810029bb>] system_call_fastpath+0x16/0x1b [15383.315347] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9 [15383.315347] RIP [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph] [15383.315347] RSP <ffff88011baefc68> [15383.315347] CR2: 0000000000000336 [15383.315640] BUG: unable to handle kernel [15383.815681] libceph: get_reply unknown tid 3384032 from osd18 [15383.815695] libceph: get_reply unknown tid 3384030 from osd14 [15383.815706] libceph: get_reply unknown tid 3384035 from osd16 [15383.815717] libceph: get_reply unknown tid 3384036 from osd2 [15383.815841] ---[ end trace 07ae65cfa15aa1da ]--- [15383.817332] libceph: get_reply unknown tid 3384034 from osd18 [15383.315640] NULL pointer dereference at 0000000000000273 [15383.315640] IP: [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph] [15383.315640] PGD 11959e067 PUD 119fd6067 PMD 0 [15383.315640] Oops: 0000 [#3] PREEMPT SMP [15383.315640] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger [15383.315640] CPU 1 [15383.315640] Modules linked in: ceph libceph btrfs lzo_decompress lzo_compress [15383.315640] [15383.315640] Pid: 3204, comm: fsstress Tainted: G D 2.6.38-rc6+ #70 Supermicro PDSMi/PDSMi+ [15383.315640] RIP: 0010:[<ffffffffa0093bc3>] [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph] [15383.315640] RSP: 0018:ffff880114a1fc68 EFLAGS: 00010202 [15383.315640] RAX: 0000000000000000 RBX: 0000000000000273 RCX: 0000000000111d71 [15383.315640] RDX: 0000000000000001 RSI: 00000000000002ef RDI: 0000000000000273 [15383.315640] RBP: ffff880114a1fc98 R08: ffff880116fe4240 R09: ffff880114a1fba8 [15383.315640] R10: 0000000000000246 R11: ffff880119d08dc0 R12: 0000000000000000 [15383.315640] R13: 0000000000000273 R14: 0000000000000001 R15: 00000000000002ef [15383.315640] FS: 00007f8721143700(0000) GS:ffff8800df800000(0000) knlGS:0000000000000000 [15383.315640] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [15383.315640] CR2: 0000000000000273 CR3: 0000000119cf4000 CR4: 00000000000006e0 [15383.315640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [15383.315640] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [15383.315640] Process fsstress (pid: 3204, threadinfo ffff880114a1e000, task ffff880114a068c0) [15383.315640] Stack: [15383.315640] fffffffffffffffe 00000000002ee28f 0000000000111d71 00000000fffffe00 [15383.315640] 0000000000000000 0000000000000000 ffff880114a1fde8 ffffffffa00c4381 [15383.315640] 0000000000000001 ffffffffffffffff 0000000000000273 ffff8801000002ef [15383.315640] Call Trace: [15383.315640] [<ffffffffa00c4381>] ceph_aio_read+0x749/0x910 [ceph] [15383.315640] [<ffffffff8105db1c>] ? put_lock_stats+0xe/0x29 [15383.315640] [<ffffffff810ba0dd>] do_sync_read+0xcb/0x112 [15383.315640] [<ffffffff8122d797>] ? security_file_permission+0x81/0x8a [15383.315640] [<ffffffff810ba8a1>] vfs_read+0xae/0x134 [15383.315640] [<ffffffff810babf5>] sys_read+0x4a/0x76 [15383.315640] [<ffffffff810029bb>] system_call_fastpath+0x16/0x1b [15383.315640] Code: 41 5e c9 c3 55 48 89 e5 41 57 41 89 f7 41 56 41 88 d6 41 55 49 89 fd 41 54 45 31 e4 53 48 89 fb 48 83 ec 08 eb 1c 45 84 f6 74 08 <48> 8b 3b e8 c3 b5 ff e0 48 8b 3b 41 ff c4 48 83 c3 08 e8 0a d9 [15383.315640] RIP [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph] [15383.315640] RSP <ffff880114a1fc68> [15383.315640] CR2: 0000000000000273 [15384.107989] ---[ end trace 07ae65cfa15aa1db ]---
#3 Updated by Sage Weil about 13 years ago
- Project changed from Ceph to Linux kernel client
#4 Updated by Sage Weil about 13 years ago
- Priority changed from Normal to Urgent
- Target version set to v2.6.38
#5 Updated by Sage Weil about 13 years ago
- Assignee set to Samuel Just
Sam, are you able to reproduce this? I have a possible fix, but want to make sure it's the right thing first. If we print out the message with a positive (or any?) number after the below patch is applied then I'm on the right track...
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c index 1a040e6..a599773 100644 --- a/net/ceph/pagevec.c +++ b/net/ceph/pagevec.c @@ -31,6 +31,7 @@ struct page **ceph_get_direct_page_vector(const char __user *data, return pages; fail: + dout("get_direct_page_vector failed with %d\n", rc); ceph_put_page_vector(pages, rc > 0 ? rc : 0, false); return ERR_PTR(rc); }
#6 Updated by Samuel Just about 13 years ago
New debug output, pages is invalid out of ceph_get_direct_page_vector on O_DIRECT.
[ 1231.684328] ceph: file.c:639 : aio_read ffff8801193e1790 100000001d9.fffffffffffffffe 0~3690496 got cap refs on Fcr
[ 1231.684331] ceph: file.c:383 : sync_read on file ffff880118ee8e40 503808~3690496 O_DIRECT
[ 1231.684334] ceph: file.c:386 : Ceph O_DIRECT branch with ceph_get_direct_page_vector
[ 1231.684971] ceph: file.c:397 : pages is ffff880115900000
[ 1231.684973] ceph: file.c:409 : about to striped_read
[ 1231.686166] ceph: file.c:325 : striped_read 503808~3690496 (read 0) got 0 SHORT
[ 1231.686168] ceph: file.c:364 : striped_read returns 0
[ 1231.686251] ceph: file.c:425 : sync_read result 0
[ 1231.686254] ceph: file.c:651 : aio_read ffff8801193e1790 100000001d9.fffffffffffffffe dropping cap refs on Fcr = 0
[ 1231.686263] ceph: file.c:659 : aio_read sync_read hit hole, reading more
[ 1231.686269] ceph: file.c:639 : aio_read ffff8801193e1790 100000001d9.fffffffffffffffe 0~3690496 got cap refs on Fcr
[ 1231.686272] ceph: file.c:383 : sync_read on file ffff880118ee8e40 503808~3690496 O_DIRECT
[ 1231.686274] ceph: file.c:386 : Ceph O_DIRECT branch with ceph_get_direct_page_vector
[ 1231.686796] ceph: file.c:397 : pages is 00000000000000ef
[ 1231.686798] ceph: file.c:409 : about to striped_read
[ 1231.686878] ceph: file.c:325 : striped_read 503808~3690496 (read 0) got -512
[ 1231.686880] ceph: file.c:364 : striped_read returns -512
[ 1231.686888] BUG: unable to handle kernel NULL pointer dereference at 00000000000000ef
[ 1231.686891] IP: [<ffffffffa0093bc3>] ceph_put_page_vector+0x27/0x57 [libceph]
[ 1231.686910] PGD 1158b7067 PUD 1158b8067 PMD 0
[ 1231.686915] Oops: 0000 [#1] PREEMPT SMP
[ 1231.686918] last sysfs file: /sys/devices/system/machinecheck/machinecheck0/trigger
[ 1231.686922] CPU 0
#7 Updated by Samuel Just about 13 years ago
Annoyingly, I am suddenly having some trouble triggering the bug, but the previous output suggests the get_user_pages in ceph_get_direct_page_vector is returning fewer pages than requested (which is consistent with the documentation in mm/memory.c). The code currently seems to assume that either the correct amount was returned or rc is an error code.
#8 Updated by Samuel Just about 13 years ago
Ah, just noticed that it's fixed in fe7c17ec5ec0851550e32be986788e367b1f8e00.
#9 Updated by Sage Weil about 13 years ago
- Status changed from New to Resolved
Samuel Just wrote:
Ah, just noticed that it's fixed in commit:fe7c17ec5ec0851550e32be986788e367b1f8e00.
Yeah that bug is... My only worry is that the printk I added before didn't show up? Oh, duh... I did dout("blah") not printk("blah"), and we didn't crank up debugging.. that's why it didn't show up on console.
So, same bug. Yay, I'm closing this out then!