Project

General

Profile

Actions

Bug #6267

closed

krbd: null deref in __kick_osd_requests+0x15e/0x1b0

Added by geraint jones over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

[639680.982539] BUG: unable to handle kernel NULL pointer dereference at 0000000000000498
[639680.986988] IP: [<ffffffffa01cf1ae>] __kick_osd_requests+0x15e/0x1b0 [libceph]
[639680.989983] PGD 10d2139067 PUD 10d546c067 PMD 0
[639680.989983] Oops: 0000 [#1] SMP
[639680.989983] Modules linked in: xt_nat xt_mark iptable_mangle dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ipt_REJECT xt_LOG xt_limit xt_recent xt_state xt_REDIRECT xt_tcpudp iptable_filter xt_addrtype veth aufs ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables ebt_ip ebtable_broute ebtable_nat ebtable_filter ebtables x_tables bridge stp llc ip_gre gre dm_crypt rbd libceph psmouse libcrc32c microcode joydev i2c_piix4 mac_hid serio_raw virtio_balloon acpiphp hid_generic usbhid hid crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 xts lrw gf128mul ablk_helper floppy cryptd cirrus syscopyarea sysfillrect sysimgblt ttm drm_kms_helper drm
[639681.016598] CPU 4
[639681.016598] Pid: 27857, comm: kworker/4:1 Tainted: G W 3.9.00generic #4userns5 OpenStack Foundation OpenStack Nova
[639681.016598] RIP: 0010:[<ffffffffa01cf1ae>] [<ffffffffa01cf1ae>] __kick_osd_requests+0x15e/0x1b0 [libceph]
[639681.016598] RSP: 0018:ffff880eb4fc9d28 EFLAGS: 00010206
[639681.016598] RAX: 0000000000000000 RBX: ffff8810d5ff0738 RCX: ffff880ffc440db0
[639681.016598] RDX: ffff8810d5ff07f0 RSI: ffff880ffc440d80 RDI: ffff8810d5ff0738
[639681.016598] RBP: ffff880eb4fc9d78 R08: 000000000000000a R09: 0000000000000000
[639681.045466] R10: 00000000000298f7 R11: 00000000000298f6 R12: ffff880ffc440d80
[639681.045466] R13: ffff880e90298ca0 R14: ffff880ffc440d80 R15: ffff880ffc440da0
[639681.045466] FS: 0000000000000000(0000) GS:ffff881139700000(0000) knlGS:0000000000000000
[639681.045466] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[639681.045466] CR2: 0000000000000498 CR3: 00000010d2122000 CR4: 00000000000006e0
[639681.065324] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[639681.065324] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[639681.065324] Process kworker/4:1 (pid: 27857, threadinfo ffff880eb4fc8000, task ffff880ec5e10000)
[639681.065324] Stack:
[639681.065324] 0000000000000000 ffff8810d5ff07f0 ffff880ebec37640 ffff88103d8072e0
[639681.065324] ffff880eb4fc9d68 ffff8810d5ff0738 ffff8810d5ff0748 ffff8810d5ff0790
[639681.065324] ffff880e90298800 0000000000000000 ffff880eb4fc9da8 ffffffffa01cf24c
[639681.065324] Call Trace:
[639681.065324] [<ffffffffa01cf24c>] osd_reset+0x4c/0x80 [libceph]
[639681.065324] [<ffffffffa01ca576>] con_work+0x126/0x240 [libceph]
[639681.065324] [<ffffffff8107a5db>] process_one_work+0x16b/0x400
[639681.065324] [<ffffffff8107b298>] worker_thread+0x118/0x350
[639681.065324] [<ffffffff8107b180>] ? manage_workers+0x120/0x120
[639681.109419] [<ffffffff810808e0>] kthread+0xc0/0xd0
[639681.109419] [<ffffffff81080820>] ? flush_kthread_worker+0xb0/0xb0
[639681.109419] [<ffffffff8171816c>] ret_from_fork+0x7c/0xb0
[639681.109419] [<ffffffff81080820>] ? flush_kthread_worker+0xb0/0xb0
[639681.125066] Code: 00 00 00 48 8b 55 b8 49 8d 4c 24 30 4c 89 bb c0 00 00 00 4c 89 e6 48 89 df 49 89 44 24 28 49 89 54 24 20 4c 89 38 49 8b 44 24 60 <48> 8b 90 98 04 00 00 48 89 88 98 04 00 00 48 05 90 04 00 00 49
[639681.125066] RIP [<ffffffffa01cf1ae>] __kick_osd_requests+0x15e/0x1b0 [libceph]
[639681.137488] RSP <ffff880eb4fc9d28>
[639681.137488] CR2: 0000000000000498
[639681.167212] [ end trace f707d561f7dc2318 ]
[639681.171811] BUG: unable to handle kernel paging request at ffffffffffffffd8
[639681.174826] IP: [<ffffffff81080d10>] kthread_data+0x10/0x20
[639681.174826] PGD 1c10067 PUD 1c12067 PMD 0
[639681.174826] Oops: 0000 [#2] SMP
[639681.174826] Modules linked in: xt_nat xt_mark iptable_mangle dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ipt_REJECT xt_LOG xt_limit xt_recent xt_state xt_REDIRECT xt_tcpudp iptable_filter xt_addrtype veth aufs ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables ebt_ip ebtable_broute ebtable_nat ebtable_filter ebtables x_tables bridge stp llc ip_gre gre dm_crypt rbd libceph psmouse libcrc32c microcode joydev i2c_piix4 mac_hid serio_raw virtio_balloon acpiphp hid_generic usbhid hid crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 xts lrw gf128mul ablk_helper floppy cryptd cirrus syscopyarea sysfillrect sysimgblt ttm drm_kms_helper drm
[639681.174826] CPU 4
[639681.174826] Pid: 27857, comm: kworker/4:1 Tainted: G D W 3.9.00generic #4userns5 OpenStack Foundation OpenStack Nova
[639681.174826] RIP: 0010:[<ffffffff81080d10>] [<ffffffff81080d10>] kthread_data+0x10/0x20
[639681.174826] RSP: 0018:ffff880eb4fc9918 EFLAGS: 00010096
[639681.174826] RAX: 0000000000000000 RBX: 0000000000000004 RCX: ffffffff81ebf020
[639681.174826] RDX: 0000000000000009 RSI: 0000000000000004 RDI: ffff880ec5e10000
[639681.174826] RBP: ffff880eb4fc9918 R08: 0000000000000001 R09: ffffea003729f800
[639681.174826] R10: 0000000000000000 R11: ffff8810d26f1f20 R12: 0000000000000004
[639681.174826] R13: ffff880ec5e10400 R14: 0000000000000001 R15: 0000000000000246
[639681.174826] FS: 0000000000000000(0000) GS:ffff881139700000(0000) knlGS:0000000000000000
[639681.174826] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[639681.174826] CR2: ffffffffffffffd8 CR3: 0000000d5a786000 CR4: 00000000000006e0
[639681.174826] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[639681.174826] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[639681.174826] Process kworker/4:1 (pid: 27857, threadinfo ffff880eb4fc8000, task ffff880ec5e10000)
[639681.174826] Stack:
[639681.174826] ffff880eb4fc9938 ffffffff8107bdd5 ffff880eb4fc9938 ffff881139714240
[639681.174826] ffff880eb4fc99b8 ffffffff8170d6af 0000000000000003 ffffea003729fa00
[639681.174826] ffff880eb4fc9fd8 ffff880eb4fc9fd8 ffff880eb4fc9fd8 0000000000014240
[639681.174826] Call Trace:
[639681.174826] [<ffffffff8107bdd5>] wq_worker_sleeping+0x15/0x80
[639681.174826] [<ffffffff8170d6af>] __schedule+0x5ef/0x6d0
[639681.174826] [<ffffffff8170e269>] schedule+0x29/0x70
[639681.174826] [<ffffffff81060e0b>] do_exit+0x2bb/0x480
[639681.174826] [<ffffffff81710669>] oops_end+0xb9/0x100
[639681.174826] [<ffffffff816f56c6>] no_context+0x1ab/0x1ba
[639681.174826] [<ffffffff816f58a8>] __bad_area_nosemaphore+0x1d3/0x1f2
[639681.174826] [<ffffffff8105b3a7>] ? print_time.part.5+0x67/0x90
[639681.174826] [<ffffffff8105b447>] ? print_prefix+0x77/0xc0
[639681.174826] [<ffffffff816f58da>] bad_area_nosemaphore+0x13/0x15
[639681.174826] [<ffffffff817135e2>] __do_page_fault+0x3c2/0x560
[639681.174826] [<ffffffff8105ce04>] ? wake_up_klogd+0x34/0x40
[639681.174826] [<ffffffff8105d057>] ? console_unlock.part.9+0x247/0x270
[639681.174826] [<ffffffff8105d09a>] ? console_unlock+0x1a/0x30
[639681.174826] [<ffffffff817137ab>] do_page_fault+0x2b/0x50
[639681.174826] [<ffffffff81712e45>] do_async_page_fault+0x35/0xa0
[639681.174826] [<ffffffff8170fa48>] async_page_fault+0x28/0x30
[639681.174826] [<ffffffffa01cf1ae>] ? __kick_osd_requests+0x15e/0x1b0 [libceph]
[639681.174826] [<ffffffffa01cf17f>] ? __kick_osd_requests+0x12f/0x1b0 [libceph]
[639681.174826] [<ffffffffa01cf24c>] osd_reset+0x4c/0x80 [libceph]
[639681.174826] [<ffffffffa01ca576>] con_work+0x126/0x240 [libceph]
[639681.174826] [<ffffffff8107a5db>] process_one_work+0x16b/0x400
[639681.174826] [<ffffffff8107b298>] worker_thread+0x118/0x350
[639681.174826] [<ffffffff8107b180>] ? manage_workers+0x120/0x120
[639681.174826] [<ffffffff810808e0>] kthread+0xc0/0xd0
[639681.174826] [<ffffffff81080820>] ? flush_kthread_worker+0xb0/0xb0
[639681.174826] [<ffffffff8171816c>] ret_from_fork+0x7c/0xb0
[639681.174826] [<ffffffff81080820>] ? flush_kthread_worker+0xb0/0xb0
[639681.174826] Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 87 a0 03 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90
[639681.174826] RIP [<ffffffff81080d10>] kthread_data+0x10/0x20
[639681.174826] RSP <ffff880eb4fc9918>
[639681.174826] CR2: ffffffffffffffd8
[639681.174826] [ end trace f707d561f7dc2319 ]
[639681.174826] Fixing recursive fault but reboot is needed!

This is ubuntu 13.04 with Kernel 3.9.00 patched to enable user namespaces.

This has happened twice today under moderate load - about 150 RBD's mapped.

Actions #1

Updated by Sage Weil over 10 years ago

  • Subject changed from Kernel null pointer leading to panic to krbd: null deref in __kick_osd_requests+0x15e/0x1b0
  • Status changed from New to Need More Info

can you try a 3.10 kernel? there was at least one locking fix during that interval that could explain this. (also, the 3.9 kernel is EOL.)

Actions #2

Updated by geraint jones over 10 years ago

Sage Weil wrote:

can you try a 3.10 kernel? there was at least one locking fix during that interval that could explain this. (also, the 3.9 kernel is EOL.)

I will do my best, these boxes are production :)

Give me a few days and I will do the update and see if its still happening.

Actions #3

Updated by Sage Weil over 10 years ago

Any update?

Actions #4

Updated by Sage Weil over 10 years ago

  • Status changed from Need More Info to Resolved
Actions

Also available in: Atom PDF