Project

General

Profile

Actions

Bug #1793

closed

NULL pointer dereference at try_write+0x627/0x1060

Added by Josh Durgin over 12 years ago. Updated about 12 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
libceph
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Found in sepia50's console:

[ 8727.491802] Call Trace:
[ 8727.491802]  [<ffffffff8107f0c5>] wq_worker_sleeping+0x15/0xa0
[ 8727.491802]  [<ffffffff81600763>] __schedule+0x5c3/0x940
[ 8727.491802]  [<ffffffff81066e81>] ? do_exit+0x551/0x880
[ 8727.491802]  [<ffffffff81065d7d>] ? release_task+0x1d/0x470
[ 8727.491802]  [<ffffffff81600e0f>] schedule+0x3f/0x60
[ 8727.491802]  [<ffffffff81066ef7>] do_exit+0x5c7/0x880
[ 8727.491802]  [<ffffffff81063b05>] ? kmsg_dump+0x75/0x140
[ 8727.491802]  [<ffffffff81604ac0>] oops_end+0xb0/0xf0

[ 8727.491802]  [<ffffffff8103f85d>] no_context+0xfd/0x270
[ 8727.491802]  [<ffffffff814edfd5>] ? release_sock+0x35/0x190
[ 8727.491802]  [<ffffffff8103fb15>] __bad_area_nosemaphore+0x145/0x230
[ 8727.491802]  [<ffffffff810509a3>] ? get_parent_ip+0x33/0x50
[ 8727.491802]  [<ffffffff8103fc13>] bad_area_nosemaphore+0x13/0x20
[ 8727.491802]  [<ffffffff8160763e>] do_page_fault+0x34e/0x4b0
[ 8727.491802]  [<ffffffff814ee0e4>] ? release_sock+0x144/0x190
[ 8727.491802]  [<ffffffff815429e5>] ? tcp_sendpage+0xc5/0x590
[ 8727.491802]  [<ffffffff81313fbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 8727.491802]  [<ffffffff81603ef5>] page_fault+0x25/0x30
[ 8727.491802]  [<ffffffffa00c6a87>] ? try_write+0x627/0x1060 [libceph]
[ 8727.491802]  [<ffffffff814e8db4>] ? kernel_recvmsg+0x44/0x60

[ 8727.491802]  [<ffffffffa00c5748>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]

[ 8727.491802]  [<ffffffffa00c80d0>] con_work+0xc10/0x1b00 [libceph]

[ 8727.491802]  [<ffffffff81313f7e>] ? trace_hardirqs_on_thunk+0x3a/0x3f

[ 8727.491802]  [<ffffffff81053978>] ? finish_task_switch+0x48/0x120

[ 8727.491802]  [<ffffffff8107fcb6>] process_one_work+0x1a6/0x520

[ 8727.491802]  [<ffffffff8107fc47>] ? process_one_work+0x137/0x520
[ 8727.491802]  [<ffffffffa00c74c0>] ? try_write+0x1060/0x1060 [libceph]
[ 8727.491802]  [<ffffffff81081fe3>] worker_thread+0x173/0x400
[ 8727.491802]  [<ffffffff81081e70>] ? manage_workers+0x210/0x210
[ 8727.491802]  [<ffffffff81087026>] kthread+0xb6/0xc0
[ 8727.491802]  [<ffffffff8160dd44>] kernel_thread_helper+0x4/0x10
[ 8727.491802]  [<ffffffff81603c74>] ? retint_restore_args+0x13/0x13
[ 8727.491802]  [<ffffffff81086f70>] ? __init_kthread_worker+0x70/0x70
[ 8727.491802]  [<ffffffff8160dd40>] ? gs_change+0x13/0x13
[ 8727.491802] Code: 66 66 66 90 65 48 8b 04 25 40 c4 00 00 48 8b 80 48 03 00 00 8b 40 f0 c9 c3 66 90 55 48 89 e5 66 66 66 66 90 48 8b 87 48 03 00 00 
[ 8727.491802]  8b 40 f8 c9 c3 eb 08 90 90 90 90 90 90 90 90 55 48 89 e5 66 
[ 8727.491802] RIP  [<ffffffff81086a10>] kthread_data+0x10/0x20

[ 8727.491802]  RSP <ffff8800455b76f8>
[ 8727.491802] CR2: fffffffffffffff8
[ 8727.491802] ---[ end trace 9654ef6c74784bc5 ]---
[ 8727.491802] Fixing recursive fault but reboot is needed!
[ 8727.491802] BUG: spinlock lockup on CPU#0, kworker/0:2/5057

[ 8727.491802]  lock: ffff8800fbc13b80, .magic: dead4ead, .owner: kworker/0:2/5057, .owner_cpu: 0

[ 8727.491802] Pid: 5057, comm: kworker/0:2 Tainted: G      D     3.1.0-ceph-08936-gb643987 #1

[ 8727.491802] Call Trace:

[ 8727.491802]  [<ffffffff8131a178>] spin_dump+0x78/0xc0
[ 8727.491802]  [<ffffffff8131a39d>] do_raw_spin_lock+0xed/0x120
[ 8727.491802]  [<ffffffff816031e5>] _raw_spin_lock_irq+0x45/0x50
[ 8727.491802]  [<ffffffff81600277>] ? __schedule+0xd7/0x940
[ 8727.491802]  [<ffffffff81600277>] __schedule+0xd7/0x940
[ 8727.491802]  [<ffffffff81600e0f>] schedule+0x3f/0x60
[ 8727.491802]  [<ffffffff81067164>] do_exit+0x834/0x880
[ 8727.491802]  [<ffffffff81063b95>] ? kmsg_dump+0x105/0x140
[ 8727.491802]  [<ffffffff81063b05>] ? kmsg_dump+0x75/0x140
[ 8727.491802]  [<ffffffff81604ac0>] oops_end+0xb0/0xf0
[ 8727.491802]  [<ffffffff8103f85d>] no_context+0xfd/0x270
[ 8727.491802]  [<ffffffff8103fb15>] __bad_area_nosemaphore+0x145/0x230
[ 8727.491802]  [<ffffffff811b3af0>] ? fsnotify_clear_marks_by_inode+0x30/0xf0
[ 8727.491802]  [<ffffffff8103fc13>] bad_area_nosemaphore+0x13/0x20
[ 8727.491802]  [<ffffffff8160763e>] do_page_fault+0x34e/0x4b0
[ 8727.491802]  [<ffffffff81057bc4>] ? cpuacct_charge+0x24/0xb0
[ 8727.491802]  [<ffffffff81057bc4>] ? cpuacct_charge+0x24/0xb0
[ 8727.491802]  [<ffffffff81313fbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 8727.491802]  [<ffffffff81603ef5>] page_fault+0x25/0x30
[ 8727.491802]  [<ffffffff81086a10>] ? kthread_data+0x10/0x20
[ 8727.491802]  [<ffffffff8107f0c5>] wq_worker_sleeping+0x15/0xa0
[ 8727.491802]  [<ffffffff81600763>] __schedule+0x5c3/0x940
[ 8727.491802]  [<ffffffff81066e81>] ? do_exit+0x551/0x880

[ 8727.491802]  [<ffffffff81065d7d>] ? release_task+0x1d/0x470
[ 8727.491802]  [<ffffffff81600e0f>] schedule+0x3f/0x60
[ 8727.491802]  [<ffffffff81066ef7>] do_exit+0x5c7/0x880
[ 8727.491802]  [<ffffffff81063b05>] ? kmsg_dump+0x75/0x140
[ 8727.491802]  [<ffffffff81604ac0>] oops_end+0xb0/0xf0
[ 8727.491802]  [<ffffffff8103f85d>] no_context+0xfd/0x270
[ 8727.491802]  [<ffffffff814edfd5>] ? release_sock+0x35/0x190
[ 8727.491802]  [<ffffffff8103fb15>] __bad_area_nosemaphore+0x145/0x230
[ 8727.491802]  [<ffffffff810509a3>] ? get_parent_ip+0x33/0x50
[ 8727.491802]  [<ffffffff8103fc13>] bad_area_nosemaphore+0x13/0x20
[ 8727.491802]  [<ffffffff8160763e>] do_page_fault+0x34e/0x4b0
[ 8727.491802]  [<ffffffff814ee0e4>] ? release_sock+0x144/0x190
[ 8727.491802]  [<ffffffff815429e5>] ? tcp_sendpage+0xc5/0x590
[ 8727.491802]  [<ffffffff81313fbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 8727.491802]  [<ffffffff81603ef5>] page_fault+0x25/0x30
[ 8727.491802]  [<ffffffffa00c6a87>] ? try_write+0x627/0x1060 [libceph]
[ 8727.491802]  [<ffffffff814e8db4>] ? kernel_recvmsg+0x44/0x60
[ 8727.491802]  [<ffffffffa00c5748>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[ 8727.491802]  [<ffffffffa00c80d0>] con_work+0xc10/0x1b00 [libceph]
[ 8727.491802]  [<ffffffff81313f7e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 8727.491802]  [<ffffffff81053978>] ? finish_task_switch+0x48/0x120
[ 8727.491802]  [<ffffffff8107fcb6>] process_one_work+0x1a6/0x520
[ 8727.491802]  [<ffffffff8107fc47>] ? process_one_work+0x137/0x520
[ 8727.491802]  [<ffffffffa00c74c0>] ? try_write+0x1060/0x1060 [libceph]
[ 8727.491802]  [<ffffffff81081fe3>] worker_thread+0x173/0x400
[ 8727.491802]  [<ffffffff81081e70>] ? manage_workers+0x210/0x210
[ 8727.491802]  [<ffffffff81087026>] kthread+0xb6/0xc0
[ 8727.491802]  [<ffffffff8160dd44>] kernel_thread_helper+0x4/0x10
[ 8727.491802]  [<ffffffff81603c74>] ? retint_restore_args+0x13/0x13
[ 8727.491802]  [<ffffffff81086f70>] ? __init_kthread_worker+0x70/0x70
[ 8727.491802]  [<ffffffff8160dd40>] ? gs_change+0x13/0x13

Related issues 1 (0 open1 closed)

Has duplicate Linux kernel client - Bug #1866: null pointer dereference after osd went downDuplicate12/29/2011

Actions
Actions #1

Updated by Josh Durgin over 12 years ago

Got the same trace on sepia18 while running mkfs.ext3 on an rbd image.

Actions #2

Updated by Josh Durgin over 12 years ago

A disk error prevented me from getting logs before:

[  226.628240] sd 2:0:0:0: [sda] Unhandled error code
[  226.637615] sd 2:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  226.649859] sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 39 84 61 f0 00 00 10 00
[  226.661756] end_request: I/O error, dev sda, sector 964977136
[  226.672023] Aborting journal on device sda1-8.
[  226.680985] sd 2:0:0:0: [sda] Unhandled error code
[  226.690213] sd 2:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  226.702446] sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 39 84 08 00 00 00 08 00
[  226.714316] end_request: I/O error, dev sda, sector 964954112
[  226.724576] quiet_error: 24 callbacks suppressed
[  226.733687] Buffer I/O error on device sda1, logical block 120619008
[  226.744608] lost page write due to I/O error on sda1
[  226.744620] JBD2: I/O error detected when updating journal superblock for sda1-8.

But now I got the full trace of what happens before the page fault handling:

Dec  9 15:44:44 sepia18 kernel: [  126.472614] libceph: client4104 fsid b10de2f1-02b3-4bd7-90cb-544fa7dca7d1
Dec  9 15:44:44 sepia18 kernel: [  126.472802] libceph: mon0 10.3.14.141:6789 session established
Dec  9 15:44:44 sepia18 kernel: [  126.474825]  rbd0: unknown partition table
Dec  9 15:44:44 sepia18 kernel: [  126.474907] rbd: rbd0: added with size 0x500000000
Dec  9 15:44:45 sepia18 kernel: [  127.043581] libceph: osd1 up
Dec  9 15:44:45 sepia18 kernel: [  127.043585] libceph: osd1 weight 0x10000 (in)
Dec  9 15:44:45 sepia18 kernel: [  127.059822] libceph: osd0 10.3.14.141:6800 socket closed
Dec  9 15:44:45 sepia18 kernel: [  127.073590] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
Dec  9 15:44:45 sepia18 kernel: [  127.081594] IP: [<ffffffffa02c5a87>] try_write+0x627/0x1060 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.083535] PGD 37b7c067 PUD 37bbc067 PMD 0 
Dec  9 15:44:45 sepia18 kernel: [  127.083535] Oops: 0000 [#1] SMP 
Dec  9 15:44:45 sepia18 kernel: [  127.083535] CPU 1 
Dec  9 15:44:45 sepia18 kernel: [  127.083535] Modules linked in: cryptd aes_x86_64 aes_generic rbd libceph radeon ttm drm_kms_helper drm lp ppdev parport_pc parport shpchp i2c_algo_bit i3000_edac 
edac_core serio_raw ahci libahci floppy e1000e btrfs zlib_deflate crc32c libcrc32c
Dec  9 15:44:45 sepia18 kernel: [  127.083535] 
Dec  9 15:44:45 sepia18 kernel: [  127.083535] Pid: 46, comm: kworker/1:2 Not tainted 3.1.0-ceph-08936-gfb6ee37 #1 Supermicro PDSMi/PDSMi+
Dec  9 15:44:45 sepia18 kernel: [  127.083535] RIP: 0010:[<ffffffffa02c5a87>]  [<ffffffffa02c5a87>] try_write+0x627/0x1060 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.083535] RSP: 0018:ffff88010368bb00  EFLAGS: 00010246
Dec  9 15:44:45 sepia18 kernel: [  127.083535] RAX: 0000000000000000 RBX: ffff880037be3830 RCX: 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.083535] RDX: 0000000000061000 RSI: 0000000000000001 RDI: 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.083535] RBP: ffff88010368bc20 R08: 0000000000000000 R09: 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.083535] R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0002d6b1c0
Dec  9 15:44:45 sepia18 kernel: [  127.083535] R13: 000000000001f000 R14: ffff880037848b00 R15: 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.083535] FS:  0000000000000000(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.083535] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec  9 15:44:45 sepia18 kernel: [  127.083535] CR2: 0000000000000048 CR3: 0000000037af4000 CR4: 00000000000006e0
Dec  9 15:44:45 sepia18 kernel: [  127.083535] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.083535] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec  9 15:44:45 sepia18 kernel: [  127.083535] Process kworker/1:2 (pid: 46, threadinfo ffff88010368a000, task ffff88010364bf60)
Dec  9 15:44:45 sepia18 kernel: [  127.260013] Stack:
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  0000000100004040 ffff88010319a800 0000000000000000 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  ffff88010368bbd0 0000000000000000 ffff88010860be91 ffff88010860be00
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  ffff880037be3c10 ffff880037be3a40 ffff880037be3870 ffff880037be3a30
Dec  9 15:44:45 sepia18 kernel: [  127.260013] Call Trace:
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff814e8db4>] ? kernel_recvmsg+0x44/0x60
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffffa02c4748>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffffa02c70d0>] con_work+0xc10/0x1b00 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff81053978>] ? finish_task_switch+0x48/0x120
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff8107fcb6>] process_one_work+0x1a6/0x520
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff8107fc47>] ? process_one_work+0x137/0x520
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffffa02c64c0>] ? try_write+0x1060/0x1060 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff81081fe3>] worker_thread+0x173/0x400
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff81081e70>] ? manage_workers+0x210/0x210
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff81087026>] kthread+0xb6/0xc0
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff8109ee55>] ? trace_hardirqs_on_caller+0x105/0x190
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff8160dd44>] kernel_thread_helper+0x4/0x10
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff81603c74>] ? retint_restore_args+0x13/0x13
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff81086f70>] ? __init_kthread_worker+0x70/0x70
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff8160dd40>] ? gs_change+0x13/0x13
Dec  9 15:44:45 sepia18 kernel: [  127.260013] Code: 7c fb ff ff 49 83 be 90 00 00 00 00 0f 84 d2 04 00 00 49 63 8e a0 00 00 00 49 8b 86 98 00 00 00 44 8b 9d 5c ff ff ff 48 c1 e1 04 
Dec  9 15:44:45 sepia18 kernel: [  127.260013] RIP  [<ffffffffa02c5a87>] try_write+0x627/0x1060 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  RSP <ffff88010368bb00>
Dec  9 15:44:45 sepia18 kernel: [  127.260013] CR2: 0000000000000048
Dec  9 15:44:45 sepia18 kernel: [  127.264188] ---[ end trace 6f04513fdb7876aa ]---
Dec  9 15:44:45 sepia18 kernel: [  127.264332] BUG: unable to handle kernel paging request at fffffffffffffff8
Dec  9 15:44:45 sepia18 kernel: [  127.264334] IP: [<ffffffff81086a10>] kthread_data+0x10/0x20
Dec  9 15:44:45 sepia18 kernel: [  127.264339] PGD 1c07067 PUD 1c08067 PMD 0 
Dec  9 15:44:45 sepia18 kernel: [  127.264342] Oops: 0000 [#2] SMP 
Dec  9 15:44:45 sepia18 kernel: [  127.264344] CPU 1 
Dec  9 15:44:45 sepia18 kernel: [  127.264345] Modules linked in: cryptd aes_x86_64 aes_generic rbd libceph radeon ttm drm_kms_helper drm lp ppdev parport_pc parport shpchp i2c_algo_bit i3000_edac edac_core serio_raw ahci libahci floppy e1000e btrfs zlib_deflate crc32c libcrc32c
Dec  9 15:44:45 sepia18 kernel: [  127.264359] 
Dec  9 15:44:45 sepia18 kernel: [  127.264361] Pid: 46, comm: kworker/1:2 Tainted: G      D     3.1.0-ceph-08936-gfb6ee37 #1 Supermicro PDSMi/PDSMi+
Dec  9 15:44:45 sepia18 kernel: [  127.264364] RIP: 0010:[<ffffffff81086a10>]  [<ffffffff81086a10>] kthread_data+0x10/0x20
Dec  9 15:44:45 sepia18 kernel: [  127.264368] RSP: 0018:ffff88010368b6f8  EFLAGS: 00010092
Dec  9 15:44:45 sepia18 kernel: [  127.264370] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
Dec  9 15:44:45 sepia18 kernel: [  127.264372] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88010364bf60
Dec  9 15:44:45 sepia18 kernel: [  127.264374] RBP: ffff88010368b6f8 R08: 0000000000989680 R09: 0000000000000001
Dec  9 15:44:45 sepia18 kernel: [  127.264376] R10: 0000000000000400 R11: 0000000000000000 R12: ffff88010364c300
Dec  9 15:44:45 sepia18 kernel: [  127.264378] R13: 0000000000000001 R14: 0000000000000001 R15: ffff88010368b820
Dec  9 15:44:45 sepia18 kernel: [  127.264381] FS:  0000000000000000(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.264383] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec  9 15:44:45 sepia18 kernel: [  127.264385] CR2: fffffffffffffff8 CR3: 0000000037af4000 CR4: 00000000000006e0
Dec  9 15:44:45 sepia18 kernel: [  127.264387] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.264389] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec  9 15:44:45 sepia18 kernel: [  127.264391] Process kworker/1:2 (pid: 46, threadinfo ffff88010368a000, task ffff88010364bf60)
Dec  9 15:44:45 sepia18 kernel: [  127.264393] Stack:
Dec  9 15:44:45 sepia18 kernel: [  127.264394]  ffff88010368b718 ffffffff8107f0c5 ffff88010368b718 ffff88011fc93b80
Dec  9 15:44:45 sepia18 kernel: [  127.264398]  ffff88010368b7b8 ffffffff81600763 0000000000000000 ffffffff81066e81
Dec  9 15:44:45 sepia18 kernel: [  127.264401]  ffff88010364bf60 0000000000013b80 ffff88010368bfd8 ffff88010368a010
Dec  9 15:44:45 sepia18 kernel: [  127.264404] Call Trace:
Dec  9 15:44:45 sepia18 kernel: [  127.264407]  [<ffffffff8107f0c5>] wq_worker_sleeping+0x15/0xa0
Dec  9 15:44:45 sepia18 kernel: [  127.264410]  [<ffffffff81600763>] __schedule+0x5c3/0x940
Dec  9 15:44:45 sepia18 kernel: [  127.264414]  [<ffffffff81066e81>] ? do_exit+0x551/0x880
Dec  9 15:44:45 sepia18 kernel: [  127.264416]  [<ffffffff81065d7d>] ? release_task+0x1d/0x470
Dec  9 15:44:45 sepia18 kernel: [  127.264419]  [<ffffffff81600e0f>] schedule+0x3f/0x60
Dec  9 15:44:45 sepia18 kernel: [  127.264421]  [<ffffffff81066ef7>] do_exit+0x5c7/0x880
Dec  9 15:44:45 sepia18 kernel: [  127.264424]  [<ffffffff81063b05>] ? kmsg_dump+0x75/0x140
Dec  9 15:44:45 sepia18 kernel: [  127.264427]  [<ffffffff81604ac0>] oops_end+0xb0/0xf0
Dec  9 15:44:45 sepia18 kernel: [  127.264431]  [<ffffffff8103f85d>] no_context+0xfd/0x270
Dec  9 15:44:45 sepia18 kernel: [  127.264434]  [<ffffffff814edfd5>] ? release_sock+0x35/0x190
Dec  9 15:44:45 sepia18 kernel: [  127.264437]  [<ffffffff8103fb15>] __bad_area_nosemaphore+0x145/0x230
Dec  9 15:44:45 sepia18 kernel: [  127.264440]  [<ffffffff810509a3>] ? get_parent_ip+0x33/0x50
Dec  9 15:44:45 sepia18 kernel: [  127.264443]  [<ffffffff8103fc13>] bad_area_nosemaphore+0x13/0x20
Dec  9 15:44:45 sepia18 kernel: [  127.264446]  [<ffffffff8160763e>] do_page_fault+0x34e/0x4b0
Dec  9 15:44:45 sepia18 kernel: [  127.264449]  [<ffffffff814ee0e4>] ? release_sock+0x144/0x190
Dec  9 15:44:45 sepia18 kernel: [  127.264453]  [<ffffffff815429e5>] ? tcp_sendpage+0xc5/0x590
Dec  9 15:44:45 sepia18 kernel: [  127.264457]  [<ffffffff81313fbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
Dec  9 15:44:45 sepia18 kernel: [  127.264460]  [<ffffffff81603ef5>] page_fault+0x25/0x30
Dec  9 15:44:45 sepia18 kernel: [  127.264467]  [<ffffffffa02c5a87>] ? try_write+0x627/0x1060 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.264470]  [<ffffffff814e8db4>] ? kernel_recvmsg+0x44/0x60
Dec  9 15:44:45 sepia18 kernel: [  127.264476]  [<ffffffffa02c4748>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.264483]  [<ffffffffa02c70d0>] con_work+0xc10/0x1b00 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.264486]  [<ffffffff81053978>] ? finish_task_switch+0x48/0x120
Dec  9 15:44:45 sepia18 kernel: [  127.264489]  [<ffffffff8107fcb6>] process_one_work+0x1a6/0x520
Dec  9 15:44:45 sepia18 kernel: [  127.264491]  [<ffffffff8107fc47>] ? process_one_work+0x137/0x520
Dec  9 15:44:45 sepia18 kernel: [  127.264498]  [<ffffffffa02c64c0>] ? try_write+0x1060/0x1060 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.264501]  [<ffffffff81081fe3>] worker_thread+0x173/0x400
Dec  9 15:44:45 sepia18 kernel: [  127.264504]  [<ffffffff81081e70>] ? manage_workers+0x210/0x210
Dec  9 15:44:45 sepia18 kernel: [  127.264506]  [<ffffffff81087026>] kthread+0xb6/0xc0
Dec  9 15:44:45 sepia18 kernel: [  127.264510]  [<ffffffff8109ee55>] ? trace_hardirqs_on_caller+0x105/0x190
Dec  9 15:44:45 sepia18 kernel: [  127.264513]  [<ffffffff8160dd44>] kernel_thread_helper+0x4/0x10
Dec  9 15:44:45 sepia18 kernel: [  127.264516]  [<ffffffff81603c74>] ? retint_restore_args+0x13/0x13
Dec  9 15:44:45 sepia18 kernel: [  127.264519]  [<ffffffff81086f70>] ? __init_kthread_worker+0x70/0x70
Dec  9 15:44:45 sepia18 kernel: [  127.264522]  [<ffffffff8160dd40>] ? gs_change+0x13/0x13
Dec  9 15:44:45 sepia18 kernel: [  127.264523] Code: 66 66 66 90 65 48 8b 04 25 40 c4 00 00 48 8b 80 48 03 00 00 8b 40 f0 c9 c3 66 90 55 48 89 e5 66 66 66 66 90 48 8b 87 48 03 00 00 
Dec  9 15:44:45 sepia18 kernel: [  127.264546] RIP  [<ffffffff81086a10>] kthread_data+0x10/0x20
Dec  9 15:44:45 sepia18 kernel: [  127.264549]  RSP <ffff88010368b6f8>
Dec  9 15:44:45 sepia18 kernel: [  127.264551] CR2: fffffffffffffff8
Dec  9 15:44:45 sepia18 kernel: [  127.264553] ---[ end trace 6f04513fdb7876ab ]---
Dec  9 15:44:45 sepia18 kernel: [  127.264554] Fixing recursive fault but reboot is needed!

Actions #3

Updated by Sage Weil over 12 years ago

  • Category set to libceph
  • Priority changed from Normal to High
  • Target version set to v3.2
Actions #4

Updated by Josh Durgin over 12 years ago

This happened again on sepia70 during the kernel untar build workunit on rbd.

Actions #5

Updated by Sage Weil over 12 years ago

  • Assignee set to Alex Elder
Actions #6

Updated by Josh Durgin over 12 years ago

This happened again with the same workload on sepia81.

Actions #7

Updated by Sage Weil over 12 years ago

looks like a null pointer dereference.. look for a struct member that's 0x48 bytes in?

Actions #8

Updated by Sage Weil over 12 years ago

  • Target version deleted (v3.2)
Actions #9

Updated by Sage Weil over 12 years ago

  • Translation missing: en.field_position set to 1
Actions #10

Updated by Sage Weil over 12 years ago

  • Target version set to v3.3
  • Translation missing: en.field_position deleted (2)
  • Translation missing: en.field_position set to 699
Actions #11

Updated by Sage Weil over 12 years ago

  • Subject changed from spinlock lockup to NULL pointer dereference at try_write+0x627/0x1060
Actions #12

Updated by Alex Elder about 12 years ago

Is there a core file for this problem anywhere?

It would really be nice to poke around in the message, or the
connection structure, or basically anything. I was unable to
reproduce the problem before when trying to repeatedly run
the kernel_untar_build workunit, which was mentioned in bug
1866 as what was running when it happened there.

Actions #13

Updated by Sage Weil about 12 years ago

Hmm.. yeah, I don't think we have anything beyond these console dumps. And we don't capture any kind of kernel core dumps (can/should we?).

Actually, I don't think we've seen this in more than a month. :/

Actions #14

Updated by Sage Weil about 12 years ago

  • Status changed from New to Need More Info
Actions #15

Updated by Alex Elder about 12 years ago

Bugs 1793 and 2081 have a signature of a page fault/bad memory reference
from process_one_work() -> con_work(), and they may well be the same issue.
So I have been treating them as such, and have been attempting to reproduce
them by running the kernel_untar_build.sh workunit repeatedly.

The last day or two I have been testing with the latest version of the
ceph-client code (c666601a93). One time I hit a kernel failure that had a
signature different from either 1793 or 2081--it reported:

INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 4, t=492980 jiffies)

I believe this is bug 2174, and am adding an update to that to record
this occurrence.

I restarted testing and completed 50 more iterations of kernel_untar_build.sh
without error. I will try a bit more, I think we should keep these two bugs
on a sort of inactive status (1793 is already "Need More Info" but 2081 was
not until now) until we start seeing occurrences again. A kernel core dump
would also be helpful, once we get that in place (see bug 2127).

Actions #16

Updated by Alex Elder about 12 years ago

Another 100 iterations of kernel_untar_build.sh using the current
master branch (c666601a935b94cc0f3310339411b6940de751ba) without
error. I'm going to quit testing this now and wait to see if
our regular automated testing produces it again.

Actions #17

Updated by Alex Elder about 12 years ago

  • Status changed from Need More Info to Can't reproduce

Marking this Can't Reproduce. Will reopen if it shows up again.

Actions

Also available in: Atom PDF