Bug #1793
NULL pointer dereference at try_write+0x627/0x1060
0%
Description
Found in sepia50's console:
[ 8727.491802] Call Trace: [ 8727.491802] [<ffffffff8107f0c5>] wq_worker_sleeping+0x15/0xa0 [ 8727.491802] [<ffffffff81600763>] __schedule+0x5c3/0x940 [ 8727.491802] [<ffffffff81066e81>] ? do_exit+0x551/0x880 [ 8727.491802] [<ffffffff81065d7d>] ? release_task+0x1d/0x470 [ 8727.491802] [<ffffffff81600e0f>] schedule+0x3f/0x60 [ 8727.491802] [<ffffffff81066ef7>] do_exit+0x5c7/0x880 [ 8727.491802] [<ffffffff81063b05>] ? kmsg_dump+0x75/0x140 [ 8727.491802] [<ffffffff81604ac0>] oops_end+0xb0/0xf0 [ 8727.491802] [<ffffffff8103f85d>] no_context+0xfd/0x270 [ 8727.491802] [<ffffffff814edfd5>] ? release_sock+0x35/0x190 [ 8727.491802] [<ffffffff8103fb15>] __bad_area_nosemaphore+0x145/0x230 [ 8727.491802] [<ffffffff810509a3>] ? get_parent_ip+0x33/0x50 [ 8727.491802] [<ffffffff8103fc13>] bad_area_nosemaphore+0x13/0x20 [ 8727.491802] [<ffffffff8160763e>] do_page_fault+0x34e/0x4b0 [ 8727.491802] [<ffffffff814ee0e4>] ? release_sock+0x144/0x190 [ 8727.491802] [<ffffffff815429e5>] ? tcp_sendpage+0xc5/0x590 [ 8727.491802] [<ffffffff81313fbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 8727.491802] [<ffffffff81603ef5>] page_fault+0x25/0x30 [ 8727.491802] [<ffffffffa00c6a87>] ? try_write+0x627/0x1060 [libceph] [ 8727.491802] [<ffffffff814e8db4>] ? kernel_recvmsg+0x44/0x60 [ 8727.491802] [<ffffffffa00c5748>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] [ 8727.491802] [<ffffffffa00c80d0>] con_work+0xc10/0x1b00 [libceph] [ 8727.491802] [<ffffffff81313f7e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 8727.491802] [<ffffffff81053978>] ? finish_task_switch+0x48/0x120 [ 8727.491802] [<ffffffff8107fcb6>] process_one_work+0x1a6/0x520 [ 8727.491802] [<ffffffff8107fc47>] ? process_one_work+0x137/0x520 [ 8727.491802] [<ffffffffa00c74c0>] ? try_write+0x1060/0x1060 [libceph] [ 8727.491802] [<ffffffff81081fe3>] worker_thread+0x173/0x400 [ 8727.491802] [<ffffffff81081e70>] ? manage_workers+0x210/0x210 [ 8727.491802] [<ffffffff81087026>] kthread+0xb6/0xc0 [ 8727.491802] [<ffffffff8160dd44>] kernel_thread_helper+0x4/0x10 [ 8727.491802] [<ffffffff81603c74>] ? retint_restore_args+0x13/0x13 [ 8727.491802] [<ffffffff81086f70>] ? __init_kthread_worker+0x70/0x70 [ 8727.491802] [<ffffffff8160dd40>] ? gs_change+0x13/0x13 [ 8727.491802] Code: 66 66 66 90 65 48 8b 04 25 40 c4 00 00 48 8b 80 48 03 00 00 8b 40 f0 c9 c3 66 90 55 48 89 e5 66 66 66 66 90 48 8b 87 48 03 00 00 [ 8727.491802] 8b 40 f8 c9 c3 eb 08 90 90 90 90 90 90 90 90 55 48 89 e5 66 [ 8727.491802] RIP [<ffffffff81086a10>] kthread_data+0x10/0x20 [ 8727.491802] RSP <ffff8800455b76f8> [ 8727.491802] CR2: fffffffffffffff8 [ 8727.491802] ---[ end trace 9654ef6c74784bc5 ]--- [ 8727.491802] Fixing recursive fault but reboot is needed! [ 8727.491802] BUG: spinlock lockup on CPU#0, kworker/0:2/5057 [ 8727.491802] lock: ffff8800fbc13b80, .magic: dead4ead, .owner: kworker/0:2/5057, .owner_cpu: 0 [ 8727.491802] Pid: 5057, comm: kworker/0:2 Tainted: G D 3.1.0-ceph-08936-gb643987 #1 [ 8727.491802] Call Trace: [ 8727.491802] [<ffffffff8131a178>] spin_dump+0x78/0xc0 [ 8727.491802] [<ffffffff8131a39d>] do_raw_spin_lock+0xed/0x120 [ 8727.491802] [<ffffffff816031e5>] _raw_spin_lock_irq+0x45/0x50 [ 8727.491802] [<ffffffff81600277>] ? __schedule+0xd7/0x940 [ 8727.491802] [<ffffffff81600277>] __schedule+0xd7/0x940 [ 8727.491802] [<ffffffff81600e0f>] schedule+0x3f/0x60 [ 8727.491802] [<ffffffff81067164>] do_exit+0x834/0x880 [ 8727.491802] [<ffffffff81063b95>] ? kmsg_dump+0x105/0x140 [ 8727.491802] [<ffffffff81063b05>] ? kmsg_dump+0x75/0x140 [ 8727.491802] [<ffffffff81604ac0>] oops_end+0xb0/0xf0 [ 8727.491802] [<ffffffff8103f85d>] no_context+0xfd/0x270 [ 8727.491802] [<ffffffff8103fb15>] __bad_area_nosemaphore+0x145/0x230 [ 8727.491802] [<ffffffff811b3af0>] ? fsnotify_clear_marks_by_inode+0x30/0xf0 [ 8727.491802] [<ffffffff8103fc13>] bad_area_nosemaphore+0x13/0x20 [ 8727.491802] [<ffffffff8160763e>] do_page_fault+0x34e/0x4b0 [ 8727.491802] [<ffffffff81057bc4>] ? cpuacct_charge+0x24/0xb0 [ 8727.491802] [<ffffffff81057bc4>] ? cpuacct_charge+0x24/0xb0 [ 8727.491802] [<ffffffff81313fbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 8727.491802] [<ffffffff81603ef5>] page_fault+0x25/0x30 [ 8727.491802] [<ffffffff81086a10>] ? kthread_data+0x10/0x20 [ 8727.491802] [<ffffffff8107f0c5>] wq_worker_sleeping+0x15/0xa0 [ 8727.491802] [<ffffffff81600763>] __schedule+0x5c3/0x940 [ 8727.491802] [<ffffffff81066e81>] ? do_exit+0x551/0x880 [ 8727.491802] [<ffffffff81065d7d>] ? release_task+0x1d/0x470 [ 8727.491802] [<ffffffff81600e0f>] schedule+0x3f/0x60 [ 8727.491802] [<ffffffff81066ef7>] do_exit+0x5c7/0x880 [ 8727.491802] [<ffffffff81063b05>] ? kmsg_dump+0x75/0x140 [ 8727.491802] [<ffffffff81604ac0>] oops_end+0xb0/0xf0 [ 8727.491802] [<ffffffff8103f85d>] no_context+0xfd/0x270 [ 8727.491802] [<ffffffff814edfd5>] ? release_sock+0x35/0x190 [ 8727.491802] [<ffffffff8103fb15>] __bad_area_nosemaphore+0x145/0x230 [ 8727.491802] [<ffffffff810509a3>] ? get_parent_ip+0x33/0x50 [ 8727.491802] [<ffffffff8103fc13>] bad_area_nosemaphore+0x13/0x20 [ 8727.491802] [<ffffffff8160763e>] do_page_fault+0x34e/0x4b0 [ 8727.491802] [<ffffffff814ee0e4>] ? release_sock+0x144/0x190 [ 8727.491802] [<ffffffff815429e5>] ? tcp_sendpage+0xc5/0x590 [ 8727.491802] [<ffffffff81313fbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 8727.491802] [<ffffffff81603ef5>] page_fault+0x25/0x30 [ 8727.491802] [<ffffffffa00c6a87>] ? try_write+0x627/0x1060 [libceph] [ 8727.491802] [<ffffffff814e8db4>] ? kernel_recvmsg+0x44/0x60 [ 8727.491802] [<ffffffffa00c5748>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] [ 8727.491802] [<ffffffffa00c80d0>] con_work+0xc10/0x1b00 [libceph] [ 8727.491802] [<ffffffff81313f7e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 8727.491802] [<ffffffff81053978>] ? finish_task_switch+0x48/0x120 [ 8727.491802] [<ffffffff8107fcb6>] process_one_work+0x1a6/0x520 [ 8727.491802] [<ffffffff8107fc47>] ? process_one_work+0x137/0x520 [ 8727.491802] [<ffffffffa00c74c0>] ? try_write+0x1060/0x1060 [libceph] [ 8727.491802] [<ffffffff81081fe3>] worker_thread+0x173/0x400 [ 8727.491802] [<ffffffff81081e70>] ? manage_workers+0x210/0x210 [ 8727.491802] [<ffffffff81087026>] kthread+0xb6/0xc0 [ 8727.491802] [<ffffffff8160dd44>] kernel_thread_helper+0x4/0x10 [ 8727.491802] [<ffffffff81603c74>] ? retint_restore_args+0x13/0x13 [ 8727.491802] [<ffffffff81086f70>] ? __init_kthread_worker+0x70/0x70 [ 8727.491802] [<ffffffff8160dd40>] ? gs_change+0x13/0x13
Related issues
History
#1 Updated by Josh Durgin almost 12 years ago
Got the same trace on sepia18 while running mkfs.ext3 on an rbd image.
#2 Updated by Josh Durgin almost 12 years ago
A disk error prevented me from getting logs before:
[ 226.628240] sd 2:0:0:0: [sda] Unhandled error code [ 226.637615] sd 2:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [ 226.649859] sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 39 84 61 f0 00 00 10 00 [ 226.661756] end_request: I/O error, dev sda, sector 964977136 [ 226.672023] Aborting journal on device sda1-8. [ 226.680985] sd 2:0:0:0: [sda] Unhandled error code [ 226.690213] sd 2:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [ 226.702446] sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 39 84 08 00 00 00 08 00 [ 226.714316] end_request: I/O error, dev sda, sector 964954112 [ 226.724576] quiet_error: 24 callbacks suppressed [ 226.733687] Buffer I/O error on device sda1, logical block 120619008 [ 226.744608] lost page write due to I/O error on sda1 [ 226.744620] JBD2: I/O error detected when updating journal superblock for sda1-8.
But now I got the full trace of what happens before the page fault handling:
Dec 9 15:44:44 sepia18 kernel: [ 126.472614] libceph: client4104 fsid b10de2f1-02b3-4bd7-90cb-544fa7dca7d1 Dec 9 15:44:44 sepia18 kernel: [ 126.472802] libceph: mon0 10.3.14.141:6789 session established Dec 9 15:44:44 sepia18 kernel: [ 126.474825] rbd0: unknown partition table Dec 9 15:44:44 sepia18 kernel: [ 126.474907] rbd: rbd0: added with size 0x500000000 Dec 9 15:44:45 sepia18 kernel: [ 127.043581] libceph: osd1 up Dec 9 15:44:45 sepia18 kernel: [ 127.043585] libceph: osd1 weight 0x10000 (in) Dec 9 15:44:45 sepia18 kernel: [ 127.059822] libceph: osd0 10.3.14.141:6800 socket closed Dec 9 15:44:45 sepia18 kernel: [ 127.073590] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 Dec 9 15:44:45 sepia18 kernel: [ 127.081594] IP: [<ffffffffa02c5a87>] try_write+0x627/0x1060 [libceph] Dec 9 15:44:45 sepia18 kernel: [ 127.083535] PGD 37b7c067 PUD 37bbc067 PMD 0 Dec 9 15:44:45 sepia18 kernel: [ 127.083535] Oops: 0000 [#1] SMP Dec 9 15:44:45 sepia18 kernel: [ 127.083535] CPU 1 Dec 9 15:44:45 sepia18 kernel: [ 127.083535] Modules linked in: cryptd aes_x86_64 aes_generic rbd libceph radeon ttm drm_kms_helper drm lp ppdev parport_pc parport shpchp i2c_algo_bit i3000_edac edac_core serio_raw ahci libahci floppy e1000e btrfs zlib_deflate crc32c libcrc32c Dec 9 15:44:45 sepia18 kernel: [ 127.083535] Dec 9 15:44:45 sepia18 kernel: [ 127.083535] Pid: 46, comm: kworker/1:2 Not tainted 3.1.0-ceph-08936-gfb6ee37 #1 Supermicro PDSMi/PDSMi+ Dec 9 15:44:45 sepia18 kernel: [ 127.083535] RIP: 0010:[<ffffffffa02c5a87>] [<ffffffffa02c5a87>] try_write+0x627/0x1060 [libceph] Dec 9 15:44:45 sepia18 kernel: [ 127.083535] RSP: 0018:ffff88010368bb00 EFLAGS: 00010246 Dec 9 15:44:45 sepia18 kernel: [ 127.083535] RAX: 0000000000000000 RBX: ffff880037be3830 RCX: 0000000000000000 Dec 9 15:44:45 sepia18 kernel: [ 127.083535] RDX: 0000000000061000 RSI: 0000000000000001 RDI: 0000000000000000 Dec 9 15:44:45 sepia18 kernel: [ 127.083535] RBP: ffff88010368bc20 R08: 0000000000000000 R09: 0000000000000000 Dec 9 15:44:45 sepia18 kernel: [ 127.083535] R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0002d6b1c0 Dec 9 15:44:45 sepia18 kernel: [ 127.083535] R13: 000000000001f000 R14: ffff880037848b00 R15: 0000000000000000 Dec 9 15:44:45 sepia18 kernel: [ 127.083535] FS: 0000000000000000(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000 Dec 9 15:44:45 sepia18 kernel: [ 127.083535] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 9 15:44:45 sepia18 kernel: [ 127.083535] CR2: 0000000000000048 CR3: 0000000037af4000 CR4: 00000000000006e0 Dec 9 15:44:45 sepia18 kernel: [ 127.083535] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 9 15:44:45 sepia18 kernel: [ 127.083535] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Dec 9 15:44:45 sepia18 kernel: [ 127.083535] Process kworker/1:2 (pid: 46, threadinfo ffff88010368a000, task ffff88010364bf60) Dec 9 15:44:45 sepia18 kernel: [ 127.260013] Stack: Dec 9 15:44:45 sepia18 kernel: [ 127.260013] 0000000100004040 ffff88010319a800 0000000000000000 0000000000000000 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] ffff88010368bbd0 0000000000000000 ffff88010860be91 ffff88010860be00 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] ffff880037be3c10 ffff880037be3a40 ffff880037be3870 ffff880037be3a30 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] Call Trace: Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffff814e8db4>] ? kernel_recvmsg+0x44/0x60 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffffa02c4748>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffffa02c70d0>] con_work+0xc10/0x1b00 [libceph] Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffff81053978>] ? finish_task_switch+0x48/0x120 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffff8107fcb6>] process_one_work+0x1a6/0x520 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffff8107fc47>] ? process_one_work+0x137/0x520 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffffa02c64c0>] ? try_write+0x1060/0x1060 [libceph] Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffff81081fe3>] worker_thread+0x173/0x400 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffff81081e70>] ? manage_workers+0x210/0x210 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffff81087026>] kthread+0xb6/0xc0 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffff8109ee55>] ? trace_hardirqs_on_caller+0x105/0x190 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffff8160dd44>] kernel_thread_helper+0x4/0x10 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffff81603c74>] ? retint_restore_args+0x13/0x13 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffff81086f70>] ? __init_kthread_worker+0x70/0x70 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] [<ffffffff8160dd40>] ? gs_change+0x13/0x13 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] Code: 7c fb ff ff 49 83 be 90 00 00 00 00 0f 84 d2 04 00 00 49 63 8e a0 00 00 00 49 8b 86 98 00 00 00 44 8b 9d 5c ff ff ff 48 c1 e1 04 Dec 9 15:44:45 sepia18 kernel: [ 127.260013] RIP [<ffffffffa02c5a87>] try_write+0x627/0x1060 [libceph] Dec 9 15:44:45 sepia18 kernel: [ 127.260013] RSP <ffff88010368bb00> Dec 9 15:44:45 sepia18 kernel: [ 127.260013] CR2: 0000000000000048 Dec 9 15:44:45 sepia18 kernel: [ 127.264188] ---[ end trace 6f04513fdb7876aa ]--- Dec 9 15:44:45 sepia18 kernel: [ 127.264332] BUG: unable to handle kernel paging request at fffffffffffffff8 Dec 9 15:44:45 sepia18 kernel: [ 127.264334] IP: [<ffffffff81086a10>] kthread_data+0x10/0x20 Dec 9 15:44:45 sepia18 kernel: [ 127.264339] PGD 1c07067 PUD 1c08067 PMD 0 Dec 9 15:44:45 sepia18 kernel: [ 127.264342] Oops: 0000 [#2] SMP Dec 9 15:44:45 sepia18 kernel: [ 127.264344] CPU 1 Dec 9 15:44:45 sepia18 kernel: [ 127.264345] Modules linked in: cryptd aes_x86_64 aes_generic rbd libceph radeon ttm drm_kms_helper drm lp ppdev parport_pc parport shpchp i2c_algo_bit i3000_edac edac_core serio_raw ahci libahci floppy e1000e btrfs zlib_deflate crc32c libcrc32c Dec 9 15:44:45 sepia18 kernel: [ 127.264359] Dec 9 15:44:45 sepia18 kernel: [ 127.264361] Pid: 46, comm: kworker/1:2 Tainted: G D 3.1.0-ceph-08936-gfb6ee37 #1 Supermicro PDSMi/PDSMi+ Dec 9 15:44:45 sepia18 kernel: [ 127.264364] RIP: 0010:[<ffffffff81086a10>] [<ffffffff81086a10>] kthread_data+0x10/0x20 Dec 9 15:44:45 sepia18 kernel: [ 127.264368] RSP: 0018:ffff88010368b6f8 EFLAGS: 00010092 Dec 9 15:44:45 sepia18 kernel: [ 127.264370] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001 Dec 9 15:44:45 sepia18 kernel: [ 127.264372] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88010364bf60 Dec 9 15:44:45 sepia18 kernel: [ 127.264374] RBP: ffff88010368b6f8 R08: 0000000000989680 R09: 0000000000000001 Dec 9 15:44:45 sepia18 kernel: [ 127.264376] R10: 0000000000000400 R11: 0000000000000000 R12: ffff88010364c300 Dec 9 15:44:45 sepia18 kernel: [ 127.264378] R13: 0000000000000001 R14: 0000000000000001 R15: ffff88010368b820 Dec 9 15:44:45 sepia18 kernel: [ 127.264381] FS: 0000000000000000(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000 Dec 9 15:44:45 sepia18 kernel: [ 127.264383] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 9 15:44:45 sepia18 kernel: [ 127.264385] CR2: fffffffffffffff8 CR3: 0000000037af4000 CR4: 00000000000006e0 Dec 9 15:44:45 sepia18 kernel: [ 127.264387] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 9 15:44:45 sepia18 kernel: [ 127.264389] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Dec 9 15:44:45 sepia18 kernel: [ 127.264391] Process kworker/1:2 (pid: 46, threadinfo ffff88010368a000, task ffff88010364bf60) Dec 9 15:44:45 sepia18 kernel: [ 127.264393] Stack: Dec 9 15:44:45 sepia18 kernel: [ 127.264394] ffff88010368b718 ffffffff8107f0c5 ffff88010368b718 ffff88011fc93b80 Dec 9 15:44:45 sepia18 kernel: [ 127.264398] ffff88010368b7b8 ffffffff81600763 0000000000000000 ffffffff81066e81 Dec 9 15:44:45 sepia18 kernel: [ 127.264401] ffff88010364bf60 0000000000013b80 ffff88010368bfd8 ffff88010368a010 Dec 9 15:44:45 sepia18 kernel: [ 127.264404] Call Trace: Dec 9 15:44:45 sepia18 kernel: [ 127.264407] [<ffffffff8107f0c5>] wq_worker_sleeping+0x15/0xa0 Dec 9 15:44:45 sepia18 kernel: [ 127.264410] [<ffffffff81600763>] __schedule+0x5c3/0x940 Dec 9 15:44:45 sepia18 kernel: [ 127.264414] [<ffffffff81066e81>] ? do_exit+0x551/0x880 Dec 9 15:44:45 sepia18 kernel: [ 127.264416] [<ffffffff81065d7d>] ? release_task+0x1d/0x470 Dec 9 15:44:45 sepia18 kernel: [ 127.264419] [<ffffffff81600e0f>] schedule+0x3f/0x60 Dec 9 15:44:45 sepia18 kernel: [ 127.264421] [<ffffffff81066ef7>] do_exit+0x5c7/0x880 Dec 9 15:44:45 sepia18 kernel: [ 127.264424] [<ffffffff81063b05>] ? kmsg_dump+0x75/0x140 Dec 9 15:44:45 sepia18 kernel: [ 127.264427] [<ffffffff81604ac0>] oops_end+0xb0/0xf0 Dec 9 15:44:45 sepia18 kernel: [ 127.264431] [<ffffffff8103f85d>] no_context+0xfd/0x270 Dec 9 15:44:45 sepia18 kernel: [ 127.264434] [<ffffffff814edfd5>] ? release_sock+0x35/0x190 Dec 9 15:44:45 sepia18 kernel: [ 127.264437] [<ffffffff8103fb15>] __bad_area_nosemaphore+0x145/0x230 Dec 9 15:44:45 sepia18 kernel: [ 127.264440] [<ffffffff810509a3>] ? get_parent_ip+0x33/0x50 Dec 9 15:44:45 sepia18 kernel: [ 127.264443] [<ffffffff8103fc13>] bad_area_nosemaphore+0x13/0x20 Dec 9 15:44:45 sepia18 kernel: [ 127.264446] [<ffffffff8160763e>] do_page_fault+0x34e/0x4b0 Dec 9 15:44:45 sepia18 kernel: [ 127.264449] [<ffffffff814ee0e4>] ? release_sock+0x144/0x190 Dec 9 15:44:45 sepia18 kernel: [ 127.264453] [<ffffffff815429e5>] ? tcp_sendpage+0xc5/0x590 Dec 9 15:44:45 sepia18 kernel: [ 127.264457] [<ffffffff81313fbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c Dec 9 15:44:45 sepia18 kernel: [ 127.264460] [<ffffffff81603ef5>] page_fault+0x25/0x30 Dec 9 15:44:45 sepia18 kernel: [ 127.264467] [<ffffffffa02c5a87>] ? try_write+0x627/0x1060 [libceph] Dec 9 15:44:45 sepia18 kernel: [ 127.264470] [<ffffffff814e8db4>] ? kernel_recvmsg+0x44/0x60 Dec 9 15:44:45 sepia18 kernel: [ 127.264476] [<ffffffffa02c4748>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] Dec 9 15:44:45 sepia18 kernel: [ 127.264483] [<ffffffffa02c70d0>] con_work+0xc10/0x1b00 [libceph] Dec 9 15:44:45 sepia18 kernel: [ 127.264486] [<ffffffff81053978>] ? finish_task_switch+0x48/0x120 Dec 9 15:44:45 sepia18 kernel: [ 127.264489] [<ffffffff8107fcb6>] process_one_work+0x1a6/0x520 Dec 9 15:44:45 sepia18 kernel: [ 127.264491] [<ffffffff8107fc47>] ? process_one_work+0x137/0x520 Dec 9 15:44:45 sepia18 kernel: [ 127.264498] [<ffffffffa02c64c0>] ? try_write+0x1060/0x1060 [libceph] Dec 9 15:44:45 sepia18 kernel: [ 127.264501] [<ffffffff81081fe3>] worker_thread+0x173/0x400 Dec 9 15:44:45 sepia18 kernel: [ 127.264504] [<ffffffff81081e70>] ? manage_workers+0x210/0x210 Dec 9 15:44:45 sepia18 kernel: [ 127.264506] [<ffffffff81087026>] kthread+0xb6/0xc0 Dec 9 15:44:45 sepia18 kernel: [ 127.264510] [<ffffffff8109ee55>] ? trace_hardirqs_on_caller+0x105/0x190 Dec 9 15:44:45 sepia18 kernel: [ 127.264513] [<ffffffff8160dd44>] kernel_thread_helper+0x4/0x10 Dec 9 15:44:45 sepia18 kernel: [ 127.264516] [<ffffffff81603c74>] ? retint_restore_args+0x13/0x13 Dec 9 15:44:45 sepia18 kernel: [ 127.264519] [<ffffffff81086f70>] ? __init_kthread_worker+0x70/0x70 Dec 9 15:44:45 sepia18 kernel: [ 127.264522] [<ffffffff8160dd40>] ? gs_change+0x13/0x13 Dec 9 15:44:45 sepia18 kernel: [ 127.264523] Code: 66 66 66 90 65 48 8b 04 25 40 c4 00 00 48 8b 80 48 03 00 00 8b 40 f0 c9 c3 66 90 55 48 89 e5 66 66 66 66 90 48 8b 87 48 03 00 00 Dec 9 15:44:45 sepia18 kernel: [ 127.264546] RIP [<ffffffff81086a10>] kthread_data+0x10/0x20 Dec 9 15:44:45 sepia18 kernel: [ 127.264549] RSP <ffff88010368b6f8> Dec 9 15:44:45 sepia18 kernel: [ 127.264551] CR2: fffffffffffffff8 Dec 9 15:44:45 sepia18 kernel: [ 127.264553] ---[ end trace 6f04513fdb7876ab ]--- Dec 9 15:44:45 sepia18 kernel: [ 127.264554] Fixing recursive fault but reboot is needed!
#3 Updated by Sage Weil almost 12 years ago
- Category set to libceph
- Priority changed from Normal to High
- Target version set to v3.2
#4 Updated by Josh Durgin almost 12 years ago
This happened again on sepia70 during the kernel untar build workunit on rbd.
#5 Updated by Sage Weil almost 12 years ago
- Assignee set to Alex Elder
#6 Updated by Josh Durgin almost 12 years ago
This happened again with the same workload on sepia81.
#7 Updated by Sage Weil almost 12 years ago
looks like a null pointer dereference.. look for a struct member that's 0x48 bytes in?
#8 Updated by Sage Weil almost 12 years ago
- Target version deleted (
v3.2)
#9 Updated by Sage Weil almost 12 years ago
- translation missing: en.field_position set to 1
#10 Updated by Sage Weil almost 12 years ago
- Target version set to v3.3
- translation missing: en.field_position deleted (
2) - translation missing: en.field_position set to 699
#11 Updated by Sage Weil almost 12 years ago
- Subject changed from spinlock lockup to NULL pointer dereference at try_write+0x627/0x1060
#12 Updated by Alex Elder almost 12 years ago
Is there a core file for this problem anywhere?
It would really be nice to poke around in the message, or the
connection structure, or basically anything. I was unable to
reproduce the problem before when trying to repeatedly run
the kernel_untar_build workunit, which was mentioned in bug
1866 as what was running when it happened there.
#13 Updated by Sage Weil almost 12 years ago
Hmm.. yeah, I don't think we have anything beyond these console dumps. And we don't capture any kind of kernel core dumps (can/should we?).
Actually, I don't think we've seen this in more than a month. :/
#14 Updated by Sage Weil almost 12 years ago
- Status changed from New to Need More Info
#15 Updated by Alex Elder over 11 years ago
Bugs 1793 and 2081 have a signature of a page fault/bad memory reference
from process_one_work() -> con_work(), and they may well be the same issue.
So I have been treating them as such, and have been attempting to reproduce
them by running the kernel_untar_build.sh workunit repeatedly.
The last day or two I have been testing with the latest version of the
ceph-client code (c666601a93). One time I hit a kernel failure that had a
signature different from either 1793 or 2081--it reported:
INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 4, t=492980 jiffies)
I believe this is bug 2174, and am adding an update to that to record
this occurrence.
I restarted testing and completed 50 more iterations of kernel_untar_build.sh
without error. I will try a bit more, I think we should keep these two bugs
on a sort of inactive status (1793 is already "Need More Info" but 2081 was
not until now) until we start seeing occurrences again. A kernel core dump
would also be helpful, once we get that in place (see bug 2127).
#16 Updated by Alex Elder over 11 years ago
Another 100 iterations of kernel_untar_build.sh using the current
master branch (c666601a935b94cc0f3310339411b6940de751ba) without
error. I'm going to quit testing this now and wait to see if
our regular automated testing produces it again.
#17 Updated by Alex Elder over 11 years ago
- Status changed from Need More Info to Can't reproduce
Marking this Can't Reproduce. Will reopen if it shows up again.