Bug #1793: NULL pointer dereference at try_write+0x627/0x1060 - Linux kernel client - Ceph

Actions

Copy link

Bug #1793

closed

NULL pointer dereference at try_write+0x627/0x1060

Added by Josh Durgin over 12 years ago. Updated about 12 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Alex Elder

Category:

libceph

Target version:

v3.3

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

Found in sepia50's console:

[ 8727.491802] Call Trace:
[ 8727.491802]  [<ffffffff8107f0c5>] wq_worker_sleeping+0x15/0xa0
[ 8727.491802]  [<ffffffff81600763>] __schedule+0x5c3/0x940
[ 8727.491802]  [<ffffffff81066e81>] ? do_exit+0x551/0x880
[ 8727.491802]  [<ffffffff81065d7d>] ? release_task+0x1d/0x470
[ 8727.491802]  [<ffffffff81600e0f>] schedule+0x3f/0x60
[ 8727.491802]  [<ffffffff81066ef7>] do_exit+0x5c7/0x880
[ 8727.491802]  [<ffffffff81063b05>] ? kmsg_dump+0x75/0x140
[ 8727.491802]  [<ffffffff81604ac0>] oops_end+0xb0/0xf0

[ 8727.491802]  [<ffffffff8103f85d>] no_context+0xfd/0x270
[ 8727.491802]  [<ffffffff814edfd5>] ? release_sock+0x35/0x190
[ 8727.491802]  [<ffffffff8103fb15>] __bad_area_nosemaphore+0x145/0x230
[ 8727.491802]  [<ffffffff810509a3>] ? get_parent_ip+0x33/0x50
[ 8727.491802]  [<ffffffff8103fc13>] bad_area_nosemaphore+0x13/0x20
[ 8727.491802]  [<ffffffff8160763e>] do_page_fault+0x34e/0x4b0
[ 8727.491802]  [<ffffffff814ee0e4>] ? release_sock+0x144/0x190
[ 8727.491802]  [<ffffffff815429e5>] ? tcp_sendpage+0xc5/0x590
[ 8727.491802]  [<ffffffff81313fbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 8727.491802]  [<ffffffff81603ef5>] page_fault+0x25/0x30
[ 8727.491802]  [<ffffffffa00c6a87>] ? try_write+0x627/0x1060 [libceph]
[ 8727.491802]  [<ffffffff814e8db4>] ? kernel_recvmsg+0x44/0x60

[ 8727.491802]  [<ffffffffa00c5748>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]

[ 8727.491802]  [<ffffffffa00c80d0>] con_work+0xc10/0x1b00 [libceph]

[ 8727.491802]  [<ffffffff81313f7e>] ? trace_hardirqs_on_thunk+0x3a/0x3f

[ 8727.491802]  [<ffffffff81053978>] ? finish_task_switch+0x48/0x120

[ 8727.491802]  [<ffffffff8107fcb6>] process_one_work+0x1a6/0x520

[ 8727.491802]  [<ffffffff8107fc47>] ? process_one_work+0x137/0x520
[ 8727.491802]  [<ffffffffa00c74c0>] ? try_write+0x1060/0x1060 [libceph]
[ 8727.491802]  [<ffffffff81081fe3>] worker_thread+0x173/0x400
[ 8727.491802]  [<ffffffff81081e70>] ? manage_workers+0x210/0x210
[ 8727.491802]  [<ffffffff81087026>] kthread+0xb6/0xc0
[ 8727.491802]  [<ffffffff8160dd44>] kernel_thread_helper+0x4/0x10
[ 8727.491802]  [<ffffffff81603c74>] ? retint_restore_args+0x13/0x13
[ 8727.491802]  [<ffffffff81086f70>] ? __init_kthread_worker+0x70/0x70
[ 8727.491802]  [<ffffffff8160dd40>] ? gs_change+0x13/0x13
[ 8727.491802] Code: 66 66 66 90 65 48 8b 04 25 40 c4 00 00 48 8b 80 48 03 00 00 8b 40 f0 c9 c3 66 90 55 48 89 e5 66 66 66 66 90 48 8b 87 48 03 00 00 
[ 8727.491802]  8b 40 f8 c9 c3 eb 08 90 90 90 90 90 90 90 90 55 48 89 e5 66 
[ 8727.491802] RIP  [<ffffffff81086a10>] kthread_data+0x10/0x20

[ 8727.491802]  RSP <ffff8800455b76f8>
[ 8727.491802] CR2: fffffffffffffff8
[ 8727.491802] ---[ end trace 9654ef6c74784bc5 ]---
[ 8727.491802] Fixing recursive fault but reboot is needed!
[ 8727.491802] BUG: spinlock lockup on CPU#0, kworker/0:2/5057

[ 8727.491802]  lock: ffff8800fbc13b80, .magic: dead4ead, .owner: kworker/0:2/5057, .owner_cpu: 0

[ 8727.491802] Pid: 5057, comm: kworker/0:2 Tainted: G      D     3.1.0-ceph-08936-gb643987 #1

[ 8727.491802] Call Trace:

[ 8727.491802]  [<ffffffff8131a178>] spin_dump+0x78/0xc0
[ 8727.491802]  [<ffffffff8131a39d>] do_raw_spin_lock+0xed/0x120
[ 8727.491802]  [<ffffffff816031e5>] _raw_spin_lock_irq+0x45/0x50
[ 8727.491802]  [<ffffffff81600277>] ? __schedule+0xd7/0x940
[ 8727.491802]  [<ffffffff81600277>] __schedule+0xd7/0x940
[ 8727.491802]  [<ffffffff81600e0f>] schedule+0x3f/0x60
[ 8727.491802]  [<ffffffff81067164>] do_exit+0x834/0x880
[ 8727.491802]  [<ffffffff81063b95>] ? kmsg_dump+0x105/0x140
[ 8727.491802]  [<ffffffff81063b05>] ? kmsg_dump+0x75/0x140
[ 8727.491802]  [<ffffffff81604ac0>] oops_end+0xb0/0xf0
[ 8727.491802]  [<ffffffff8103f85d>] no_context+0xfd/0x270
[ 8727.491802]  [<ffffffff8103fb15>] __bad_area_nosemaphore+0x145/0x230
[ 8727.491802]  [<ffffffff811b3af0>] ? fsnotify_clear_marks_by_inode+0x30/0xf0
[ 8727.491802]  [<ffffffff8103fc13>] bad_area_nosemaphore+0x13/0x20
[ 8727.491802]  [<ffffffff8160763e>] do_page_fault+0x34e/0x4b0
[ 8727.491802]  [<ffffffff81057bc4>] ? cpuacct_charge+0x24/0xb0
[ 8727.491802]  [<ffffffff81057bc4>] ? cpuacct_charge+0x24/0xb0
[ 8727.491802]  [<ffffffff81313fbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 8727.491802]  [<ffffffff81603ef5>] page_fault+0x25/0x30
[ 8727.491802]  [<ffffffff81086a10>] ? kthread_data+0x10/0x20
[ 8727.491802]  [<ffffffff8107f0c5>] wq_worker_sleeping+0x15/0xa0
[ 8727.491802]  [<ffffffff81600763>] __schedule+0x5c3/0x940
[ 8727.491802]  [<ffffffff81066e81>] ? do_exit+0x551/0x880

[ 8727.491802]  [<ffffffff81065d7d>] ? release_task+0x1d/0x470
[ 8727.491802]  [<ffffffff81600e0f>] schedule+0x3f/0x60
[ 8727.491802]  [<ffffffff81066ef7>] do_exit+0x5c7/0x880
[ 8727.491802]  [<ffffffff81063b05>] ? kmsg_dump+0x75/0x140
[ 8727.491802]  [<ffffffff81604ac0>] oops_end+0xb0/0xf0
[ 8727.491802]  [<ffffffff8103f85d>] no_context+0xfd/0x270
[ 8727.491802]  [<ffffffff814edfd5>] ? release_sock+0x35/0x190
[ 8727.491802]  [<ffffffff8103fb15>] __bad_area_nosemaphore+0x145/0x230
[ 8727.491802]  [<ffffffff810509a3>] ? get_parent_ip+0x33/0x50
[ 8727.491802]  [<ffffffff8103fc13>] bad_area_nosemaphore+0x13/0x20
[ 8727.491802]  [<ffffffff8160763e>] do_page_fault+0x34e/0x4b0
[ 8727.491802]  [<ffffffff814ee0e4>] ? release_sock+0x144/0x190
[ 8727.491802]  [<ffffffff815429e5>] ? tcp_sendpage+0xc5/0x590
[ 8727.491802]  [<ffffffff81313fbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 8727.491802]  [<ffffffff81603ef5>] page_fault+0x25/0x30
[ 8727.491802]  [<ffffffffa00c6a87>] ? try_write+0x627/0x1060 [libceph]
[ 8727.491802]  [<ffffffff814e8db4>] ? kernel_recvmsg+0x44/0x60
[ 8727.491802]  [<ffffffffa00c5748>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[ 8727.491802]  [<ffffffffa00c80d0>] con_work+0xc10/0x1b00 [libceph]
[ 8727.491802]  [<ffffffff81313f7e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 8727.491802]  [<ffffffff81053978>] ? finish_task_switch+0x48/0x120
[ 8727.491802]  [<ffffffff8107fcb6>] process_one_work+0x1a6/0x520
[ 8727.491802]  [<ffffffff8107fc47>] ? process_one_work+0x137/0x520
[ 8727.491802]  [<ffffffffa00c74c0>] ? try_write+0x1060/0x1060 [libceph]
[ 8727.491802]  [<ffffffff81081fe3>] worker_thread+0x173/0x400
[ 8727.491802]  [<ffffffff81081e70>] ? manage_workers+0x210/0x210
[ 8727.491802]  [<ffffffff81087026>] kthread+0xb6/0xc0
[ 8727.491802]  [<ffffffff8160dd44>] kernel_thread_helper+0x4/0x10
[ 8727.491802]  [<ffffffff81603c74>] ? retint_restore_args+0x13/0x13
[ 8727.491802]  [<ffffffff81086f70>] ? __init_kthread_worker+0x70/0x70
[ 8727.491802]  [<ffffffff8160dd40>] ? gs_change+0x13/0x13

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Josh Durgin over 12 years ago

Got the same trace on sepia18 while running mkfs.ext3 on an rbd image.

Actions

Copy link

Updated by Josh Durgin over 12 years ago

A disk error prevented me from getting logs before:

[  226.628240] sd 2:0:0:0: [sda] Unhandled error code
[  226.637615] sd 2:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  226.649859] sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 39 84 61 f0 00 00 10 00
[  226.661756] end_request: I/O error, dev sda, sector 964977136
[  226.672023] Aborting journal on device sda1-8.
[  226.680985] sd 2:0:0:0: [sda] Unhandled error code
[  226.690213] sd 2:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  226.702446] sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 39 84 08 00 00 00 08 00
[  226.714316] end_request: I/O error, dev sda, sector 964954112
[  226.724576] quiet_error: 24 callbacks suppressed
[  226.733687] Buffer I/O error on device sda1, logical block 120619008
[  226.744608] lost page write due to I/O error on sda1
[  226.744620] JBD2: I/O error detected when updating journal superblock for sda1-8.

But now I got the full trace of what happens before the page fault handling:

Dec  9 15:44:44 sepia18 kernel: [  126.472614] libceph: client4104 fsid b10de2f1-02b3-4bd7-90cb-544fa7dca7d1
Dec  9 15:44:44 sepia18 kernel: [  126.472802] libceph: mon0 10.3.14.141:6789 session established
Dec  9 15:44:44 sepia18 kernel: [  126.474825]  rbd0: unknown partition table
Dec  9 15:44:44 sepia18 kernel: [  126.474907] rbd: rbd0: added with size 0x500000000
Dec  9 15:44:45 sepia18 kernel: [  127.043581] libceph: osd1 up
Dec  9 15:44:45 sepia18 kernel: [  127.043585] libceph: osd1 weight 0x10000 (in)
Dec  9 15:44:45 sepia18 kernel: [  127.059822] libceph: osd0 10.3.14.141:6800 socket closed
Dec  9 15:44:45 sepia18 kernel: [  127.073590] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
Dec  9 15:44:45 sepia18 kernel: [  127.081594] IP: [<ffffffffa02c5a87>] try_write+0x627/0x1060 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.083535] PGD 37b7c067 PUD 37bbc067 PMD 0 
Dec  9 15:44:45 sepia18 kernel: [  127.083535] Oops: 0000 [#1] SMP 
Dec  9 15:44:45 sepia18 kernel: [  127.083535] CPU 1 
Dec  9 15:44:45 sepia18 kernel: [  127.083535] Modules linked in: cryptd aes_x86_64 aes_generic rbd libceph radeon ttm drm_kms_helper drm lp ppdev parport_pc parport shpchp i2c_algo_bit i3000_edac 
edac_core serio_raw ahci libahci floppy e1000e btrfs zlib_deflate crc32c libcrc32c
Dec  9 15:44:45 sepia18 kernel: [  127.083535] 
Dec  9 15:44:45 sepia18 kernel: [  127.083535] Pid: 46, comm: kworker/1:2 Not tainted 3.1.0-ceph-08936-gfb6ee37 #1 Supermicro PDSMi/PDSMi+
Dec  9 15:44:45 sepia18 kernel: [  127.083535] RIP: 0010:[<ffffffffa02c5a87>]  [<ffffffffa02c5a87>] try_write+0x627/0x1060 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.083535] RSP: 0018:ffff88010368bb00  EFLAGS: 00010246
Dec  9 15:44:45 sepia18 kernel: [  127.083535] RAX: 0000000000000000 RBX: ffff880037be3830 RCX: 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.083535] RDX: 0000000000061000 RSI: 0000000000000001 RDI: 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.083535] RBP: ffff88010368bc20 R08: 0000000000000000 R09: 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.083535] R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0002d6b1c0
Dec  9 15:44:45 sepia18 kernel: [  127.083535] R13: 000000000001f000 R14: ffff880037848b00 R15: 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.083535] FS:  0000000000000000(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.083535] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec  9 15:44:45 sepia18 kernel: [  127.083535] CR2: 0000000000000048 CR3: 0000000037af4000 CR4: 00000000000006e0
Dec  9 15:44:45 sepia18 kernel: [  127.083535] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.083535] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec  9 15:44:45 sepia18 kernel: [  127.083535] Process kworker/1:2 (pid: 46, threadinfo ffff88010368a000, task ffff88010364bf60)
Dec  9 15:44:45 sepia18 kernel: [  127.260013] Stack:
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  0000000100004040 ffff88010319a800 0000000000000000 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  ffff88010368bbd0 0000000000000000 ffff88010860be91 ffff88010860be00
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  ffff880037be3c10 ffff880037be3a40 ffff880037be3870 ffff880037be3a30
Dec  9 15:44:45 sepia18 kernel: [  127.260013] Call Trace:
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff814e8db4>] ? kernel_recvmsg+0x44/0x60
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffffa02c4748>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffffa02c70d0>] con_work+0xc10/0x1b00 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff81053978>] ? finish_task_switch+0x48/0x120
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff8107fcb6>] process_one_work+0x1a6/0x520
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff8107fc47>] ? process_one_work+0x137/0x520
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffffa02c64c0>] ? try_write+0x1060/0x1060 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff81081fe3>] worker_thread+0x173/0x400
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff81081e70>] ? manage_workers+0x210/0x210
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff81087026>] kthread+0xb6/0xc0
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff8109ee55>] ? trace_hardirqs_on_caller+0x105/0x190
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff8160dd44>] kernel_thread_helper+0x4/0x10
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff81603c74>] ? retint_restore_args+0x13/0x13
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff81086f70>] ? __init_kthread_worker+0x70/0x70
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  [<ffffffff8160dd40>] ? gs_change+0x13/0x13
Dec  9 15:44:45 sepia18 kernel: [  127.260013] Code: 7c fb ff ff 49 83 be 90 00 00 00 00 0f 84 d2 04 00 00 49 63 8e a0 00 00 00 49 8b 86 98 00 00 00 44 8b 9d 5c ff ff ff 48 c1 e1 04 
Dec  9 15:44:45 sepia18 kernel: [  127.260013] RIP  [<ffffffffa02c5a87>] try_write+0x627/0x1060 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.260013]  RSP <ffff88010368bb00>
Dec  9 15:44:45 sepia18 kernel: [  127.260013] CR2: 0000000000000048
Dec  9 15:44:45 sepia18 kernel: [  127.264188] ---[ end trace 6f04513fdb7876aa ]---
Dec  9 15:44:45 sepia18 kernel: [  127.264332] BUG: unable to handle kernel paging request at fffffffffffffff8
Dec  9 15:44:45 sepia18 kernel: [  127.264334] IP: [<ffffffff81086a10>] kthread_data+0x10/0x20
Dec  9 15:44:45 sepia18 kernel: [  127.264339] PGD 1c07067 PUD 1c08067 PMD 0 
Dec  9 15:44:45 sepia18 kernel: [  127.264342] Oops: 0000 [#2] SMP 
Dec  9 15:44:45 sepia18 kernel: [  127.264344] CPU 1 
Dec  9 15:44:45 sepia18 kernel: [  127.264345] Modules linked in: cryptd aes_x86_64 aes_generic rbd libceph radeon ttm drm_kms_helper drm lp ppdev parport_pc parport shpchp i2c_algo_bit i3000_edac edac_core serio_raw ahci libahci floppy e1000e btrfs zlib_deflate crc32c libcrc32c
Dec  9 15:44:45 sepia18 kernel: [  127.264359] 
Dec  9 15:44:45 sepia18 kernel: [  127.264361] Pid: 46, comm: kworker/1:2 Tainted: G      D     3.1.0-ceph-08936-gfb6ee37 #1 Supermicro PDSMi/PDSMi+
Dec  9 15:44:45 sepia18 kernel: [  127.264364] RIP: 0010:[<ffffffff81086a10>]  [<ffffffff81086a10>] kthread_data+0x10/0x20
Dec  9 15:44:45 sepia18 kernel: [  127.264368] RSP: 0018:ffff88010368b6f8  EFLAGS: 00010092
Dec  9 15:44:45 sepia18 kernel: [  127.264370] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
Dec  9 15:44:45 sepia18 kernel: [  127.264372] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88010364bf60
Dec  9 15:44:45 sepia18 kernel: [  127.264374] RBP: ffff88010368b6f8 R08: 0000000000989680 R09: 0000000000000001
Dec  9 15:44:45 sepia18 kernel: [  127.264376] R10: 0000000000000400 R11: 0000000000000000 R12: ffff88010364c300
Dec  9 15:44:45 sepia18 kernel: [  127.264378] R13: 0000000000000001 R14: 0000000000000001 R15: ffff88010368b820
Dec  9 15:44:45 sepia18 kernel: [  127.264381] FS:  0000000000000000(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.264383] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec  9 15:44:45 sepia18 kernel: [  127.264385] CR2: fffffffffffffff8 CR3: 0000000037af4000 CR4: 00000000000006e0
Dec  9 15:44:45 sepia18 kernel: [  127.264387] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  9 15:44:45 sepia18 kernel: [  127.264389] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec  9 15:44:45 sepia18 kernel: [  127.264391] Process kworker/1:2 (pid: 46, threadinfo ffff88010368a000, task ffff88010364bf60)
Dec  9 15:44:45 sepia18 kernel: [  127.264393] Stack:
Dec  9 15:44:45 sepia18 kernel: [  127.264394]  ffff88010368b718 ffffffff8107f0c5 ffff88010368b718 ffff88011fc93b80
Dec  9 15:44:45 sepia18 kernel: [  127.264398]  ffff88010368b7b8 ffffffff81600763 0000000000000000 ffffffff81066e81
Dec  9 15:44:45 sepia18 kernel: [  127.264401]  ffff88010364bf60 0000000000013b80 ffff88010368bfd8 ffff88010368a010
Dec  9 15:44:45 sepia18 kernel: [  127.264404] Call Trace:
Dec  9 15:44:45 sepia18 kernel: [  127.264407]  [<ffffffff8107f0c5>] wq_worker_sleeping+0x15/0xa0
Dec  9 15:44:45 sepia18 kernel: [  127.264410]  [<ffffffff81600763>] __schedule+0x5c3/0x940
Dec  9 15:44:45 sepia18 kernel: [  127.264414]  [<ffffffff81066e81>] ? do_exit+0x551/0x880
Dec  9 15:44:45 sepia18 kernel: [  127.264416]  [<ffffffff81065d7d>] ? release_task+0x1d/0x470
Dec  9 15:44:45 sepia18 kernel: [  127.264419]  [<ffffffff81600e0f>] schedule+0x3f/0x60
Dec  9 15:44:45 sepia18 kernel: [  127.264421]  [<ffffffff81066ef7>] do_exit+0x5c7/0x880
Dec  9 15:44:45 sepia18 kernel: [  127.264424]  [<ffffffff81063b05>] ? kmsg_dump+0x75/0x140
Dec  9 15:44:45 sepia18 kernel: [  127.264427]  [<ffffffff81604ac0>] oops_end+0xb0/0xf0
Dec  9 15:44:45 sepia18 kernel: [  127.264431]  [<ffffffff8103f85d>] no_context+0xfd/0x270
Dec  9 15:44:45 sepia18 kernel: [  127.264434]  [<ffffffff814edfd5>] ? release_sock+0x35/0x190
Dec  9 15:44:45 sepia18 kernel: [  127.264437]  [<ffffffff8103fb15>] __bad_area_nosemaphore+0x145/0x230
Dec  9 15:44:45 sepia18 kernel: [  127.264440]  [<ffffffff810509a3>] ? get_parent_ip+0x33/0x50
Dec  9 15:44:45 sepia18 kernel: [  127.264443]  [<ffffffff8103fc13>] bad_area_nosemaphore+0x13/0x20
Dec  9 15:44:45 sepia18 kernel: [  127.264446]  [<ffffffff8160763e>] do_page_fault+0x34e/0x4b0
Dec  9 15:44:45 sepia18 kernel: [  127.264449]  [<ffffffff814ee0e4>] ? release_sock+0x144/0x190
Dec  9 15:44:45 sepia18 kernel: [  127.264453]  [<ffffffff815429e5>] ? tcp_sendpage+0xc5/0x590
Dec  9 15:44:45 sepia18 kernel: [  127.264457]  [<ffffffff81313fbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
Dec  9 15:44:45 sepia18 kernel: [  127.264460]  [<ffffffff81603ef5>] page_fault+0x25/0x30
Dec  9 15:44:45 sepia18 kernel: [  127.264467]  [<ffffffffa02c5a87>] ? try_write+0x627/0x1060 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.264470]  [<ffffffff814e8db4>] ? kernel_recvmsg+0x44/0x60
Dec  9 15:44:45 sepia18 kernel: [  127.264476]  [<ffffffffa02c4748>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.264483]  [<ffffffffa02c70d0>] con_work+0xc10/0x1b00 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.264486]  [<ffffffff81053978>] ? finish_task_switch+0x48/0x120
Dec  9 15:44:45 sepia18 kernel: [  127.264489]  [<ffffffff8107fcb6>] process_one_work+0x1a6/0x520
Dec  9 15:44:45 sepia18 kernel: [  127.264491]  [<ffffffff8107fc47>] ? process_one_work+0x137/0x520
Dec  9 15:44:45 sepia18 kernel: [  127.264498]  [<ffffffffa02c64c0>] ? try_write+0x1060/0x1060 [libceph]
Dec  9 15:44:45 sepia18 kernel: [  127.264501]  [<ffffffff81081fe3>] worker_thread+0x173/0x400
Dec  9 15:44:45 sepia18 kernel: [  127.264504]  [<ffffffff81081e70>] ? manage_workers+0x210/0x210
Dec  9 15:44:45 sepia18 kernel: [  127.264506]  [<ffffffff81087026>] kthread+0xb6/0xc0
Dec  9 15:44:45 sepia18 kernel: [  127.264510]  [<ffffffff8109ee55>] ? trace_hardirqs_on_caller+0x105/0x190
Dec  9 15:44:45 sepia18 kernel: [  127.264513]  [<ffffffff8160dd44>] kernel_thread_helper+0x4/0x10
Dec  9 15:44:45 sepia18 kernel: [  127.264516]  [<ffffffff81603c74>] ? retint_restore_args+0x13/0x13
Dec  9 15:44:45 sepia18 kernel: [  127.264519]  [<ffffffff81086f70>] ? __init_kthread_worker+0x70/0x70
Dec  9 15:44:45 sepia18 kernel: [  127.264522]  [<ffffffff8160dd40>] ? gs_change+0x13/0x13
Dec  9 15:44:45 sepia18 kernel: [  127.264523] Code: 66 66 66 90 65 48 8b 04 25 40 c4 00 00 48 8b 80 48 03 00 00 8b 40 f0 c9 c3 66 90 55 48 89 e5 66 66 66 66 90 48 8b 87 48 03 00 00 
Dec  9 15:44:45 sepia18 kernel: [  127.264546] RIP  [<ffffffff81086a10>] kthread_data+0x10/0x20
Dec  9 15:44:45 sepia18 kernel: [  127.264549]  RSP <ffff88010368b6f8>
Dec  9 15:44:45 sepia18 kernel: [  127.264551] CR2: fffffffffffffff8
Dec  9 15:44:45 sepia18 kernel: [  127.264553] ---[ end trace 6f04513fdb7876ab ]---
Dec  9 15:44:45 sepia18 kernel: [  127.264554] Fixing recursive fault but reboot is needed!

Actions

Copy link

Updated by Sage Weil over 12 years ago

Category set to libceph
Priority changed from Normal to High
Target version set to v3.2

Actions

Copy link

Updated by Josh Durgin over 12 years ago

This happened again on sepia70 during the kernel untar build workunit on rbd.

Actions

Copy link

Updated by Sage Weil over 12 years ago

Assignee set to Alex Elder

Actions

Copy link

Updated by Josh Durgin over 12 years ago

This happened again with the same workload on sepia81.

Actions

Copy link

Updated by Sage Weil over 12 years ago

looks like a null pointer dereference.. look for a struct member that's 0x48 bytes in?

Actions

Copy link

Updated by Sage Weil over 12 years ago

Target version deleted (~~v3.2~~)

Actions

Copy link

Updated by Sage Weil over 12 years ago

Translation missing: en.field_position set to 1

Actions

Copy link

#10

Updated by Sage Weil over 12 years ago

Target version set to v3.3
Translation missing: en.field_position deleted (2)
Translation missing: en.field_position set to 699

Actions

Copy link

#11

Updated by Sage Weil over 12 years ago

Subject changed from spinlock lockup to NULL pointer dereference at try_write+0x627/0x1060

Actions

Copy link

#12

Updated by Alex Elder about 12 years ago

Is there a core file for this problem anywhere?

It would really be nice to poke around in the message, or the
connection structure, or basically anything. I was unable to
reproduce the problem before when trying to repeatedly run
the kernel_untar_build workunit, which was mentioned in bug
1866 as what was running when it happened there.

Actions

Copy link

#13

Updated by Sage Weil about 12 years ago

Hmm.. yeah, I don't think we have anything beyond these console dumps. And we don't capture any kind of kernel core dumps (can/should we?).

Actually, I don't think we've seen this in more than a month. :/

Actions

Copy link

#14

Updated by Sage Weil about 12 years ago

Status changed from New to Need More Info

Actions

Copy link

#15

Updated by Alex Elder about 12 years ago

Bugs 1793 and 2081 have a signature of a page fault/bad memory reference
from process_one_work() -> con_work(), and they may well be the same issue.
So I have been treating them as such, and have been attempting to reproduce
them by running the kernel_untar_build.sh workunit repeatedly.

The last day or two I have been testing with the latest version of the
ceph-client code (c666601a93). One time I hit a kernel failure that had a
signature different from either 1793 or 2081--it reported:

INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 4, t=492980 jiffies)

I believe this is bug 2174, and am adding an update to that to record
this occurrence.

I restarted testing and completed 50 more iterations of kernel_untar_build.sh
without error. I will try a bit more, I think we should keep these two bugs
on a sort of inactive status (1793 is already "Need More Info" but 2081 was
not until now) until we start seeing occurrences again. A kernel core dump
would also be helpful, once we get that in place (see bug 2127).

Actions

Copy link

#16

Updated by Alex Elder about 12 years ago

Another 100 iterations of kernel_untar_build.sh using the current
master branch (c666601a935b94cc0f3310339411b6940de751ba) without
error. I'm going to quit testing this now and wait to see if
our regular automated testing produces it again.

Actions

Copy link

#17

Updated by Alex Elder about 12 years ago

Status changed from Need More Info to Can't reproduce

Marking this Can't Reproduce. Will reopen if it shows up again.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #1793

NULL pointer dereference at try_write+0x627/0x1060

Updated by Josh Durgin over 12 years ago

Updated by Josh Durgin over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Josh Durgin over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Josh Durgin over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Alex Elder about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Alex Elder about 12 years ago

Updated by Alex Elder about 12 years ago

Updated by Alex Elder about 12 years ago