Bug #144: GPF at con_close_socket+0x40/0x9f - Linux kernel client - Ceph

Actions

Copy link

Bug #144

closed

GPF at con_close_socket+0x40/0x9f

Added by Sage Weil almost 14 years ago. Updated over 13 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

Target version:

v2.6.35

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

[12834.543677] general protection fault: 0000 [#1] PREEMPT SMP 
[12834.547396] last sysfs file: /sys/kernel/uevent_seqnum
[12834.547396] CPU 1 
[12834.547396] Modules linked in: ceph aes_x86_64 aes_generic fan ac battery psmouse ehci_hcd ohci_hcd ide_pci_generic thermal button processor [last unloaded: ceph]
[12834.547396] 
[12834.547396] Pid: 2661, comm: ceph-msgr/1 Not tainted 2.6.34 #29 H8SSL-I2/H8SSL-I2
[12834.547396] RIP: 0010:[<ffffffffa0102f1f>]  [<ffffffffa0102f1f>] con_close_socket+0x40/0x9f [ceph]
[12834.547396] RSP: 0018:ffff88010c5add50  EFLAGS: 00010206
[12834.547396] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: ffff88010da4b2e0
[12834.547396] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 6b6b6b6b6b6b6b6b
[12834.547396] RBP: ffff88010c5add60 R08: 0000000000000000 R09: 0000000000000002
[12834.547396] R10: 0000000000000000 R11: ffff88010c4517b8 R12: ffff88010bbcd128
[12834.547396] R13: ffff88010bbcd128 R14: ffff88010bbcd340 R15: ffff88010bbcd330
[12834.547396] FS:  00007ff94b2926e0(0000) GS:ffff880002800000(0000) knlGS:0000000000000000
[12834.547396] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[12834.547396] CR2: 00007f98afe6c000 CR3: 000000010da04000 CR4: 00000000000006e0
[12834.547396] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[12834.547396] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[12834.547396] Process ceph-msgr/1 (pid: 2661, threadinfo ffff88010c5ac000, task ffff88010da4ac80)
[12834.547396] Stack:
[12834.547396]  ffffffffa012c313 0000000000000001 ffff88010c5addb0 ffffffffa0106920
[12834.547396] <0> ffff88010bbcd6a8 ffff88010bbcd298 ffff88010bbcd168 ffff88010bbcd6b0
[12834.547396] <0> ffff8800029d9500 ffff88010bbcd6a8 ffff88010da4ac80 ffff88010da4ac80
[12834.547396] Call Trace:
[12834.547396]  [<ffffffffa0106920>] con_work+0x105/0x6bc [ceph]
[12834.547396]  [<ffffffff8104786b>] worker_thread+0x1e8/0x2fa
[12834.547396]  [<ffffffff81047812>] ? worker_thread+0x18f/0x2fa
[12834.547396]  [<ffffffffa010681b>] ? con_work+0x0/0x6bc [ceph]
[12834.547396]  [<ffffffff8104a990>] ? autoremove_wake_function+0x0/0x38
[12834.547396]  [<ffffffff81047683>] ? worker_thread+0x0/0x2fa
[12834.547396]  [<ffffffff8104a65e>] kthread+0x7d/0x85
[12834.547396]  [<ffffffff810037d4>] kernel_thread_helper+0x4/0x10
[12834.547396]  [<ffffffff81429380>] ? restore_args+0x0/0x30
[12834.547396]  [<ffffffff8104a5e1>] ? kthread+0x0/0x85
[12834.547396]  [<ffffffff810037d0>] ? kernel_thread_helper+0x0/0x10
[12834.547396] Code: ca 54 37 e2 02 74 09 80 3d 7d 8f 03 00 00 75 40 31 db 49 83 7c 24 20 00 74 69 f0 41 80 4c 24 29 08 49 8b 7c 24 20 be 02 00 00 00 <48> 8b 47 78 ff 50 60 49 8b 7c 24 20 89 c3 e8 59 6a 28 e1 49 c7 
[12834.547396] RIP  [<ffffffffa0102f1f>] con_close_socket+0x40/0x9f [ceph]
[12834.547396]  RSP <ffff88010c5add50>
[12834.796895] ---[ end trace 1432bc2d2c7624aa ]---
[12836.053594] Slab corruption: size-2048 start=ffff88010bbcd0f8, len=2048
[12836.060294] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
[12836.065773] Last user: [<ffffffffa01106b9>](put_osd+0x3f/0x82 [ceph])
[12836.072468] 050: 6b 6b 6b 6b 6b 6b 6b 6b 4b 4b 6b 6b 6b 6b 6b 6b
[12836.085814] Prev obj: start=ffff88010bbcc8e0, len=2048
[12836.091027] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
[12836.096679] Last user: [<ffffffff810a8488>](__kmalloc_node_track_caller+0x24/0x29)
[12836.104492] 000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
[12836.111761] 010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a

Actions

Copy link

Updated by Yehuda Sadeh almost 14 years ago

What was the specific scenario? Can it be reproduced?

Actions

Copy link

Updated by Sage Weil almost 14 years ago

Yeah, i think this is related to #163, but i still don't know how that would cause this problem. The basic issue is that con refs are taken when work is queued on the msgr workqueue, and ceph_con structs are embedded in caller structs (ceph_osd, etc.), so the >put is an op to the caller. When the caller shuts down, there may be work queued on the connection, so the final ->put may come after things have been torn down. And when the final put_osd looks at client>monc->auth->... there's a problem. The fix was to stop monc last, after flushing the workqueue.

What does teh 'last user' in

[12836.053594] Slab corruption: size-2048 start=ffff88010bbcd0f8, len=2048
[12836.060294] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
[12836.065773] Last user: [<ffffffffa01106b9>](put_osd+0x3f/0x82 [ceph])

really mean? This error makes it look more like the ref counting on the con itself was wrong. :/

Actions

Copy link