Project

General

Profile

Actions

Bug #144

closed

GPF at con_close_socket+0x40/0x9f

Added by Sage Weil almost 14 years ago. Updated over 13 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

[12834.543677] general protection fault: 0000 [#1] PREEMPT SMP 
[12834.547396] last sysfs file: /sys/kernel/uevent_seqnum
[12834.547396] CPU 1 
[12834.547396] Modules linked in: ceph aes_x86_64 aes_generic fan ac battery psmouse ehci_hcd ohci_hcd ide_pci_generic thermal button processor [last unloaded: ceph]
[12834.547396] 
[12834.547396] Pid: 2661, comm: ceph-msgr/1 Not tainted 2.6.34 #29 H8SSL-I2/H8SSL-I2
[12834.547396] RIP: 0010:[<ffffffffa0102f1f>]  [<ffffffffa0102f1f>] con_close_socket+0x40/0x9f [ceph]
[12834.547396] RSP: 0018:ffff88010c5add50  EFLAGS: 00010206
[12834.547396] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: ffff88010da4b2e0
[12834.547396] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 6b6b6b6b6b6b6b6b
[12834.547396] RBP: ffff88010c5add60 R08: 0000000000000000 R09: 0000000000000002
[12834.547396] R10: 0000000000000000 R11: ffff88010c4517b8 R12: ffff88010bbcd128
[12834.547396] R13: ffff88010bbcd128 R14: ffff88010bbcd340 R15: ffff88010bbcd330
[12834.547396] FS:  00007ff94b2926e0(0000) GS:ffff880002800000(0000) knlGS:0000000000000000
[12834.547396] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[12834.547396] CR2: 00007f98afe6c000 CR3: 000000010da04000 CR4: 00000000000006e0
[12834.547396] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[12834.547396] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[12834.547396] Process ceph-msgr/1 (pid: 2661, threadinfo ffff88010c5ac000, task ffff88010da4ac80)
[12834.547396] Stack:
[12834.547396]  ffffffffa012c313 0000000000000001 ffff88010c5addb0 ffffffffa0106920
[12834.547396] <0> ffff88010bbcd6a8 ffff88010bbcd298 ffff88010bbcd168 ffff88010bbcd6b0
[12834.547396] <0> ffff8800029d9500 ffff88010bbcd6a8 ffff88010da4ac80 ffff88010da4ac80
[12834.547396] Call Trace:
[12834.547396]  [<ffffffffa0106920>] con_work+0x105/0x6bc [ceph]
[12834.547396]  [<ffffffff8104786b>] worker_thread+0x1e8/0x2fa
[12834.547396]  [<ffffffff81047812>] ? worker_thread+0x18f/0x2fa
[12834.547396]  [<ffffffffa010681b>] ? con_work+0x0/0x6bc [ceph]
[12834.547396]  [<ffffffff8104a990>] ? autoremove_wake_function+0x0/0x38
[12834.547396]  [<ffffffff81047683>] ? worker_thread+0x0/0x2fa
[12834.547396]  [<ffffffff8104a65e>] kthread+0x7d/0x85
[12834.547396]  [<ffffffff810037d4>] kernel_thread_helper+0x4/0x10
[12834.547396]  [<ffffffff81429380>] ? restore_args+0x0/0x30
[12834.547396]  [<ffffffff8104a5e1>] ? kthread+0x0/0x85
[12834.547396]  [<ffffffff810037d0>] ? kernel_thread_helper+0x0/0x10
[12834.547396] Code: ca 54 37 e2 02 74 09 80 3d 7d 8f 03 00 00 75 40 31 db 49 83 7c 24 20 00 74 69 f0 41 80 4c 24 29 08 49 8b 7c 24 20 be 02 00 00 00 <48> 8b 47 78 ff 50 60 49 8b 7c 24 20 89 c3 e8 59 6a 28 e1 49 c7 
[12834.547396] RIP  [<ffffffffa0102f1f>] con_close_socket+0x40/0x9f [ceph]
[12834.547396]  RSP <ffff88010c5add50>
[12834.796895] ---[ end trace 1432bc2d2c7624aa ]---
[12836.053594] Slab corruption: size-2048 start=ffff88010bbcd0f8, len=2048
[12836.060294] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
[12836.065773] Last user: [<ffffffffa01106b9>](put_osd+0x3f/0x82 [ceph])
[12836.072468] 050: 6b 6b 6b 6b 6b 6b 6b 6b 4b 4b 6b 6b 6b 6b 6b 6b
[12836.085814] Prev obj: start=ffff88010bbcc8e0, len=2048
[12836.091027] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
[12836.096679] Last user: [<ffffffff810a8488>](__kmalloc_node_track_caller+0x24/0x29)
[12836.104492] 000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
[12836.111761] 010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
Actions #1

Updated by Yehuda Sadeh almost 14 years ago

What was the specific scenario? Can it be reproduced?

Actions #2

Updated by Sage Weil almost 14 years ago

Yeah, i think this is related to #163, but i still don't know how that would cause this problem. The basic issue is that con refs are taken when work is queued on the msgr workqueue, and ceph_con structs are embedded in caller structs (ceph_osd, etc.), so the >put is an op to the caller. When the caller shuts down, there may be work queued on the connection, so the final ->put may come after things have been torn down. And when the final put_osd looks at client>monc->auth->... there's a problem. The fix was to stop monc last, after flushing the workqueue.

What does teh 'last user' in

[12836.053594] Slab corruption: size-2048 start=ffff88010bbcd0f8, len=2048
[12836.060294] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
[12836.065773] Last user: [<ffffffffa01106b9>](put_osd+0x3f/0x82 [ceph])

really mean? This error makes it look more like the ref counting on the con itself was wrong. :/

Actions #3

Updated by Sage Weil almost 14 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF