Bug #12160: kernel BUG at fs/ceph/caps.c:2307, ceph_put_wrbuffer_cap_refs - Linux kernel client - Ceph

Actions

Copy link

Bug #12160

closed

kernel BUG at fs/ceph/caps.c:2307, ceph_put_wrbuffer_cap_refs

Added by Markus Blank-Burian almost 9 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Zheng Yan

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

I am seeing this bug again:

2015-06-25T14:11:52+02:00 kaa-29 kernel: kernel BUG at fs/ceph/caps.c:2307!
2015-06-25T14:11:52+02:00 kaa-29 kernel: invalid opcode: 0000 [#2] SMP 
2015-06-25T14:11:52+02:00 kaa-29 kernel: Modules linked in: cbc ceph libceph ipmi_watchdog adm1026 w83795 w83793 hwmon_vid jc42 8021q garp mrp stp llc autofs4 xfs ipmi_devintf mgag200 syscopyarea sysfillrect sysimgblt ttm x86_pkg_temp_thermal drm_kms_helper coretemp iTCO
_wdt drm iTCO_vendor_support microcode pcspkr sb_edac evdev lpc_ich edac_core ipmi_si i2c_i801 mfd_core rtc_cmos ipmi_msghandler processor thermal_sys ioatdma button dca rpcsec_gss_krb5 fuse nfsv4 nfs af_packet hid_generic usbhid hid bonding sd_mod ehci_pci ehci_hcd crc3
2c_intel ahci libahci usbcore libata usb_common ipv6 dm_mirror dm_region_hash dm_log dm_mod unix
2015-06-25T14:11:52+02:00 kaa-29 kernel: CPU: 25 PID: 698 Comm: kworker/25:1 Tainted: P      D W  O   3.18.10-gentoo #1
2015-06-25T14:11:52+02:00 kaa-29 kernel: Hardware name: Supermicro X9DRT-HF+/X9DRT-HF+, BIOS 3.0c 05/21/2014
2015-06-25T14:11:52+02:00 kaa-29 kernel: Workqueue: ceph-msgr con_work [libceph]
2015-06-25T14:11:52+02:00 kaa-29 kernel: task: ffff88101dc4e400 ti: ffff88101ef70000 task.ti: ffff88101ef70000
2015-06-25T14:11:52+02:00 kaa-29 kernel: RIP: 0010:[<ffffffffa05233a7>]  [<ffffffffa05233a7>] ceph_put_wrbuffer_cap_refs+0x252/0x26e [ceph]
2015-06-25T14:11:52+02:00 kaa-29 kernel: RSP: 0018:ffff88101ef73b98  EFLAGS: 00010246
2015-06-25T14:11:52+02:00 kaa-29 kernel: RAX: ffff880b621e2d30 RBX: ffff880b621e2b80 RCX: ffff880b621e2d30
2015-06-25T14:11:52+02:00 kaa-29 kernel: RDX: 000000000000132b RSI: 0000000000000020 RDI: 0000000000000000
2015-06-25T14:11:52+02:00 kaa-29 kernel: RBP: ffff88101ef73c08 R08: ffff88107ffc5468 R09: 0000000000013c84
2015-06-25T14:11:52+02:00 kaa-29 kernel: R10: 000000000000001f R11: 0000000000013c28 R12: ffff88101de54fa0
2015-06-25T14:11:52+02:00 kaa-29 kernel: R13: ffff8820241a0000 R14: ffff880b621e2d20 R15: 0000000000000000
2015-06-25T14:11:52+02:00 kaa-29 kernel: FS:  0000000000000000(0000) GS:ffff88103fba0000(0000) knlGS:0000000000000000
2015-06-25T14:11:52+02:00 kaa-29 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2015-06-25T14:11:52+02:00 kaa-29 kernel: CR2: 00007ff4564430f0 CR3: 0000000001811000 CR4: 00000000001407e0
2015-06-25T14:11:52+02:00 kaa-29 kernel: Stack:
2015-06-25T14:11:52+02:00 kaa-29 kernel:  0000000000000282 ffff88103f419400 00ffea001e075cc0 ffff8820241a0110
2015-06-25T14:11:52+02:00 kaa-29 kernel:  ffff88101ef73be8 ffffea000f36d280 ffff880b621e2d30 ffff880b621e2eb0
2015-06-25T14:11:52+02:00 kaa-29 kernel:  0000000000000000 ffff880b621e2eb0 ffff8820241a0110 ffff8820241a0000
2015-06-25T14:11:52+02:00 kaa-29 kernel: Call Trace:
2015-06-25T14:11:52+02:00 kaa-29 kernel:  [<ffffffffa051c41a>] writepages_finish+0x20a/0x26c [ceph]
2015-06-25T14:11:52+02:00 kaa-29 kernel:  [<ffffffffa0148c0e>] dispatch+0x5a0/0x8af [libceph]
2015-06-25T14:11:53+02:00 kaa-29 kernel:  [<ffffffffa01410c3>] con_work+0x1087/0x239d [libceph]
2015-06-25T14:11:53+02:00 kaa-29 kernel:  [<ffffffff8105a238>] ? update_next_balance.constprop.66+0x15/0x42
2015-06-25T14:11:53+02:00 kaa-29 kernel:  [<ffffffff81008fe6>] ? native_sched_clock+0x35/0x37
2015-06-25T14:11:53+02:00 kaa-29 kernel:  [<ffffffff810577fc>] ? sched_clock_cpu+0x1b/0xa4
2015-06-25T14:11:53+02:00 kaa-29 kernel:  [<ffffffff81058087>] ? arch_vtime_task_switch+0x6b/0x70
2015-06-25T14:11:53+02:00 kaa-29 kernel:  [<ffffffff81051a57>] ? finish_task_switch+0x95/0xea
2015-06-25T14:11:53+02:00 kaa-29 kernel:  [<ffffffff810496a9>] process_one_work+0x154/0x213
2015-06-25T14:11:53+02:00 kaa-29 kernel:  [<ffffffff81049bf0>] worker_thread+0x1c2/0x299
2015-06-25T14:11:53+02:00 kaa-29 kernel:  [<ffffffff81049a2e>] ? cancel_delayed_work_sync+0x10/0x10
2015-06-25T14:11:53+02:00 kaa-29 kernel:  [<ffffffff8104d8cd>] kthread+0xa0/0xa8
2015-06-25T14:11:53+02:00 kaa-29 kernel:  [<ffffffff8104d82d>] ? __kthread_parkme+0x5c/0x5c
2015-06-25T14:11:53+02:00 kaa-29 kernel:  [<ffffffff813ecd48>] ret_from_fork+0x58/0x90
2015-06-25T14:11:53+02:00 kaa-29 kernel:  [<ffffffff8104d82d>] ? __kthread_parkme+0x5c/0x5c
2015-06-25T14:11:53+02:00 kaa-29 kernel: Code: df e8 49 e0 ff ff 48 8d bb 70 01 00 00 31 c9 31 d2 be 03 00 00 00 e8 dd fb b3 e0 45 85 e4 74 18 48 8b 7d c8 e8 6e 4a bf e0 eb 0d <0f> 0b 45 31 e4 45 31 ed e9 ee fe ff ff 48 83 c4 48 5b 41 5c 41 
2015-06-25T14:11:53+02:00 kaa-29 kernel: RIP  [<ffffffffa05233a7>] ceph_put_wrbuffer_cap_refs+0x252/0x26e [ceph]
2015-06-25T14:11:53+02:00 kaa-29 kernel:  RSP <ffff88101ef73b98>
2015-06-25T14:11:53+02:00 kaa-29 kernel: ---[ end trace 3649ad6510e9703c ]---
2015-06-25T14:11:53+02:00 kaa-29 kernel: BUG: unable to handle kernel paging request at ffffffffffffffd8
2015-06-25T14:11:53+02:00 kaa-29 kernel: IP: [<ffffffff8104dabe>] kthread_data+0xb/0x11
2015-06-25T14:11:53+02:00 kaa-29 kernel: PGD 1812067 PUD 1814067 PMD 0 
2015-06-25T14:11:53+02:00 kaa-29 kernel: Oops: 0000 [#3] SMP 
2015-06-25T14:11:53+02:00 kaa-29 kernel: Modules linked in: cbc ceph libceph ipmi_watchdog adm1026 w83795 w83793 hwmon_vid jc42 8021q garp mrp stp llc autofs4 xfs ipmi_devintf mgag200 syscopyarea sysfillrect sysimgblt ttm x86_pkg_temp_thermal drm_kms_helper coretemp iTCO
_wdt drm iTCO_vendor_support microcode pcspkr sb_edac evdev lpc_ich edac_core ipmi_si i2c_i801 mfd_core rtc_cmos ipmi_msghandler processor thermal_sys ioatdma button dca rpcsec_gss_krb5 fuse nfsv4 nfs af_packet hid_generic usbhid hid bonding sd_mod ehci_pci ehci_hcd crc3
2c_intel ahci libahci usbcore libata usb_common ipv6 dm_mirror dm_region_hash dm_log dm_mod unix
2015-06-25T14:11:53+02:00 kaa-29 kernel: CPU: 25 PID: 698 Comm: kworker/25:1 Tainted: P      D W  O   3.18.10-gentoo #1
2015-06-25T14:11:53+02:00 kaa-29 kernel: Hardware name: Supermicro X9DRT-HF+/X9DRT-HF+, BIOS 3.0c 05/21/2014
2015-06-25T14:11:53+02:00 kaa-29 kernel: task: ffff88101dc4e400 ti: ffff88101ef70000 task.ti: ffff88101ef70000
2015-06-25T14:11:53+02:00 kaa-29 kernel: RIP: 0010:[<ffffffff8104dabe>]  [<ffffffff8104dabe>] kthread_data+0xb/0x11
2015-06-25T14:11:53+02:00 kaa-29 kernel: RSP: 0018:ffff88101ef73838  EFLAGS: 00010002
2015-06-25T14:11:53+02:00 kaa-29 kernel: RAX: 0000000000000000 RBX: ffff88103fbb0780 RCX: ffff88103fbb07f8
2015-06-25T14:11:53+02:00 kaa-29 kernel: RDX: 0000000000000000 RSI: 0000000000000019 RDI: ffff88101dc4e400
2015-06-25T14:11:53+02:00 kaa-29 kernel: RBP: ffff88101ef73838 R08: ffffffff819a6dc0 R09: 000000000000000f
2015-06-25T14:11:53+02:00 kaa-29 kernel: R10: 00000000ffffffff R11: 000000000000b80e R12: 0000000000000019
2015-06-25T14:11:53+02:00 kaa-29 kernel: R13: ffff88101dc4e960 R14: ffff881029378000 R15: ffff88101dc4e400
2015-06-25T14:11:53+02:00 kaa-29 kernel: FS:  0000000000000000(0000) GS:ffff88103fba0000(0000) knlGS:0000000000000000
2015-06-25T14:11:53+02:00 kaa-29 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2015-06-25T14:11:53+02:00 kaa-29 kernel: CR2: 0000000000000028 CR3: 0000000001811000 CR4: 00000000001407e0
2015-06-25T14:11:53+02:00 kaa-29 kernel: Stack:
2015-06-25T14:11:53+02:00 kaa-29 kernel:  ffff88101ef73858 ffffffff81049f42 ffff88103fbb0780 0000000000000019

Previously, I saw this bug on kernel 3.14.y, but the patch "ceph: introduce global empty snap context" seemed to have solved this. Having upgraded to Ceph 0.94.2 and kernel 3.18.10, the problem reapperead today on one machine, despite having the following patches applied:

ceph: exclude setfilelock requests when calculating oldest tid
e8a7b8b12b13831467c6158c1e82801e25b5dd98

ceph: fix dentry leaks
5cba372c0fe78d24e83d9e0556ecbeb219625c15

libceph: kfree() in put_osd() shouldn't depend on authorizer
b28ec2f37e6a2bbd0bdf74b39cb89c74e4ad17f3

libceph: request a new osdmap if lingering request maps to no osd
b0494532214bdfbf241e94fabab5dd46f7b82631

ceph: re-send requests when MDS enters reconnecting stage
3de22be6771353241eaec237fe594dfea3daf30f

Revert "libceph: clear r_req_lru_item in __unregister_linger_request()" 
521a04d06a729e5971cdee7f84080387ed320527

ceph: introduce global empty snap context
97c85a828f36bbfffe9d77b977b65a5872b6cad4

Is there any other fix I could try?

Files

Download all files

0001-ceph-keep-i_snap_realm-while-there-are-writers.patch (2.29 KB) 0001-ceph-keep-i_snap_realm-while-there-are-writers.patch		Zheng Yan, 06/26/2015 08:18 AM
0002-ceph-don-t-zero-i_wrbuffer_ref-when-reconnecting-is-.patch (1.07 KB) 0002-ceph-don-t-zero-i_wrbuffer_ref-when-reconnecting-is-.patch		Zheng Yan, 06/26/2015 08:18 AM

Actions

Copy link

Updated by Zheng Yan almost 9 years ago

have you ever used snapshot on the filesystem?

Actions

Copy link

Updated by Markus Blank-Burian almost 9 years ago

no, we never made a snapshot

Actions

Copy link

Updated by Zheng Yan almost 9 years ago

did this Oops happen when restarting the MDS? please send full kernel message to me.

Actions

Copy link Download all files

Updated by Zheng Yan almost 9 years ago

please try the attached patches if the Oops happened while restarting MDS

Actions

Copy link

Updated by Markus Blank-Burian almost 9 years ago

The standby mds took over approximately 1.5h before this bug. In the mds log, I find

2015-06-25T12:50:37+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:37.422135 7f8ddffff700  1 mds.0.server reconnect gave up on client.7001452 192.168.1.28:0/1906398027
2015-06-25T12:50:37+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:37.422210 7f8ddffff700  1 mds.0.server reconnect gave up on client.7001458 192.168.1.46:0/42480723
2015-06-25T12:50:37+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:37.422215 7f8ddffff700  1 mds.0.server reconnect gave up on client.7001462 192.168.1.43:0/2294373093
2015-06-25T12:50:37+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:37.422219 7f8ddffff700  1 mds.0.server reconnect gave up on client.7001976 192.168.1.31:0/3819345903
2015-06-25T12:50:37+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:37.422223 7f8ddffff700  1 mds.0.server reconnect gave up on client.7007324 192.168.1.48:0/799265129
2015-06-25T12:50:37+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:37.422227 7f8ddffff700  1 mds.0.server reconnect gave up on client.7007352 192.168.1.29:0/1924916090
2015-06-25T12:50:37+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:37.422232 7f8ddffff700  1 mds.0.332 reconnect_done
2015-06-25T12:50:37+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:37.788797 7f8de4282700  1 mds.0.332 handle_mds_map i am now mds.0.332
2015-06-25T12:50:37+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:37.788804 7f8de4282700  1 mds.0.332 handle_mds_map state change up:reconnect --> up:rejoin
2015-06-25T12:50:37+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:37.788807 7f8de4282700  1 mds.0.332 rejoin_start
2015-06-25T12:50:37+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:37.791365 7f8de4282700  1 mds.0.332 rejoin_joint_start
2015-06-25T12:50:38+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:38.899838 7f8ddeffd700  1 mds.0.332 rejoin_done
2015-06-25T12:50:40+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:40.280516 7f8de4282700  1 mds.0.332 handle_mds_map i am now mds.0.332
2015-06-25T12:50:40+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:40.280523 7f8de4282700  1 mds.0.332 handle_mds_map state change up:rejoin --> up:clientreplay
2015-06-25T12:50:40+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:40.280526 7f8de4282700  1 mds.0.332 recovery_done -- successful recovery!
2015-06-25T12:50:40+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:40.282925 7f8de4282700  1 mds.0.332 clientreplay_start
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427408 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7007330 192.168.1.81:0/1344128320 after 6214.257132
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427436 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7001466 192.168.1.47:0/4123295554 after 6214.257090
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427445 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7007320 192.168.1.102:0/1261991612 after 6214.068432
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427457 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.6883268 192.168.1.21:0/8818 after 6214.068427
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427465 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7001442 192.168.1.35:0/3903382741 after 6214.068414
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427472 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7001440 192.168.1.39:0/2973986646 after 6214.068419
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427479 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7007262 192.168.1.71:0/3764231419 after 6214.068410
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427486 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7001436 192.168.1.34:0/1747240724 after 6214.068407
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427493 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.6913066 192.168.1.8:0/29724 after 6214.068403
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427501 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7007342 192.168.1.66:0/960604257 after 6214.068398
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427508 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7001448 192.168.1.15:0/3928381201 after 6213.960858
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427516 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.6883274 192.168.1.23:0/8924 after 6213.960853
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427524 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.6930102 192.168.1.12:0/8216 after 6213.960845
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427537 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.3664188 192.168.1.253:0/2779307766 after 6213.960838
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427544 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.6883276 192.168.1.24:0/9096 after 6213.960834
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427560 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.6883262 192.168.1.18:0/8666 after 6213.960827
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427575 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7001468 192.168.1.17:0/1093602686 after 6213.591832
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427583 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7007336 192.168.1.104:0/3079450156 after 6213.403195
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427590 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7001454 192.168.1.30:0/19419193 after 6213.403175
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427600 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.6913056 192.168.1.7:0/30112 after 6213.403181
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427607 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7007308 192.168.1.69:0/4214157904 after 6213.403190
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427616 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.6883272 192.168.1.22:0/8920 after 6213.323038
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427658 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7007314 192.168.1.84:0/3451170179 after 6213.181279
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427666 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.6883266 192.168.1.20:0/8809 after 6213.063169
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427673 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7001460 192.168.1.44:0/4156243384 after 6213.063150
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427718 7f8ddffff700  0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 50.653845 secs
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427723 7f8ddffff700  0 log_channel(cluster) log [WRN] : slow request 50.653845 seconds old, received at 2015-06-25 12:49:51.773847: client_request(client.6883272:23399 create #10000554c3b/state0_step
50437440.cpt 2015-06-25 12:48:15.403334 RETRY=1 REPLAY) currently killing request
2015-06-25T12:50:42+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:42.427731 7f8ddffff700  0 log_channel(cluster) log [WRN] : slow request 50.653763 seconds old, received at 2015-06-25 12:49:51.773928: client_request(client.6883272:23404 create #10000554c3b/state1_step
50437440.cpt 2015-06-25 12:48:15.411736 RETRY=1 REPLAY) currently killing request
2015-06-25T12:50:52+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:52.422407 7f8ddffff700  0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 60.648521 secs
2015-06-25T12:50:52+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:52.422418 7f8ddffff700  0 log_channel(cluster) log [WRN] : slow request 60.648521 seconds old, received at 2015-06-25 12:49:51.773847: client_request(client.6883272:23399 create #10000554c3b/state0_step
50437440.cpt 2015-06-25 12:48:15.403334 RETRY=1 REPLAY) currently killing request
2015-06-25T12:50:52+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:52.422424 7f8ddffff700  0 log_channel(cluster) log [WRN] : slow request 60.648439 seconds old, received at 2015-06-25 12:49:51.773928: client_request(client.6883272:23404 create #10000554c3b/state1_step
50437440.cpt 2015-06-25 12:48:15.411736 RETRY=1 REPLAY) currently killing request
2015-06-25T12:50:56+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:56.132330 7f8de4282700  1 mds.0.332 clientreplay_done
2015-06-25T12:50:56+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:56.755431 7f8dbd9a5700  0 -- 192.168.1.254:6800/19440 >> 192.168.1.31:0/3819345903 pipe(0x7f8dd3888000 sd=54 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f8dda037600).accept peer addr is really 192.168.1.31:0/38193459
03 (socket is 192.168.1.31:38634/0)
2015-06-25T12:50:56+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:56.755893 7f8dbd6a2700  0 -- 192.168.1.254:6800/19440 >> 192.168.1.43:0/2294373093 pipe(0x7f8dd3904000 sd=73 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f8dda037780).accept peer addr is really 192.168.1.43:0/2294373093 (socket is 192.168.1.43:58386/0)
2015-06-25T12:50:56+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:56.756437 7f8dc08d4700  0 -- 192.168.1.254:6800/19440 >> 192.168.1.29:0/1924916090 pipe(0x7f8dd1406000 sd=125 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f8dda037900).accept peer addr is really 192.168.1.29:0/1924916090 (socket is 192.168.1.29:58134/0)
2015-06-25T12:50:56+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:56.756558 7f8dbf4c0700  0 -- 192.168.1.254:6800/19440 >> 192.168.1.48:0/799265129 pipe(0x7f8dd140b000 sd=126 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f8dda037a80).accept peer addr is really 192.168.1.48:0/799265129 (socket is 192.168.1.48:55010/0)
2015-06-25T12:50:56+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:56.757069 7f8db6f3b700  0 -- 192.168.1.254:6800/19440 >> 192.168.1.46:0/42480723 pipe(0x7f8dd1410000 sd=127 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f8dda037c00).accept peer addr is really 192.168.1.46:0/42480723 (socket is 192.168.1.46:59810/0)
2015-06-25T12:50:56+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:56.757157 7f8db6e3a700  0 -- 192.168.1.254:6800/19440 >> 192.168.1.28:0/1906398027 pipe(0x7f8dd1415000 sd=128 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f8dda037d80).accept peer addr is really 192.168.1.28:0/1906398027 (socket is 192.168.1.28:46650/0)
2015-06-25T12:50:56+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:56.778319 7f8de4282700  1 mds.0.332 handle_mds_map i am now mds.0.332
2015-06-25T12:50:56+02:00 bagheera ceph-mds[19442]: 2015-06-25 12:50:56.778326 7f8de4282700  1 mds.0.332 handle_mds_map state change up:clientreplay --> up:active

And just before the crash, everything looks "normal":

2015-06-25T14:07:47+02:00 bagheera ceph-mds[19442]: 2015-06-25 14:07:47.561304 7f8ddffff700  0 log_channel(cluster) log [WRN] : slow request 3842.439722 seconds old, received at 2015-06-25 13:03:45.121527: client_request(client.7001450:432210 setfilelockrule 1, type 2, o
wner 9618881894372880279, pid 11612, start 0, length 0, wait 1 #100005687c8 2015-06-25 13:03:45.000000) currently acquired locks
2015-06-25T14:08:12+02:00 bagheera ceph-mds[19442]: 2015-06-25 14:08:12.561787 7f8ddffff700  0 log_channel(cluster) log [WRN] : 12 slow requests, 1 included below; oldest blocked for > 3996.158670 secs
2015-06-25T14:08:12+02:00 bagheera ceph-mds[19442]: 2015-06-25 14:08:12.561804 7f8ddffff700  0 log_channel(cluster) log [WRN] : slow request 3842.176972 seconds old, received at 2015-06-25 13:04:10.384773: client_request(client.7001450:432265 setfilelockrule 1, type 2, o
wner 9618881894375997079, pid 11681, start 0, length 0, wait 1 #100005683e0 2015-06-25 13:04:10.000000) currently acquired locks
2015-06-25T14:08:17+02:00 bagheera ceph-mds[19442]: 2015-06-25 14:08:17.561888 7f8ddffff700  0 log_channel(cluster) log [WRN] : 12 slow requests, 3 included below; oldest blocked for > 4001.158751 secs
2015-06-25T14:08:17+02:00 bagheera ceph-mds[19442]: 2015-06-25 14:08:17.561900 7f8ddffff700  0 log_channel(cluster) log [WRN] : slow request 3844.508751 seconds old, received at 2015-06-25 13:04:13.053075: client_request(client.7001450:432273 setfilelockrule 1, type 1, o
wner 9618881894373326871, pid 11639, start 0, length 0, wait 1 #100005687d7 2015-06-25 13:04:13.000000) currently acquired locks
2015-06-25T14:08:17+02:00 bagheera ceph-mds[19442]: 2015-06-25 14:08:17.561906 7f8ddffff700  0 log_channel(cluster) log [WRN] : slow request 3844.411097 seconds old, received at 2015-06-25 13:04:13.150729: client_request(client.7001450:432275 setfilelockrule 1, type 2, o
wner 9618881894372872599, pid 11421, start 0, length 0, wait 1 #100005687ba 2015-06-25 13:04:13.000000) currently acquired locks
2015-06-25T14:08:17+02:00 bagheera ceph-mds[19442]: 2015-06-25 14:08:17.561913 7f8ddffff700  0 log_channel(cluster) log [WRN] : slow request 3842.281180 seconds old, received at 2015-06-25 13:04:15.280645: client_request(client.7001450:432277 setfilelockrule 1, type 2, o
wner 9618881894379384343, pid 11585, start 0, length 0, wait 1 #100005687d1 2015-06-25 13:04:15.000000) currently acquired locks
2015-06-25T14:14:17+02:00 bagheera ceph-mds[19442]: 2015-06-25 14:14:17.570133 7f8ddffff700  0 log_channel(cluster) log [INF] : closing stale session client.7007352 192.168.1.29:0/1924916090 after 301.242718

I will try these patches and come back, if it happens again.

Actions

Copy link

Updated by Zheng Yan almost 8 years ago

Status changed from New to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #12160

kernel BUG at fs/ceph/caps.c:2307, ceph_put_wrbuffer_cap_refs

Updated by Zheng Yan almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Zheng Yan almost 8 years ago