Bug #9928
closedkernel BUG at fs/ceph/caps.c:2307!
0%
Description
Client's view of its operations:
91573685 mds0 create (unsafe) #10004d513be/mon.b.tgz (teuthology-archive/teuthology-2014-10-27_23:02:01-rgw-giant-distro-basic-multi/575474/data/mon.b.tgz) 92401836 mds0 readdir #10004d490dc 92401837 mds0 getattr #1 92401838 mds0 getattr #1 92401839 mds0 getattr #1 92401840 mds0 getattr #1 92401841 mds0 getattr #1 92401842 mds0 getattr #1 92401843 mds0 getattr #1 92401844 mds0 getattr #1 92401845 mds0 getattr #1 92401846 mds0 getattr #1 92401847 mds0 getattr #1 92401848 mds0 getattr #1 92401849 mds0 getattr #1 92401850 mds0 getattr #1 92401851 mds0 getattr #1 92401852 mds0 getattr #1 92401853 mds0 getattr #1 92401854 mds0 getattr #1 92401855 mds0 getattr #1 92401856 mds0 getattr #1 92401857 mds0 getattr #1 92401858 mds0 getattr #10004d46cba 92401859 mds0 getattr #1 92401860 mds0 getattr #1 92401861 mds0 getattr #1 92401862 mds0 getattr #1 92401863 mds0 getattr #1 ...
Diffed that list over one minute and an extra getattr had appeared, so those were steadily accumulating presumably.
Client (teuthology) became unresponsive to SSH login, and soft lockup traces were being printed to the serial console (see attached).
Files
Updated by John Spray over 9 years ago
MDS cache dump at ~/jcsp/9928/cachedump.1870.mds0 on teuthology.
This was taken at around 0800 local, long after the MDS had evicted the client session for the bad client, so probably isn't all that useful.
2014-10-28 23:31:39.133091 7f0113b71700 0 log_channel(default) log [INF] : closing stale session client.2755727 10.214.137.23:0/3139049599 after 303.924754
<< some 500000 lines of log from removing each client cap >>
2014-10-28 23:31:45.106346 7f0114473700 10 mds.0.sessionmap remove_session s=0x59195dc0 name=client.2755727
The slice of the MDS log immediately leading up to the loss of contact is ~/jcsp/9928/mds.log. Last few responses are:
2014-10-28 23:26:42.411135 7f0116c78700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401830 lookup #10004d490dc/575409 2014-10-28 23:26:42.000000) v2 2014-10-28 23:26:42.411981 7f0116c78700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401831 lookup #10004d490dc/575410 2014-10-28 23:26:42.000000) v2 2014-10-28 23:26:42.412862 7f0116c78700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401832 lookup #10004d490dc/575411 2014-10-28 23:26:42.000000) v2 2014-10-28 23:26:42.413707 7f0116c78700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401833 lookup #10004d490dc/575412 2014-10-28 23:26:42.000000) v2 2014-10-28 23:26:42.414580 7f0116c78700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401834 lookup #10004d490dc/575413 2014-10-28 23:26:42.000000) v2 2014-10-28 23:26:42.415427 7f0116c78700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401835 lookup #10004d490dc/575414 2014-10-28 23:26:42.000000) v2 2014-10-28 23:27:37.338019 7f0113b71700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401836 readdir #10004d490dc 2014-10-28 23:26:52.000000) v2
Updated by Sage Weil over 9 years ago
2014-10-28 21:21:25.394832 mon.0 10.214.133.104:6789/0 5 : cluster [INF] mdsmap e1830: 1/1/1 up {0=burnupi21=up:active} 2014-10-28 21:21:26.524753 mon.0 10.214.133.104:6789/0 522 : cluster [INF] mds.0 10.214.134.10:6801/17218 up:active 2014-10-28 21:21:26.524826 mon.0 10.214.133.104:6789/0 523 : cluster [INF] mdsmap e1831: 1/1/1 up {0=burnupi21=up:active} 2014-10-28 21:23:48.427099 mon.0 10.214.133.104:6789/0 1122 : cluster [INF] mds.? 10.214.134.10:6800/48685 up:boot 2014-10-28 21:23:48.427970 mon.0 10.214.133.104:6789/0 1123 : cluster [INF] mdsmap e1832: 0/1/1 up, 1 up:standby, 1 failed 2014-10-28 21:23:48.430976 mon.0 10.214.133.104:6789/0 1124 : cluster [INF] mdsmap e1833: 1/1/1 up {0=burnupi21=up:replay} 2014-10-28 21:23:52.852302 mon.0 10.214.133.104:6789/0 1135 : cluster [INF] mds.0 10.214.134.10:6800/48685 up:reconnect 2014-10-28 21:23:52.852398 mon.0 10.214.133.104:6789/0 1136 : cluster [INF] mdsmap e1834: 1/1/1 up {0=burnupi21=up:reconnect} 2014-10-28 21:23:54.944394 mon.0 10.214.133.104:6789/0 1138 : cluster [INF] mds.0 10.214.134.10:6800/48685 up:rejoin 2014-10-28 21:23:54.944487 mon.0 10.214.133.104:6789/0 1139 : cluster [INF] mdsmap e1835: 1/1/1 up {0=burnupi21=up:rejoin} 2014-10-28 21:26:02.693663 mon.0 10.214.133.104:6789/0 3148 : cluster [INF] mds.? 10.214.134.10:6801/49913 up:boot 2014-10-28 21:26:02.693704 mon.0 10.214.133.104:6789/0 3149 : cluster [INF] mdsmap e1836: 1/1/1 up {0=burnupi21=up:replay} 2014-10-28 21:26:06.293923 mon.0 10.214.133.104:6789/0 3208 : cluster [INF] mds.0 10.214.134.10:6801/49913 up:reconnect 2014-10-28 21:26:06.293979 mon.0 10.214.133.104:6789/0 3209 : cluster [INF] mdsmap e1837: 1/1/1 up {0=burnupi21=up:reconnect} 2014-10-28 21:26:09.361988 mon.0 10.214.133.104:6789/0 3225 : cluster [INF] mds.0 10.214.134.10:6801/49913 up:rejoin 2014-10-28 21:26:09.362069 mon.0 10.214.133.104:6789/0 3226 : cluster [INF] mdsmap e1838: 1/1/1 up {0=burnupi21=up:rejoin} 2014-10-28 21:35:12.659906 mon.0 10.214.133.104:6789/0 3731 : cluster [INF] mds.0 10.214.134.10:6801/49913 up:clientreplay 2014-10-28 21:35:12.659967 mon.0 10.214.133.104:6789/0 3732 : cluster [INF] mdsmap e1839: 1/1/1 up {0=burnupi21=up:clientreplay} 2014-10-28 21:35:30.579366 mon.0 10.214.133.104:6789/0 3759 : cluster [INF] mds.0 10.214.134.10:6801/49913 up:active 2014-10-28 21:35:30.579436 mon.0 10.214.133.104:6789/0 3760 : cluster [INF] mdsmap e1840: 1/1/1 up {0=burnupi21=up:active}
Updated by Sage Weil over 9 years ago
- Priority changed from Normal to Urgent
- Source changed from other to Development
Updated by Zheng Yan over 9 years ago
- Subject changed from Soft lockup in ceph_caps_revoking/ceph_put_fmode to kernel BUG at fs/ceph/caps.c:2307!
The very first error message is:
Oct 28 23:26:17 teuthology kernel: [3457610.583885] kernel BUG at /srv/autobuild-ceph/gitbuilder.git/build/fs/ceph/caps.c:2307! Oct 28 23:26:17 teuthology kernel: [3457610.592106] invalid opcode: 0000 [#1] SMP Oct 28 23:26:17 teuthology kernel: [3457610.596470] Modules linked in: ipmi_devintf(E) ipmi_si(E) ipmi_msghandler(E) ip6table_filter(E) ip6_tables(E) ebtable_nat(E) ebtables(E) ipt_MASQUERADE(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_state(E) nf_conntrack(E) ipt_REJECT(E) xt_CHECKSUM(E) iptable_mangle(E) xt_tcpudp(E) iptable_filter(E) ip_tables(E) x_tables(E) bridge(E) stp(E) llc(E) ceph(E) gpio_ich(E) i7core_edac(E) libceph(E) edac_core(E) psmouse(E) serio_raw(E) joydev(E) fscache(E) lpc_ich(E) tpm_infineon(E) tpm_tis(E) xfs(E) lp(E) parport(E) hid_generic(E) usbhid(E) hid(E) btrfs(E) e1000e(E) ptp(E) raid6_pq(E) ahci(E) libahci(E) pps_core(E) arcmsr(E) xor(E) libcrc32c(E) Oct 28 23:26:17 teuthology kernel: [3457610.658400] CPU: 2 PID: 19790 Comm: teuthology Tainted: G E 3.16.3-ceph-00306-g76c7fd1 #1 Oct 28 23:26:17 teuthology kernel: [3457610.667893] Hardware name: Supermicro X8SIL/X8SIL, BIOS 1.1 05/27/2010 Oct 28 23:26:17 teuthology kernel: [3457610.674741] task: ffff8803e1f621f0 ti: ffff8803e4ae4000 task.ti: ffff8803e4ae4000 Oct 28 23:26:17 teuthology kernel: [3457610.682510] RIP: 0010:[<ffffffffa0390f40>] [<ffffffffa0390f40>] ceph_put_wrbuffer_cap_refs+0x120/0x360 [ceph] Oct 28 23:26:17 teuthology kernel: [3457610.692771] RSP: 0018:ffff8803e4ae79c8 EFLAGS: 00010246 Oct 28 23:26:17 teuthology kernel: [3457610.698331] RAX: ffff8800049b5a90 RBX: ffff8800049b5a80 RCX: 0000000089178916 Oct 28 23:26:17 teuthology kernel: [3457610.705724] RDX: ffff8800049b5a90 RSI: 0000000000000001 RDI: ffff8804265b0740 Oct 28 23:26:17 teuthology kernel: [3457610.713111] RBP: ffff8803e4ae7a48 R08: 0000000000000001 R09: 0000000000000000 Oct 28 23:26:17 teuthology kernel: [3457610.720649] R10: 57ffbb806e9fe740 R11: 0000000000000000 R12: ffff8800049b5820 Oct 28 23:26:17 teuthology kernel: [3457610.727993] R13: ffff88025f590a40 R14: ffff8800049b5830 R15: 0000000000000001 Oct 28 23:26:17 teuthology kernel: [3457610.735358] FS: 00007f47587f4700(0000) GS:ffff88043fc80000(0000) knlGS:0000000000000000 Oct 28 23:26:17 teuthology kernel: [3457610.743658] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 28 23:26:17 teuthology kernel: [3457610.749605] CR2: 00000000032fd000 CR3: 000000035890f000 CR4: 00000000000007e0 Oct 28 23:26:17 teuthology kernel: [3457610.756951] Stack: Oct 28 23:26:17 teuthology kernel: [3457610.759160] ffff8803e4ae7a28 ffffffff811688a3 ffffffff81168798 ffffffff00002201 Oct 28 23:26:17 teuthology kernel: [3457610.766939] ffffffff00000024 00ff88025f590a40 ffff880400000001 ffffea0000000001 Oct 28 23:26:17 teuthology kernel: [3457610.774675] ffff8800049b5d50 ffff8800049b5820 ffffea000b9fe740 ffff8800049b5d50 Oct 28 23:26:17 teuthology kernel: [3457610.782525] Call Trace: Oct 28 23:26:17 teuthology kernel: [3457610.785294] [<ffffffff811688a3>] ? test_clear_page_writeback+0x143/0x260 Oct 28 23:26:17 teuthology kernel: [3457610.792429] [<ffffffff81168798>] ? test_clear_page_writeback+0x38/0x260 Oct 28 23:26:17 teuthology kernel: [3457610.799493] [<ffffffffa0389854>] writepage_nounlock+0x2a4/0x450 [ceph] Oct 28 23:26:17 teuthology kernel: [3457610.806314] [<ffffffff81193d60>] ? SyS_msync+0x230/0x230 Oct 28 23:26:17 teuthology kernel: [3457610.811958] [<ffffffffa0389b50>] ceph_update_writeable_page+0x150/0x480 [ceph] Oct 28 23:26:17 teuthology kernel: [3457610.819487] [<ffffffff81166536>] ? wait_for_stable_page+0x16/0x50 Oct 28 23:26:17 teuthology kernel: [3457610.825875] [<ffffffffa0389eec>] ceph_write_begin+0x6c/0xc0 [ceph] Oct 28 23:26:17 teuthology kernel: [3457610.832337] [<ffffffff8115a9e6>] generic_perform_write+0xc6/0x1e0 Oct 28 23:26:17 teuthology kernel: [3457610.839856] [<ffffffffa03856b0>] ceph_write_iter+0x890/0xa30 [ceph] Oct 28 23:26:17 teuthology kernel: [3457610.846480] [<ffffffff817295fb>] ? _raw_spin_unlock+0x2b/0x40 Oct 28 23:26:17 teuthology kernel: [3457610.852507] [<ffffffff81187454>] ? do_wp_page+0x374/0x740 Oct 28 23:26:17 teuthology kernel: [3457610.858215] [<ffffffff81189efc>] ? handle_mm_fault+0x86c/0x1060 Oct 28 23:26:17 teuthology kernel: [3457610.864411] [<ffffffff81046a19>] ? __do_page_fault+0x159/0x580 Oct 28 23:26:17 teuthology kernel: [3457610.870561] [<ffffffff811cdc13>] ? vfs_write+0x1d3/0x1f0 Oct 28 23:26:17 teuthology kernel: [3457610.876205] [<ffffffff811cdc13>] ? vfs_write+0x1d3/0x1f0 Oct 28 23:26:17 teuthology kernel: [3457610.881793] [<ffffffff811ccdcb>] new_sync_write+0x7b/0xb0 Oct 28 23:26:17 teuthology kernel: [3457610.887502] [<ffffffff811cdb07>] vfs_write+0xc7/0x1f0 Oct 28 23:26:17 teuthology kernel: [3457610.892897] [<ffffffff811ce012>] SyS_write+0x52/0xb0 Oct 28 23:26:17 teuthology kernel: [3457610.898139] [<ffffffff81729cd6>] system_call_fastpath+0x1a/0x1f Oct 28 23:26:17 teuthology kernel: [3457610.904357] Code: 0f 85 c5 00 00 00 45 85 ed 74 09 48 8b 7d c0 e8 e7 94 e5 e0 48 83 c4 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 84 00 00 00 00 00 <0f> 0b 66 0f 1f 44 00 00 4c 89 e7 31 d2 be 02 00 00 00 e8 f9 db Oct 28 23:26:17 teuthology kernel: [3457610.924816] RIP [<ffffffffa0390f40>] ceph_put_wrbuffer_cap_refs+0x120/0x360 [ceph] Oct 28 23:26:17 teuthology kernel: [3457610.932745] RSP <ffff8803e4ae79c8>
Updated by Zheng Yan over 9 years ago
- Status changed from New to 7
fixed by "ceph: introduce global empty snap context"
Updated by Ilya Dryomov over 9 years ago
- Status changed from 7 to Resolved
This patch has been in testing for a while.
Updated by Markus Blank-Burian about 9 years ago
I found the same trace in my logfiles but also a similar with a NULL dereference instead (kernel 3.14.28 stable + some fixes). I guess this is also fixed by this commit?
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.857936] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862038] IP: [<ffffffffa01bc40b>] ceph_update_writeable_page+0xda/0x319 [ceph] 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862077] PGD f3aa85067 PUD 1007756067 PMD 0 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862080] Oops: 0000 [#1] SMP 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862130] Modules linked in: cbc ceph libceph ipmi_watchdog w83627ehf adm1026 w83795 w83793 hwmon_vid jc42 80 21q garp stp mrp llc autofs4 cpufreq_ondemand xfs ipmi_si ipmi_devintf ipmi_msghandler sr_mod cdrom mgag200 syscopyarea sysfillrect sysimgblt kvm_amd ttm kvm drm_kms_helper usb_storage drm microcode evdev amd64_edac_mod sp5100_tco edac_mce_amd acpi_cpufreq rtc_cmos i2c_piix4 edac_core button k10temp processor psm ouse thermal_sys pcspkr rpcsec_gss_krb5 fuse nfsv4 nfs af_packet hid_generic usbhid hid bonding sd_mod ohci_pci ehci_pci ohci_hcd ehci_hcd usbcore ahci libah ci libata usb_common ipv6 dm_mirror dm_region_hash dm_log dm_mod unix 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862136] CPU: 10 PID: 20086 Comm: slurmstepd Tainted: P W O 3.14.28-gentoo #2 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862137] Hardware name: Supermicro H8DGU/H8DGU, BIOS 1.0c 10/14/10 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862140] task: ffff8810076f98f0 ti: ffff880fd6f50000 task.ti: ffff880fd6f50000 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862164] RIP: 0010:[<ffffffffa01bc40b>] [<ffffffffa01bc40b>] ceph_update_writeable_page+0xda/0x319 [ceph] 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862166] RSP: 0018:ffff880fd6f51bb0 EFLAGS: 00010257 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862167] RAX: 0000000000000000 RBX: ffffea0035f4e700 RCX: 00000000000093c1 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862169] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8805234fac90 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862171] RBP: ffff880fd6f51c38 R08: ffffea0035f4e700 R09: ffff880fd6f51ce8 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862173] R10: 0000000000001000 R11: ffff880fd6f51e98 R12: ffff88043edb59b0 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862174] R13: ffff88080a12f060 R14: 0000000000000c74 R15: 0000000000000063 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862177] FS: 00007fd6b1902700(0000) GS:ffff88101fc80000(0000) knlGS:0000000000000000 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862179] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862180] CR2: 0000000000000008 CR3: 0000001008329000 CR4: 00000000000007e0 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862181] Stack: 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862185] 0000000000000000 0000000000000cd7 0000000000000000 0000000000000063 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862188] ffff8800da0a1a00 00000cd7010200da ffff8805234fac90 ffffffff810b89e4 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862191] ffffea0035f4e700 ffff880fd6f51c38 ffffffff810b1cab 0000000000000063 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862192] Call Trace: 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862203] [<ffffffff810b89e4>] ? wait_for_stable_page+0xd/0x25 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862209] [<ffffffff810b1cab>] ? grab_cache_page_write_begin+0xa7/0xb6 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862230] [<ffffffffa01bc6d7>] ceph_write_begin+0x8d/0xa8 [ceph] 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862235] [<ffffffff810b1164>] generic_file_buffered_write+0xf7/0x1fe 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862255] [<ffffffffa01b97ce>] ceph_aio_write+0xac2/0xc01 [ceph] 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862261] [<ffffffff810cf68f>] ? handle_mm_fault+0x354/0x98e 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862267] [<ffffffff810febb9>] do_sync_write+0x54/0x73 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862271] [<ffffffff810ff230>] vfs_write+0xaf/0xf9 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862274] [<ffffffff810ff942>] SyS_write+0x48/0x73 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862280] [<ffffffff813cc61b>] tracesys+0xdd/0xe2 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862307] Code: 90 fe ff ff 0f 84 4b 01 00 00 31 f6 4c 89 e7 e8 fa de ff ff 48 8b 7d a8 48 89 85 78 ff ff ff e8 62 ff ea e0 48 8b 85 78 ff ff ff <48> 8b 50 08 49 39 55 08 48 89 c7 0f 86 d1 00 00 00 e8 7e 54 fc 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862327] RIP [<ffffffffa01bc40b>] ceph_update_writeable_page+0xda/0x319 [ceph] 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862327] RSP <ffff880fd6f51bb0> 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862328] CR2: 0000000000000008 2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862345] ---[ end trace 53dcb85c1131d4f5 ]---