Project

General

Profile

Actions

Bug #9928

closed

kernel BUG at fs/ceph/caps.c:2307!

Added by John Spray over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Client's view of its operations:

91573685    mds0    create    (unsafe) #10004d513be/mon.b.tgz (teuthology-archive/teuthology-2014-10-27_23:02:01-rgw-giant-distro-basic-multi/575474/data/mon.b.tgz)
92401836    mds0    readdir     #10004d490dc
92401837    mds0    getattr     #1
92401838    mds0    getattr     #1
92401839    mds0    getattr     #1
92401840    mds0    getattr     #1
92401841    mds0    getattr     #1
92401842    mds0    getattr     #1
92401843    mds0    getattr     #1
92401844    mds0    getattr     #1
92401845    mds0    getattr     #1
92401846    mds0    getattr     #1
92401847    mds0    getattr     #1
92401848    mds0    getattr     #1
92401849    mds0    getattr     #1
92401850    mds0    getattr     #1
92401851    mds0    getattr     #1
92401852    mds0    getattr     #1
92401853    mds0    getattr     #1
92401854    mds0    getattr     #1
92401855    mds0    getattr     #1
92401856    mds0    getattr     #1
92401857    mds0    getattr     #1
92401858    mds0    getattr     #10004d46cba
92401859    mds0    getattr     #1
92401860    mds0    getattr     #1
92401861    mds0    getattr     #1
92401862    mds0    getattr     #1
92401863    mds0    getattr     #1
...

Diffed that list over one minute and an extra getattr had appeared, so those were steadily accumulating presumably.

Client (teuthology) became unresponsive to SSH login, and soft lockup traces were being printed to the serial console (see attached).


Files

client_console.log (371 KB) client_console.log John Spray, 10/29/2014 07:55 AM
Actions #1

Updated by John Spray over 9 years ago

MDS cache dump at ~/jcsp/9928/cachedump.1870.mds0 on teuthology.

This was taken at around 0800 local, long after the MDS had evicted the client session for the bad client, so probably isn't all that useful.

2014-10-28 23:31:39.133091 7f0113b71700 0 log_channel(default) log [INF] : closing stale session client.2755727 10.214.137.23:0/3139049599 after 303.924754
<< some 500000 lines of log from removing each client cap >>
2014-10-28 23:31:45.106346 7f0114473700 10 mds.0.sessionmap remove_session s=0x59195dc0 name=client.2755727

The slice of the MDS log immediately leading up to the loss of contact is ~/jcsp/9928/mds.log. Last few responses are:

2014-10-28 23:26:42.411135 7f0116c78700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401830 lookup #10004d490dc/575409 2014-10-28 23:26:42.000000) v2
2014-10-28 23:26:42.411981 7f0116c78700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401831 lookup #10004d490dc/575410 2014-10-28 23:26:42.000000) v2
2014-10-28 23:26:42.412862 7f0116c78700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401832 lookup #10004d490dc/575411 2014-10-28 23:26:42.000000) v2
2014-10-28 23:26:42.413707 7f0116c78700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401833 lookup #10004d490dc/575412 2014-10-28 23:26:42.000000) v2
2014-10-28 23:26:42.414580 7f0116c78700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401834 lookup #10004d490dc/575413 2014-10-28 23:26:42.000000) v2
2014-10-28 23:26:42.415427 7f0116c78700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401835 lookup #10004d490dc/575414 2014-10-28 23:26:42.000000) v2
2014-10-28 23:27:37.338019 7f0113b71700 10 mds.0.server reply_request 0 ((0) Success) client_request(client.2755727:92401836 readdir #10004d490dc 2014-10-28 23:26:52.000000) v2

Actions #2

Updated by Sage Weil over 9 years ago

2014-10-28 21:21:25.394832 mon.0 10.214.133.104:6789/0 5 : cluster [INF] mdsmap e1830: 1/1/1 up {0=burnupi21=up:active}
2014-10-28 21:21:26.524753 mon.0 10.214.133.104:6789/0 522 : cluster [INF] mds.0 10.214.134.10:6801/17218 up:active
2014-10-28 21:21:26.524826 mon.0 10.214.133.104:6789/0 523 : cluster [INF] mdsmap e1831: 1/1/1 up {0=burnupi21=up:active}
2014-10-28 21:23:48.427099 mon.0 10.214.133.104:6789/0 1122 : cluster [INF] mds.? 10.214.134.10:6800/48685 up:boot
2014-10-28 21:23:48.427970 mon.0 10.214.133.104:6789/0 1123 : cluster [INF] mdsmap e1832: 0/1/1 up, 1 up:standby, 1 failed
2014-10-28 21:23:48.430976 mon.0 10.214.133.104:6789/0 1124 : cluster [INF] mdsmap e1833: 1/1/1 up {0=burnupi21=up:replay}
2014-10-28 21:23:52.852302 mon.0 10.214.133.104:6789/0 1135 : cluster [INF] mds.0 10.214.134.10:6800/48685 up:reconnect
2014-10-28 21:23:52.852398 mon.0 10.214.133.104:6789/0 1136 : cluster [INF] mdsmap e1834: 1/1/1 up {0=burnupi21=up:reconnect}
2014-10-28 21:23:54.944394 mon.0 10.214.133.104:6789/0 1138 : cluster [INF] mds.0 10.214.134.10:6800/48685 up:rejoin
2014-10-28 21:23:54.944487 mon.0 10.214.133.104:6789/0 1139 : cluster [INF] mdsmap e1835: 1/1/1 up {0=burnupi21=up:rejoin}
2014-10-28 21:26:02.693663 mon.0 10.214.133.104:6789/0 3148 : cluster [INF] mds.? 10.214.134.10:6801/49913 up:boot
2014-10-28 21:26:02.693704 mon.0 10.214.133.104:6789/0 3149 : cluster [INF] mdsmap e1836: 1/1/1 up {0=burnupi21=up:replay}
2014-10-28 21:26:06.293923 mon.0 10.214.133.104:6789/0 3208 : cluster [INF] mds.0 10.214.134.10:6801/49913 up:reconnect
2014-10-28 21:26:06.293979 mon.0 10.214.133.104:6789/0 3209 : cluster [INF] mdsmap e1837: 1/1/1 up {0=burnupi21=up:reconnect}
2014-10-28 21:26:09.361988 mon.0 10.214.133.104:6789/0 3225 : cluster [INF] mds.0 10.214.134.10:6801/49913 up:rejoin
2014-10-28 21:26:09.362069 mon.0 10.214.133.104:6789/0 3226 : cluster [INF] mdsmap e1838: 1/1/1 up {0=burnupi21=up:rejoin}
2014-10-28 21:35:12.659906 mon.0 10.214.133.104:6789/0 3731 : cluster [INF] mds.0 10.214.134.10:6801/49913 up:clientreplay
2014-10-28 21:35:12.659967 mon.0 10.214.133.104:6789/0 3732 : cluster [INF] mdsmap e1839: 1/1/1 up {0=burnupi21=up:clientreplay}
2014-10-28 21:35:30.579366 mon.0 10.214.133.104:6789/0 3759 : cluster [INF] mds.0 10.214.134.10:6801/49913 up:active
2014-10-28 21:35:30.579436 mon.0 10.214.133.104:6789/0 3760 : cluster [INF] mdsmap e1840: 1/1/1 up {0=burnupi21=up:active}

Actions #3

Updated by Sage Weil over 9 years ago

  • Priority changed from Normal to Urgent
  • Source changed from other to Development
Actions #4

Updated by Zheng Yan over 9 years ago

  • Subject changed from Soft lockup in ceph_caps_revoking/ceph_put_fmode to kernel BUG at fs/ceph/caps.c:2307!

The very first error message is:

Oct 28 23:26:17 teuthology kernel: [3457610.583885] kernel BUG at /srv/autobuild-ceph/gitbuilder.git/build/fs/ceph/caps.c:2307!
Oct 28 23:26:17 teuthology kernel: [3457610.592106] invalid opcode: 0000 [#1] SMP
Oct 28 23:26:17 teuthology kernel: [3457610.596470] Modules linked in: ipmi_devintf(E) ipmi_si(E) ipmi_msghandler(E) ip6table_filter(E) ip6_tables(E) ebtable_nat(E) ebtables(E) ipt_MASQUERADE(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_state(E) nf_conntrack(E) ipt_REJECT(E) xt_CHECKSUM(E) iptable_mangle(E) xt_tcpudp(E) iptable_filter(E) ip_tables(E) x_tables(E) bridge(E) stp(E) llc(E) ceph(E) gpio_ich(E) i7core_edac(E) libceph(E) edac_core(E) psmouse(E) serio_raw(E) joydev(E) fscache(E) lpc_ich(E) tpm_infineon(E) tpm_tis(E) xfs(E) lp(E) parport(E) hid_generic(E) usbhid(E) hid(E) btrfs(E) e1000e(E) ptp(E) raid6_pq(E) ahci(E) libahci(E) pps_core(E) arcmsr(E) xor(E) libcrc32c(E)
Oct 28 23:26:17 teuthology kernel: [3457610.658400] CPU: 2 PID: 19790 Comm: teuthology Tainted: G            E 3.16.3-ceph-00306-g76c7fd1 #1
Oct 28 23:26:17 teuthology kernel: [3457610.667893] Hardware name: Supermicro X8SIL/X8SIL, BIOS 1.1 05/27/2010
Oct 28 23:26:17 teuthology kernel: [3457610.674741] task: ffff8803e1f621f0 ti: ffff8803e4ae4000 task.ti: ffff8803e4ae4000
Oct 28 23:26:17 teuthology kernel: [3457610.682510] RIP: 0010:[<ffffffffa0390f40>]  [<ffffffffa0390f40>] ceph_put_wrbuffer_cap_refs+0x120/0x360 [ceph]
Oct 28 23:26:17 teuthology kernel: [3457610.692771] RSP: 0018:ffff8803e4ae79c8  EFLAGS: 00010246
Oct 28 23:26:17 teuthology kernel: [3457610.698331] RAX: ffff8800049b5a90 RBX: ffff8800049b5a80 RCX: 0000000089178916
Oct 28 23:26:17 teuthology kernel: [3457610.705724] RDX: ffff8800049b5a90 RSI: 0000000000000001 RDI: ffff8804265b0740
Oct 28 23:26:17 teuthology kernel: [3457610.713111] RBP: ffff8803e4ae7a48 R08: 0000000000000001 R09: 0000000000000000
Oct 28 23:26:17 teuthology kernel: [3457610.720649] R10: 57ffbb806e9fe740 R11: 0000000000000000 R12: ffff8800049b5820
Oct 28 23:26:17 teuthology kernel: [3457610.727993] R13: ffff88025f590a40 R14: ffff8800049b5830 R15: 0000000000000001
Oct 28 23:26:17 teuthology kernel: [3457610.735358] FS:  00007f47587f4700(0000) GS:ffff88043fc80000(0000) knlGS:0000000000000000
Oct 28 23:26:17 teuthology kernel: [3457610.743658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 28 23:26:17 teuthology kernel: [3457610.749605] CR2: 00000000032fd000 CR3: 000000035890f000 CR4: 00000000000007e0
Oct 28 23:26:17 teuthology kernel: [3457610.756951] Stack:
Oct 28 23:26:17 teuthology kernel: [3457610.759160]  ffff8803e4ae7a28 ffffffff811688a3 ffffffff81168798 ffffffff00002201
Oct 28 23:26:17 teuthology kernel: [3457610.766939]  ffffffff00000024 00ff88025f590a40 ffff880400000001 ffffea0000000001
Oct 28 23:26:17 teuthology kernel: [3457610.774675]  ffff8800049b5d50 ffff8800049b5820 ffffea000b9fe740 ffff8800049b5d50
Oct 28 23:26:17 teuthology kernel: [3457610.782525] Call Trace:
Oct 28 23:26:17 teuthology kernel: [3457610.785294]  [<ffffffff811688a3>] ? test_clear_page_writeback+0x143/0x260
Oct 28 23:26:17 teuthology kernel: [3457610.792429]  [<ffffffff81168798>] ? test_clear_page_writeback+0x38/0x260
Oct 28 23:26:17 teuthology kernel: [3457610.799493]  [<ffffffffa0389854>] writepage_nounlock+0x2a4/0x450 [ceph]
Oct 28 23:26:17 teuthology kernel: [3457610.806314]  [<ffffffff81193d60>] ? SyS_msync+0x230/0x230
Oct 28 23:26:17 teuthology kernel: [3457610.811958]  [<ffffffffa0389b50>] ceph_update_writeable_page+0x150/0x480 [ceph]
Oct 28 23:26:17 teuthology kernel: [3457610.819487]  [<ffffffff81166536>] ? wait_for_stable_page+0x16/0x50
Oct 28 23:26:17 teuthology kernel: [3457610.825875]  [<ffffffffa0389eec>] ceph_write_begin+0x6c/0xc0 [ceph]
Oct 28 23:26:17 teuthology kernel: [3457610.832337]  [<ffffffff8115a9e6>] generic_perform_write+0xc6/0x1e0
Oct 28 23:26:17 teuthology kernel: [3457610.839856]  [<ffffffffa03856b0>] ceph_write_iter+0x890/0xa30 [ceph]
Oct 28 23:26:17 teuthology kernel: [3457610.846480]  [<ffffffff817295fb>] ? _raw_spin_unlock+0x2b/0x40
Oct 28 23:26:17 teuthology kernel: [3457610.852507]  [<ffffffff81187454>] ? do_wp_page+0x374/0x740
Oct 28 23:26:17 teuthology kernel: [3457610.858215]  [<ffffffff81189efc>] ? handle_mm_fault+0x86c/0x1060
Oct 28 23:26:17 teuthology kernel: [3457610.864411]  [<ffffffff81046a19>] ? __do_page_fault+0x159/0x580
Oct 28 23:26:17 teuthology kernel: [3457610.870561]  [<ffffffff811cdc13>] ? vfs_write+0x1d3/0x1f0
Oct 28 23:26:17 teuthology kernel: [3457610.876205]  [<ffffffff811cdc13>] ? vfs_write+0x1d3/0x1f0
Oct 28 23:26:17 teuthology kernel: [3457610.881793]  [<ffffffff811ccdcb>] new_sync_write+0x7b/0xb0
Oct 28 23:26:17 teuthology kernel: [3457610.887502]  [<ffffffff811cdb07>] vfs_write+0xc7/0x1f0
Oct 28 23:26:17 teuthology kernel: [3457610.892897]  [<ffffffff811ce012>] SyS_write+0x52/0xb0
Oct 28 23:26:17 teuthology kernel: [3457610.898139]  [<ffffffff81729cd6>] system_call_fastpath+0x1a/0x1f
Oct 28 23:26:17 teuthology kernel: [3457610.904357] Code: 0f 85 c5 00 00 00 45 85 ed 74 09 48 8b 7d c0 e8 e7 94 e5 e0 48 83 c4 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 84 00 00 00 00 00 <0f> 0b 66 0f 1f 44 00 00 4c 89 e7 31 d2 be 02 00 00 00 e8 f9 db
Oct 28 23:26:17 teuthology kernel: [3457610.924816] RIP  [<ffffffffa0390f40>] ceph_put_wrbuffer_cap_refs+0x120/0x360 [ceph]
Oct 28 23:26:17 teuthology kernel: [3457610.932745]  RSP <ffff8803e4ae79c8>
Actions #5

Updated by Sage Weil over 9 years ago

  • Assignee set to Zheng Yan
Actions #6

Updated by Zheng Yan over 9 years ago

  • Status changed from New to 7

fixed by "ceph: introduce global empty snap context"

Actions #7

Updated by Ilya Dryomov over 9 years ago

  • Status changed from 7 to Resolved

This patch has been in testing for a while.

Actions #8

Updated by Markus Blank-Burian about 9 years ago

I found the same trace in my logfiles but also a similar with a NULL dereference instead (kernel 3.14.28 stable + some fixes). I guess this is also fixed by this commit?

2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.857936] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862038] IP: [<ffffffffa01bc40b>] ceph_update_writeable_page+0xda/0x319 [ceph]
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862077] PGD f3aa85067 PUD 1007756067 PMD 0
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862080] Oops: 0000 [#1] SMP
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862130] Modules linked in: cbc ceph libceph ipmi_watchdog w83627ehf adm1026 w83795 w83793 hwmon_vid jc42 80
21q garp stp mrp llc autofs4 cpufreq_ondemand xfs ipmi_si ipmi_devintf ipmi_msghandler sr_mod cdrom mgag200 syscopyarea sysfillrect sysimgblt kvm_amd ttm kvm
 drm_kms_helper usb_storage drm microcode evdev amd64_edac_mod sp5100_tco edac_mce_amd acpi_cpufreq rtc_cmos i2c_piix4 edac_core button k10temp processor psm
ouse thermal_sys pcspkr rpcsec_gss_krb5 fuse nfsv4 nfs af_packet hid_generic usbhid hid bonding sd_mod ohci_pci ehci_pci ohci_hcd ehci_hcd usbcore ahci libah
ci libata usb_common ipv6 dm_mirror dm_region_hash dm_log dm_mod unix
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862136] CPU: 10 PID: 20086 Comm: slurmstepd Tainted: P        W  O 3.14.28-gentoo #2
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862137] Hardware name: Supermicro H8DGU/H8DGU, BIOS 1.0c       10/14/10
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862140] task: ffff8810076f98f0 ti: ffff880fd6f50000 task.ti: ffff880fd6f50000
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862164] RIP: 0010:[<ffffffffa01bc40b>]  [<ffffffffa01bc40b>] ceph_update_writeable_page+0xda/0x319 [ceph]
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862166] RSP: 0018:ffff880fd6f51bb0  EFLAGS: 00010257
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862167] RAX: 0000000000000000 RBX: ffffea0035f4e700 RCX: 00000000000093c1
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862169] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8805234fac90
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862171] RBP: ffff880fd6f51c38 R08: ffffea0035f4e700 R09: ffff880fd6f51ce8
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862173] R10: 0000000000001000 R11: ffff880fd6f51e98 R12: ffff88043edb59b0
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862174] R13: ffff88080a12f060 R14: 0000000000000c74 R15: 0000000000000063
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862177] FS:  00007fd6b1902700(0000) GS:ffff88101fc80000(0000) knlGS:0000000000000000
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862179] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862180] CR2: 0000000000000008 CR3: 0000001008329000 CR4: 00000000000007e0
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862181] Stack:
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862185]  0000000000000000 0000000000000cd7 0000000000000000 0000000000000063
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862188]  ffff8800da0a1a00 00000cd7010200da ffff8805234fac90 ffffffff810b89e4
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862191]  ffffea0035f4e700 ffff880fd6f51c38 ffffffff810b1cab 0000000000000063
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862192] Call Trace:
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862203]  [<ffffffff810b89e4>] ? wait_for_stable_page+0xd/0x25
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862209]  [<ffffffff810b1cab>] ? grab_cache_page_write_begin+0xa7/0xb6
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862230]  [<ffffffffa01bc6d7>] ceph_write_begin+0x8d/0xa8 [ceph]
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862235]  [<ffffffff810b1164>] generic_file_buffered_write+0xf7/0x1fe
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862255]  [<ffffffffa01b97ce>] ceph_aio_write+0xac2/0xc01 [ceph]
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862261]  [<ffffffff810cf68f>] ? handle_mm_fault+0x354/0x98e
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862267]  [<ffffffff810febb9>] do_sync_write+0x54/0x73
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862271]  [<ffffffff810ff230>] vfs_write+0xaf/0xf9
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862274]  [<ffffffff810ff942>] SyS_write+0x48/0x73
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862280]  [<ffffffff813cc61b>] tracesys+0xdd/0xe2
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862307] Code: 90 fe ff ff 0f 84 4b 01 00 00 31 f6 4c 89 e7 e8 fa de ff ff 48 8b 7d a8 48 89 85 78 ff ff ff
e8 62 ff ea e0 48 8b 85 78 ff ff ff <48> 8b 50 08 49 39 55 08 48 89 c7 0f 86 d1 00 00 00 e8 7e 54 fc
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862327] RIP  [<ffffffffa01bc40b>] ceph_update_writeable_page+0xda/0x319 [ceph]
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862327]  RSP <ffff880fd6f51bb0>
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862328] CR2: 0000000000000008
2015-01-29T11:08:00+01:00 kaa-17 kernel: [1162446.862345] ---[ end trace 53dcb85c1131d4f5 ]---

Actions

Also available in: Atom PDF