Project

General

Profile

Bug #41519

kclient: crashed due to dentry still in use on umount

Added by Erqi Chen over 4 years ago. Updated almost 4 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

BUG triggered crash due to dentry still in use during cephfs umount.
There is mds session reconnect log in history.

Kernel version is 3.10.0-693.5.2.el7.x86_64, which is ported the latest net/ceph and fs/ceph patch from linux-3.10.0-957.27.2.el7.

fs/dcache.c
static void shrink_dcache_for_umount_subtree(struct dentry *dentry) {
if (dentry->d_lockref.count != 0) {
printk(KERN_ERR
"BUG: Dentry %p{i=%lx,n=%s}"
" still in use (%d)"
" [unmount of %s %s]\n",
dentry,
dentry->d_inode ?
dentry->d_inode->i_ino : 0UL,
dentry->d_name.name,
dentry->d_lockref.count,
dentry->d_sb->s_type->name,
dentry->d_sb->s_id);
BUG;
}
}

[4034502.618100] docker0: port 3(vetha87bc74) entered disabled state
[4034502.650797] docker0: port 3(vetha87bc74) entered disabled state
[4034502.651642] device vetha87bc74 left promiscuous mode
[4034502.652228] docker0: port 3(vetha87bc74) entered disabled state
[4034504.876161] BUG: Dentry ffff882e0fa38000{i=1008b32ff3c,n=images} still in use (105) [unmount of ceph ceph]
[4034504.876856] ------------[ cut here ]------------
[4034504.877453] kernel BUG at fs/dcache.c:970!
[4034504.878036] invalid opcode: 0000 [#1] SMP
[4034504.878613] Modules linked in: ipmi_watchdog ipmi_poweroff tcp_diag inet_diag nvidia_modeset(POE) nvidia_uvm(OE) veth ceph libceph dns_resolver nvidia(POE) br_netfilter bridge stp llc overlay(T) nf_conntrack_netlink xt_statistic xt_nat ipt_MASQUERADE nf_nat_masquerade_ipv4 ipt_REJECT nf_reject_ipv4 xt_mark vxlan xt_comment xt_conntrack ip6_udp_tunnel udp_tunnel ip_set nfnetlink iptable_filter xt_addrtype iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_conntrack binfmt_misc xfs libcrc32c sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support ipmi_si ipmi_devintf ioatdma sg pcspkr ipmi_msghandler mei_me
[4034504.882626] acpi_pad shpchp lpc_ich mei i2c_i801 acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mxm_wmi drm ixgbe igb ahci libahci crct10dif_pclmul crct10dif_common mdio libata crc32c_intel ptp pps_core i2c_algo_bit megaraid_sas dca i2c_core wmi
[4034504.885644] CPU: 1 PID: 18757 Comm: umount Tainted: P OE ------------ T 3.10.0-1.1.0.el7.x86_64 #1
[4034504.886443] Hardware name: Supermicro SYS-4028GR-TR2/X10DRG-O+-CPU, BIOS 2.0a 04/24/2017
[4034504.887233] task: ffff883df573af70 ti: ffff8834925ec000 task.ti: ffff8834925ec000
[4034504.888033] RIP: 0010:[<ffffffff8120abac>] [<ffffffff8120abac>] shrink_dcache_for_umount_subtree+0x1bc/0x1d0
[4034504.888867] RSP: 0018:ffff8834925efdf0 EFLAGS: 00010246
[4034504.889691] RAX: 000000000000005e RBX: ffff882e0fa38000 RCX: 0000000000000000
[4034504.890521] RDX: 0000000000000000 RSI: ffff881fff64f438 RDI: ffff881fff64f438
[4034504.891347] RBP: ffff8834925efe10 R08: 0000000000000000 R09: ffff881ff90cef00
[4034504.892180] R10: 0000000000004dd7 R11: 0000000000000001 R12: ffff882e0fa38000
[4034504.893008] R13: 0000000000000311 R14: ffffffff81e4f950 R15: ffff883df573b740
[4034504.893833] FS: 00007feb69f1a880(0000) GS:ffff881fff640000(0000) knlGS:0000000000000000
[4034504.894662] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[4034504.895485] CR2: 00007feb69adc074 CR3: 0000000f316c6000 CR4: 00000000003407e0
[4034504.896322] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[4034504.897163] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[4034504.898006] Stack:
[4034504.898839] ffff883444b1db20 ffff883444b1d800 ffff883444b1d800 ffffffffc074f260
[4034504.899694] ffff8834925efe28 ffffffff8120c44f ffff883444b1d800 ffff8834925efe50
[4034504.900544] ffffffff811f50e1 0000000000000051 ffff883444b1d800 ffff883df573af70
[4034504.901396] Call Trace:
[4034504.902254] [<ffffffff8120c44f>] shrink_dcache_for_umount+0x2f/0x60
[4034504.903110] [<ffffffff811f50e1>] generic_shutdown_super+0x21/0x100
[4034504.903957] [<ffffffff811f5502>] kill_anon_super+0x12/0x20
[4034504.904814] [<ffffffffc0720130>] ceph_kill_sb+0x30/0x80 [ceph]
[4034504.905665] [<ffffffff811f58b9>] deactivate_locked_super+0x49/0x60
[4034504.906513] [<ffffffff811f6026>] deactivate_super+0x46/0x60
[4034504.907353] [<ffffffff81212fff>] cleanup_mnt+0x3f/0x80
[4034504.908185] [<ffffffff81213092>] __cleanup_mnt+0x12/0x20
[4034504.909009] [<ffffffff810a67c5>] task_work_run+0xc5/0xf0
[4034504.909836] [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0
[4034504.910664] [<ffffffff816931bd>] int_signal+0x12/0x17
[4034504.911497] Code: 00 00 48 8b 40 28 4c 8b 08 48 8b 43 30 48 85 c0 74 1b 48 8b 50 40 48 89 34 24 48 c7 c7 c8 4e 8e 81 48 89 de 31 c0 e8 d6 11 47 00 <0f> 0b 31 d2 eb e5 0f 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f
[4034504.913292] RIP [<ffffffff8120abac>] shrink_dcache_for_umount_subtree+0x1bc/0x1d0
[4034504.914177] RSP <ffff8834925efdf0>

[3176288.274928] libceph: client1611665 fsid 9932812f-7fbd-49d5-8233-c9ca90bfafa6
[3192219.180200] libceph: mon2 10.48.178.47:6789 socket closed (con state CONNECTING)
[3192219.722315] libceph: mon2 10.48.178.47:6789 socket closed (con state CONNECTING)
[3192220.752249] libceph: mon2 10.48.178.47:6789 socket closed (con state CONNECTING)
[3192222.816147] libceph: mon2 10.48.178.47:6789 socket closed (con state CONNECTING)
[3192222.864963] libceph: mon1 10.48.175.48:6789 session established
[3192222.867586] libceph: client2731459 fsid de0707c1-402e-4e63-8144-4ccd7f2d8204
[3196058.342902] libceph: mon2 10.48.156.148:6789 session established
[3196058.344410] libceph: client1615574 fsid 9932812f-7fbd-49d5-8233-c9ca90bfafa6
[3196077.633893] libceph: osd246 down
[3196077.635152] libceph: osd246 up
[3199117.829434] libceph: mon0 10.48.154.172:6789 session established
[3199117.832125] libceph: client1620912 fsid 9932812f-7fbd-49d5-8233-c9ca90bfafa6
[3257221.415213] ceph: mds0 reconnect start
[3257221.437386] ceph: mds0 reconnect denied
[3257221.437986] ceph: dropping dirty Fw state for ffff88261aa00e60 1101375001894
[3257276.925737] libceph: mds0 10.48.175.48:6800 socket closed (con state OPEN)
[3257323.780682] ceph: mds0 reconnect start
[3257323.805229] ceph: mds0 reconnect denied
[3257323.805821] ceph: dropping dirty+flushing Fw state for ffff88261aa00e60 1101375001894
[3261747.432203] libceph: mon1 10.48.155.148:6789 session established
[3261747.433206] libceph: mon1 10.48.175.48:6789 session established
[3261747.434054] libceph: client1619131 fsid 9932812f-7fbd-49d5-8233-c9ca90bfafa6
[3261747.435783] libceph: client2741092 fsid de0707c1-402e-4e63-8144-4ccd7f2d8204
[3267877.754541] ceph: mds0 caps stale
[3267897.754518] ceph: mds0 caps stale
[3267923.101182] libceph: mds0 10.48.154.172:6801 socket closed (con state OPEN)
[3267923.718610] libceph: mds0 10.48.154.172:6801 socket closed (con state CONNECTING)
[3267924.742677] libceph: wrong peer, want 10.48.154.172:6801/-2077849026, got 10.48.154.172:6801/772766424
[3267924.743419] libceph: mds0 10.48.154.172:6801 wrong peer at address
[3267926.650687] libceph: wrong peer, want 10.48.154.172:6801/-2077849026, got 10.48.154.172:6801/772766424
[3267926.651363] libceph: mds0 10.48.154.172:6801 wrong peer at address
[3267932.483401] ceph: mds0 reconnect start
[3267932.678877] ceph: mds0 reconnect denied
[3267932.679556] ceph: dropping unsafe request 2322621
[3267932.680172] ceph: dropping unsafe request 2322622
[3267932.680785] ceph: dropping unsafe request 2322623
[3267932.681366] ceph: dropping unsafe request 2322624
[3267932.681936] ceph: dropping unsafe request 2322625
[3267932.682502] ceph: dropping unsafe request 2322626
[3267932.683059] ceph: dropping unsafe request 2322627
[3267932.683610] ceph: dropping unsafe request 2322628
[3267932.684141] ceph: dropping unsafe request 2322629
[3267932.684663] ceph: dropping unsafe request 2322630
[3267932.685759] ceph: dropping dirty+flushing Fw state for ffff8832e3adbae0 1101797044822
[3267932.686293] ceph: dropping dirty+flushing Fw state for ffff8832e3adc070 1101797044823
[3267932.686829] ceph: dropping dirty+flushing Fw state for ffff8832e3adc600 1101797044824
[3267932.687347] ceph: dropping dirty+flushing Fw state for ffff8835c12e13f0 1101797044825
[3267932.687917] ceph: dropping dirty+flushing Fw state for ffff8835c12e1980 1101797044826
[3267932.688428] ceph: dropping dirty+flushing Fw state for ffff8835c12e1f10 1101797044827
[3267932.688953] ceph: dropping dirty+flushing Fw state for ffff8835c12e24a0 1101797044828
[3267932.689445] ceph: dropping dirty+flushing Fw state for ffff8829edb9a4a0 1101797044829
[3267932.689933] ceph: dropping dirty+flushing Fw state for ffff8821f073aa30 1101797044830
[3268977.753104] ceph: mds1 caps stale
[3268997.753093] ceph: mds1 caps stale
[3269152.773911] ceph: mds1 reconnect start
[3269152.784220] ceph: mds1 reconnect denied

History

#1 Updated by Jeff Layton over 4 years ago

This is not something I've seen.

With the lockref count being so high, I have to wonder if the dentry is just corrupt, though it could be that we have a dentry reference leak in an error codepath of some sort. The d_name seems to be legit though. Mostly, we clean up dentry references when the last reference to an MDS request is put, so it's possible that we have an MDS reference leak somewhere too.

What you may need to do is get a coredump when this occurs and analyze it. Determine what state the dentry and inode are in, and whether there might be MDS requests that are still have outstanding references.

#2 Updated by Jeff Layton over 4 years ago

This kernel is missing a whole pile of dentry handling fixes that got backported in -701.el7. I'd recommend updating to a newer kernel and seeing if this problem is now fixed.

#3 Updated by Jeff Layton over 4 years ago

  • Assignee set to Jeff Layton

#4 Updated by Jeff Layton almost 4 years ago

  • Status changed from New to Rejected

No response in several months. Please reopen if this is still reproducible.

Also available in: Atom PDF