Project

General

Profile

Bug #45562

soft lockup stuck for 22s! in ceph.ko and code stack is 'destroy_inode->ceph_destroy_inode->__ceph_remove_cap->_raw_spin_lock'

Added by joe h almost 4 years ago. Updated over 2 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
fs/ceph
Target version:
% Done:

10%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
05/15/2020
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

【描述】
recently,I encountered the same bug(CPU soft lockup stuck for 22s!) for many times while cluster process running deleting data bueiness for 12 hours,my cluster kernel version is 4.14.0, there is the only backtrace information in /var/log/messages are as follows;
and I search for the same CPU softlockup 22s! on www.tracker.ceph.com,there is only one question which code stack are different from my question code stack.

【backtrace】
kernel: watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kswapd0:290]
kernel: Modules linked in: rpcsec_gss_krb5(OE) iptable_filter tcp_diag inet_diag rpcrdma(OE) nfsd(OE) auth_rpcgss(OE) nfs_acl(OE) lockd(OE) grace(OE) fscache sunrpc(OE) ceph(OE) libceph(OE) dns_resolver dev_pmc_scsi(OE) flashcache(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE) mlx4_core(OE) devlink mlx_compat(OE) ip_vs nf_conntrack sr_mod vfat fat cdrom dm_mirror dm_region_hash dm_log dm_mod intel_rapl x86_pkg_temp_thermal ext4 intel_powerclamp mbcache jbd2 coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel uas ses crypto_simd enclosure usb_storage glue_helper cryptd sg intel_cstate iTCO_wdt iTCO_vendor_support intel_uncore
kernel: intel_rapl_perf pcspkr joydev ioatdma mei_me i2c_i801 mei lpc_ich shpchp ipmi_si wmi ipmi_devintf ipmi_msghandler nfit acpi_power_meter libnvdimm acpi_pad ip_tables xfs libcrc32c sd_mod ast drm_kms_helper syscopyarea sysfillrect crc32c_intel sysimgblt fb_sys_fops ixgbe ttm igb ahci mdio smartpqi drm libahci ptp scsi_transport_sas i2c_algo_bit dca libata pps_core i2c_core [last unloaded: mlxfw]
kernel: CPU: 5 PID: 290 Comm: kswapd0 Kdump: loaded Tainted: G W OE ------------ 4.14.0.xxx.x86_64 #1
kernel: task: ffff9ff33beb0000 task.stack: ffffb2cf0eb3c000
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x117/0x1a0
kernel: RSP: 0018:ffffb2cf0eb3fa70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
kernel: RAX: 0000000000000000 RBX: ffff9fdc1b1672c0 RCX: 0000000000180000
kernel: RDX: ffffffffab84b740 RSI: 0000000079383530 RDI: ffffa00379383530
kernel: RBP: ffffb2cf0eb3fa70 R08: ffff9ff33e15c740 R09: 0000000000000000
kernel: R10: ffffd2bebe948e50 R11: ffff9fe680d54410 R12: ffff9fd9397576c8
kernel: R13: ffffa00379383000 R14: 0000000000000001 R15: ffff9fdfd254a840
kernel: FS: 0000000000000000(0000) GS:ffff9ff33e140000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007fc320005738 CR3: 0000002379a09005 CR4: 00000000007606e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: PKRU: 55555554
kernel: Call Trace:
kernel: queued_spin_lock_slowpath+0xb/0x13
kernel: _raw_spin_lock+0x20/0x30
kernel: __ceph_remove_cap+0x52/0x250 [ceph]
kernel: ceph_queue_caps_release+0x50/0x70 [ceph]
kernel: ceph_destroy_inode+0x2d/0x1c0 [ceph]
kernel: destroy_inode+0x3b/0x60
kernel: evict+0x142/0x1a0
kernel: iput+0x17d/0x1d0
kernel: dentry_unlink_inode+0xb9/0xf0
kernel: __dentry_kill+0xc7/0x170
kernel: shrink_dentry_list+0x122/0x280
kernel: prune_dcache_sb+0x5a/0x80
kernel: super_cache_scan+0x107/0x190
kernel: shrink_slab+0x26b/0x480
kernel: shrink_node+0x2f7/0x310
kernel: kswapd+0x2cf/0x730
kernel: kthread+0x109/0x140
kernel: ? mem_cgroup_shrink_node+0x180/0x180
kernel: ? kthread_park+0x60/0x60
kernel: ret_from_fork+0x2a/0x40
kernel: Code: c1 e8 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 c7 01 00 48 03 14 c5 20 a4 50 ab 4c 89 02 41 8b 40 08 85 c0 75 0a f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 08 41 0f 0d 09 eb

【相关问题】
there is only one CPU soft lockup stuck for 22s(tracker.ceph.com/issues/18130),however a pity, which is different with my.

History

#1 Updated by Jeff Layton almost 4 years ago

Where did this kernel come from? I don't recognize this version string: 4.14.0.xxx.x86_64

Assuming that it's v4.14 based it's probably quite old and is missing a lot of upstream fixes. Can you test this on something closer to mainline kernels and let us know if it's still an issue?

#2 Updated by Jeff Layton over 3 years ago

  • Status changed from New to Need More Info
  • Assignee set to Jeff Layton

No response in a month. Please reopen if you are able to supply the requested info.

#3 Updated by joe h over 3 years ago

Jeff Layton wrote:

No response in a month. Please reopen if you are able to supply the requested info.

first thinks very much,Jeff Layton.
our cluster basal kernel version is based on kernel-4.14.0, and the kernel version will not be upgraded in the near future regrettably.
However,I found a regular phenomenon, When physical node memory is very high(beyond 95%), and excuting 'rm -rf /share/data/*' operation,the probability of these problems(BUG #45562 and BUG #45563) increasing.

#4 Updated by Jeff Layton over 3 years ago

  • Status changed from Need More Info to Can't reproduce

joe h wrote:

However,I found a regular phenomenon, When physical node memory is very high(beyond 95%), and excuting 'rm -rf /share/data/*' operation,the probability of these problems(BUG #45562 and BUG #45563) increasing.

Again, that doesn't tell me much. Memory pressure is often a trigger for certain types of bugs. In this case, the softlockup is due to not being able to acquire a spinlock. Either it wasn't released properly, or the holder of it was stuck for some reason and couldn't release it.

I haven't seen this with modern kernels at all, and there have been several bugs fixed in this area since v4.14 shipped. I doubt there's much we can do here. I'll close this back out unless you can reproduce on a more recent kernel.

#5 Updated by Yaarit Hatuka over 2 years ago

  • Crash signature (v1) updated (diff)

'Crash signature (v1)' should hold the value of 'stack_sig' key (which is missing in this case). Moving its content into a note:

soft lockup stuck for 22s! in ceph.ko and code stack is 'destroy_inode->ceph_destroy_inode->__ceph_remove_cap->_raw_spin_lock'

Also available in: Atom PDF