Project

General

Profile

Bug #46671

nautilus:tasks/cfuse_workunit_suites_fsstress: "kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s!"

Added by Ramana Raja over 3 years ago. Updated about 2 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
multimds
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

See the error with Yuri's nautilus backport testing in the multi-mds suite,
https://pulpito.ceph.com/yuriw-2020-07-20_15:25:01-multimds-wip-yuri3-testing-2020-07-17-1802-nautilus-distro-basic-smithi/5244060/

Failure: '/home/ubuntu/cephtest/archive/syslog/kern.log:2020-07-20T19:09:56.618000+00:00 smithi166 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:1:85986]
' in syslog
1 jobs: ['5244060']
suites: ['centos_latest', 'clusters/9-mds', 'conf/{client', 'mds', 'mon', 'mon-debug', 'mount', 'mount/kclient/{kernel-testing', 'ms-die-on-skipped}', 'multimds/verify/{begin', 'objectstore-ec/filestore-xfs', 'osd}', 'overrides/{fuse-default-perm-no', 'tasks/cfuse_workunit_suites_fsstress', 'validater/valgrind}', 'verify/{frag_enable', 'whitelist_health', 'whitelist_wrongly_marked_down}}']

And in the client machine log, /a/yuriw-2020-07-20_15:25:01-multimds-wip-yuri3-testing-2020-07-17-1802-nautilus-distro-basic-smithi/5244060/remote/smithi166/syslog/kern.log.gz

2020-07-20T19:09:56.620071+00:00 smithi166 kernel: Call Trace:
2020-07-20T19:09:56.620150+00:00 smithi166 kernel: ? __cap_is_valid+0x22/0xd0 [ceph]
2020-07-20T19:09:56.620228+00:00 smithi166 kernel: ? __lock_acquire+0x4e7/0x2000
2020-07-20T19:09:56.620308+00:00 smithi166 kernel: _raw_spin_lock+0x35/0x50
2020-07-20T19:09:56.620407+00:00 smithi166 kernel: ? __cap_is_valid+0x22/0xd0 [ceph]
2020-07-20T19:09:56.620489+00:00 smithi166 kernel: __cap_is_valid+0x22/0xd0 [ceph]
2020-07-20T19:09:56.620568+00:00 smithi166 kernel: ? ceph_check_caps+0x6e7/0xbe0 [ceph]
2020-07-20T19:09:56.620646+00:00 smithi166 kernel: __ceph_caps_issued+0x52/0xf0 [ceph]
2020-07-20T19:09:56.620740+00:00 smithi166 kernel: ceph_check_caps+0xf2/0xbe0 [ceph]
2020-07-20T19:09:56.620819+00:00 smithi166 kernel: ? trace_hardirqs_on_thunk+0x1a/0x1c
2020-07-20T19:09:56.620905+00:00 smithi166 kernel: ? lockdep_hardirqs_on+0x144/0x1d0
2020-07-20T19:09:56.620988+00:00 smithi166 kernel: ? trace_hardirqs_on_thunk+0x1a/0x1c
2020-07-20T19:09:56.621067+00:00 smithi166 kernel: ? __lock_acquire+0x4e7/0x2000
2020-07-20T19:09:56.621145+00:00 smithi166 kernel: ? igrab+0x19/0x50
2020-07-20T19:09:56.621222+00:00 smithi166 kernel: ? find_held_lock+0x2d/0x90
2020-07-20T19:09:56.621300+00:00 smithi166 kernel: ? find_held_lock+0x2d/0x90
2020-07-20T19:09:56.621402+00:00 smithi166 kernel: ? ceph_check_delayed_caps+0x90/0x140 [ceph]
2020-07-20T19:09:56.621485+00:00 smithi166 kernel: ceph_check_delayed_caps+0xaa/0x140 [ceph]
2020-07-20T19:09:56.621573+00:00 smithi166 kernel: delayed_work+0x8e/0x2b0 [ceph]
2020-07-20T19:09:56.621651+00:00 smithi166 kernel: process_one_work+0x2b7/0x540
2020-07-20T19:09:56.621735+00:00 smithi166 kernel: ? process_one_work+0x1b1/0x540
2020-07-20T19:09:56.621819+00:00 smithi166 kernel: worker_thread+0x225/0x3f0
2020-07-20T19:09:56.621897+00:00 smithi166 kernel: kthread+0x12e/0x140
2020-07-20T19:09:56.621974+00:00 smithi166 kernel: ? process_one_work+0x540/0x540
2020-07-20T19:09:56.622053+00:00 smithi166 kernel: ? kthread_insert_work_sanity_check+0x60/0x60
2020-07-20T19:09:56.622132+00:00 smithi166 kernel: ret_from_fork+0x24/0x30
2020-07-20T19:10:24.613177+00:00 smithi166 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:1:85986]

History

#1 Updated by Ramana Raja over 3 years ago

  • Description updated (diff)

#2 Updated by Patrick Donnelly over 3 years ago

  • Assignee set to Jeff Layton

This is a new issue in the kernel testing branch. I vaguely recall raising a similar issue but can't find it.

#3 Updated by Jeff Layton over 3 years ago

  • Status changed from New to Need More Info

Sorry for the long delay. This one slipped through the cracks.

It looks like this is probably stuck waiting on the session->s_gen_ttl_lock. That lock has a pretty small footprint and never has other locks nested inside it, so I have to wonder if this is indicative of some sort of memory corruption (maybe the spinlock was corrupt in memory, IOW).

The logs here seem to be long gone, do we know what kernel this was running at the time?

#4 Updated by Jeff Layton about 2 years ago

  • Status changed from Need More Info to Duplicate
  • Parent task set to #46284

#5 Updated by Jeff Layton about 2 years ago

  • Parent task deleted (#46284)

Also available in: Atom PDF