Bug #46671
nautilus:tasks/cfuse_workunit_suites_fsstress: "kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s!"
0%
Description
See the error with Yuri's nautilus backport testing in the multi-mds suite,
https://pulpito.ceph.com/yuriw-2020-07-20_15:25:01-multimds-wip-yuri3-testing-2020-07-17-1802-nautilus-distro-basic-smithi/5244060/
Failure: '/home/ubuntu/cephtest/archive/syslog/kern.log:2020-07-20T19:09:56.618000+00:00 smithi166 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:1:85986] ' in syslog 1 jobs: ['5244060'] suites: ['centos_latest', 'clusters/9-mds', 'conf/{client', 'mds', 'mon', 'mon-debug', 'mount', 'mount/kclient/{kernel-testing', 'ms-die-on-skipped}', 'multimds/verify/{begin', 'objectstore-ec/filestore-xfs', 'osd}', 'overrides/{fuse-default-perm-no', 'tasks/cfuse_workunit_suites_fsstress', 'validater/valgrind}', 'verify/{frag_enable', 'whitelist_health', 'whitelist_wrongly_marked_down}}']
And in the client machine log, /a/yuriw-2020-07-20_15:25:01-multimds-wip-yuri3-testing-2020-07-17-1802-nautilus-distro-basic-smithi/5244060/remote/smithi166/syslog/kern.log.gz
2020-07-20T19:09:56.620071+00:00 smithi166 kernel: Call Trace: 2020-07-20T19:09:56.620150+00:00 smithi166 kernel: ? __cap_is_valid+0x22/0xd0 [ceph] 2020-07-20T19:09:56.620228+00:00 smithi166 kernel: ? __lock_acquire+0x4e7/0x2000 2020-07-20T19:09:56.620308+00:00 smithi166 kernel: _raw_spin_lock+0x35/0x50 2020-07-20T19:09:56.620407+00:00 smithi166 kernel: ? __cap_is_valid+0x22/0xd0 [ceph] 2020-07-20T19:09:56.620489+00:00 smithi166 kernel: __cap_is_valid+0x22/0xd0 [ceph] 2020-07-20T19:09:56.620568+00:00 smithi166 kernel: ? ceph_check_caps+0x6e7/0xbe0 [ceph] 2020-07-20T19:09:56.620646+00:00 smithi166 kernel: __ceph_caps_issued+0x52/0xf0 [ceph] 2020-07-20T19:09:56.620740+00:00 smithi166 kernel: ceph_check_caps+0xf2/0xbe0 [ceph] 2020-07-20T19:09:56.620819+00:00 smithi166 kernel: ? trace_hardirqs_on_thunk+0x1a/0x1c 2020-07-20T19:09:56.620905+00:00 smithi166 kernel: ? lockdep_hardirqs_on+0x144/0x1d0 2020-07-20T19:09:56.620988+00:00 smithi166 kernel: ? trace_hardirqs_on_thunk+0x1a/0x1c 2020-07-20T19:09:56.621067+00:00 smithi166 kernel: ? __lock_acquire+0x4e7/0x2000 2020-07-20T19:09:56.621145+00:00 smithi166 kernel: ? igrab+0x19/0x50 2020-07-20T19:09:56.621222+00:00 smithi166 kernel: ? find_held_lock+0x2d/0x90 2020-07-20T19:09:56.621300+00:00 smithi166 kernel: ? find_held_lock+0x2d/0x90 2020-07-20T19:09:56.621402+00:00 smithi166 kernel: ? ceph_check_delayed_caps+0x90/0x140 [ceph] 2020-07-20T19:09:56.621485+00:00 smithi166 kernel: ceph_check_delayed_caps+0xaa/0x140 [ceph] 2020-07-20T19:09:56.621573+00:00 smithi166 kernel: delayed_work+0x8e/0x2b0 [ceph] 2020-07-20T19:09:56.621651+00:00 smithi166 kernel: process_one_work+0x2b7/0x540 2020-07-20T19:09:56.621735+00:00 smithi166 kernel: ? process_one_work+0x1b1/0x540 2020-07-20T19:09:56.621819+00:00 smithi166 kernel: worker_thread+0x225/0x3f0 2020-07-20T19:09:56.621897+00:00 smithi166 kernel: kthread+0x12e/0x140 2020-07-20T19:09:56.621974+00:00 smithi166 kernel: ? process_one_work+0x540/0x540 2020-07-20T19:09:56.622053+00:00 smithi166 kernel: ? kthread_insert_work_sanity_check+0x60/0x60 2020-07-20T19:09:56.622132+00:00 smithi166 kernel: ret_from_fork+0x24/0x30 2020-07-20T19:10:24.613177+00:00 smithi166 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:1:85986]
History
#1 Updated by Ramana Raja over 3 years ago
- Description updated (diff)
#2 Updated by Patrick Donnelly over 3 years ago
- Assignee set to Jeff Layton
This is a new issue in the kernel testing branch. I vaguely recall raising a similar issue but can't find it.
#3 Updated by Jeff Layton over 3 years ago
- Status changed from New to Need More Info
Sorry for the long delay. This one slipped through the cracks.
It looks like this is probably stuck waiting on the session->s_gen_ttl_lock. That lock has a pretty small footprint and never has other locks nested inside it, so I have to wonder if this is indicative of some sort of memory corruption (maybe the spinlock was corrupt in memory, IOW).
The logs here seem to be long gone, do we know what kernel this was running at the time?
#4 Updated by Jeff Layton about 2 years ago
- Status changed from Need More Info to Duplicate
- Parent task set to #46284
#5 Updated by Jeff Layton about 2 years ago
- Parent task deleted (
#46284)