Project

General

Profile

Actions

Bug #63324

open

kclient: watchdog: BUG: soft lockup - CPU#11 stuck for 25738s! [kworker/11:1:1724873]

Added by Xiubo Li 6 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

The kworker seems fall into a infinetely loop when iterating the mdsc->snap_flush_list list:

56233 2023-10-25T14:15:38.879533+0000 kernel: watchdog: BUG: soft lockup - CPU#11 stuck for 25712s! [kworker/11:1:1724873]
56234 2023-10-25T14:15:38.879724+0000 kernel: Modules linked in: ceph fscache netfs xt_TCPMSS xt_set ip_set_hash_net ip_set nbd rbd libceph dns_resolver xt_addrtype veth xt_mark nf_conntrack_netlink geneve ip6_udp_tunnel udp_tunnel       ipt_REJECT nf_reject_ipv4 xt_nat nft_chain_nat xt_CT xt_conntrack xt_comment nft_compat nft_counter nf_tables nfnetlink_cttimeout nfnetlink rfkill openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ov      erlay vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock ext4 mbcache jbd2 intel_rapl_msr intel_rapl_common intel_uncore_frequency_common vmwgfx isst_if_mbox_msr isst_if_common drm_ttm_helper ttm       nfit libnvdimm drm_kms_helper rapl vmw_balloon syscopyarea sysfillrect sysimgblt pcspkr vmw_vmci fb_sys_fops i2c_piix4 joydev drm ip_tables xfs libcrc32c dm_multipath ata_generic sd_mod nvme_tcp sg nvme_fabrics nvme ata_piix       crct10dif_pclmul nvme_core crc32_pclmul crc32c_intel libata nvme_common vmxnet3 ghash_clmulni_intel t10_pi vmw_pvscsi serio_raw
56235 2023-10-25T14:15:38.898655+0000 kernel:  dm_mirror dm_region_hash dm_log dm_mod fuse
56236 2023-10-25T14:15:38.898743+0000 kernel: CPU: 11 PID: 1724873 Comm: kworker/11:1 Tainted: G             L   --------  ---  5.14.0-284.36.1.el9_2.x86_64 #1                                                                        
56237 2023-10-25T14:15:38.898774+0000 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
56238 2023-10-25T14:15:38.898799+0000 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph]
56239 2023-10-25T14:15:38.898825+0000 kernel: RIP: 0010:_raw_spin_lock+0x17/0x30
56240 2023-10-25T14:15:38.898863+0000 kernel: Code: cc cc e9 bc f1 62 ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 65 ff 05 34 24 4e 58 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 05 c3 cc cc cc cc 89 c6 e8 2b ed 62 ff 66 90 c3       cc cc cc cc 0f
56241 2023-10-25T14:15:38.898892+0000 kernel: RSP: 0018:ffffb90d1c25fbf8 EFLAGS: 00000246
56242 2023-10-25T14:15:38.898921+0000 kernel: RAX: 0000000000000000 RBX: ffff973c71dc5f80 RCX: 00000000802a0028
56243 2023-10-25T14:15:38.898952+0000 kernel: RDX: 0000000000000001 RSI: ffffb90d1c25fc78 RDI: ffff973c71dc6230
56244 2023-10-25T14:15:38.898982+0000 kernel: RBP: 0000000000000000 R08: ffff973c71dc6518 R09: ffffb90d1c25fbe0
56245 2023-10-25T14:15:38.899029+0000 kernel: R10: 0000000000000000 R11: 0000000000002e4c R12: ffff973c71dc6230
56246 2023-10-25T14:15:38.899077+0000 kernel: R13: ffff97366aae4808 R14: ffff97366aae4800 R15: 0000000000000002
56247 2023-10-25T14:15:38.899148+0000 kernel: FS:  0000000000000000(0000) GS:ffff973e1fcc0000(0000) knlGS:0000000000000000
56248 2023-10-25T14:15:38.899176+0000 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
56249 2023-10-25T14:15:38.899208+0000 kernel: CR2: 000000c0008c9ae0 CR3: 0000000140264003 CR4: 00000000007706e0
56250 2023-10-25T14:15:38.899234+0000 kernel: PKRU: 55555554
56251 2023-10-25T14:15:38.899257+0000 kernel: Call Trace:
56252 2023-10-25T14:15:38.899283+0000 kernel:  <TASK>
56253 2023-10-25T14:15:38.899305+0000 kernel:  ceph_flush_snaps+0x4f/0x1e0 [ceph]
56254 2023-10-25T14:15:38.899330+0000 kernel:  ceph_handle_snap+0x2c2/0x4f0 [ceph]
56255 2023-10-25T14:15:38.899365+0000 kernel:  ? tcp_recvmsg_locked+0x200/0x900
56256 2023-10-25T14:15:38.899394+0000 kernel:  mds_dispatch+0x19d/0x1e0 [ceph]
56257 2023-10-25T14:15:38.899416+0000 kernel:  ceph_con_process_message+0x77/0x150 [libceph]
56258 2023-10-25T14:15:38.899440+0000 kernel:  ? release_sock+0x40/0x90
56259 2023-10-25T14:15:38.899465+0000 kernel:  process_message+0xf/0x110 [libceph]
56260 2023-10-25T14:15:38.899497+0000 kernel:  populate_in_iter+0x45e/0x660 [libceph]
56261 2023-10-25T14:15:38.899522+0000 kernel:  ? inet_recvmsg+0x5c/0x140
56262 2023-10-25T14:15:38.899551+0000 kernel:  ceph_con_v2_try_read+0x101/0x210 [libceph]
56263 2023-10-25T14:15:38.899606+0000 kernel:  ? ceph_con_v2_try_write+0x1aa/0x4c0 [libceph]
56264 2023-10-25T14:15:38.899632+0000 kernel:  ceph_con_workfn+0x175/0x440 [libceph]
56265 2023-10-25T14:15:38.899657+0000 kernel:  process_one_work+0x1e2/0x3b0
56266 2023-10-25T14:15:38.899681+0000 kernel:  ? rescuer_thread+0x390/0x390
56267 2023-10-25T14:15:38.899714+0000 kernel:  worker_thread+0x50/0x3a0
56268 2023-10-25T14:15:38.899739+0000 kernel:  ? rescuer_thread+0x390/0x390
56269 2023-10-25T14:15:38.899762+0000 kernel:  kthread+0xd6/0x100
56270 2023-10-25T14:15:38.899793+0000 kernel:  ? kthread_complete_and_exit+0x20/0x20
56271 2023-10-25T14:15:38.899819+0000 kernel:  ret_from_fork+0x1f/0x30
56272 2023-10-25T14:15:38.899842+0000 kernel:  </TASK>
56273 2023-10-25T14:16:06.879512+0000 kernel: watchdog: BUG: soft lockup - CPU#11 stuck for 25738s! [kworker/11:1:1724873]
56274 2023-10-25T14:16:06.879675+0000 kernel: Modules linked in: ceph fscache netfs xt_TCPMSS xt_set ip_set_hash_net ip_set nbd rbd libceph dns_resolver xt_addrtype veth xt_mark nf_conntrack_netlink geneve ip6_udp_tunnel udp_tunnel       ipt_REJECT nf_reject_ipv4 xt_nat nft_chain_nat xt_CT xt_conntrack xt_comment nft_compat nft_counter nf_tables nfnetlink_cttimeout nfnetlink rfkill openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ov      erlay vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock ext4 mbcache jbd2 intel_rapl_msr intel_rapl_common intel_uncore_frequency_common vmwgfx isst_if_mbox_msr isst_if_common drm_ttm_helper ttm       nfit libnvdimm drm_kms_helper rapl vmw_balloon syscopyarea sysfillrect sysimgblt pcspkr vmw_vmci fb_sys_fops i2c_piix4 joydev drm ip_tables xfs libcrc32c dm_multipath ata_generic sd_mod nvme_tcp sg nvme_fabrics nvme ata_piix       crct10dif_pclmul nvme_core crc32_pclmul crc32c_intel libata nvme_common vmxnet3 ghash_clmulni_intel t10_pi vmw_pvscsi serio_raw
56275 2023-10-25T14:16:06.879774+0000 kernel:  dm_mirror dm_region_hash dm_log dm_mod fuse
56276 2023-10-25T14:16:06.879822+0000 kernel: CPU: 11 PID: 1724873 Comm: kworker/11:1 Tainted: G             L   --------  ---  5.14.0-284.36.1.el9_2.x86_64 #1 
56277 2023-10-25T14:16:06.879850+0000 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
56278 2023-10-25T14:16:06.879871+0000 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph]
56279 2023-10-25T14:16:06.879895+0000 kernel: RIP: 0010:_raw_spin_unlock+0x1f/0x30
56280 2023-10-25T14:16:06.879919+0000 kernel: Code: e9 26 7e 5c ff 66 0f 1f 44 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 65 ff 0d 7d 23 4e 58 74 05 c3 cc cc cc cc 0f 1f 44 00 00 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90       0f 1f 44 00 00
56281 2023-10-25T14:16:06.879947+0000 kernel: RSP: 0018:ffffb90d1c25fc38 EFLAGS: 00000246
56282 2023-10-25T14:16:06.879970+0000 kernel: RAX: 0000000000000002 RBX: ffff973749586000 RCX: 00000000802a0028 
56283 2023-10-25T14:16:06.879997+0000 kernel: RDX: 0000000000000001 RSI: ffff973c71dc6008 RDI: ffff9737495861c8
56284 2023-10-25T14:16:06.880022+0000 kernel: RBP: ffff973c71dc5f80 R08: ffff973c71dc6518 R09: ffffb90d1c25fbe0
56285 2023-10-25T14:16:06.880059+0000 kernel: R10: 0000000000000000 R11: 0000000000002e4c R12: ffff9737495861c8
56286 2023-10-25T14:16:06.880098+0000 kernel: R13: ffff9737495861b8 R14: ffff973645342600 R15: 0000000000000002
56287 2023-10-25T14:16:06.880137+0000 kernel: FS:  0000000000000000(0000) GS:ffff973e1fcc0000(0000) knlGS:0000000000000000
56288 2023-10-25T14:16:06.880177+0000 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
56289 2023-10-25T14:16:06.880204+0000 kernel: CR2: 000000c0008c9ae0 CR3: 0000000140264003 CR4: 00000000007706e0
56290 2023-10-25T14:16:06.880235+0000 kernel: PKRU: 55555554
56291 2023-10-25T14:16:06.880263+0000 kernel: Call Trace:
56292 2023-10-25T14:16:06.880306+0000 kernel:  <TASK>
56293 2023-10-25T14:16:06.880331+0000 kernel:  ceph_handle_snap+0x2b5/0x4f0 [ceph]
56294 2023-10-25T14:16:06.880352+0000 kernel:  ? tcp_recvmsg_locked+0x200/0x900
56295 2023-10-25T14:16:06.880378+0000 kernel:  mds_dispatch+0x19d/0x1e0 [ceph]
56296 2023-10-25T14:16:06.880401+0000 kernel:  ceph_con_process_message+0x77/0x150 [libceph]
56297 2023-10-25T14:16:06.880423+0000 kernel:  ? release_sock+0x40/0x90
56298 2023-10-25T14:16:06.880444+0000 kernel:  process_message+0xf/0x110 [libceph]
56299 2023-10-25T14:16:06.880468+0000 kernel:  populate_in_iter+0x45e/0x660 [libceph]
56300 2023-10-25T14:16:06.880492+0000 kernel:  ? inet_recvmsg+0x5c/0x140
56301 2023-10-25T14:16:06.880517+0000 kernel:  ceph_con_v2_try_read+0x101/0x210 [libceph]
56302 2023-10-25T14:16:06.880539+0000 kernel:  ? ceph_con_v2_try_write+0x1aa/0x4c0 [libceph]
56303 2023-10-25T14:16:06.880575+0000 kernel:  ceph_con_workfn+0x175/0x440 [libceph]
56304 2023-10-25T14:16:06.880611+0000 kernel:  process_one_work+0x1e2/0x3b0
56305 2023-10-25T14:16:06.880635+0000 kernel:  ? rescuer_thread+0x390/0x390
56306 2023-10-25T14:16:06.880659+0000 kernel:  worker_thread+0x50/0x3a0
56307 2023-10-25T14:16:06.880679+0000 kernel:  ? rescuer_thread+0x390/0x390
56308 2023-10-25T14:16:06.880700+0000 kernel:  kthread+0xd6/0x100
56309 2023-10-25T14:16:06.880722+0000 kernel:  ? kthread_complete_and_exit+0x20/0x20
56310 2023-10-25T14:16:06.880742+0000 kernel:  ret_from_fork+0x1f/0x30
56311 2023-10-25T14:16:06.880763+0000 kernel:  </TASK>  

The code:

 937 /*
 938  * Send any cap_snaps that are queued for flush.  Try to carry
 939  * s_mutex across multiple snap flushes to avoid locking overhead.
 940  *
 941  * Caller holds no locks.
 942  */
 943 static void flush_snaps(struct ceph_mds_client *mdsc)
 944 { 
 945         struct ceph_client *cl = mdsc->fsc->client;
 946         struct ceph_inode_info *ci;
 947         struct inode *inode;  
 948         struct ceph_mds_session *session = NULL;
 949 
 950         doutc(cl, "begin\n");
 951         spin_lock(&mdsc->snap_flush_lock);  
 952         while (!list_empty(&mdsc->snap_flush_list)) {
 953                 ci = list_first_entry(&mdsc->snap_flush_list,
 954                                 struct ceph_inode_info, i_snap_flush_item);
 955                 inode = &ci->netfs.inode;           
 956                 ihold(inode);
 957                 spin_unlock(&mdsc->snap_flush_lock);
 958                 ceph_flush_snaps(ci, &session);                                                                                                                                                                                   
 959                 iput(inode);
 960                 spin_lock(&mdsc->snap_flush_lock);
 961         }
 962         spin_unlock(&mdsc->snap_flush_lock);
 963 
 964         ceph_put_mds_session(session);      
 965         doutc(cl, "done\n");
 966 }

This is a similar issue with https://tracker.ceph.com/issues/63298.

No data to display

Actions

Also available in: Atom PDF