Bug #56693
Updated by Jeff Layton almost 2 years ago
Opening as a kernel bug, but this may be a MDS problem. I hit this warning today testing with the current testing branch (commit 7515a529973fc) today: <pre> [ 511.413581] WARNING: CPU: 3 PID: 96 at fs/ceph/inode.c:745 ceph_fill_file_size+0x284/0x290 [ceph] [ 511.419352] Modules linked in: ceph(E) libceph(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ip_set(E) rfkill(E) nf_tables(E) nfnetlink(E) cachefiles(E) fscache(E) netfs(E) sunrpc(E) i2c_i801(E) i2c_smbus(E) joydev(E) virtio_balloon(E) iTCO_wdt(E) intel_pmc_bxt(E) iTCO_vendor_support(E) lpc_ich(E) intel_rapl_msr(E) intel_rapl_common(E) fuse(E) zram(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) serio_raw(E) virtio_gpu(E) virtio_dma_buf(E) drm_shmem_helper(E) drm_kms_helper(E) virtio_net(E) net_failover(E) virtio_blk(E) virtio_console(E) failover(E) drm(E) qemu_fw_cfg(E) [ 511.453023] CPU: 3 PID: 96 Comm: kworker/3:1 Tainted: G E 5.19.0-rc6+ #320 [ 511.461188] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014 [ 511.469479] Workqueue: ceph-msgr ceph_con_workfn [libceph] [ 511.475692] RIP: 0010:ceph_fill_file_size+0x284/0x290 [ceph] [ 511.480806] Code: 8b 44 24 04 e9 ed fe ff ff 48 8d bb a8 07 00 00 e8 91 58 8d c9 8b 83 a8 07 00 00 85 c0 0f 85 93 fe ff ff 31 c0 e9 a4 fe ff ff <0f> 0b e9 52 fe ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 [ 511.491849] RSP: 0018:ffff888101787518 EFLAGS: 00010206 [ 511.496261] RAX: 0000000000000000 RBX: ffff88815e236680 RCX: dffffc0000000000 [ 511.500979] RDX: 0000000000000003 RSI: ffffffffc0c6dc78 RDI: ffff88815e236ec8 [ 511.505890] RBP: 0000000088bb3f83 R08: 0000000088bb3f83 R09: ffff88811b3f8cb1 [ 511.510918] R10: ffffed102367f196 R11: 0000000000000001 R12: 0000000000000002 [ 511.515367] R13: 0000000088bb3f83 R14: ffff88815e236ec8 R15: ffff88815e2366d0 [ 511.519732] FS: 0000000000000000(0000) GS:ffff88841d380000(0000) knlGS:0000000000000000 [ 511.524423] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 511.528664] CR2: 00000000020d58d8 CR3: 000000010d002000 CR4: 00000000003506e0 [ 511.533051] Call Trace: [ 511.536873] <TASK> [ 511.540734] ceph_fill_inode+0x6ae/0x1eb0 [ceph] [ 511.544925] ? __lock_acquire+0x841/0x2d00 [ 511.549004] ? ceph_queue_inode_work+0x100/0x100 [ceph] [ 511.553267] ? lock_acquire+0x157/0x3d0 [ 511.557351] ? lock_acquire+0x167/0x3d0 [ 511.561373] ceph_fill_trace+0x474/0x1510 [ceph] [ 511.565495] ? rcu_read_lock_sched_held+0x3b/0x70 [ 511.569683] ? trace_contention_end+0xbc/0x110 [ 511.573717] ? preempt_count_sub+0x14/0xc0 [ 511.577614] ? __mutex_lock+0x1bf/0xc00 [ 511.581557] ? ceph_fill_inode+0x1eb0/0x1eb0 [ceph] [ 511.585605] ? mds_dispatch+0xb05/0x24e0 [ceph] [ 511.589539] ? lock_release+0x430/0x430 [ 511.593380] ? mutex_lock_io_nested+0xae0/0xae0 [ 511.597371] ? ceph_ino_compare+0x60/0x60 [ceph] [ 511.601493] ? iget5_locked+0x2a/0xa0 [ 511.605914] ? downgrade_write+0xc6/0x240 [ 511.612667] ? rwsem_mark_wake+0x4f0/0x4f0 [ 511.618237] ? down_write_killable+0x120/0x120 [ 511.626448] mds_dispatch+0xb50/0x24e0 [ceph] [ 511.632358] ? ceph_mdsc_handle_mdsmap+0x3a0/0x3a0 [ceph] [ 511.638367] ? wait_for_completion_io_timeout+0x20/0x20 [ 511.644583] ? tcp_recv_timestamp+0x350/0x350 [ 511.650333] ceph_con_process_message+0xd3/0x220 [libceph] [ 511.656150] ? inet_recvmsg+0xcf/0x310 [ 511.661772] process_message+0x1b/0x1d0 [libceph] [ 511.667354] ceph_con_v2_try_read+0x1a2b/0x2ae0 [libceph] [ 511.673029] ? populate_out_iter+0x1850/0x1850 [libceph] [ 511.678532] ? lock_is_held_type+0xe3/0x140 [ 511.683466] ? rcu_read_lock_sched_held+0x3b/0x70 [ 511.688375] ? trace_contention_end+0xbc/0x110 [ 511.693147] ? preempt_count_sub+0x14/0xc0 [ 511.697805] ? __mutex_lock+0x1bf/0xc00 [ 511.702350] ? reacquire_held_locks+0x270/0x270 [ 511.706933] ? ceph_con_workfn+0x35/0x8c0 [libceph] [ 511.710987] ? lockdep_hardirqs_on_prepare+0x220/0x220 [ 511.714359] ? mutex_lock_io_nested+0xae0/0xae0 [ 511.717437] ? lock_downgrade+0x380/0x380 [ 511.720385] ceph_con_workfn+0x53e/0x8c0 [libceph] [ 511.723670] process_one_work+0x515/0x980 [ 511.726819] ? pwq_dec_nr_in_flight+0x100/0x100 [ 511.729886] ? rwlock_bug.part.0+0x60/0x60 [ 511.732879] worker_thread+0x8c/0x610 [ 511.735768] ? process_one_work+0x980/0x980 [ 511.738687] kthread+0x167/0x1a0 [ 511.741470] ? kthread_complete_and_exit+0x20/0x20 [ 511.744490] ret_from_fork+0x22/0x30 [ 511.747410] </TASK> [ 511.749893] irq event stamp: 972595 [ 511.752690] hardirqs last enabled at (972609): [<ffffffff8a2033de>] __up_console_sem+0x5e/0x70 [ 511.756306] hardirqs last disabled at (972622): [<ffffffff8a2033c3>] __up_console_sem+0x43/0x70 [ 511.759924] softirqs last enabled at (972292): [<ffffffff8a13b8f5>] __irq_exit_rcu+0x135/0x1b0 [ 511.763559] softirqs last disabled at (972287): [<ffffffff8a13b8f5>] __irq_exit_rcu+0x135/0x1b0 [ 511.767131] ---[ end trace 0000000000000000 ]--- </pre> That is: <pre> /* the MDS should have revoked these caps */ WARN_ON_ONCE(issued & (CEPH_CAP_FILE_EXCL | CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR | CEPH_CAP_FILE_LAZYIO)); </pre> So it looks like the MDS changed the truncate_seq without revoking caps properly?