Project

General

Profile

Actions

Bug #56693

open

MDS changed truncate_seq with client caps still outstanding

Added by Jeff Layton almost 2 years ago. Updated 8 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Opening as a kernel bug, but this may be a MDS problem. I hit this warning today testing with the current testing branch (commit 7515a529973fc)

[  511.413581] WARNING: CPU: 3 PID: 96 at fs/ceph/inode.c:745 ceph_fill_file_size+0x284/0x290 [ceph]
[  511.419352] Modules linked in: ceph(E) libceph(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ip_set(E) rfkill(E) nf_tables(E) nfnetlink(E) cachefiles(E) fscache(E) netfs(E) sunrpc(E) i2c_i801(E) i2c_smbus(E) joydev(E) virtio_balloon(E) iTCO_wdt(E) intel_pmc_bxt(E) iTCO_vendor_support(E) lpc_ich(E) intel_rapl_msr(E) intel_rapl_common(E) fuse(E) zram(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) serio_raw(E) virtio_gpu(E) virtio_dma_buf(E) drm_shmem_helper(E) drm_kms_helper(E) virtio_net(E) net_failover(E) virtio_blk(E) virtio_console(E) failover(E) drm(E) qemu_fw_cfg(E)
[  511.453023] CPU: 3 PID: 96 Comm: kworker/3:1 Tainted: G            E     5.19.0-rc6+ #320
[  511.461188] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014
[  511.469479] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[  511.475692] RIP: 0010:ceph_fill_file_size+0x284/0x290 [ceph]
[  511.480806] Code: 8b 44 24 04 e9 ed fe ff ff 48 8d bb a8 07 00 00 e8 91 58 8d c9 8b 83 a8 07 00 00 85 c0 0f 85 93 fe ff ff 31 c0 e9 a4 fe ff ff <0f> 0b e9 52 fe ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41
[  511.491849] RSP: 0018:ffff888101787518 EFLAGS: 00010206
[  511.496261] RAX: 0000000000000000 RBX: ffff88815e236680 RCX: dffffc0000000000
[  511.500979] RDX: 0000000000000003 RSI: ffffffffc0c6dc78 RDI: ffff88815e236ec8
[  511.505890] RBP: 0000000088bb3f83 R08: 0000000088bb3f83 R09: ffff88811b3f8cb1
[  511.510918] R10: ffffed102367f196 R11: 0000000000000001 R12: 0000000000000002
[  511.515367] R13: 0000000088bb3f83 R14: ffff88815e236ec8 R15: ffff88815e2366d0
[  511.519732] FS:  0000000000000000(0000) GS:ffff88841d380000(0000) knlGS:0000000000000000
[  511.524423] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  511.528664] CR2: 00000000020d58d8 CR3: 000000010d002000 CR4: 00000000003506e0
[  511.533051] Call Trace:
[  511.536873]  <TASK>
[  511.540734]  ceph_fill_inode+0x6ae/0x1eb0 [ceph]
[  511.544925]  ? __lock_acquire+0x841/0x2d00
[  511.549004]  ? ceph_queue_inode_work+0x100/0x100 [ceph]
[  511.553267]  ? lock_acquire+0x157/0x3d0
[  511.557351]  ? lock_acquire+0x167/0x3d0
[  511.561373]  ceph_fill_trace+0x474/0x1510 [ceph]
[  511.565495]  ? rcu_read_lock_sched_held+0x3b/0x70
[  511.569683]  ? trace_contention_end+0xbc/0x110
[  511.573717]  ? preempt_count_sub+0x14/0xc0
[  511.577614]  ? __mutex_lock+0x1bf/0xc00
[  511.581557]  ? ceph_fill_inode+0x1eb0/0x1eb0 [ceph]
[  511.585605]  ? mds_dispatch+0xb05/0x24e0 [ceph]
[  511.589539]  ? lock_release+0x430/0x430
[  511.593380]  ? mutex_lock_io_nested+0xae0/0xae0
[  511.597371]  ? ceph_ino_compare+0x60/0x60 [ceph]
[  511.601493]  ? iget5_locked+0x2a/0xa0
[  511.605914]  ? downgrade_write+0xc6/0x240
[  511.612667]  ? rwsem_mark_wake+0x4f0/0x4f0
[  511.618237]  ? down_write_killable+0x120/0x120
[  511.626448]  mds_dispatch+0xb50/0x24e0 [ceph]
[  511.632358]  ? ceph_mdsc_handle_mdsmap+0x3a0/0x3a0 [ceph]
[  511.638367]  ? wait_for_completion_io_timeout+0x20/0x20
[  511.644583]  ? tcp_recv_timestamp+0x350/0x350
[  511.650333]  ceph_con_process_message+0xd3/0x220 [libceph]
[  511.656150]  ? inet_recvmsg+0xcf/0x310
[  511.661772]  process_message+0x1b/0x1d0 [libceph]
[  511.667354]  ceph_con_v2_try_read+0x1a2b/0x2ae0 [libceph]
[  511.673029]  ? populate_out_iter+0x1850/0x1850 [libceph]
[  511.678532]  ? lock_is_held_type+0xe3/0x140
[  511.683466]  ? rcu_read_lock_sched_held+0x3b/0x70
[  511.688375]  ? trace_contention_end+0xbc/0x110
[  511.693147]  ? preempt_count_sub+0x14/0xc0
[  511.697805]  ? __mutex_lock+0x1bf/0xc00
[  511.702350]  ? reacquire_held_locks+0x270/0x270
[  511.706933]  ? ceph_con_workfn+0x35/0x8c0 [libceph]
[  511.710987]  ? lockdep_hardirqs_on_prepare+0x220/0x220
[  511.714359]  ? mutex_lock_io_nested+0xae0/0xae0
[  511.717437]  ? lock_downgrade+0x380/0x380
[  511.720385]  ceph_con_workfn+0x53e/0x8c0 [libceph]
[  511.723670]  process_one_work+0x515/0x980
[  511.726819]  ? pwq_dec_nr_in_flight+0x100/0x100
[  511.729886]  ? rwlock_bug.part.0+0x60/0x60
[  511.732879]  worker_thread+0x8c/0x610
[  511.735768]  ? process_one_work+0x980/0x980
[  511.738687]  kthread+0x167/0x1a0
[  511.741470]  ? kthread_complete_and_exit+0x20/0x20
[  511.744490]  ret_from_fork+0x22/0x30
[  511.747410]  </TASK>
[  511.749893] irq event stamp: 972595
[  511.752690] hardirqs last  enabled at (972609): [<ffffffff8a2033de>] __up_console_sem+0x5e/0x70
[  511.756306] hardirqs last disabled at (972622): [<ffffffff8a2033c3>] __up_console_sem+0x43/0x70
[  511.759924] softirqs last  enabled at (972292): [<ffffffff8a13b8f5>] __irq_exit_rcu+0x135/0x1b0
[  511.763559] softirqs last disabled at (972287): [<ffffffff8a13b8f5>] __irq_exit_rcu+0x135/0x1b0
[  511.767131] ---[ end trace 0000000000000000 ]---

That is:

      /* the MDS should have revoked these caps */
      WARN_ON_ONCE(issued & (CEPH_CAP_FILE_EXCL |
                             CEPH_CAP_FILE_RD |
                             CEPH_CAP_FILE_WR |
                             CEPH_CAP_FILE_LAZYIO));

So it looks like the MDS changed the truncate_seq without revoking caps properly?

Actions

Also available in: Atom PDF