Project

General

Profile

Bug #46419

ceph: ceph_add_cap: couldn't find snap realm 110

Added by Jeff Layton about 2 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Mark Nelson saw this pop in some testing, which looks like a case where we caught a cap IMPORT with a snap realm that that the client was not aware of (yet):

[ 8663.340065] ceph: ceph_add_cap: couldn't find snap realm 110
[ 8663.347293] ------------[ cut here ]------------
[ 8663.353864] WARNING: CPU: 10 PID: 33414 at fs/ceph/caps.c:707 ceph_add_cap.cold+0x14/0x1b [ceph]
[ 8663.366633] Modules linked in: ceph libceph fscache xfs libcrc32c iscsi_target_mod target_core_mod intel_rapl_msr intel_rapl_common nfit libnvdimm ppdev parport_pc intel_rapl_perf ena parport i2c_piix4 drm ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel nvme nvme_core ghash_clmulni_intel serio_raw
[ 8663.396387] CPU: 10 PID: 33414 Comm: kworker/10:2 Not tainted 5.7.6-201.fc32.x86_64 #1
[ 8663.408770] Hardware name: Amazon EC2 i3en.3xlarge/, BIOS 1.0 10/16/2017
[ 8663.416409] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[ 8663.423487] RIP: 0010:ceph_add_cap.cold+0x14/0x1b [ceph]
[ 8663.430461] Code: 74 24 10 48 8b 6c 24 18 41 8d 0c 04 e8 22 20 99 c1 e9 b6 75 fe ff 48 8b b4 24 90 00 00 00 48 c7 c7 d8 24 7c c0 e8 09 20 99 c1 <0f> 0b e9 2f 86 fe ff 48 8b 55 08 4c 8b 43 30 48 c7 c7 28 26 7c c0
[ 8663.452325] RSP: 0018:ffffa84a8229bb58 EFLAGS: 00010282
[ 8663.459270] RAX: 0000000000000030 RBX: 0000000000000001 RCX: 0000000000000000
[ 8663.467196] RDX: ffff94c7d8ea7cc0 RSI: ffff94c7d8e99cc8 RDI: ffff94c7d8e99cc8
[ 8663.475009] RBP: 0000000000000045 R08: 0000000000000410 R09: 0000000000000019
[ 8663.482842] R10: 0000000000000730 R11: 0000000000000000 R12: ffff94c719be9a60
[ 8663.490640] R13: ffff94c65edb3528 R14: ffff94c7ca261508 R15: 0000000000000010
[ 8663.498496] FS:  0000000000000000(0000) GS:ffff94c7d8e80000(0000) knlGS:0000000000000000
[ 8663.511091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8663.518307] CR2: 000055da16b73000 CR3: 000000178d7ac006 CR4: 00000000007606e0
[ 8663.526130] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8663.533945] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8663.541790] PKRU: 55555554
[ 8663.547492] Call Trace:
[ 8663.553101]  ? kmem_cache_alloc+0x168/0x220
[ 8663.559513]  ? __cap_is_valid+0x1c/0xb0 [ceph]
[ 8663.566058]  ceph_handle_caps+0x90e/0x19b0 [ceph]
[ 8663.572725]  dispatch+0x283/0x1400 [ceph]
[ 8663.579035]  ? inet_recvmsg+0x4d/0xd0
[ 8663.585188]  ? ceph_tcp_recvmsg+0x62/0x80 [libceph]
[ 8663.591895]  ceph_con_workfn+0x19f7/0x2800 [libceph]
[ 8663.598697]  ? __switch_to_asm+0x40/0x70
[ 8663.604952]  ? __switch_to_asm+0x34/0x70
[ 8663.611203]  ? __switch_to_asm+0x40/0x70
[ 8663.617434]  ? __switch_to_asm+0x34/0x70
[ 8663.623670]  ? __switch_to_asm+0x40/0x70
[ 8663.629956]  ? __switch_to_asm+0x40/0x70
[ 8663.636235]  ? __switch_to_asm+0x34/0x70
[ 8663.642499]  ? __switch_to+0x80/0x420
[ 8663.648621]  ? __switch_to_asm+0x34/0x70
[ 8663.654868]  process_one_work+0x1b4/0x380
[ 8663.661178]  worker_thread+0x53/0x3e0
[ 8663.667333]  ? process_one_work+0x380/0x380
[ 8663.673709]  kthread+0x115/0x140
[ 8663.679605]  ? __kthread_bind_mask+0x60/0x60
[ 8663.686019]  ret_from_fork+0x35/0x40
[ 8663.692129] ---[ end trace 12f255404f1ebfed ]---
[ 8663.698783] ------------[ cut here ]------------
[ 8663.705411] refcount_t: underflow; use-after-free.
[ 8663.712156] WARNING: CPU: 10 PID: 33414 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
[ 8663.724800] Modules linked in: ceph libceph fscache xfs libcrc32c iscsi_target_mod target_core_mod intel_rapl_msr intel_rapl_common nfit libnvdimm ppdev parport_pc intel_rapl_perf ena parport i2c_piix4 drm ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel nvme nvme_core ghash_clmulni_intel serio_raw
[ 8663.754526] CPU: 10 PID: 33414 Comm: kworker/10:2 Tainted: G        W         5.7.6-201.fc32.x86_64 #1
[ 8663.767549] Hardware name: Amazon EC2 i3en.3xlarge/, BIOS 1.0 10/16/2017
[ 8663.775165] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[ 8663.782180] RIP: 0010:refcount_warn_saturate+0xa6/0xf0
[ 8663.789022] Code: 05 7d a3 4f 01 01 e8 20 50 bc ff 0f 0b c3 80 3d 6b a3 4f 01 00 75 95 48 c7 c7 20 d6 3b 83 c6 05 5b a3 4f 01 01 e8 01 50 bc ff <0f> 0b c3 80 3d 4a a3 4f 01 00 0f 85 72 ff ff ff 48 c7 c7 78 d6 3b
[ 8663.810286] RSP: 0018:ffffa84a8229bba0 EFLAGS: 00010286
[ 8663.816418] RAX: 0000000000000026 RBX: ffff94c7829bed28 RCX: 0000000000000000
[ 8663.823359] RDX: ffff94c7d8ea7cc0 RSI: ffff94c7d8e99cc8 RDI: ffff94c7d8e99cc8
[ 8663.830276] RBP: ffff94c7cd2cb500 R08: 000000000000043e R09: 0000000000000019
[ 8663.837210] R10: 000000000000072e R11: 0000000000000000 R12: ffff94c65ea90800
[ 8663.844163] R13: ffff94c65ea908d0 R14: ffff94c7ca261000 R15: ffff94c65edb3528
[ 8663.851072] FS:  0000000000000000(0000) GS:ffff94c7d8e80000(0000) knlGS:0000000000000000
[ 8663.862170] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8663.868527] CR2: 000055da16b73000 CR3: 000000178d7ac006 CR4: 00000000007606e0
[ 8663.875445] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8663.882348] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8663.889237] PKRU: 55555554
[ 8663.894351] Call Trace:
[ 8663.899337]  __destroy_snap_realm+0xc2/0x100 [ceph]
[ 8663.905326]  ceph_put_snap_realm+0x71/0xf0 [ceph]
[ 8663.911346]  ceph_handle_caps+0x96a/0x19b0 [ceph]
[ 8663.917460]  dispatch+0x283/0x1400 [ceph]
[ 8663.923103]  ? inet_recvmsg+0x4d/0xd0
[ 8663.928622]  ? ceph_tcp_recvmsg+0x62/0x80 [libceph]
[ 8663.934672]  ceph_con_workfn+0x19f7/0x2800 [libceph]
[ 8663.940697]  ? __switch_to_asm+0x40/0x70
[ 8663.946302]  ? __switch_to_asm+0x34/0x70
[ 8663.951892]  ? __switch_to_asm+0x40/0x70
[ 8663.957529]  ? __switch_to_asm+0x34/0x70
[ 8663.963121]  ? __switch_to_asm+0x40/0x70
[ 8663.968740]  ? __switch_to_asm+0x40/0x70
[ 8663.974355]  ? __switch_to_asm+0x34/0x70
[ 8663.979935]  ? __switch_to+0x80/0x420
[ 8663.985519]  ? __switch_to_asm+0x34/0x70
[ 8663.991134]  process_one_work+0x1b4/0x380
[ 8663.996780]  worker_thread+0x53/0x3e0
[ 8664.002300]  ? process_one_work+0x380/0x380
[ 8664.008024]  kthread+0x115/0x140
[ 8664.013343]  ? __kthread_bind_mask+0x60/0x60
[ 8664.019104]  ret_from_fork+0x35/0x40
[ 8664.024550] ---[ end trace 12f255404f1ebfee ]---
[ 8664.030525] ------------[ cut here ]------------
[ 8664.036545] kernel BUG at mm/slub.c:304!
[ 8664.042135] invalid opcode: 0000 [#1] SMP PTI
[ 8664.047921] CPU: 10 PID: 33414 Comm: kworker/10:2 Tainted: G        W         5.7.6-201.fc32.x86_64 #1
[ 8664.059542] Hardware name: Amazon EC2 i3en.3xlarge/, BIOS 1.0 10/16/2017
[ 8664.066301] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[ 8664.072561] RIP: 0010:kfree+0x1ef/0x200
[ 8664.078147] Code: 5b 5d 41 5c 41 5d e9 d0 2d fd ff 4d 89 e1 41 b8 01 00 00 00 4c 89 d1 48 89 da 48 89 ee 4c 89 ef e8 26 fa ff ff e9 2a ff ff ff <0f> 0b 0f 0b 48 8b 05 16 1e 53 01 e9 43 fe ff ff 90 0f 1f 44 00 00
[ 8664.097449] RSP: 0018:ffffa84a8229bb90 EFLAGS: 00010246
[ 8664.103577] RAX: ffff94c7cd2cb500 RBX: ffff94c7cd2cb500 RCX: ffff94c7cd2cb580
[ 8664.110510] RDX: 0000000001351b7e RSI: ffffa84a8229bb90 RDI: ffff94c7d2806f40
[ 8664.117389] RBP: fffff858de34b280 R08: 000000000000043e R09: 0000000000000019
[ 8664.124293] R10: ffff94c7cd2cb500 R11: 0000000000000000 R12: ffffffffc07a7381
[ 8664.131234] R13: ffff94c7d2806f40 R14: ffff94c7ca261000 R15: ffff94c65edb3528
[ 8664.138144] FS:  0000000000000000(0000) GS:ffff94c7d8e80000(0000) knlGS:0000000000000000
[ 8664.149261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8664.155612] CR2: 000055da16b73000 CR3: 000000178d7ac006 CR4: 00000000007606e0
[ 8664.162581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8664.169482] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8664.176453] PKRU: 55555554
[ 8664.181526] Call Trace:
[ 8664.186516]  ceph_put_snap_realm+0x71/0xf0 [ceph]
[ 8664.192425]  ceph_handle_caps+0x96a/0x19b0 [ceph]
[ 8664.198348]  dispatch+0x283/0x1400 [ceph]
[ 8664.203949]  ? inet_recvmsg+0x4d/0xd0
[ 8664.209418]  ? ceph_tcp_recvmsg+0x62/0x80 [libceph]
[ 8664.215402]  ceph_con_workfn+0x19f7/0x2800 [libceph]
[ 8664.221427]  ? __switch_to_asm+0x40/0x70
[ 8664.227059]  ? __switch_to_asm+0x34/0x70
[ 8664.232662]  ? __switch_to_asm+0x40/0x70
[ 8664.238253]  ? __switch_to_asm+0x34/0x70
[ 8664.243856]  ? __switch_to_asm+0x40/0x70
[ 8664.249442]  ? __switch_to_asm+0x40/0x70
[ 8664.255083]  ? __switch_to_asm+0x34/0x70
[ 8664.260743]  ? __switch_to+0x80/0x420
[ 8664.266319]  ? __switch_to_asm+0x34/0x70
[ 8664.272222]  process_one_work+0x1b4/0x380
[ 8664.277970]  worker_thread+0x53/0x3e0
[ 8664.283552]  ? process_one_work+0x380/0x380
[ 8664.289378]  kthread+0x115/0x140
[ 8664.294683]  ? __kthread_bind_mask+0x60/0x60
[ 8664.300420]  ret_from_fork+0x35/0x40
[ 8664.305872] Modules linked in: ceph libceph fscache xfs libcrc32c iscsi_target_mod target_core_mod intel_rapl_msr intel_rapl_common nfit libnvdimm ppdev parport_pc intel_rapl_perf ena parport i2c_piix4 drm ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel nvme nvme_core ghash_clmulni_intel serio_raw
[ 8664.332441] ---[ end trace 12f255404f1ebfef ]---
[ 8664.338411] RIP: 0010:kfree+0x1ef/0x200
[ 8664.343944] Code: 5b 5d 41 5c 41 5d e9 d0 2d fd ff 4d 89 e1 41 b8 01 00 00 00 4c 89 d1 48 89 da 48 89 ee 4c 89 ef e8 26 fa ff ff e9 2a ff ff ff <0f> 0b 0f 0b 48 8b 05 16 1e 53 01 e9 43 fe ff ff 90 0f 1f 44 00 00
[ 8664.363345] RSP: 0018:ffffa84a8229bb90 EFLAGS: 00010246
[ 8664.369595] RAX: ffff94c7cd2cb500 RBX: ffff94c7cd2cb500 RCX: ffff94c7cd2cb580
[ 8664.376579] RDX: 0000000001351b7e RSI: ffffa84a8229bb90 RDI: ffff94c7d2806f40
[ 8664.383517] RBP: fffff858de34b280 R08: 000000000000043e R09: 0000000000000019
[ 8664.390448] R10: ffff94c7cd2cb500 R11: 0000000000000000 R12: ffffffffc07a7381
[ 8664.506212] R13: ffff94c7d2806f40 R14: ffff94c7ca261000 R15: ffff94c65edb3528
[ 8664.513158] FS:  0000000000000000(0000) GS:ffff94c7d8e80000(0000) knlGS:0000000000000000
[ 8664.524246] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8664.530612] CR2: 000055da16b73000 CR3: 000000178d7ac006 CR4: 00000000007606e0
[ 8664.537534] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8664.544428] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8664.551361] PKRU: 55555554
[ 8959.725126] libceph: mds1 (1)10.0.1.10:6811 socket closed (con state OPEN)
[ 8959.727768] libceph: mds3 (1)10.0.1.9:6811 socket closed (con state OPEN)
[ 8959.727815] libceph: mds6 (1)10.0.1.8:6809 socket closed (con state OPEN)
[ 8959.728119] libceph: mds17 (1)10.0.1.2:6811 socket closed (con state OPEN)
[ 8959.728484] libceph: mds15 (1)10.0.1.3:6811 socket closed (con state OPEN)
[ 8959.728638] libceph: mds12 (1)10.0.1.5:6809 socket closed (con state OPEN)
[ 8959.728663] libceph: mds14 (1)10.0.1.4:6809 socket closed (con state OPEN)
[ 8959.728825] libceph: mds2 (1)10.0.1.10:6809 socket closed (con state OPEN)
[ 8959.728924] libceph: mds7 (1)10.0.1.7:6811 socket closed (con state OPEN)
[ 8959.728971] libceph: mds11 (1)10.0.1.5:6811 socket closed (con state OPEN)
[ 8959.729221] libceph: mds13 (1)10.0.1.4:6811 socket closed (con state OPEN)
[ 8959.730620] libceph: mds5 (1)10.0.1.8:6811 socket closed (con state OPEN)
[ 8959.730716] libceph: mds9 (1)10.0.1.6:6811 socket closed (con state OPEN)
[ 8959.731398] libceph: mds20 (1)10.0.1.1:6811 socket closed (con state OPEN)
[ 8959.732099] libceph: mds8 (1)10.0.1.7:6809 socket closed (con state OPEN)
[ 8959.738976] libceph: mds19 (1)10.0.1.1:6813 socket closed (con state OPEN)

History

#1 Updated by Jeff Layton about 2 years ago

Zheng says:

inodes 0x100 to 0x1ff are mdsdir inode. this can happen when the inode
is stray dir. I think the fix is create dummy inode and snaprealm for
0x100 ~ 0x1ff. The dummy snaprealm has seq == 1, all of the rest fields
are set to 0.

He also mentioned creating just one entry for the entire range. I'll have to see if that is possible.

#2 Updated by Jeff Layton over 1 year ago

I'm still trying to wrap my brain around the locking in this code, but it's pretty clear that it's badly broken (or at least, not working as advertised). In particular, ceph_get_snap_realm is documented as requiring the caller hold the snap_rwsem for write, but it's not held for that in the ceph_add_cap codepath.

While adding the dummy realms that Zheng mentioned in the comment above might be part of the eventual fix, I think we're probably encountering some races in this code due to inconsistent locking that may be explain some of the refcounting issues.

#3 Updated by Jeff Layton over 1 year ago

The snaprealm locking is not as broken as I had originally feared. Some of the comments were definitely wrong, but they mostly just claimed you needed to hold the snap_rwsem for write when holding it for read is actually OK. I'm working some patches to fix those (and correct the comments).

I'll still need to spin up some code to handle creating this dummy snaprealm, but I'm not quite to where I understand this enough to do it yet.

The refcounting bug looks different however. I think that is actually indicative of a refcounting bug in the snap_context handling (not to be confused with a snap_realm). I suspect the BUG_ON in slub.c is showing the result of a double free, of one of those objects.

#4 Updated by Nigel Williams about 1 year ago

We have had two hangs today we think due to this bug, any thoughts on whether this will arrive in Octopus soon? thanks.

Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796634] ------------[ cut here ]------------
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796635] kernel BUG at mm/slub.c:306!
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796661] invalid opcode: 0000 [#1] SMP NOPTI
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796680] CPU: 58 PID: 6015 Comm: kworker/58:14 Tainted: G W 5.4.0-77-generic #86-Ubuntu
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796711] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796748] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796774] RIP: 0010:kfree+0x236/0x250
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796795] Code: e7 e8 ae 75 fd ff e9 ef fe ff ff 4d 89 f1 41 b8 01 00 00 00 48 89 d9 48 89 da 4c 89 e6 4c 89 ef e8 9f fa ff ff e9 d0 fe ff ff <0f> 0b 48 8b 05 31 c8 37 01 e9 ff fd ff ff 66 66 2e 0f 1f 84 00 00
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796862] RSP: 0018:ffffb8a0018e3c40 EFLAGS: 00010246
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796886] RAX: ffff8a66bfb81800 RBX: ffff8a66bfb81800 RCX: ffff8a66bfb81800
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796905] RDX: 000000000007f3eb RSI: ffff8a66bf9988f8 RDI: ffff8a66bfb81800
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796923] RBP: ffffb8a0018e3c60 R08: ffff8a615edb3e18 R09: ffff8a6686b70818
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796941] R10: 0000000000000000 R11: 0000000000000001 R12: ffffe4ef5dfee000
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796962] R13: ffff8a66d2806f40 R14: ffffffffc0b0a194 R15: ffff8a666b027c00
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.796985] FS: 0000000000000000(0000) GS:ffff8a66dfa80000(0000) knlGS:0000000000000000
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797009] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797033] CR2: 00007f8791321490 CR3: 0000002780024000 CR4: 00000000003406e0
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797055] Call Trace:
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797072] __destroy_snap_realm+0xa4/0x110 [ceph]
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797104] ceph_put_snap_realm+0x6f/0xe0 [ceph]
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797127] handle_reply+0x2e4/0x6a0 [ceph]
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797146] dispatch+0xe9/0x610 [ceph]
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797167] try_read+0x566/0x8c0 [libceph]
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797185] ? __switch_to_asm+0x40/0x70
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797198] ? __switch_to_asm+0x34/0x70
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797210] ? __switch_to_asm+0x40/0x70
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797222] ? __switch_to+0x68/0x470
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797233] ? __switch_to_asm+0x34/0x70
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797251] ceph_con_workfn+0x130/0x620 [libceph]
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797270] process_one_work+0x1eb/0x3b0
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797286] worker_thread+0x4d/0x400
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797301] kthread+0x104/0x140
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797315] ? process_one_work+0x3b0/0x3b0
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797332] ? kthread_park+0x90/0x90
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797348] ret_from_fork+0x22/0x40
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797364] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge stp llc nfsv3 rpcsec_gss_krb5 nfsv4 nfs aufs ceph libceph fscache overlay dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm kvm_amd joydev ccp snd_timer snd soundcore input_leds kvm mac_hid serio_raw qemu_fw_cfg nfsd auth_rpcgss nfs_acl lockd sch_fq_codel grace sunrpc ip_tables x_tables autofs4 xfs btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic qxl ttm usbhid crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea ghash_clmulni_intel sysfillrect aesni_intel virtio_net crypto_simd sysimgblt cryptd fb_sys_fops net_failover hid glue_helper psmouse
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.797379] virtio_blk drm failover i2c_piix4 floppy pata_acpi
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.798295] ---[ end trace a5b8b4af5b4054bb ]---
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.798952] RIP: 0010:kfree+0x236/0x250
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.799625] Code: e7 e8 ae 75 fd ff e9 ef fe ff ff 4d 89 f1 41 b8 01 00 00 00 48 89 d9 48 89 da 4c 89 e6 4c 89 ef e8 9f fa ff ff e9 d0 fe ff ff <0f> 0b 48 8b 05 31 c8 37 01 e9 ff fd ff ff 66 66 2e 0f 1f 84 00 00
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.800763] RSP: 0018:ffffb8a0018e3c40 EFLAGS: 00010246
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.801335] RAX: ffff8a66bfb81800 RBX: ffff8a66bfb81800 RCX: ffff8a66bfb81800
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.801914] RDX: 000000000007f3eb RSI: ffff8a66bf9988f8 RDI: ffff8a66bfb81800
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.802461] RBP: ffffb8a0018e3c60 R08: ffff8a615edb3e18 R09: ffff8a6686b70818
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.803000] R10: 0000000000000000 R11: 0000000000000001 R12: ffffe4ef5dfee000
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.803547] R13: ffff8a66d2806f40 R14: ffffffffc0b0a194 R15: ffff8a666b027c00
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.804086] FS: 0000000000000000(0000) GS:ffff8a66dfa80000(0000) knlGS:0000000000000000
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.804616] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 6 00:50:19 rosalind-ceph-gw kernel: [ 5091.805146] CR2: 00007f8791321490 CR3: 0000002780024000 CR4: 00000000003406e0
Jul 6 00:50:45 rosalind-ceph-gw kernel: [ 5117.949546] libceph: mon1 (1)172.27.51.22:6789 session lost, hunting for new mon
Jul 6 00:50:45 rosalind-ceph-gw kernel: [ 5117.950444] libceph: mon1 (1)172.27.51.22:6789 session lost, hunting for new mon
Jul 6 00:50:45 rosalind-ceph-gw kernel: [ 5117.951025] libceph: mon1 (1)172.27.51.22:6789 session lost, hunting for new mon
Jul 6 00:50:45 rosalind-ceph-gw kernel: [ 5117.951608] libceph: mon0 (1)172.27.51.20:6789 session lost, hunting for new mon
Jul 6 00:50:45 rosalind-ceph-gw kernel: [ 5117.952148] libceph: mon2 (1)172.27.51.23:6789 session lost, hunting for new mon
Jul 6 00:51:12 rosalind-ceph-gw kernel: [ 5144.829632] ceph: mds0 caps stale
Jul 6 00:51:12 rosalind-ceph-gw kernel: [ 5144.830229] ceph: mds0 caps stale
Jul 6 00:51:12 rosalind-ceph-gw kernel: [ 5144.830727] ceph: mds0 caps stale
Jul 6 00:51:12 rosalind-ceph-gw kernel: [ 5144.831210] ceph: mds0 caps stale

#5 Updated by Ilya Dryomov about 1 year ago

  • Category set to fs/ceph

#6 Updated by Jeff Layton about 1 year ago

This is a kernel bug, so it wouldn't land in a particular ceph release. That said, there are no patches for this yet, so the ETA is not yet clear.

#7 Updated by Jeff Layton about 1 year ago

I pulled down the kernel debuginfo for the kernel from Mark's original report. The stack trace that shows the refcount_t warning is putting the snaprealm at the end of the cap import process:

        case CEPH_CAP_OP_IMPORT:
                realm = NULL;
                if (snaptrace_len) {
                        down_write(&mdsc->snap_rwsem);
                        ceph_update_snap_trace(mdsc, snaptrace,
                                               snaptrace + snaptrace_len,
                                               false, &realm);
                        downgrade_write(&mdsc->snap_rwsem);
                } else {
                        down_read(&mdsc->snap_rwsem);
                }
                handle_cap_import(mdsc, inode, h, peer, session,
                                  &cap, &extra_info.issued);
                handle_cap_grant(inode, session, cap,
                                 h, msg->middle, &extra_info);
                if (realm)
                        ceph_put_snap_realm(mdsc, realm);                 <<<< refcount_t underput here
                goto done_unlocked;

#8 Updated by Jeff Layton about 1 year ago

Nigel Williams wrote:

We have had two hangs today we think due to this bug, any thoughts on whether this will arrive in Octopus soon? thanks.

Nigel, I'm not certain this is the same bug as the one Mark saw. It's possible that it's something different. You may want to try a newer kernel if you're having issues -- 5.4 is quite old at this point.

#9 Updated by Jeff Layton about 1 year ago

After going over this code in some pretty heavy detail, I think I've spotted a potential race in ceph_put_snap_realm:

void ceph_put_snap_realm(struct ceph_mds_client *mdsc,
                         struct ceph_snap_realm *realm)
{
        dout("put_snap_realm %llx %p %d -> %d\n", realm->ino, realm,
             atomic_read(&realm->nref), atomic_read(&realm->nref)-1);
        if (!atomic_dec_and_test(&realm->nref))
                return;

        if (down_write_trylock(&mdsc->snap_rwsem)) {
                __destroy_snap_realm(mdsc, realm);
                up_write(&mdsc->snap_rwsem);
        } else {
                spin_lock(&mdsc->snap_empty_lock);
                list_add(&realm->empty_item, &mdsc->snap_empty);
                spin_unlock(&mdsc->snap_empty_lock);
        }
}

The problem here is that a ceph_get_snap_realm could race with this, and we end up putting the thing on the snap_empty list when we shouldn't. What we probably need to do is atomic_dec_and_lock the snap_empty_lock here and then just drop it if we can get the mutex. There are a couple of other places that change the nref though that probably also need to hold the spinlock too.

Longer term, we probably need to re-think the lifecycle of the snaprealm objects altogether. The locking around them is really complicated, and the code currently makes heavy use of recursion (which is generally frowned-upon in the kernel). It could use a substantial rework.

#10 Updated by Jeff Layton about 1 year ago

Ok, patch posted here:

https://lore.kernel.org/ceph-devel/20210803175126.29165-1-jlayton@kernel.org/T/#u

I suspect this may be the cause of the panic that Nigel saw as well. I'll plan to put this into the testing kernel for now, but we'll probably want this in v5.14 and stable kernels.

#11 Updated by Jeff Layton about 1 year ago

  • Status changed from New to Fix Under Review

#12 Updated by Jeff Layton 3 months ago

  • Status changed from Fix Under Review to Resolved

This seems to be fixed with patches merged upstream.

Also available in: Atom PDF