Bug #63586
Updated by Xiubo Li 6 months ago
https://pulpito.ceph.com/yuriw-2023-11-15_23:07:34-fs-wip-yuri-testing-2023-11-14-0743-reef-distro-default-smithi/7459949/
From remote/smithi204/syslog/kern.log.gz:
<pre>
2023-11-16T02:47:20.751447+00:00 smithi204 kernel: [ 1895.456434] libceph: client5322 fsid 1325a234-ab9e-4632-9a6e-98afbef8ae76
2023-11-16T02:47:20.751448+00:00 smithi204 kernel: [ 1895.456483] ceph: test_dummy_encryption mode enabled
2023-11-16T02:50:12.194680+00:00 smithi204 kernel: [ 2066.894411] ------------[ cut here ]------------
2023-11-16T02:50:12.194695+00:00 smithi204 kernel: [ 2066.894484] WARNING: CPU: 7 PID: 112988 at mm/page_alloc.c:4402 __alloc_pages+0x1e7/0x270
2023-11-16T02:50:12.194696+00:00 smithi204 kernel: [ 2066.894518] Modules linked in: ceph libceph fscache netfs veth xfs xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables libcrc32c nfnetlink br_netfilter bridge stp llc rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core overlay ipmi_ssif intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp mei_me coretemp crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel mei crypto_simd cryptd joydev wmi ioatdma acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad kvm_intel kvm irqbypass sch_fq_codel scsi_transport_iscsi lp parport ramoops reed_solomon nfsd efi_pstore configfs auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 hid_generic usbhid hid igb ixgbe i2c_algo_bit dca crc32_pclmul nvme i2c_i801 ahci ptp i2c_smbus lpc_ich libahci pps_core nvme_core mdio
2023-11-16T02:50:12.194699+00:00 smithi204 kernel: [ 2066.894671] CPU: 7 PID: 112988 Comm: kworker/7:0 Not tainted 6.6.0-rc5-g3b796cfe2969 #1
2023-11-16T02:50:12.194699+00:00 smithi204 kernel: [ 2066.894674] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0a 08/02/2016
2023-11-16T02:50:12.194730+00:00 smithi204 kernel: [ 2066.894676] Workqueue: ceph-msgr ceph_con_workfn [libceph]
2023-11-16T02:50:12.194732+00:00 smithi204 kernel: [ 2066.894701] RIP: 0010:__alloc_pages+0x1e7/0x270
2023-11-16T02:50:12.194732+00:00 smithi204 kernel: [ 2066.894704] Code: ff ff 00 0f 84 19 ff ff ff 80 ce 01 e9 11 ff ff ff 83 fe 0a 0f 86 9d fe ff ff 80 3d 80 f4 53 01 00 75 09 c6 05 77 f4 53 01 01 <0f> 0b 45 31 ed e9 57 ff ff ff e8 9a f3 e0 ff 84 c0 0f 85 4a ff ff
2023-11-16T02:50:12.194733+00:00 smithi204 kernel: [ 2066.894706] RSP: 0018:ffffbcae88d5bc58 EFLAGS: 00010246
2023-11-16T02:50:12.194734+00:00 smithi204 kernel: [ 2066.894709] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
2023-11-16T02:50:12.194734+00:00 smithi204 kernel: [ 2066.894711] RDX: 0000000000000000 RSI: 0000000000000018 RDI: 0000000000000000
2023-11-16T02:50:12.194742+00:00 smithi204 kernel: [ 2066.894713] RBP: 0000000000040c00 R08: ffff9ab023c70f00 R09: 0000000000000000
2023-11-16T02:50:12.194742+00:00 smithi204 kernel: [ 2066.894715] R10: ffff9aaf00ebf030 R11: ffff9ab023c6ee00 R12: 0000000000000018
2023-11-16T02:50:12.194743+00:00 smithi204 kernel: [ 2066.894717] R13: 0000000000000018 R14: ffffffffc1235c85 R15: ffff9aaf00ebf030
2023-11-16T02:50:12.194744+00:00 smithi204 kernel: [ 2066.894718] FS: 0000000000000000(0000) GS:ffff9ab65fdc0000(0000) knlGS:0000000000000000
2023-11-16T02:50:12.194744+00:00 smithi204 kernel: [ 2066.894720] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-11-16T02:50:12.194745+00:00 smithi204 kernel: [ 2066.894722] CR2: 00007f8a8a03c180 CR3: 000000005a63e002 CR4: 00000000003706e0
2023-11-16T02:50:12.194750+00:00 smithi204 kernel: [ 2066.894724] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2023-11-16T02:50:12.194750+00:00 smithi204 kernel: [ 2066.894726] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2023-11-16T02:50:12.194751+00:00 smithi204 kernel: [ 2066.894728] Call Trace:
2023-11-16T02:50:12.194751+00:00 smithi204 kernel: [ 2066.894730] <TASK>
2023-11-16T02:50:12.194751+00:00 smithi204 kernel: [ 2066.894732] ? __warn+0x84/0x170
2023-11-16T02:50:12.194752+00:00 smithi204 kernel: [ 2066.894736] ? __alloc_pages+0x1e7/0x270
2023-11-16T02:50:12.194752+00:00 smithi204 kernel: [ 2066.894740] ? report_bug+0x191/0x1a0
2023-11-16T02:50:12.194756+00:00 smithi204 kernel: [ 2066.894747] ? handle_bug+0x42/0x70
2023-11-16T02:50:12.194756+00:00 smithi204 kernel: [ 2066.894751] ? exc_invalid_op+0x18/0x70
2023-11-16T02:50:12.194757+00:00 smithi204 kernel: [ 2066.894754] ? asm_exc_invalid_op+0x1a/0x20
2023-11-16T02:50:12.194757+00:00 smithi204 kernel: [ 2066.894760] ? osd_sparse_read+0x115/0x670 [libceph]
2023-11-16T02:50:12.194757+00:00 smithi204 kernel: [ 2066.894791] ? __alloc_pages+0x1e7/0x270
2023-11-16T02:50:12.194758+00:00 smithi204 kernel: [ 2066.894798] __kmalloc_large_node+0x88/0x140
2023-11-16T02:50:12.194762+00:00 smithi204 kernel: [ 2066.894804] __kmalloc+0xdf/0x150
2023-11-16T02:50:12.194762+00:00 smithi204 kernel: [ 2066.894819] osd_sparse_read+0x115/0x670 [libceph]
2023-11-16T02:50:12.194762+00:00 smithi204 kernel: [ 2066.894834] ? read_partial_message_chunk+0x8b/0x90 [libceph]
2023-11-16T02:50:12.194763+00:00 smithi204 kernel: [ 2066.894850] ceph_con_v1_try_read+0x17cc/0x1e00 [libceph]
2023-11-16T02:50:12.194763+00:00 smithi204 kernel: [ 2066.894881] ? ceph_con_workfn+0x2e/0x730 [libceph]
2023-11-16T02:50:12.194763+00:00 smithi204 kernel: [ 2066.894894] ? lock_acquire+0xcc/0x2a0
2023-11-16T02:50:12.194764+00:00 smithi204 kernel: [ 2066.894911] ? process_one_work+0x176/0x4c0
2023-11-16T02:50:12.194768+00:00 smithi204 kernel: [ 2066.894915] ceph_con_workfn+0x383/0x730 [libceph]
2023-11-16T02:50:12.194768+00:00 smithi204 kernel: [ 2066.894929] ? process_one_work+0x176/0x4c0
2023-11-16T02:50:12.194768+00:00 smithi204 kernel: [ 2066.894932] process_one_work+0x1dd/0x4c0
2023-11-16T02:50:12.194769+00:00 smithi204 kernel: [ 2066.894939] worker_thread+0x1df/0x3e0
2023-11-16T02:50:12.194769+00:00 smithi204 kernel: [ 2066.894943] ? drain_workqueue+0x120/0x120
2023-11-16T02:50:12.194769+00:00 smithi204 kernel: [ 2066.894947] kthread+0xf4/0x130
2023-11-16T02:50:12.194774+00:00 smithi204 kernel: [ 2066.894959] ? kthread_complete_and_exit+0x20/0x20
2023-11-16T02:50:12.194774+00:00 smithi204 kernel: [ 2066.894964] ret_from_fork+0x31/0x50
2023-11-16T02:50:12.194774+00:00 smithi204 kernel: [ 2066.894967] ? kthread_complete_and_exit+0x20/0x20
2023-11-16T02:50:12.194775+00:00 smithi204 kernel: [ 2066.894970] ret_from_fork_asm+0x11/0x20
2023-11-16T02:50:12.194775+00:00 smithi204 kernel: [ 2066.894980] </TASK>
2023-11-16T02:50:12.194775+00:00 smithi204 kernel: [ 2066.894982] irq event stamp: 4219761
2023-11-16T02:50:12.194779+00:00 smithi204 kernel: [ 2066.894984] hardirqs last enabled at (4219767): [<ffffffffa213778d>] console_unlock+0x11d/0x150
2023-11-16T02:50:12.194780+00:00 smithi204 kernel: [ 2066.894988] hardirqs last disabled at (4219772): [<ffffffffa2137772>] console_unlock+0x102/0x150
2023-11-16T02:50:12.194780+00:00 smithi204 kernel: [ 2066.894991] softirqs last enabled at (4219726): [<ffffffffa20a5777>] __irq_exit_rcu+0x77/0xa0
2023-11-16T02:50:12.194780+00:00 smithi204 kernel: [ 2066.894996] softirqs last disabled at (4219709): [<ffffffffa20a5777>] __irq_exit_rcu+0x77/0xa0
2023-11-16T02:50:12.194781+00:00 smithi204 kernel: [ 2066.894998] ---[ end trace 0000000000000000 ]---
2023-11-16T02:50:12.194781+00:00 smithi204 kernel: [ 2066.895001] libceph: osd_sparse_read: failed to allocate 2208169984 sparse read extents
2023-11-16T02:50:12.194788+00:00 smithi204 kernel: [ 2066.903155] libceph: osd6 (1)172.21.15.70:6812 socket error on read
2023-11-16T02:50:12.218371+00:00 smithi204 kernel: [ 2066.908743] libceph: osd_sparse_read: failed to allocate 2208169984 sparse read extents
2023-11-16T02:50:12.218406+00:00 smithi204 kernel: [ 2066.917112] libceph: osd6 (1)172.21.15.70:6812 socket error on read
2023-11-16T02:50:12.218407+00:00 smithi204 kernel: [ 2066.918810] libceph: osd_sparse_read: failed to allocate 2208169984 sparse read extents
2023-11-16T02:50:12.219434+00:00 smithi204 kernel: [ 2066.926987] libceph: osd6 (1)172.21.15.70:6812 socket error on read
...
2023-11-16T14:13:46.983459+00:00 smithi204 kernel: [43082.250808] libceph: osd6 (1)172.21.15.70:6812 socket error on read
2023-11-16T14:13:47.002106+00:00 smithi204 kernel: [43082.252340] libceph: osd_sparse_read: failed to allocate 2208169984 sparse read extents
2023-11-16T14:13:47.002112+00:00 smithi204 kernel: [43082.260503] libceph: osd6 (1)172.21.15.70:6812 socket error on read
</pre>
The Rados just returned a very large extents *2208169984* and *kernel libceph* couldn't allocate enough memory for it and reset the connection and keep retrying the osd requests.
While in the Rados side, the *osd.6*, the read was succeeful:
<pre>
2023-11-16T02:50:12.200+0000 7fec2cc41640 15 osd.6 pg_epoch: 33 pg[5.11( v 33'9174 (0'0,33'9174] local-lis/les=31/32 n=4 ec=31/31 lis/c=31/31 les/c/f=32/32/0 sis=31) [6,4] r=0 lpr=31 crt=33'9174 lcod 33'9173 mlcod 33'9173 active+clean] log_op_stats osd_op(client.5322.0:160938 5.11 5:8b7d215f:::1000000985f.00000000:head [sparse-read 3756032~65536 out=65560b] snapc 0=[] RETRY=1 ondisk+retry+read+known_if_redirected e33) v8 inb 0 outb 65560 lat 0.000235
2023-11-16T02:50:12.200+0000 7fec28438640 10 osd.6 pg_epoch: 33 pg[5.2( v 33'5683 (0'0,33'5683] local-lis/les=31/32 n=1 ec=31/31 lis/c=31/31 les/c/f=32/32/0 sis=31) [6,0] r=0 lpr=31 crt=33'5683 lcod 33'5682 mlcod 33'5682 active+clean] do_op osd_op(client.5322.0:160962 5.2 5:411858ae:::10000009858.00000001:head [sparse-read 1630208~65536 [1@-1]] snapc 0=[] RETRY=1 ondisk+retry+read+known_if_redirected e33) v8 may_read -> read-ordered flags ondisk+retry+read+known_if_redirected
2023-11-16T02:50:12.200+0000 7fec2cc41640 20 osd.6 pg_epoch: 33 pg[5.11( v 33'9174 (0'0,33'9174] local-lis/les=31/32 n=4 ec=31/31 lis/c=31/31 les/c/f=32/32/0 sis=31) [6,4] r=0 lpr=31 crt=33'9174 lcod 33'9173 mlcod 33'9173 active+clean] prepare_stats_for_publish reporting purged_snaps []
2023-11-16T02:50:12.200+0000 7fec2cc41640 15 osd.6 pg_epoch: 33 pg[5.11( v 33'9174 (0'0,33'9174] local-lis/les=31/32 n=4 ec=31/31 lis/c=31/31 les/c/f=32/32/0 sis=31) [6,4] r=0 lpr=31 crt=33'9174 lcod 33'9173 mlcod 33'9173 active+clean] publish_stats_to_osd 33:9995
2023-11-16T02:50:12.200+0000 7fec28438640 10 osd.6 pg_epoch: 33 pg[5.2( v 33'5683 (0'0,33'5683] local-lis/les=31/32 n=1 ec=31/31 lis/c=31/31 les/c/f=32/32/0 sis=31) [6,0] r=0 lpr=31 crt=33'5683 lcod 33'5682 mlcod 33'5682 active+clean] get_object_context: found obc in cache: obc(5:411858ae:::10000009858.00000001:head rwstate(none n=0 w=0))
2023-11-16T02:50:12.200+0000 7fec2cc41640 1 -- [v2:172.21.15.70:6809/38829,v1:172.21.15.70:6812/38829] --> v1:172.21.15.204:0/484060394 -- osd_op_reply(160938 1000000985f.00000000 [sparse-read 3756032~65536 out=65560b] v0'0 uv9173 ondisk = 0) v8 -- 0x55b9f7dd4b40 con 0x55ba006e0400
</pre>