Bug #21420
ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshots
0%
Description
https://www.spinics.net/lists/ceph-devel/msg38197.html:
kernel BUG at /home/kernel/COD/linux/net/ceph/osd_client.c:1554! invalid opcode: 0000 [#1] SMP [Wed Aug 30 14:17:04 2017] Modules linked in: binfmt_misc ipmi_devintf ceph libceph libcrc32c fscache ipmi_ssif intel_powerclamp coretemp kvm_intel kvm gpio_ich input_leds ipmi_si serio_raw irqbypass intel_cstate shpchp i7core_edac hpilo lpc_ich edac_core acpi_power_meter ipmi_msghandler mac_hid 8021q garp mrp stp llc bonding nfsd auth_rpcgss nfs_acl lp lockd grace parport sunrpc autofs4 btrfs xor raid6_pq mlx4_en ptp pps_core hid_generic i2c_algo_bit ttm drm_kms_helper usbhid syscopyarea sysfillrect sysimgblt fb_sys_fops hid mlx4_core hpsa psmouse drm pata_acpi bnx2 devlink scsi_transport_sas fjes CPU: 18 PID: 471071 Comm: vsftpd Tainted: G I 4.9.44-040944-generic #201708161731 Hardware name: HP ProLiant DL360 G6, BIOS P64 08/16/2015 task: ffff9268331bc080 task.stack: ffffab4a96988000 RIP: 0010:[<ffffffffc09c44f7>] [<ffffffffc09c44f7>] send_request+0xa27/0xab0 [libceph] RSP: 0018:ffffab4a9698b8e8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000002201 RCX: ffff925e48490000 RDX: ffff926242fcf553 RSI: 0000000000001295 RDI: 0000000000002201 RBP: ffffab4a9698b958 R08: ffff92685f95c9e0 R09: 0000000000000000 R10: 0000000000000000 R11: ffff926842265680 R12: ffff92684078c610 R13: 0000000000000001 R14: ffff926242fc608b R15: ffff92684078c610 FS: 00007f5b59fd5700(0000) GS:ffff92685f940000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fec2bb2f003 CR3: 000000068ef59000 CR4: 00000000000006e0 Stack: 01ffffffc09b9cd7 ffff92685f959300 0000000000000067 ffff9268331bc080 ffff926242fc7000 ffff926241a3e000 ffff926840215200 00000e8100002201 0000000000000000 ffff92684078c610 ffff926841ddf7c0 0000000000000000 Call Trace: [<ffffffffc09c815a>] __submit_request+0x20a/0x2f0 [libceph] [<ffffffffc09c826b>] submit_request+0x2b/0x30 [libceph] [<ffffffffc09c8c14>] ceph_osdc_writepages+0x104/0x1a0 [libceph] [<ffffffffc0a0f4b1>] writepage_nounlock+0x2c1/0x470 [ceph] [<ffffffffa65f120a>] ? page_mkclean+0x6a/0xb0 [<ffffffffa65ef3b0>] ? __page_check_address+0x1c0/0x1c0 [<ffffffffc0a11f9c>] ceph_update_writeable_page+0xdc/0x4a0 [ceph] [<ffffffffa65a974d>] ? pagecache_get_page+0x17d/0x2a0 [<ffffffffc0a123ca>] ceph_write_begin+0x6a/0x120 [ceph] [<ffffffffa65a89b8>] generic_perform_write+0xc8/0x1c0 [<ffffffffa66592ee>] ? file_update_time+0x5e/0x110 [<ffffffffc0a0c402>] ceph_write_iter+0xba2/0xbe0 [ceph] [<ffffffffa6b6238c>] ? release_sock+0x8c/0xa0 [<ffffffffa6bce0b9>] ? tcp_recvmsg+0x4c9/0xb50 [<ffffffffa6b5d65d>] ? sock_recvmsg+0x3d/0x50 [<ffffffffa663ad45>] __vfs_write+0xe5/0x160 [<ffffffffa663bfe5>] vfs_write+0xb5/0x1a0 [<ffffffffa663d465>] SyS_write+0x55/0xc0 [<ffffffffa6c9b9bb>] entry_SYSCALL_64_fastpath+0x1e/0xad Code: fb ab e5 e9 de f6 ff ff ba 14 00 00 00 e9 42 f7 ff ff 49 c7 46 08 00 00 00 00 41 c7 46 10 00 00 00 00 49 8d 56 14 e9 6d fb ff ff <0f> 0b 0f 0b be 8f 05 00 00 48 c7 c7 d8 0c 9e c0 e8 b4 fb ab e5 RIP [<ffffffffc09c44f7>] send_request+0xa27/0xab0 [libceph] RSP <ffffab4a9698b8e8> ---[ end trace 5c55854998e663dc ]---
You have quite a lot of snapshots -- 4758 of them? send_request() attempted to encode a 8 + 4 + 4758*8 = ~38k snap context into a 4k buffer. Normally it's fine because the snap context is taken into account when allocating a message buffer. However, this particular code path (... ceph_osdc_writepages()) uses pre-allocated messages, which are always 4k in size. I think it's a known bug^Wlimitation. As a short-term fix, we can probably increase that pre-allocated size from 4k to something bigger. A proper resolution would take a considerable amount of time. Until then I'd recommend a much more aggressive snapshot rotation schedule, which is a good idea anyway -- your writes will transmit faster!
ceph_writepages_start() (to a lesser extent) and ceph_osdc_writepages() code paths are affected.
Related issues
History
#1 Updated by Ilya Dryomov over 6 years ago
- Description updated (diff)
#2 Updated by Wyllys Ingersoll over 6 years ago
We are seeing this on other systems now that have fewer snapshots, it appears to happen around the time that the snapshots are taken
Ceph 10.2.7 on a 4.9.52 kernel.
Has this been looked at more closely or fixed in any recent kernels?
Oct 19 17:01:43 gw02 kernel: [ 4219.940277] kernel BUG at /home/kernel/COD/linux/net/ceph/osd_client.c:1554! Oct 19 17:01:43 gw02 kernel: [ 4219.948143] invalid opcode: 0000 [#1] SMP Oct 19 17:01:43 gw02 kernel: [ 4219.952624] Modules linked in: iptable_filter ip_tables x_tables tcp_diag inet_diag binfmt_misc ceph libceph fscache ipmi_ssif ipmi_devintf inte l_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm dcdbas irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw glue_help er ablk_helper cryptd intel_cstate intel_rapl_perf ipmi_si mei_me mei lpc_ich ipmi_msghandler shpchp acpi_power_meter mac_hid 8021q garp mrp stp llc bonding sunrpc autofs4 btrfs x or raid6_pq mlx4_en bnx2x tg3 mlx4_core ptp mxm_wmi mdio megaraid_sas devlink pata_acpi pps_core libcrc32c fjes wmi Oct 19 17:01:43 gw02 kernel: [ 4220.013544] CPU: 10 PID: 152 Comm: kswapd0 Not tainted 4.9.52-040952-generic #201710131251 Oct 19 17:01:43 gw02 kernel: [ 4220.022768] Hardware name: Dell Inc. PowerEdge FC630/0R10KJ, BIOS 2.1.7 06/17/2016 Oct 19 17:01:43 gw02 kernel: [ 4220.031217] task: ffff8979b41216c0 task.stack: ffff9b6ccd4a0000 Oct 19 17:01:43 gw02 kernel: [ 4220.037822] RIP: 0010:[<ffffffffc06904f7>] [<ffffffffc06904f7>] send_request+0xa27/0xab0 [libceph] Oct 19 17:01:43 gw02 kernel: [ 4220.047946] RSP: 0018:ffff9b6ccd4a3788 EFLAGS: 00010297 Oct 19 17:01:43 gw02 kernel: [ 4220.053872] RAX: 0000000000000000 RBX: 0000000000002201 RCX: ffff897684f76000 Oct 19 17:01:43 gw02 kernel: [ 4220.061834] RDX: ffff8969a4690923 RSI: 000000000000030f RDI: 0000000000002201 Oct 19 17:01:43 gw02 kernel: [ 4220.069798] RBP: ffff9b6ccd4a37f8 R08: ffff8969bfa9c9e0 R09: 0000000000000000 Oct 19 17:01:43 gw02 kernel: [ 4220.077760] R10: 0000000000000000 R11: ffff8969a3dcba78 R12: ffff8969b68e2890 Oct 19 17:01:43 gw02 kernel: [ 4220.085721] R13: 0000000000000001 R14: ffff8969a468f08b R15: ffff8969b68e2890 Oct 19 17:01:43 gw02 kernel: [ 4220.093684] FS: 0000000000000000(0000) GS:ffff8969bfa80000(0000) knlGS:0000000000000000 Oct 19 17:01:43 gw02 kernel: [ 4220.102714] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 19 17:01:43 gw02 kernel: [ 4220.109125] CR2: 00007f139916f000 CR3: 0000001f1d407000 CR4: 00000000003406e0 Oct 19 17:01:43 gw02 kernel: [ 4220.117088] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Oct 19 17:01:43 gw02 kernel: [ 4220.125049] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Oct 19 17:01:43 gw02 kernel: [ 4220.133012] Stack: Oct 19 17:01:43 gw02 kernel: [ 4220.135255] ffff896900000009 ffff8969b46b47c0 0000000002400040 0000000000001000 Oct 19 17:01:43 gw02 kernel: [ 4220.143548] ffff8969a4690000 ffff89699ad6e800 ffff89699b86b100 0000100000002201 Oct 19 17:01:43 gw02 kernel: [ 4220.151834] 0000000000000000 ffff8969b68e2890 ffff8969b46b47c0 0000000000000001 Oct 19 17:01:43 gw02 kernel: [ 4220.160119] Call Trace: Oct 19 17:01:43 gw02 kernel: [ 4220.162866] [<ffffffffc069415a>] __submit_request+0x20a/0x2f0 [libceph] Oct 19 17:01:43 gw02 kernel: [ 4220.170353] [<ffffffffc069426b>] submit_request+0x2b/0x30 [libceph] Oct 19 17:01:43 gw02 kernel: [ 4220.177451] [<ffffffffc0694c14>] ceph_osdc_writepages+0x104/0x1a0 [libceph] Oct 19 17:01:43 gw02 kernel: [ 4220.185331] [<ffffffffc072b4b1>] writepage_nounlock+0x2c1/0x470 [ceph] Oct 19 17:01:43 gw02 kernel: [ 4220.192716] [<ffffffffbadf0b05>] ? try_to_unmap_one+0x2b5/0x650 Oct 19 17:01:43 gw02 kernel: [ 4220.199411] [<ffffffffbadf19ca>] ? page_mkclean+0x6a/0xb0 Oct 19 17:01:43 gw02 kernel: [ 4220.205532] [<ffffffffbadefb70>] ? __page_check_address+0x1c0/0x1c0 Oct 19 17:01:43 gw02 kernel: [ 4220.212638] [<ffffffffc072b733>] ceph_writepage+0x33/0x60 [ceph] Oct 19 17:01:43 gw02 kernel: [ 4220.219440] [<ffffffffbadbfba9>] pageout.isra.41+0x189/0x2a0 Oct 19 17:01:43 gw02 kernel: [ 4220.225845] [<ffffffffbadc2a75>] shrink_page_list+0x7d5/0x9a0 Oct 19 17:01:43 gw02 kernel: [ 4220.232355] [<ffffffffbadc3481>] shrink_inactive_list+0x231/0x560 Oct 19 17:01:43 gw02 kernel: [ 4220.239252] [<ffffffffbadc414b>] shrink_node_memcg+0x5fb/0x7d0 Oct 19 17:01:43 gw02 kernel: [ 4220.245856] [<ffffffffbadc4401>] shrink_node+0xe1/0x320 Oct 19 17:01:43 gw02 kernel: [ 4220.251780] [<ffffffffbadc53d6>] kswapd+0x2f6/0x710 Oct 19 17:01:43 gw02 kernel: [ 4220.257317] [<ffffffffbadc50e0>] ? mem_cgroup_shrink_node+0x170/0x170 Oct 19 17:01:43 gw02 kernel: [ 4220.264600] [<ffffffffbaca56b6>] kthread+0xe6/0x100 Oct 19 17:01:43 gw02 kernel: [ 4220.270138] [<ffffffffbaca55d0>] ? kthread_park+0x60/0x60 Oct 19 17:01:43 gw02 kernel: [ 4220.276259] [<ffffffffbb49cb35>] ret_from_fork+0x25/0x30 Oct 19 17:01:43 gw02 kernel: [ 4220.282282] Code: 3c 5f fa e9 de f6 ff ff ba 14 00 00 00 e9 42 f7 ff ff 49 c7 46 08 00 00 00 00 41 c7 46 10 00 00 00 00 49 8d 56 14 e9 6d fb ff ff <0f> 0b 0f 0b be 8f 05 00 00 48 c7 c7 d8 cc 6a c0 e8 e4 3b 5f fa Oct 19 17:01:43 gw02 kernel: [ 4220.303922] RIP [<ffffffffc06904f7>] send_request+0xa27/0xab0 [libceph] Oct 19 17:01:43 gw02 kernel: [ 4220.311421] RSP <ffff9b6ccd4a3788> Oct 19 17:01:43 gw02 kernel: [ 4220.315851] ---[ end trace 154cde58ce00766e ]---
#3 Updated by Ilya Dryomov over 6 years ago
No, not yet. Increasing the size of pre-allocated messages would be a one-line patch, but seems a bit dumb -- it's still going to be a hard limit.
It appears you now have ~780 snapshots? If you get it down to 400, you won't see this crash.
#4 Updated by Wyllys Ingersoll over 6 years ago
What is the correct upper bound on the number of snapshots - is it 400 or 4000? In the original description, you mentioned 4K, but in this later one its 400.
#5 Updated by Ilya Dryomov over 6 years ago
400 is correct. 4k is 4096 bytes -- the size of the pre-allocated buffer.
I've updated the docs in https://github.com/ceph/ceph/pull/18579.
#6 Updated by Luis Henriques over 4 years ago
I was wondering if this bug could be closed, as it seems that it has already been fixed by the "libceph: preallocate message data items" patchset:
98c4bfe9d89b libceph: check reply num_data_items in setup_request_data() 0d9c1ab3be4c libceph: preallocate message data items 26f887e0a3c4 libceph, rbd, ceph: move ceph_osdc_alloc_messages() calls 39e58c3425b1 libceph: introduce alloc_watch_request() 81c65213d73c libceph: assign cookies in linger_submit() 3b83f60da6dd libceph: enable fallback to ceph_msg_new() in ceph_msgpool_get() 61d2f855042c ceph: num_ops is off by one in ceph_aio_retry_work() 41a264e1b30c libceph: no need to call osd_req_opcode_valid() in osd_req_encode_op() 668028844174 ceph: set timeout conditionally in __cap_delay_requeue 894868330a1e libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist() 33165d472310 libceph: introduce ceph_pagelist_alloc()
#7 Updated by Ilya Dryomov over 4 years ago
No, I don't think so. It now falls back to a regular allocation that can fail, while that code path is specifically set up to use a memory pool allocation that isn't supposed to fail, in order to guarantee forward progress. Note that I added a WARN for that, so this ticket is still valid. It's just that I made it less of a problem because we no longer crash in this case.
#8 Updated by Luis Henriques over 4 years ago
Ilya Dryomov wrote:
No, I don't think so. It now falls back to a regular allocation that can fail, while that code path is specifically set up to use a memory pool allocation that isn't supposed to fail, in order to guarantee forward progress. Note that I added a WARN for that, so this ticket is still valid. It's just that I made it less of a problem because we no longer crash in this case.
Right, thanks for clarifying this Ilya. I guess I'll need to go re-read these patches then :-)
#9 Updated by Chris Dunlop over 4 years ago
Is the 400 snapshot limit per file system, or per directory, or per something else?
#10 Updated by Zheng Yan over 4 years ago
per filesystem
#11 Updated by Patrick Donnelly over 4 years ago
- Status changed from 12 to New
#12 Updated by Patrick Donnelly over 3 years ago
- Related to Feature #46074: mds: provide altrenatives to increase the total cephfs subvolume snapshot counts to greater than the current 400 across a Cephfs volume added