https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-09-18T15:29:33ZCeph Linux kernel client - Bug #21420: ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshotshttps://tracker.ceph.com/issues/21420?journal_id=992822017-09-18T15:29:33ZIlya Dryomov
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/99282/diff?detail_id=96323">diff</a>)</li></ul> Linux kernel client - Bug #21420: ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshotshttps://tracker.ceph.com/issues/21420?journal_id=1013932017-10-26T17:10:54ZWyllys Ingersollwyllys.ingersoll@keepertech.com
<ul></ul><p>We are seeing this on other systems now that have fewer snapshots, it appears to happen around the time that the snapshots are taken<br />Ceph 10.2.7 on a 4.9.52 kernel.</p>
<p>Has this been looked at more closely or fixed in any recent kernels?</p>
<pre>
Oct 19 17:01:43 gw02 kernel: [ 4219.940277] kernel BUG at /home/kernel/COD/linux/net/ceph/osd_client.c:1554!
Oct 19 17:01:43 gw02 kernel: [ 4219.948143] invalid opcode: 0000 [#1] SMP
Oct 19 17:01:43 gw02 kernel: [ 4219.952624] Modules linked in: iptable_filter ip_tables x_tables tcp_diag inet_diag binfmt_misc ceph libceph fscache ipmi_ssif ipmi_devintf inte
l_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm dcdbas irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw glue_help
er ablk_helper cryptd intel_cstate intel_rapl_perf ipmi_si mei_me mei lpc_ich ipmi_msghandler shpchp acpi_power_meter mac_hid 8021q garp mrp stp llc bonding sunrpc autofs4 btrfs x
or raid6_pq mlx4_en bnx2x tg3 mlx4_core ptp mxm_wmi mdio megaraid_sas devlink pata_acpi pps_core libcrc32c fjes wmi
Oct 19 17:01:43 gw02 kernel: [ 4220.013544] CPU: 10 PID: 152 Comm: kswapd0 Not tainted 4.9.52-040952-generic #201710131251
Oct 19 17:01:43 gw02 kernel: [ 4220.022768] Hardware name: Dell Inc. PowerEdge FC630/0R10KJ, BIOS 2.1.7 06/17/2016
Oct 19 17:01:43 gw02 kernel: [ 4220.031217] task: ffff8979b41216c0 task.stack: ffff9b6ccd4a0000
Oct 19 17:01:43 gw02 kernel: [ 4220.037822] RIP: 0010:[<ffffffffc06904f7>] [<ffffffffc06904f7>] send_request+0xa27/0xab0 [libceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.047946] RSP: 0018:ffff9b6ccd4a3788 EFLAGS: 00010297
Oct 19 17:01:43 gw02 kernel: [ 4220.053872] RAX: 0000000000000000 RBX: 0000000000002201 RCX: ffff897684f76000
Oct 19 17:01:43 gw02 kernel: [ 4220.061834] RDX: ffff8969a4690923 RSI: 000000000000030f RDI: 0000000000002201
Oct 19 17:01:43 gw02 kernel: [ 4220.069798] RBP: ffff9b6ccd4a37f8 R08: ffff8969bfa9c9e0 R09:
0000000000000000
Oct 19 17:01:43 gw02 kernel: [ 4220.077760] R10: 0000000000000000 R11: ffff8969a3dcba78 R12: ffff8969b68e2890
Oct 19 17:01:43 gw02 kernel: [ 4220.085721] R13: 0000000000000001 R14: ffff8969a468f08b R15: ffff8969b68e2890
Oct 19 17:01:43 gw02 kernel: [ 4220.093684] FS: 0000000000000000(0000) GS:ffff8969bfa80000(0000) knlGS:0000000000000000
Oct 19 17:01:43 gw02 kernel: [ 4220.102714] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 19 17:01:43 gw02 kernel: [ 4220.109125] CR2: 00007f139916f000 CR3: 0000001f1d407000 CR4: 00000000003406e0
Oct 19 17:01:43 gw02 kernel: [ 4220.117088] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 19 17:01:43 gw02 kernel: [ 4220.125049] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 19 17:01:43 gw02 kernel: [ 4220.133012] Stack:
Oct 19 17:01:43 gw02 kernel: [ 4220.135255] ffff896900000009 ffff8969b46b47c0 0000000002400040 0000000000001000
Oct 19 17:01:43 gw02 kernel: [ 4220.143548] ffff8969a4690000 ffff89699ad6e800 ffff89699b86b100 0000100000002201
Oct 19 17:01:43 gw02 kernel: [ 4220.151834] 0000000000000000 ffff8969b68e2890 ffff8969b46b47c0 0000000000000001
Oct 19 17:01:43 gw02 kernel: [ 4220.160119] Call Trace:
Oct 19 17:01:43 gw02 kernel: [ 4220.162866] [<ffffffffc069415a>] __submit_request+0x20a/0x2f0 [libceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.170353] [<ffffffffc069426b>] submit_request+0x2b/0x30 [libceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.177451] [<ffffffffc0694c14>] ceph_osdc_writepages+0x104/0x1a0 [libceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.185331] [<ffffffffc072b4b1>] writepage_nounlock+0x2c1/0x470 [ceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.192716] [<ffffffffbadf0b05>] ? try_to_unmap_one+0x2b5/0x650
Oct 19 17:01:43 gw02 kernel: [ 4220.199411] [<ffffffffbadf19ca>] ? page_mkclean+0x6a/0xb0
Oct 19 17:01:43 gw02 kernel: [ 4220.205532] [<ffffffffbadefb70>] ? __page_check_address+0x1c0/0x1c0
Oct 19 17:01:43 gw02 kernel: [ 4220.212638] [<ffffffffc072b733>] ceph_writepage+0x33/0x60 [ceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.219440] [<ffffffffbadbfba9>] pageout.isra.41+0x189/0x2a0
Oct 19 17:01:43 gw02 kernel: [ 4220.225845] [<ffffffffbadc2a75>] shrink_page_list+0x7d5/0x9a0
Oct 19 17:01:43 gw02 kernel: [ 4220.232355] [<ffffffffbadc3481>] shrink_inactive_list+0x231/0x560
Oct 19 17:01:43 gw02 kernel: [ 4220.239252] [<ffffffffbadc414b>] shrink_node_memcg+0x5fb/0x7d0
Oct 19 17:01:43 gw02 kernel: [ 4220.245856] [<ffffffffbadc4401>] shrink_node+0xe1/0x320
Oct 19 17:01:43 gw02 kernel: [ 4220.251780] [<ffffffffbadc53d6>] kswapd+0x2f6/0x710
Oct 19 17:01:43 gw02 kernel: [ 4220.257317] [<ffffffffbadc50e0>] ? mem_cgroup_shrink_node+0x170/0x170
Oct 19 17:01:43 gw02 kernel: [ 4220.264600] [<ffffffffbaca56b6>] kthread+0xe6/0x100
Oct 19 17:01:43 gw02 kernel: [ 4220.270138] [<ffffffffbaca55d0>] ? kthread_park+0x60/0x60
Oct 19 17:01:43 gw02 kernel: [ 4220.276259] [<ffffffffbb49cb35>] ret_from_fork+0x25/0x30
Oct 19 17:01:43 gw02 kernel: [ 4220.282282] Code: 3c 5f fa e9 de f6 ff ff ba 14 00 00 00 e9 42 f7 ff ff 49 c7 46 08 00 00 00 00 41 c7 46 10 00 00 00 00 49 8d 56 14 e9 6d fb ff ff <0f> 0b 0f 0b be 8f 05 00 00 48 c7 c7 d8 cc 6a c0 e8 e4 3b 5f fa
Oct 19 17:01:43 gw02 kernel: [ 4220.303922] RIP [<ffffffffc06904f7>] send_request+0xa27/0xab0 [libceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.311421] RSP <ffff9b6ccd4a3788>
Oct 19 17:01:43 gw02 kernel: [ 4220.315851] ---[ end trace 154cde58ce00766e ]---
</pre> Linux kernel client - Bug #21420: ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshotshttps://tracker.ceph.com/issues/21420?journal_id=1014512017-10-27T07:10:04ZIlya Dryomov
<ul></ul><p>No, not yet. Increasing the size of pre-allocated messages would be a one-line patch, but seems a bit dumb -- it's still going to be a hard limit.</p>
<p>It appears you now have ~780 snapshots? If you get it down to 400, you won't see this crash.</p> Linux kernel client - Bug #21420: ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshotshttps://tracker.ceph.com/issues/21420?journal_id=1016382017-10-31T15:49:03ZWyllys Ingersollwyllys.ingersoll@keepertech.com
<ul></ul><p>What is the correct upper bound on the number of snapshots - is it 400 or 4000? In the original description, you mentioned 4K, but in this later one its 400.</p> Linux kernel client - Bug #21420: ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshotshttps://tracker.ceph.com/issues/21420?journal_id=1016392017-10-31T15:54:37ZIlya Dryomov
<ul></ul><p>400 is correct. 4k is 4096 bytes -- the size of the pre-allocated buffer.</p>
<p>I've updated the docs in <a class="external" href="https://github.com/ceph/ceph/pull/18579">https://github.com/ceph/ceph/pull/18579</a>.</p> Linux kernel client - Bug #21420: ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshotshttps://tracker.ceph.com/issues/21420?journal_id=1402912019-07-10T10:28:56ZLuis Henriques
<ul></ul><p>I was wondering if this bug could be closed, as it seems that it has already been fixed by the "libceph: preallocate message data items" patchset:<br /><pre>
98c4bfe9d89b libceph: check reply num_data_items in setup_request_data()
0d9c1ab3be4c libceph: preallocate message data items
26f887e0a3c4 libceph, rbd, ceph: move ceph_osdc_alloc_messages() calls
39e58c3425b1 libceph: introduce alloc_watch_request()
81c65213d73c libceph: assign cookies in linger_submit()
3b83f60da6dd libceph: enable fallback to ceph_msg_new() in ceph_msgpool_get()
61d2f855042c ceph: num_ops is off by one in ceph_aio_retry_work()
41a264e1b30c libceph: no need to call osd_req_opcode_valid() in osd_req_encode_op()
668028844174 ceph: set timeout conditionally in __cap_delay_requeue
894868330a1e libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist()
33165d472310 libceph: introduce ceph_pagelist_alloc()
</pre></p> Linux kernel client - Bug #21420: ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshotshttps://tracker.ceph.com/issues/21420?journal_id=1403352019-07-10T18:51:16ZIlya Dryomov
<ul></ul><p>No, I don't think so. It now falls back to a regular allocation that can fail, while that code path is specifically set up to use a memory pool allocation that isn't supposed to fail, in order to guarantee forward progress. Note that I added a WARN for that, so this ticket is still valid. It's just that I made it less of a problem because we no longer crash in this case.</p> Linux kernel client - Bug #21420: ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshotshttps://tracker.ceph.com/issues/21420?journal_id=1403632019-07-11T09:50:03ZLuis Henriques
<ul></ul><p>Ilya Dryomov wrote:</p>
<blockquote>
<p>No, I don't think so. It now falls back to a regular allocation that can fail, while that code path is specifically set up to use a memory pool allocation that isn't supposed to fail, in order to guarantee forward progress. Note that I added a WARN for that, so this ticket is still valid. It's just that I made it less of a problem because we no longer crash in this case.</p>
</blockquote>
<p>Right, thanks for clarifying this Ilya. I guess I'll need to go re-read these patches then :-)</p> Linux kernel client - Bug #21420: ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshotshttps://tracker.ceph.com/issues/21420?journal_id=1495402019-10-25T02:31:28ZChris Dunlopchris@onthe.net.au
<ul></ul><p>Is the 400 snapshot limit per file system, or per directory, or per something else?</p> Linux kernel client - Bug #21420: ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshotshttps://tracker.ceph.com/issues/21420?journal_id=1495432019-10-25T02:34:49ZZheng Yanukernel@gmail.com
<ul></ul><p>per filesystem</p> Linux kernel client - Bug #21420: ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshotshttps://tracker.ceph.com/issues/21420?journal_id=1533462019-12-05T21:36:02ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>New</i></li></ul> Linux kernel client - Bug #21420: ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshotshttps://tracker.ceph.com/issues/21420?journal_id=1719122020-07-29T18:11:31ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-3 priority-6 priority-high2 closed" href="/issues/46074">Feature #46074</a>: mds: provide altrenatives to increase the total cephfs subvolume snapshot counts to greater than the current 400 across a Cephfs volume</i> added</li></ul>