Project

General

Profile

Bug #21420

ceph_osdc_writepages(): pre-allocated osdc->msgpool_op messages vs large number of snapshots

Added by Ilya Dryomov over 6 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

https://www.spinics.net/lists/ceph-devel/msg38197.html:

kernel BUG at /home/kernel/COD/linux/net/ceph/osd_client.c:1554!
invalid opcode: 0000 [#1] SMP [Wed Aug 30 14:17:04 2017] Modules linked in: binfmt_misc ipmi_devintf ceph libceph libcrc32c fscache ipmi_ssif intel_powerclamp coretemp kvm_intel kvm gpio_ich input_leds ipmi_si serio_raw irqbypass intel_cstate shpchp i7core_edac hpilo lpc_ich edac_core acpi_power_meter ipmi_msghandler mac_hid 8021q garp mrp stp llc bonding nfsd auth_rpcgss nfs_acl lp lockd grace parport sunrpc autofs4 btrfs xor raid6_pq mlx4_en ptp pps_core hid_generic i2c_algo_bit ttm drm_kms_helper usbhid syscopyarea sysfillrect sysimgblt fb_sys_fops hid mlx4_core hpsa psmouse drm pata_acpi bnx2 devlink scsi_transport_sas fjes
CPU: 18 PID: 471071 Comm: vsftpd Tainted: G I     4.9.44-040944-generic #201708161731
Hardware name: HP ProLiant DL360 G6, BIOS P64 08/16/2015
task: ffff9268331bc080 task.stack: ffffab4a96988000
RIP: 0010:[<ffffffffc09c44f7>] [<ffffffffc09c44f7>] send_request+0xa27/0xab0 [libceph]
RSP: 0018:ffffab4a9698b8e8  EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000002201 RCX: ffff925e48490000
RDX: ffff926242fcf553 RSI: 0000000000001295 RDI: 0000000000002201
RBP: ffffab4a9698b958 R08: ffff92685f95c9e0 R09: 0000000000000000
R10: 0000000000000000 R11: ffff926842265680 R12: ffff92684078c610
R13: 0000000000000001 R14: ffff926242fc608b R15: ffff92684078c610
FS:  00007f5b59fd5700(0000) GS:ffff92685f940000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fec2bb2f003 CR3: 000000068ef59000 CR4: 00000000000006e0
Stack:
 01ffffffc09b9cd7 ffff92685f959300 0000000000000067 ffff9268331bc080
 ffff926242fc7000 ffff926241a3e000 ffff926840215200 00000e8100002201
 0000000000000000 ffff92684078c610 ffff926841ddf7c0 0000000000000000
Call Trace:
 [<ffffffffc09c815a>] __submit_request+0x20a/0x2f0 [libceph]
 [<ffffffffc09c826b>] submit_request+0x2b/0x30 [libceph]
 [<ffffffffc09c8c14>] ceph_osdc_writepages+0x104/0x1a0 [libceph]
 [<ffffffffc0a0f4b1>] writepage_nounlock+0x2c1/0x470 [ceph]
 [<ffffffffa65f120a>] ? page_mkclean+0x6a/0xb0
 [<ffffffffa65ef3b0>] ?  __page_check_address+0x1c0/0x1c0
 [<ffffffffc0a11f9c>] ceph_update_writeable_page+0xdc/0x4a0 [ceph]
 [<ffffffffa65a974d>] ?  pagecache_get_page+0x17d/0x2a0
 [<ffffffffc0a123ca>] ceph_write_begin+0x6a/0x120 [ceph]
 [<ffffffffa65a89b8>] generic_perform_write+0xc8/0x1c0
 [<ffffffffa66592ee>] ? file_update_time+0x5e/0x110
 [<ffffffffc0a0c402>] ceph_write_iter+0xba2/0xbe0 [ceph]
 [<ffffffffa6b6238c>] ? release_sock+0x8c/0xa0
 [<ffffffffa6bce0b9>] ? tcp_recvmsg+0x4c9/0xb50
 [<ffffffffa6b5d65d>] ? sock_recvmsg+0x3d/0x50
 [<ffffffffa663ad45>] __vfs_write+0xe5/0x160
 [<ffffffffa663bfe5>] vfs_write+0xb5/0x1a0
 [<ffffffffa663d465>] SyS_write+0x55/0xc0
 [<ffffffffa6c9b9bb>] entry_SYSCALL_64_fastpath+0x1e/0xad
Code: fb ab e5 e9 de f6 ff ff ba 14 00 00 00 e9 42 f7 ff ff 49 c7 46 08 00 00 00 00 41 c7 46 10 00 00 00 00 49 8d 56 14 e9 6d fb ff ff <0f> 0b 0f 0b be 8f 05 00 00 48 c7 c7 d8 0c 9e c0 e8 b4 fb ab e5
RIP  [<ffffffffc09c44f7>] send_request+0xa27/0xab0 [libceph]
 RSP <ffffab4a9698b8e8>
---[ end trace 5c55854998e663dc ]---
You have quite a lot of snapshots -- 4758 of them?  send_request()
attempted to encode a 8 + 4 + 4758*8 = ~38k snap context into a 4k
buffer.  Normally it's fine because the snap context is taken into
account when allocating a message buffer.  However, this particular
code path (... ceph_osdc_writepages()) uses pre-allocated messages,
which are always 4k in size.

I think it's a known bug^Wlimitation.  As a short-term fix, we can
probably increase that pre-allocated size from 4k to something bigger.
A proper resolution would take a considerable amount of time.  Until
then I'd recommend a much more aggressive snapshot rotation schedule,
which is a good idea anyway -- your writes will transmit faster!

ceph_writepages_start() (to a lesser extent) and ceph_osdc_writepages() code paths are affected.


Related issues

Related to CephFS - Feature #46074: mds: provide altrenatives to increase the total cephfs subvolume snapshot counts to greater than the current 400 across a Cephfs volume Resolved

History

#1 Updated by Ilya Dryomov over 6 years ago

  • Description updated (diff)

#2 Updated by Wyllys Ingersoll over 6 years ago

We are seeing this on other systems now that have fewer snapshots, it appears to happen around the time that the snapshots are taken
Ceph 10.2.7 on a 4.9.52 kernel.

Has this been looked at more closely or fixed in any recent kernels?

Oct 19 17:01:43 gw02 kernel: [ 4219.940277] kernel BUG at /home/kernel/COD/linux/net/ceph/osd_client.c:1554!
Oct 19 17:01:43 gw02 kernel: [ 4219.948143] invalid opcode: 0000 [#1] SMP
Oct 19 17:01:43 gw02 kernel: [ 4219.952624] Modules linked in: iptable_filter ip_tables x_tables tcp_diag inet_diag binfmt_misc ceph libceph fscache ipmi_ssif ipmi_devintf inte
l_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm dcdbas irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw glue_help
er ablk_helper cryptd intel_cstate intel_rapl_perf ipmi_si mei_me mei lpc_ich ipmi_msghandler shpchp acpi_power_meter mac_hid 8021q garp mrp stp llc bonding sunrpc autofs4 btrfs x
or raid6_pq mlx4_en bnx2x tg3 mlx4_core ptp mxm_wmi mdio megaraid_sas devlink pata_acpi pps_core libcrc32c fjes wmi
Oct 19 17:01:43 gw02 kernel: [ 4220.013544] CPU: 10 PID: 152 Comm: kswapd0 Not tainted 4.9.52-040952-generic #201710131251
Oct 19 17:01:43 gw02 kernel: [ 4220.022768] Hardware name: Dell Inc. PowerEdge FC630/0R10KJ, BIOS 2.1.7 06/17/2016
Oct 19 17:01:43 gw02 kernel: [ 4220.031217] task: ffff8979b41216c0 task.stack: ffff9b6ccd4a0000
Oct 19 17:01:43 gw02 kernel: [ 4220.037822] RIP: 0010:[<ffffffffc06904f7>]  [<ffffffffc06904f7>] send_request+0xa27/0xab0 [libceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.047946] RSP: 0018:ffff9b6ccd4a3788  EFLAGS: 00010297
Oct 19 17:01:43 gw02 kernel: [ 4220.053872] RAX: 0000000000000000 RBX: 0000000000002201 RCX: ffff897684f76000
Oct 19 17:01:43 gw02 kernel: [ 4220.061834] RDX: ffff8969a4690923 RSI: 000000000000030f RDI: 0000000000002201
Oct 19 17:01:43 gw02 kernel: [ 4220.069798] RBP: ffff9b6ccd4a37f8 R08: ffff8969bfa9c9e0 R09:
 0000000000000000
Oct 19 17:01:43 gw02 kernel: [ 4220.077760] R10: 0000000000000000 R11: ffff8969a3dcba78 R12: ffff8969b68e2890
Oct 19 17:01:43 gw02 kernel: [ 4220.085721] R13: 0000000000000001 R14: ffff8969a468f08b R15: ffff8969b68e2890
Oct 19 17:01:43 gw02 kernel: [ 4220.093684] FS:  0000000000000000(0000) GS:ffff8969bfa80000(0000) knlGS:0000000000000000
Oct 19 17:01:43 gw02 kernel: [ 4220.102714] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 19 17:01:43 gw02 kernel: [ 4220.109125] CR2: 00007f139916f000 CR3: 0000001f1d407000 CR4: 00000000003406e0
Oct 19 17:01:43 gw02 kernel: [ 4220.117088] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 19 17:01:43 gw02 kernel: [ 4220.125049] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 19 17:01:43 gw02 kernel: [ 4220.133012] Stack:
Oct 19 17:01:43 gw02 kernel: [ 4220.135255]  ffff896900000009 ffff8969b46b47c0 0000000002400040 0000000000001000
Oct 19 17:01:43 gw02 kernel: [ 4220.143548]  ffff8969a4690000 ffff89699ad6e800 ffff89699b86b100 0000100000002201
Oct 19 17:01:43 gw02 kernel: [ 4220.151834]  0000000000000000 ffff8969b68e2890 ffff8969b46b47c0 0000000000000001
Oct 19 17:01:43 gw02 kernel: [ 4220.160119] Call Trace:
Oct 19 17:01:43 gw02 kernel: [ 4220.162866]  [<ffffffffc069415a>] __submit_request+0x20a/0x2f0 [libceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.170353]  [<ffffffffc069426b>] submit_request+0x2b/0x30 [libceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.177451]  [<ffffffffc0694c14>] ceph_osdc_writepages+0x104/0x1a0 [libceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.185331]  [<ffffffffc072b4b1>] writepage_nounlock+0x2c1/0x470 [ceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.192716]  [<ffffffffbadf0b05>] ? try_to_unmap_one+0x2b5/0x650
Oct 19 17:01:43 gw02 kernel: [ 4220.199411]  [<ffffffffbadf19ca>] ? page_mkclean+0x6a/0xb0
Oct 19 17:01:43 gw02 kernel: [ 4220.205532]  [<ffffffffbadefb70>] ? __page_check_address+0x1c0/0x1c0
Oct 19 17:01:43 gw02 kernel: [ 4220.212638]  [<ffffffffc072b733>] ceph_writepage+0x33/0x60 [ceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.219440]  [<ffffffffbadbfba9>] pageout.isra.41+0x189/0x2a0
Oct 19 17:01:43 gw02 kernel: [ 4220.225845]  [<ffffffffbadc2a75>] shrink_page_list+0x7d5/0x9a0
Oct 19 17:01:43 gw02 kernel: [ 4220.232355]  [<ffffffffbadc3481>] shrink_inactive_list+0x231/0x560
Oct 19 17:01:43 gw02 kernel: [ 4220.239252]  [<ffffffffbadc414b>] shrink_node_memcg+0x5fb/0x7d0
Oct 19 17:01:43 gw02 kernel: [ 4220.245856]  [<ffffffffbadc4401>] shrink_node+0xe1/0x320
Oct 19 17:01:43 gw02 kernel: [ 4220.251780]  [<ffffffffbadc53d6>] kswapd+0x2f6/0x710
Oct 19 17:01:43 gw02 kernel: [ 4220.257317]  [<ffffffffbadc50e0>] ? mem_cgroup_shrink_node+0x170/0x170
Oct 19 17:01:43 gw02 kernel: [ 4220.264600]  [<ffffffffbaca56b6>] kthread+0xe6/0x100
Oct 19 17:01:43 gw02 kernel: [ 4220.270138]  [<ffffffffbaca55d0>] ? kthread_park+0x60/0x60
Oct 19 17:01:43 gw02 kernel: [ 4220.276259]  [<ffffffffbb49cb35>] ret_from_fork+0x25/0x30
Oct 19 17:01:43 gw02 kernel: [ 4220.282282] Code: 3c 5f fa e9 de f6 ff ff ba 14 00 00 00 e9 42 f7 ff ff 49 c7 46 08 00 00 00 00 41 c7 46 10 00 00 00 00 49 8d 56 14 e9 6d fb ff ff <0f> 0b 0f 0b be 8f 05 00 00 48 c7 c7 d8 cc 6a c0 e8 e4 3b 5f fa
Oct 19 17:01:43 gw02 kernel: [ 4220.303922] RIP  [<ffffffffc06904f7>] send_request+0xa27/0xab0 [libceph]
Oct 19 17:01:43 gw02 kernel: [ 4220.311421]  RSP <ffff9b6ccd4a3788>
Oct 19 17:01:43 gw02 kernel: [ 4220.315851] ---[ end trace 154cde58ce00766e ]---

#3 Updated by Ilya Dryomov over 6 years ago

No, not yet. Increasing the size of pre-allocated messages would be a one-line patch, but seems a bit dumb -- it's still going to be a hard limit.

It appears you now have ~780 snapshots? If you get it down to 400, you won't see this crash.

#4 Updated by Wyllys Ingersoll over 6 years ago

What is the correct upper bound on the number of snapshots - is it 400 or 4000? In the original description, you mentioned 4K, but in this later one its 400.

#5 Updated by Ilya Dryomov over 6 years ago

400 is correct. 4k is 4096 bytes -- the size of the pre-allocated buffer.

I've updated the docs in https://github.com/ceph/ceph/pull/18579.

#6 Updated by Luis Henriques over 4 years ago

I was wondering if this bug could be closed, as it seems that it has already been fixed by the "libceph: preallocate message data items" patchset:

98c4bfe9d89b libceph: check reply num_data_items in setup_request_data()
0d9c1ab3be4c libceph: preallocate message data items
26f887e0a3c4 libceph, rbd, ceph: move ceph_osdc_alloc_messages() calls
39e58c3425b1 libceph: introduce alloc_watch_request()
81c65213d73c libceph: assign cookies in linger_submit()
3b83f60da6dd libceph: enable fallback to ceph_msg_new() in ceph_msgpool_get()
61d2f855042c ceph: num_ops is off by one in ceph_aio_retry_work()
41a264e1b30c libceph: no need to call osd_req_opcode_valid() in osd_req_encode_op()
668028844174 ceph: set timeout conditionally in __cap_delay_requeue
894868330a1e libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist()
33165d472310 libceph: introduce ceph_pagelist_alloc()

#7 Updated by Ilya Dryomov over 4 years ago

No, I don't think so. It now falls back to a regular allocation that can fail, while that code path is specifically set up to use a memory pool allocation that isn't supposed to fail, in order to guarantee forward progress. Note that I added a WARN for that, so this ticket is still valid. It's just that I made it less of a problem because we no longer crash in this case.

#8 Updated by Luis Henriques over 4 years ago

Ilya Dryomov wrote:

No, I don't think so. It now falls back to a regular allocation that can fail, while that code path is specifically set up to use a memory pool allocation that isn't supposed to fail, in order to guarantee forward progress. Note that I added a WARN for that, so this ticket is still valid. It's just that I made it less of a problem because we no longer crash in this case.

Right, thanks for clarifying this Ilya. I guess I'll need to go re-read these patches then :-)

#9 Updated by Chris Dunlop over 4 years ago

Is the 400 snapshot limit per file system, or per directory, or per something else?

#10 Updated by Zheng Yan over 4 years ago

per filesystem

#11 Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New

#12 Updated by Patrick Donnelly over 3 years ago

  • Related to Feature #46074: mds: provide altrenatives to increase the total cephfs subvolume snapshot counts to greater than the current 400 across a Cephfs volume added

Also available in: Atom PDF