Project

General

Profile

Actions

Bug #22702

closed

cephfs crashed under high memory pressure due to reserved caps number mismatch

Added by Zhi Zhang over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

The following crash happened under very high memory pressure, sometimes kernel already complained about OOM. Although we used an old kernel 3.10.104, we had backported hundreds of ceph kernel patches from kernel 4.x all the time.

I checked the latest kernel code and the process logic is almost the same, so I think the same issue would also happen.

This crash is at caps.c -> ceph_get_cap(...)

BUG_ON(ctx->count > mdsc->caps_reserve_count);
[1016714.124990] SLUB: Unable to allocate memory on node -1 (gfp=0x50)
[1016714.124992]   cache: ceph_cap(164:b8feb8c8f623dd321de0e18a0b312a0512f212c55488f03b863c8309d7976439), object size: 120, buffer size: 120, default order: 1, min order: 0
[1016714.124993]   node 0: slabs: 88, objs: 5882, free: 0
[1016714.124993]   node 1: slabs: 301, objs: 20366, free: 0
[1016714.124995] ceph: reserve caps ctx=ffff8820048ec3a4 ENOMEM need=2 got=1
[1016714.125513] ------------[ cut here ]------------
[1016714.125514] kernel BUG at /usr/src/kernels/3.10.104-1-tlinux2-0041.tl1/fs/ceph/caps.c:252!
[1016714.125516] invalid opcode: 0000 [#1] SMP

[1016714.125533] CPU: 33 PID: 50359 Comm: kworker/33:0 Tainted: G           O 3.10.104-1-tlinux2-0041.tl1 #1
[1016714.125533] Hardware name: Dell Inc. PowerEdge R730XD/072T6D, BIOS 2.5.4 07/07/2017
[1016714.125545] Workqueue: ceph-msgr con_work [libceph]
[1016714.125546] task: ffff882018934b00 ti: ffff881ea6660000 task.ti: ffff881ea6660000
[1016714.125560] RIP: 0010:[<ffffffffa032f4c2>]  [<ffffffffa032f4c2>] ceph_get_cap+0x152/0x170 [ceph]
[1016714.125561] RSP: 0018:ffff881ea6661a08  EFLAGS: 00010202
[1016714.125562] RAX: 0000000000000002 RBX: ffff880f84728400 RCX: ffff88158dcd9518
[1016714.125562] RDX: 00000000000006a0 RSI: ffff8820048ec3a4 RDI: ffff880f847285f0
[1016714.125563] RBP: ffff881ea6661a28 R08: ffff881b592ee800 R09: 000000010f273b0a
[1016714.125563] R10: ffff880fbce44780 R11: 0000000000000001 R12: ffff8820048ec3a4
[1016714.125564] R13: ffff880e675ed0f0 R14: ffff881b592ee800 R15: ffff8820048ec170
[1016714.125565] FS:  0000000000000000(0000) GS:ffff88203ec00000(0000) knlGS:0000000000000000
[1016714.125565] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1016714.125566] CR2: 000000000516d308 CR3: 0000001ecc576000 CR4: 00000000001407e0
[1016714.125567] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1016714.125567] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[1016714.125567] Stack:
[1016714.125571]  0000000118934b00 0000000000000000 ffff88158dcd9413 ffff880e675ed480
[1016714.125574]  ffff881ea6661b28 ffffffffa0321022 0000000000000000 ffff881ea6661c28
[1016714.125576]  ffffffffffffffff ffff881ea6661c38 ffffffffffffffff 000000000000000d
[1016714.125577] Call Trace:
[1016714.125582]  [<ffffffffa0321022>] fill_inode+0x9a2/0xb30 [ceph]
[1016714.125591]  [<ffffffff81891794>] ? sock_recvmsg+0x84/0xb0
[1016714.125595]  [<ffffffffa0321384>] ceph_fill_trace+0xa4/0x9a0 [ceph]
[1016714.125601]  [<ffffffffa034090e>] handle_reply+0x21e/0x6c0 [ceph]
[1016714.125607]  [<ffffffffa0342273>] dispatch+0xc3/0x180 [ceph]
[1016714.125612]  [<ffffffffa0128841>] process_message+0x91/0x170 [libceph]
[1016714.125617]  [<ffffffffa012b8a6>] ? read_partial_message+0x176/0x450 [libceph]
[1016714.125620]  [<ffffffff81891804>] ? kernel_recvmsg+0x44/0x60
[1016714.125624]  [<ffffffffa0128e28>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[1016714.125628]  [<ffffffffa012c97b>] try_read+0x30b/0x7f0 [libceph]
[1016714.125633]  [<ffffffffa012cf2b>] con_work+0xcb/0x370 [libceph]
[1016714.125637]  [<ffffffff81065bcd>] process_one_work+0x17d/0x4c0
[1016714.125639]  [<ffffffff810670cf>] worker_thread+0x11f/0x3a0
[1016714.125641]  [<ffffffff81066fb0>] ? manage_workers+0x120/0x120
[1016714.125645]  [<ffffffff8106cc6e>] kthread+0xce/0xe0
[1016714.125648]  [<ffffffff8106cba0>] ? kthread_freezable_should_stop+0x70/0x70
[1016714.125653]  [<ffffffff81ac72b8>] ret_from_fork+0x58/0x90
[1016714.125655]  [<ffffffff8106cba0>] ? kthread_freezable_should_stop+0x70/0x70
[1016714.125666] Code: c7 68 c6 35 a0 89 44 24 08 8b 83 10 02 00 00 89 04 24 41 8b 0c 24 31 c0 e8 5c a0 fe e0 e9 ef fe ff ff 0f 0b eb fe 0f 0b 90 eb fd <0f> 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 66 66 66 66 66
[1016714.125671] RIP  [<ffffffffa032f4c2>] ceph_get_cap+0x152/0x170 [ceph]
[1016714.125671]  RSP <ffff881ea6661a08>
Actions

Also available in: Atom PDF