Bug #52494
closedClient servers lock up if Ceph cluster has problems
0%
Description
Hey,
We have a Ceph cluster and a kubernetes cluster on which pods use CephFS for storing files. When our Ceph cluster would sometimes encounter performance or other problems some of our kubernetes nodes would become unresponsive and the only solution would be to hard-reset them. This looked like bug https://tracker.ceph.com/issues/46284
We are running Debian 10 with a 5.10 kernel from backports which is based on 5.10.46 and doesn't have the fix for #46284 so I pulled down the upstream kernel git repo, found the commit with this fix in linux-5.10.y branch and made a patch from it. I then applied this patch on the Debian 5.10 kernel source, compiled and installed on half of our kubernetes nodes.
Recently we had jobs running on kubernetes that did a lot of file changes on CephFS and put a high load on MDS's. During that time 2 of our kubernetes nodes (from a total of 200 nodes) locked up, both were with the #46284 patch. This doesn't seem like the same bug, couldn't find any mentions of ceph_check_delayed_caps, usually there were mentions of ceph_flush_dirty_caps. I've attached a file with some logs from /var/log/messages of both nodes.
Now I don't know if it's a problem with our configuration, if I made mistakes compiling the kernel or did we encounter a new bug?
Any insight would be appreciated.
Files
Updated by Ilya Dryomov over 2 years ago
- Category set to fs/ceph
- Assignee set to Jeff Layton
Updated by Jeff Layton over 2 years ago
The first stack trace shows a warning pop for removing an auth cap while it's still dirty. The later ones, I can't make much sense of. You're copy/pasting most of them starting with the "Modules linked in:" line, and that leaves out the lines in the logs that print the actual cause of the stack trace. Can you re-attach a more complete set of stack traces?
Updated by David B over 2 years ago
- File ceph-errors-node039 ceph-errors-node039 added
Sorry about that, I tried to cut out some lines about pods being killed because of reaching their memory limit.
I attached a continuous log from one of the nodes.
Updated by Jeff Layton over 2 years ago
Thanks, that helps.
The first warning looks like it came about because the client got blocklisted. That warning is silenced in later kernels. I think we can ignore that. After that are a number of cgroup OOM kills. Those I think we can ignore too.
The next trace is this one:
Sep 1 07:57:11 core-k8s-node039 kernel: [501198.414663] NMI backtrace for cpu 6 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.415671] CPU: 6 PID: 15156 Comm: umount Tainted: G W E 5.10.46-ceph #2 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.417182] Hardware name: Hetzner vServer, BIOS 20171111 11/11/2017 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.418517] Call Trace: Sep 1 07:57:11 core-k8s-node039 kernel: [501198.419410] <IRQ> Sep 1 07:57:11 core-k8s-node039 kernel: [501198.420225] dump_stack+0x6d/0x88 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.421195] nmi_cpu_backtrace.cold.9+0x30/0x65 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.422300] ? lapic_can_unplug_cpu+0x80/0x80 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.423414] nmi_trigger_cpumask_backtrace+0xde/0xe0 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.424603] rcu_dump_cpu_stacks+0xab/0xd9 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.425685] rcu_sched_clock_irq.cold.93+0x109/0x365 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.427078] ? trigger_load_balance+0x5c/0x240 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.428450] ? tick_sched_do_timer+0x60/0x60 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.429771] update_process_times+0x55/0x80 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.431095] tick_sched_handle.isra.22+0x1f/0x60 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.432494] tick_sched_timer+0x65/0x80 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.433740] __hrtimer_run_queues+0x101/0x250 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.435052] hrtimer_interrupt+0x100/0x210 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.436353] ? check_preempt_curr+0x51/0x60 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.437639] __sysvec_apic_timer_interrupt+0x5d/0xd0 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.439023] asm_call_irq_on_stack+0x12/0x20 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.440309] </IRQ> Sep 1 07:57:11 core-k8s-node039 kernel: [501198.441291] sysvec_apic_timer_interrupt+0x73/0x80 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.442630] asm_sysvec_apic_timer_interrupt+0x12/0x20 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.444057] RIP: 0010:ceph_check_caps+0x5/0xa60 [ceph] Sep 1 07:57:11 core-k8s-node039 kernel: [501198.445441] Code: 68 02 00 00 31 c0 48 39 f2 76 0c 48 01 c9 48 01 f2 48 39 d1 0f 93 c0 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 <41> 57 41 56 49 89 d6 41 55 41 54 55 89 f5 53 48 89 fb 48 81 ec 70 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.449353] RSP: 0018:ffffa0d90534fe18 EFLAGS: 00000246 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.450810] RAX: 0000000000000002 RBX: ffff8e98c989c1ec RCX: 0000000000000000 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.452530] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff8e948339e780 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.454246] RBP: ffff8e98c989e800 R08: 0000000000000000 R09: 0000000000000000 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.455936] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8e948339eae0 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.457352] R13: ffff8e98c989c008 R14: ffff8e98c989c000 R15: 0000000000000008 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.458785] ceph_flush_dirty_caps+0xd9/0x190 [ceph] Sep 1 07:57:11 core-k8s-node039 kernel: [501198.459989] ceph_mdsc_pre_umount+0xa0/0x210 [ceph] Sep 1 07:57:11 core-k8s-node039 kernel: [501198.461172] ceph_kill_sb+0x1f/0x80 [ceph] Sep 1 07:57:11 core-k8s-node039 kernel: [501198.462263] deactivate_locked_super+0x2f/0x70 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.463423] cleanup_mnt+0xb8/0x140 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.464442] task_work_run+0x76/0xb0 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.465454] exit_to_user_mode_prepare+0x129/0x130 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.466618] syscall_exit_to_user_mode+0x28/0x140 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.467760] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.468920] RIP: 0033:0x7f167d3ae507 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.469903] Code: 19 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 19 0c 00 f7 d8 64 89 01 48 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.473208] RSP: 002b:00007ffd59f6cd78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.474665] RAX: 0000000000000000 RBX: 0000557959cb8970 RCX: 00007f167d3ae507 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.476079] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557959cb8b80 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.477476] RBP: 0000000000000000 R08: 0000557959cb8bf0 R09: 00007f167d42fe80 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.478870] R10: 0000000000000000 R11: 0000000000000246 R12: 0000557959cb8b80 Sep 1 07:57:11 core-k8s-node039 kernel: [501198.480262] R13: 00007f167d4d41c4 R14: 0000557959cb8a68 R15: 0000000000000000
So I guess an NMI fired because this CPU was stuck?
In any case, this does look a lot like #46284, but I'm at a loss as to why the patches wouldn't have helped you. It might be interesting to see the contents of /sys/kernel/debug/ceph/<fsid>/caps during this period (and maybe also the contents of mdsc and osdc files under there too).
Updated by David B over 2 years ago
Usually it's not possible to inspect /sys when this happens because the systems become unresponsive and requires a reset.
One thing that may be of note is these nodes connecting to Ceph are running on shared CPUs (Ceph nodes are physical machines though). One theory I had before was maybe that Ceph wasn't the problem but what if we scale up our deployments on k8s or someone else running on the same Hetzner servers starts using more CPUs and we get choked somewhat on CPU time.
I later dismissed it because whenever the k8s nodes would freeze up it would be at the same time our Ceph cluster was having problems - someone running too many operations on CephFS or hardware failure on one of our Ceph nodes. And I would expect that if it was the case that sometime later we would eventually get some CPU time and the nodes would unfreeze but they would stay stuck for multiple hours or days and the only thing you could do is reset the VM.
Is it still possible that if a Ceph process would get scheduled on a CPU that's unavailable at that time to lockup the whole VM?
Updated by Jeff Layton over 2 years ago
David B wrote:
Is it still possible that if a Ceph process would get scheduled on a CPU that's unavailable at that time to lockup the whole VM?
That sounds pretty unlikely. It seems more likely that this task is ending up livelocked somehow. ceph_flush_dirty_caps just walks the list of dirty caps in the session and calls ceph_check_caps on each, but there is no guarantee that the cap will be removed from the list when that function is called, so it could just keep continually cycling over a set of stuck caps that can't be cleaned for some reason. Probably we need a stronger guarantee of forward progress there (or better warnings as to why we can't progress).
I'll have to look at revamping that code to at least warn us if we're continually cycling over the same set of caps.
Updated by Jeff Layton over 2 years ago
BTW: what mount options are you using?
It's also possible that this is a result of #52283. It looks like the patch for that went into v5.10.62 stable series kernels. Note that there is a follow-on patch for that too, so you may want to pull in 05a444d3f90a3 as well if you try that kernel.
Updated by David B over 2 years ago
The mount options are (rw,relatime,name=storage,secret=<hidden>,acl).
I'll take a look at those other patches, thanks.
Updated by Jeff Layton almost 2 years ago
- Status changed from New to Resolved
No info in some time. I'll assume that a newer kernel helped? Please reopen if this is still an issue.