Project

General

Profile

Actions

Bug #52494

closed

Client servers lock up if Ceph cluster has problems

Added by David B over 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Hey,

We have a Ceph cluster and a kubernetes cluster on which pods use CephFS for storing files. When our Ceph cluster would sometimes encounter performance or other problems some of our kubernetes nodes would become unresponsive and the only solution would be to hard-reset them. This looked like bug https://tracker.ceph.com/issues/46284

We are running Debian 10 with a 5.10 kernel from backports which is based on 5.10.46 and doesn't have the fix for #46284 so I pulled down the upstream kernel git repo, found the commit with this fix in linux-5.10.y branch and made a patch from it. I then applied this patch on the Debian 5.10 kernel source, compiled and installed on half of our kubernetes nodes.

Recently we had jobs running on kubernetes that did a lot of file changes on CephFS and put a high load on MDS's. During that time 2 of our kubernetes nodes (from a total of 200 nodes) locked up, both were with the #46284 patch. This doesn't seem like the same bug, couldn't find any mentions of ceph_check_delayed_caps, usually there were mentions of ceph_flush_dirty_caps. I've attached a file with some logs from /var/log/messages of both nodes.

Now I don't know if it's a problem with our configuration, if I made mistakes compiling the kernel or did we encounter a new bug?

Any insight would be appreciated.


Files

ceph-errors (38.4 KB) ceph-errors David B, 09/02/2021 01:11 PM
ceph-errors-node039 (94.8 KB) ceph-errors-node039 David B, 09/07/2021 12:47 PM
Actions #1

Updated by Ilya Dryomov over 2 years ago

  • Category set to fs/ceph
  • Assignee set to Jeff Layton
Actions #2

Updated by Jeff Layton over 2 years ago

The first stack trace shows a warning pop for removing an auth cap while it's still dirty. The later ones, I can't make much sense of. You're copy/pasting most of them starting with the "Modules linked in:" line, and that leaves out the lines in the logs that print the actual cause of the stack trace. Can you re-attach a more complete set of stack traces?

Actions #3

Updated by David B over 2 years ago

Sorry about that, I tried to cut out some lines about pods being killed because of reaching their memory limit.

I attached a continuous log from one of the nodes.

Actions #4

Updated by Jeff Layton over 2 years ago

Thanks, that helps.

The first warning looks like it came about because the client got blocklisted. That warning is silenced in later kernels. I think we can ignore that. After that are a number of cgroup OOM kills. Those I think we can ignore too.

The next trace is this one:

Sep  1 07:57:11 core-k8s-node039 kernel: [501198.414663] NMI backtrace for cpu 6
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.415671] CPU: 6 PID: 15156 Comm: umount Tainted: G        W   E     5.10.46-ceph #2
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.417182] Hardware name: Hetzner vServer, BIOS 20171111 11/11/2017
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.418517] Call Trace:
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.419410]  <IRQ>
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.420225]  dump_stack+0x6d/0x88
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.421195]  nmi_cpu_backtrace.cold.9+0x30/0x65
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.422300]  ? lapic_can_unplug_cpu+0x80/0x80
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.423414]  nmi_trigger_cpumask_backtrace+0xde/0xe0
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.424603]  rcu_dump_cpu_stacks+0xab/0xd9
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.425685]  rcu_sched_clock_irq.cold.93+0x109/0x365
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.427078]  ? trigger_load_balance+0x5c/0x240
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.428450]  ? tick_sched_do_timer+0x60/0x60
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.429771]  update_process_times+0x55/0x80
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.431095]  tick_sched_handle.isra.22+0x1f/0x60
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.432494]  tick_sched_timer+0x65/0x80
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.433740]  __hrtimer_run_queues+0x101/0x250
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.435052]  hrtimer_interrupt+0x100/0x210
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.436353]  ? check_preempt_curr+0x51/0x60
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.437639]  __sysvec_apic_timer_interrupt+0x5d/0xd0
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.439023]  asm_call_irq_on_stack+0x12/0x20
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.440309]  </IRQ>
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.441291]  sysvec_apic_timer_interrupt+0x73/0x80
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.442630]  asm_sysvec_apic_timer_interrupt+0x12/0x20
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.444057] RIP: 0010:ceph_check_caps+0x5/0xa60 [ceph]
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.445441] Code: 68 02 00 00 31 c0 48 39 f2 76 0c 48 01 c9 48 01 f2 48 39 d1 0f 93 c0 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 <41> 57 41 56 49 89 d6 41 55 41 54 55 89 f5 53 48 89 fb 48 81 ec 70
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.449353] RSP: 0018:ffffa0d90534fe18 EFLAGS: 00000246
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.450810] RAX: 0000000000000002 RBX: ffff8e98c989c1ec RCX: 0000000000000000
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.452530] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff8e948339e780
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.454246] RBP: ffff8e98c989e800 R08: 0000000000000000 R09: 0000000000000000
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.455936] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8e948339eae0
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.457352] R13: ffff8e98c989c008 R14: ffff8e98c989c000 R15: 0000000000000008
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.458785]  ceph_flush_dirty_caps+0xd9/0x190 [ceph]
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.459989]  ceph_mdsc_pre_umount+0xa0/0x210 [ceph]
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.461172]  ceph_kill_sb+0x1f/0x80 [ceph]
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.462263]  deactivate_locked_super+0x2f/0x70
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.463423]  cleanup_mnt+0xb8/0x140
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.464442]  task_work_run+0x76/0xb0
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.465454]  exit_to_user_mode_prepare+0x129/0x130
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.466618]  syscall_exit_to_user_mode+0x28/0x140
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.467760]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.468920] RIP: 0033:0x7f167d3ae507
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.469903] Code: 19 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 19 0c 00 f7 d8 64 89 01 48
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.473208] RSP: 002b:00007ffd59f6cd78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.474665] RAX: 0000000000000000 RBX: 0000557959cb8970 RCX: 00007f167d3ae507
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.476079] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557959cb8b80
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.477476] RBP: 0000000000000000 R08: 0000557959cb8bf0 R09: 00007f167d42fe80
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.478870] R10: 0000000000000000 R11: 0000000000000246 R12: 0000557959cb8b80
Sep  1 07:57:11 core-k8s-node039 kernel: [501198.480262] R13: 00007f167d4d41c4 R14: 0000557959cb8a68 R15: 0000000000000000

So I guess an NMI fired because this CPU was stuck?

In any case, this does look a lot like #46284, but I'm at a loss as to why the patches wouldn't have helped you. It might be interesting to see the contents of /sys/kernel/debug/ceph/<fsid>/caps during this period (and maybe also the contents of mdsc and osdc files under there too).

Actions #5

Updated by David B over 2 years ago

Usually it's not possible to inspect /sys when this happens because the systems become unresponsive and requires a reset.

One thing that may be of note is these nodes connecting to Ceph are running on shared CPUs (Ceph nodes are physical machines though). One theory I had before was maybe that Ceph wasn't the problem but what if we scale up our deployments on k8s or someone else running on the same Hetzner servers starts using more CPUs and we get choked somewhat on CPU time.

I later dismissed it because whenever the k8s nodes would freeze up it would be at the same time our Ceph cluster was having problems - someone running too many operations on CephFS or hardware failure on one of our Ceph nodes. And I would expect that if it was the case that sometime later we would eventually get some CPU time and the nodes would unfreeze but they would stay stuck for multiple hours or days and the only thing you could do is reset the VM.

Is it still possible that if a Ceph process would get scheduled on a CPU that's unavailable at that time to lockup the whole VM?

Actions #6

Updated by Jeff Layton over 2 years ago

David B wrote:

Is it still possible that if a Ceph process would get scheduled on a CPU that's unavailable at that time to lockup the whole VM?

That sounds pretty unlikely. It seems more likely that this task is ending up livelocked somehow. ceph_flush_dirty_caps just walks the list of dirty caps in the session and calls ceph_check_caps on each, but there is no guarantee that the cap will be removed from the list when that function is called, so it could just keep continually cycling over a set of stuck caps that can't be cleaned for some reason. Probably we need a stronger guarantee of forward progress there (or better warnings as to why we can't progress).

I'll have to look at revamping that code to at least warn us if we're continually cycling over the same set of caps.

Actions #7

Updated by Jeff Layton over 2 years ago

BTW: what mount options are you using?

It's also possible that this is a result of #52283. It looks like the patch for that went into v5.10.62 stable series kernels. Note that there is a follow-on patch for that too, so you may want to pull in 05a444d3f90a3 as well if you try that kernel.

Actions #8

Updated by David B over 2 years ago

The mount options are (rw,relatime,name=storage,secret=<hidden>,acl).

I'll take a look at those other patches, thanks.

Actions #9

Updated by Jeff Layton almost 2 years ago

  • Status changed from New to Resolved

No info in some time. I'll assume that a newer kernel helped? Please reopen if this is still an issue.

Actions

Also available in: Atom PDF