doc: Explain how loopback mounts (using kclient, ceph-fuse should be immune) or RBD can cause deadlock
Currently, consuming CephFS/RBD services provided by the same machine can lead to deadlock. People in general don't realize this. The cause has nothing to do with Ceph, so we're probably not going to fix it either. Create more awareness.
The slide link in the email referred to here is the one that makes me understand this best; also of note is the Red Hat bug where they said they just won't bother fixing this, it's too hard.
---------- Forwarded message ----------
From: Tommi Virtanen <email@example.com>
Date: Tue, May 29, 2012 at 12:18 PM
Subject: Re: OSD deadlock with cephfs client and OSD on same machine
To: Amon Ott <firstname.lastname@example.org>
On Tue, May 29, 2012 at 12:44 AM, Amon Ott <email@example.com> wrote:
On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client mount
on the same system and no syncfs system call (as to be expected with libc6 <
2.14 or kernel < 2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers
This is the classic issue of memory pressure needing free memory to be
relieved. While syncfs(2) may make the hang less common, I do not
think having syncfs(2) is enough; nothing sort of having a reserved
memory pool guaranteed to be big enough to handle the request will,
and maintaining that solution is hideously complex.
Loopback NFS suffers from the exact same thing.
Apparently using ceph-fuse is enough to move so much of the processing
to user space, that the pageability of userspace memory allows the
system to recover.
Here's a fragment of the earlier conversation on this topic. Apologies
for gmane/mail clients breaking the thread, anything with that subject
line is part of the conversation:
doc: Added verbiage to describe single host deadlocks.
Signed-off-by: John Wilkins <firstname.lastname@example.org>
#1 Updated by Josh Durgin over 10 years ago
The discussion in this thread https://lkml.org/lkml/2004/7/26/68 is interesting for a more in-depth discussion of the problem and why various things won't help.
#2 Updated by Sage Weil over 10 years ago
- Priority changed from Normal to High
this definitely qualifies as a faq
#3 Updated by John Wilkins over 10 years ago
- Assignee set to John Wilkins
#4 Updated by John Wilkins over 10 years ago
- Status changed from New to Resolved
Added new section to the FAQ providing details. Provided links in quick start admonitions to the FAQ.