Project

General

Profile

Documentation #3076

doc: Explain how loopback mounts (using kclient, ceph-fuse should be immune) or RBD can cause deadlock

Added by Anonymous over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Currently, consuming CephFS/RBD services provided by the same machine can lead to deadlock. People in general don't realize this. The cause has nothing to do with Ceph, so we're probably not going to fix it either. Create more awareness.

The slide link in the email referred to here is the one that makes me understand this best; also of note is the Red Hat bug where they said they just won't bother fixing this, it's too hard.

---------- Forwarded message ----------
From: Tommi Virtanen <>
Date: Tue, May 29, 2012 at 12:18 PM
Subject: Re: OSD deadlock with cephfs client and OSD on same machine
To: Amon Ott <>
Cc:

On Tue, May 29, 2012 at 12:44 AM, Amon Ott <> wrote:

On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client mount
on the same system and no syncfs system call (as to be expected with libc6 <
2.14 or kernel < 2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers
the system.

This is the classic issue of memory pressure needing free memory to be
relieved. While syncfs(2) may make the hang less common, I do not
think having syncfs(2) is enough; nothing sort of having a reserved
memory pool guaranteed to be big enough to handle the request will,
and maintaining that solution is hideously complex.

Loopback NFS suffers from the exact same thing.

Apparently using ceph-fuse is enough to move so much of the processing
to user space, that the pageability of userspace memory allows the
system to recover.

Here's a fragment of the earlier conversation on this topic. Apologies
for gmane/mail clients breaking the thread, anything with that subject
line is part of the conversation:

http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/1673

Associated revisions

Revision 516935bc (diff)
Added by John Wilkins about 7 years ago

doc: Added verbiage to describe single host deadlocks.

fixes: #3076

Signed-off-by: John Wilkins <>

History

#1 Updated by Josh Durgin over 7 years ago

The discussion in this thread https://lkml.org/lkml/2004/7/26/68 is interesting for a more in-depth discussion of the problem and why various things won't help.

#2 Updated by Sage Weil over 7 years ago

  • Priority changed from Normal to High

this definitely qualifies as a faq

#3 Updated by John Wilkins about 7 years ago

  • Assignee set to John Wilkins

#4 Updated by John Wilkins about 7 years ago

  • Status changed from New to Resolved

Added new section to the FAQ providing details. Provided links in quick start admonitions to the FAQ.

Also available in: Atom PDF