Documentation #3076: doc: Explain how loopback mounts (using kclient, ceph-fuse should be immune) or RBD can cause deadlock - Ceph - Ceph

Actions

Copy link

Documentation #3076

closed

doc: Explain how loopback mounts (using kclient, ceph-fuse should be immune) or RBD can cause deadlock

Added by Anonymous over 11 years ago. Updated about 11 years ago.

Status:

Resolved

Priority:

High

Assignee:

John Wilkins

Category:

Target version:

% Done:

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

Currently, consuming CephFS/RBD services provided by the same machine can lead to deadlock. People in general don't realize this. The cause has nothing to do with Ceph, so we're probably not going to fix it either. Create more awareness.

The slide link in the email referred to here is the one that makes me understand this best; also of note is the Red Hat bug where they said they just won't bother fixing this, it's too hard.

---------- Forwarded message ----------
From: Tommi Virtanen <tv@inktank.com>
Date: Tue, May 29, 2012 at 12:18 PM
Subject: Re: OSD deadlock with cephfs client and OSD on same machine
To: Amon Ott <a.ott@m-privacy.de>
Cc: ceph-devel@vger.kernel.org

On Tue, May 29, 2012 at 12:44 AM, Amon Ott <a.ott@m-privacy.de> wrote:

On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client mount
on the same system and no syncfs system call (as to be expected with libc6 <
2.14 or kernel < 2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers
the system.

This is the classic issue of memory pressure needing free memory to be
relieved. While syncfs(2) may make the hang less common, I do not
think having syncfs(2) is enough; nothing sort of having a reserved
memory pool guaranteed to be big enough to handle the request will,
and maintaining that solution is hideously complex.

Loopback NFS suffers from the exact same thing.

Apparently using ceph-fuse is enough to move so much of the processing
to user space, that the pageability of userspace memory allows the
system to recover.

Here's a fragment of the earlier conversation on this topic. Apologies
for gmane/mail clients breaking the thread, anything with that subject
line is part of the conversation:

http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/1673

Actions

Copy link

Updated by Josh Durgin over 11 years ago

The discussion in this thread https://lkml.org/lkml/2004/7/26/68 is interesting for a more in-depth discussion of the problem and why various things won't help.

Actions

Copy link