Project

General

Profile

Bug #8798

The kernel of a server with Ceph hangs

Added by AltScale Inc over 9 years ago. Updated over 9 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

We have two separate Ceph installations with five servers. Each machine
has four disks. Two of the disks on each machine are dedicated entirely
to Ceph. The rest two disks on each machine are partitioned and have the
OS installed in software RAID on small portion of them. The rest of the
free space is dedicated to Ceph. So we have two entire disks and two
partitions per machine dedicated to Ceph. On the same machines that have
the Ceph OSDs/disks an RDB block device is mapped and mounted as
distributed filesystem (OCFS2).

For months the servers reboot by themselves at random intervals. Usually
before the restart there are kernel errors from the libceph module.
We've had enabled kernel crash logging but, so far nothing had logged.
On the latest restart there was a crash log related to Ceph. It is
aplied with this bug report. A sample of the errors returned by the
libceph module are also applied.

Our checks show that we do not have defective hardware and network
issues. We use ECC memory. The network connectivity that Ceph uses is
over bond interfaces including four network adapters.

Recently one of the machines just hang. Our check showed that the kernel
hang when a Ceph OSD process crashed. A screenshot of the IPMI console
during the crash is applied. We have no additional information on the
issue.

We've upgraded from Cuttlefish to Dumpling and later to Emperor without
any effect on the issue.

We have stopped deep-scrubbing and run it via Cron distributed over OSDs
and days, because the system becomes unresponsive with it is enabled.

OS: Ubuntu 12.04
Ceph version: cuttlefish, dumpling, emperor
Kernel version: 3.5.x

ceph-kernel-hang.png View (46.1 KB) AltScale Inc, 07/10/2014 03:51 AM

_usr_sbin_ceph-disk.0-2014-07-07-01-00-to-01-02.crash (13.1 KB) AltScale Inc, 07/10/2014 03:51 AM

ceph-libceph-sample-errors.txt View (3.94 KB) AltScale Inc, 07/10/2014 03:51 AM

History

#1 Updated by Zheng Yan over 9 years ago

you run OSD daemon and kernel client on the same machine, it's deadlock prone. Read following link for more information

http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/6648

#2 Updated by AltScale Inc over 9 years ago

Thank you for the reply.

We use XFS for the OSD file system and do not use CephFS, but OCFS2 over RBD (kernel rbd client). Is the still relevant?

#3 Updated by Zheng Yan over 9 years ago

yes, kernel rbd clients suffer the same deadlock

#4 Updated by Zheng Yan over 9 years ago

  • Status changed from New to Won't Fix

Also available in: Atom PDF