Bug #2192: ceph-mon hangs consuming 100% CPU - Ceph - Ceph

Actions

Copy link

Bug #2192

closed

ceph-mon hangs consuming 100% CPU

Added by Vladimir Kulev about 12 years ago. Updated almost 12 years ago.

Status:

Won't Fix

Priority:

High

Assignee:

Category:

Monitor

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have a test setup of two nodes each running 0.43 mds, mon and osd. I mount ceph kernel filesystem at /srv/ceph on both nodes and run this simple script on node A:
while true; do dd if=/dev/sda of=/srv/ceph/data count=500000; done
After a couple of iterations all operations involving this filesystem (including umount -f -l) begin to hang in state D+, ceph health hangs in state Sl+, rados df gives connection timeout. This happens on both nodes.
Also on node A (where script was run) ceph-mon process starts to consube 100% CPU, and remains zombie (<defunct>) with 100% CPU after being killed.

Actions

Copy link

Updated by Sage Weil about 12 years ago

Category set to Monitor

Is this reproducible? Are you able to connect to the ceph-mon process with gdb?

Actions

Copy link

Updated by Vladimir Kulev about 12 years ago

It was reproduced all the time, for 0.44 also. After I adjusted cluster to have only one monitor problem has gone. (Un)fortunately, after adding second monitor back (changed ceph.conf and restarted services) problem did not appear again so I cannot reproduce it anymore.

Actions

Copy link

Updated by Sage Weil about 12 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Greg Farnum almost 12 years ago

I missed this when it came in, and I don't know where the 100% CPU usage is coming from, but the hung filesystem sounds like our typical issue with running clients and OSDs on the same boxes — and the Monitor problems might be related.
So what kernel was in use, and what backing filesystems?

Actions

Copy link

Updated by Vladimir Kulev almost 12 years ago

It was some 3.0.0 Ubuntu kernel, backed by btrfs.

Actions

Copy link

Updated by Sage Weil almost 12 years ago

Status changed from Need More Info to Won't Fix

Yep, this sounds like the writeback sync deadlock:

- ceph-mon calls sync
 - the kernel client flushes it's dirty data, but is waiting for an osdmap from the monitor
 -> deadlock

Don't mount the kernel client from the same machine that the monitor is running. Or, use a newer kernel + glibc that support the new syncfs(2) syscall.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #2192

ceph-mon hangs consuming 100% CPU

Updated by Sage Weil about 12 years ago

Updated by Vladimir Kulev about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Greg Farnum almost 12 years ago

Updated by Vladimir Kulev almost 12 years ago

Updated by Sage Weil almost 12 years ago