Project

General

Profile

Actions

Bug #2192

closed

ceph-mon hangs consuming 100% CPU

Added by Vladimir Kulev about 12 years ago. Updated almost 12 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have a test setup of two nodes each running 0.43 mds, mon and osd. I mount ceph kernel filesystem at /srv/ceph on both nodes and run this simple script on node A:
while true; do dd if=/dev/sda of=/srv/ceph/data count=500000; done
After a couple of iterations all operations involving this filesystem (including umount -f -l) begin to hang in state D+, ceph health hangs in state Sl+, rados df gives connection timeout. This happens on both nodes.
Also on node A (where script was run) ceph-mon process starts to consube 100% CPU, and remains zombie (<defunct>) with 100% CPU after being killed.

Actions #1

Updated by Sage Weil about 12 years ago

  • Category set to Monitor

Is this reproducible? Are you able to connect to the ceph-mon process with gdb?

Actions #2

Updated by Vladimir Kulev about 12 years ago

It was reproduced all the time, for 0.44 also. After I adjusted cluster to have only one monitor problem has gone. (Un)fortunately, after adding second monitor back (changed ceph.conf and restarted services) problem did not appear again so I cannot reproduce it anymore.

Actions #3

Updated by Sage Weil about 12 years ago

  • Status changed from New to Need More Info
Actions #4

Updated by Greg Farnum almost 12 years ago

I missed this when it came in, and I don't know where the 100% CPU usage is coming from, but the hung filesystem sounds like our typical issue with running clients and OSDs on the same boxes — and the Monitor problems might be related.
So what kernel was in use, and what backing filesystems?

Actions #5

Updated by Vladimir Kulev almost 12 years ago

It was some 3.0.0 Ubuntu kernel, backed by btrfs.

Actions #6

Updated by Sage Weil almost 12 years ago

  • Status changed from Need More Info to Won't Fix

Yep, this sounds like the writeback sync deadlock:

- ceph-mon calls sync
- the kernel client flushes it's dirty data, but is waiting for an osdmap from the monitor
-> deadlock

Don't mount the kernel client from the same machine that the monitor is running. Or, use a newer kernel + glibc that support the new syncfs(2) syscall.

Actions

Also available in: Atom PDF