Project

General

Profile

Actions

Bug #2357

closed

mds takes down ceph

Added by Jörg Ebeling about 12 years ago. Updated over 11 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi guys !
Really impressive DFS... nice performance... cool stuff !
But while starting to use it live I unfortunately got some issues.

I configured a two nodes ceph cluster with 2 osd's (@btrfs), 2 mds's and 2+1 mon's.

Everything looked good and the throughput was/is impressive.

Failover testing wasn't as good as hoped (long 1-2 minute freeze's) but this is probably due to early development.

However, my problem is another one.
When starting to test working with the FS, I get some strange issues.
When I copy i.e. a bunch of files (not much, approx. 200 files, 20MB in summary) to the FS (ceph kernel mount), the copy might stop (freeze) and no further access to the ceph mounts from any of the two machines work.
Taking a look into the logs the only thing I can find is, that the other mds is starting to say:

Apr 28 19:27:14 bert kernel: [67150.554815] ceph: mds1 caps went stale, renewing
Apr 28 19:27:14 bert kernel: [67150.559135] ceph: mds1 caps renewed

every minute.

(I've a lot "btrfs: free space inode generation (0) did not match free space cache generation" between the mds messages, but this seem to be an open btrfs- logging issue)

After approx. 10-15 minutes I get:
Apr 28 19:33:32 bert kernel: [67527.439583] libceph: mds0 10.83.79.2:6800 socket closed
Apr 28 19:33:34 bert kernel: [67529.789290] libceph: mds0 10.83.79.2:6800 connection failed
Apr 28 19:33:34 bert kernel: [67529.863170] libceph: mon0 10.83.79.1:6789 socket closed
Apr 28 19:33:34 bert kernel: [67529.863199] libceph: mon0 10.83.79.1:6789 session lost, hunting for new mon
Apr 28 19:33:34 bert kernel: [67529.864059] libceph: mon1 10.83.79.2:6789 session established
Apr 28 19:33:34 bert kernel: [67529.879931] btrfs: free space inode generation (0) did not match free space cache generation (193672)
Apr 28 19:33:34 bert kernel: [67529.879985] btrfs: free space inode generation (0) did not match free space cache generation (211950)
Apr 28 19:33:34 bert kernel: [67529.882764] libceph: mon1 10.83.79.2:6789 socket closed
Apr 28 19:33:34 bert kernel: [67529.882793] libceph: mon1 10.83.79.2:6789 session lost, hunting for new mon
Apr 28 19:33:34 bert kernel: [67529.882837] libceph: mon1 10.83.79.2:6789 connection failed

Kernel 3.2.0-2-amd64
Ceph 0.45-1 (from git)

ceph.conf:
[global]
auth supported = cephx
keyring = /etc/ceph/keyring.admin
max open files = 131072
log file = /var/log/ceph/$name.log
pid file = /var/run/ceph/$name.pid
[mon]
mon data = /srv/ceph-data/$name
[mon.erni]
host = erni
mon addr = 10.83.79.1:6789
[mon.bert]
host = bert
mon addr = 10.83.79.2:6789
[mon.cluster]
; tie-breaker
host = cluster
mon addr = 178.xx.xx.xx:6789
[mds]
keyring = /srv/ceph-data/keyring.$name
[mds.erni]
host = erni
[mds.bert]
host = bert
[osd]
osd data = /srv/ceph-data/$name
keyring = /etc/ceph/keyring.$name
osd journal = /srv/ceph-data/$name/journal
osd journal size = 1000 ; journal size, in megabytes
sudo = true
[osd.0]
host = erni
btrfs devs = /dev/mapper/vg01-ceph0
cluster addr = 10.83.79.1
public addr = 188.xx.xx.112
[osd.1]
host = bert
btrfs devs = /dev/mapper/vg01-ceph0
cluster addr = 10.83.79.2
public addr = 188.xx.xx.75

Any suggestions ?

Jörg

Actions

Also available in: Atom PDF