Project

General

Profile

Actions

Bug #16234

closed

Ceph osd daemon keeps running even after disk has been pulled

Added by David Peraza almost 8 years ago. Updated almost 7 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During our system test at Cisco one of the things we do is simulate disk failure. We do this by taking the disk offline from the CIMC by marking the disk as unconfigured and also by pulling the disk out. I our test we have 4 data disks and one SSD set for journaling

These are the mount points:

/dev/sdb1 838G 86M 838G 1% /var/lib/ceph/osd/ceph-22
/dev/sdd1 838G 46M 838G 1% /var/lib/ceph/osd/ceph-24
/dev/sda1 838G 46M 838G 1% /var/lib/ceph/osd/ceph-21
/dev/sdc1 838G 78M 838G 1% /var/lib/ceph/osd/ceph-23

After pulling disk ceph-disk does not show the disk taken offline in my case was sdc1:

[root@david_server-4 ~]# ceph-disk list
/dev/sda :
/dev/sda1 ceph data, active, cluster ceph, osd.21, journal /dev/sde1
/dev/sdb :
/dev/sdb1 ceph data, active, cluster ceph, osd.22, journal /dev/sde2
/dev/sdd :
/dev/sdd1 ceph data, active, cluster ceph, osd.24, journal /dev/sde4
/dev/sde :
/dev/sde1 ceph journal, for /dev/sda1
/dev/sde2 ceph journal, for /dev/sdb1
/dev/sde3 ceph journal
/dev/sde4 ceph journal, for /dev/sdd1
/dev/sdf :
/dev/sdf1 other, ext4, mounted on /boot
/dev/sdf2 other, LVM2_member

The mountpoint still looks OK:

[root@david_server-4 ~]# ls /var/lib/ceph/osd/ceph-23
activate.monmap active ceph_fsid current fsid journal journal_uuid keyring magic ready store_version superblock sysvinit whoami

But sdc1 is not there:

[root@david_server-4 ~]# ls /dev/sdc1
ls: cannot access /dev/sdc1: No such file or directory

Obviously ceph osd daemon for ceph-23 still things everything is OK:

[root@david_server-4 ~]# service ceph status osd.23 === osd.23 ===
osd.23: running {"version":"0.94.5-9.el7cp"}

And ceph monitors never get notified that something is wrong:
[ceph@david_server-2 /]$ ceph osd stat
osdmap e166: 25 osds: 25 up, 25 in

This could stay like this for days as long as there is no i/o on that particular OSD.

I looked around to check if there is a config option to allow for osd daemon to check the status of the disk. But I can't find anything

We will like to be able to know that a disk is bad within a reasonable amount of time, and let things rebalance even before we try to do i/o of the bad disk.

Here is my ceph info

[ceph@david_server-2 /]$ ceph version
ceph version 0.94.5-9.el7cp (deef183a81111fa5e128ec88c90a32c9587c615d)

[root@david_server-4 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.2 (Maipo)

[ceph@david_server-2 /]$ ceph status
cluster 1af8284d-d1e1-42a0-af36-c1076115c853
health HEALTH_OK
monmap e1: 3 mons at {ceph-david_server-1=20.0.0.6:6789/0,ceph-david_server-2=20.0.0.5:6789/0,ceph-david_server-3=20.0.0.7:6789/0}
election epoch 16, quorum 0,1,2 ceph-david_server-2,ceph-david_server-1,ceph-david_server-3
osdmap e166: 25 osds: 25 up, 25 in
pgmap v1864: 1024 pgs, 5 pools, 406 MB data, 54 objects
2249 MB used, 61525 GB / 61527 GB avail
1024 active+clean

Actions #1

Updated by Josh Durgin almost 7 years ago

  • Status changed from New to Rejected

No cluster should be entirely idle. At the very least scrubbing should be running, and will get an error from the disk at that point.

Actions

Also available in: Atom PDF