Support #22132
closed
OSDs stuck in "booting" state after catastrophic data loss
Added by Maxim Manuylov over 6 years ago.
Updated over 6 years ago.
Description
Hi,
I have Ceph cluster with 3 MONs+MGRs and 5 OSDs with attached disks. All nodes have CoreOS as host OS, Ceph daemons run in Docker (luminous, ubuntu 16.04, bluestore, etcd). Ceph reports HEALTH_OK and I have some data stored on OSDs.
Imagine then (intentionally or due to some failure) almost simultaneously I destroy all the cluster nodes keeping just OSD disks. Now I want to recreate the cluster keeping all my OSD data. I recreate all the nodes and all of them get some new IP address. MONs and MGRs start without any problem, form a quorum (I see all mons in "ceph -s") and I see "/ceph-config/ceph/monSetupComplete" flag in etcd (etcd is redeployed as well so the flag is for sure newly added). While OSDs fail to start stucking on "start_boot" step. And "ceph osd tree" shows all the OSDs in "down" state.
What are the correct steps to bootstrap a Ceph cluster with existing (prepared and activated) OSD disks? One thing I figured out is that I should not regenerate cluster "fsid" so I patched the "ceph/daemon" image to pass my own "fsid". But seems it is not enough because OSDs still stuck. My guess is every OSD tries to connect to peers using previous osdmap and old OSD IP addresses. If it is so, is there any way to reset the osdmap keeping my data stored on OSD? What else could I do?
So far it blocks me from using Ceph in production since I cannot be sure in keeping of my data in case of cluster failure. Feel free to ask me if you need any logs or another details.
Files
core@mm-ceph-mon-0 ~ $ ceph -s
cluster:
id: ecf1b1ee-d10f-741d-4e01-5124fb84ec4b
health: HEALTH_OK
services:
mon: 3 daemons, quorum mm-ceph-mon-2,mm-ceph-mon-0,mm-ceph-mon-1
mgr: mm-ceph-mon-1(active), standbys: mm-ceph-mon-2, mm-ceph-mon-0
osd: 5 osds: 0 up, 0 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 bytes
usage: 0 kB used, 0 kB / 0 kB avail
pgs:
core@mm-ceph-mon-0 ~ $ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.30945 root default
-4 0.06189 host mm-ceph-osd-0
0 0.06189 osd.0 down 0 1.00000
-3 0.06189 host mm-ceph-osd-1
4 0.06189 osd.4 down 0 1.00000
-5 0.06189 host mm-ceph-osd-2
3 0.06189 osd.3 down 0 1.00000
-6 0.06189 host mm-ceph-osd-3
2 0.06189 osd.2 down 0 1.00000
-2 0.06189 host mm-ceph-osd-4
1 0.06189 osd.1 down 0 1.00000
core@mm-ceph-mon-0 ~ $ ceph osd dump
epoch 6
fsid ecf1b1ee-d10f-741d-4e01-5124fb84ec4b
created 2017-11-15 15:56:37.832653
modified 2017-11-15 15:56:45.402958
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 5
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release luminous
max_osd 5
osd.0 down out weight 0 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0) - - - - exists,new fc9f64c3-5301-4981-9668-96fbb3d2b606
osd.1 down out weight 0 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0) - - - - exists,new 2252cf36-ccda-469e-9836-6dcb55891517
osd.2 down out weight 0 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0) - - - - exists,new cdc1a1c6-7016-4470-ab1a-0cce2809092e
osd.3 down out weight 0 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0) - - - - exists,new 1dbb0120-850b-4c59-bbce-c43bef2161d8
osd.4 down out weight 0 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0) - - - - exists,new 1780ee97-2f46-441b-80bb-714dc5cd2f1b
core@mm-ceph-mon-0 ~ $
Attaching OSD log (one of).
- Tracker changed from Bug to Support
- Project changed from Ceph to RADOS
- Subject changed from OSDs stuck in "booting" state after entire cluster redeploy to OSDs stuck in "booting" state after catastrophic data loss
- Category deleted (
OSD)
- Status changed from New to Resolved
Also available in: Atom
PDF