Project

General

Profile

Actions

Support #22132

closed

OSDs stuck in "booting" state after catastrophic data loss

Added by Maxim Manuylov over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

Hi,

I have Ceph cluster with 3 MONs+MGRs and 5 OSDs with attached disks. All nodes have CoreOS as host OS, Ceph daemons run in Docker (luminous, ubuntu 16.04, bluestore, etcd). Ceph reports HEALTH_OK and I have some data stored on OSDs.

Imagine then (intentionally or due to some failure) almost simultaneously I destroy all the cluster nodes keeping just OSD disks. Now I want to recreate the cluster keeping all my OSD data. I recreate all the nodes and all of them get some new IP address. MONs and MGRs start without any problem, form a quorum (I see all mons in "ceph -s") and I see "/ceph-config/ceph/monSetupComplete" flag in etcd (etcd is redeployed as well so the flag is for sure newly added). While OSDs fail to start stucking on "start_boot" step. And "ceph osd tree" shows all the OSDs in "down" state.

What are the correct steps to bootstrap a Ceph cluster with existing (prepared and activated) OSD disks? One thing I figured out is that I should not regenerate cluster "fsid" so I patched the "ceph/daemon" image to pass my own "fsid". But seems it is not enough because OSDs still stuck. My guess is every OSD tries to connect to peers using previous osdmap and old OSD IP addresses. If it is so, is there any way to reset the osdmap keeping my data stored on OSD? What else could I do?

So far it blocks me from using Ceph in production since I cannot be sure in keeping of my data in case of cluster failure. Feel free to ask me if you need any logs or another details.


Files

osd.log (38.9 KB) osd.log OSD log Maxim Manuylov, 11/15/2017 04:05 PM
Actions

Also available in: Atom PDF