Support #47455: How to recover cluster that lost its quorum? - Ceph - Ceph

Actions

Copy link

Support #47455

closed

How to recover cluster that lost its quorum?

Added by Gunther Heinrich over 3 years ago. Updated almost 3 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Monitor

Target version:

v15.2.4

% Done:

Tags:

Reviewed:

Affected Versions:

Pull request ID:

Description

This relates to the documentation issue I posted yesterday here: https://tracker.ceph.com/issues/47436

Does anyone know how I can recover a cluster in version Octopus that has lost its quorum? I attempted to follow the documentation linked above but the results weren't good.
Since yesterday I had the idea to follow the instructions by entering the monitor-container via cephadm (cephadm enter --fsid ... --name ...) but when I tried to extract the monmap inside the monitor container ceph-mon gave me the following error:

ceph-mon -i host-id --extract-monmap /tmp/monmap
2020-09-15T10:00:15.726+0000 7f26a753b700 -1 rocksdb: IO error: While lock file: /var/lib/ceph/mon/host-id/store.db/LOCK: Resource temporarily unavailable
2020-09-15T10:00:15.726+0000 7f26a753b700 -1 error opening mon data directory at '/var/lib/ceph/mon/host-id': (22) Invalid argument

I could try to remove the lock but I don't want to tinker too much at the moment.

Thanks for your help!

Actions

Copy link

Updated by Gunther Heinrich over 3 years ago

Update:
To find a possible way to somehow extract the monmap from the running mon I tried many combinations of the ceph-mon extraction (without -i and with -i (one time mon.host-id and one time host-id)). All commands below are used directly on the host, not the monitor container running on the host:

sudo ceph-mon -d --extract-monmap /tmp/monmap.bin

sometimes results in the following error

*** Caught signal (Aborted) **
 in thread 7fd3dc642700 thread_name:rocksdb:pst_st
ceph-mon: /build/ceph-UmdLL8/ceph-15.2.3/src/rocksdb/db/db_impl.cc:749: void rocksdb::DBImpl::DumpStats(): Assertion `cf_property_info != nullptr' failed.
Aborted

...some other times the following error is logged

7f452590a580  4 rocksdb: [db/version_set.cc:3747] Recovered from manifest file:/var/lib/ceph/mon/ceph-admin/store.db/MANIFEST-000023 succeeded,
manifest_file_number is 23, next_file_number is 25, last_sequence is 0, log_number is 22,prev_log_number is 0,max_column_family is 0,min_log_number_to_keep is 0

7f452590a580  4 rocksdb: [db/version_set.cc:3763] Column family [default] (ID 0), log number is 22

7f452590a580  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1601465666634552, "job": 1, "event": "recovery_started", "log_files": [24]}
7f452590a580  4 rocksdb: [db/db_impl_open.cc:581] Recovering log #24 mode 2
7f452590a580  4 rocksdb: [db/version_set.cc:3035] Creating manifest 26

7f452590a580  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1601465666712288, "job": 1, "event": "recovery_finished"}
7f452590a580  4 rocksdb: DB pointer 0x555bb957e000
7f452590a580 -1 unable to read magic from mon data

The command

sudo ceph-mon -d -i mon.host-id --extract-monmap /tmp/monmap.bin

results in this error:

7f10baf9d580 -1 monitor data directory at '/var/lib/ceph/mon/ceph-mon.host-id' does not exist: have you run 'mkfs'?

Using the command

sudo ceph-mon -d -i host-id --extract-monmap /tmp/monmap.bin

results in nothing happening at all, though this could relate to the fact that the mon daemon was not stopped because the monitor process/daemon is running in a container and I had no success in stopping it.
The command "sudo ceph-mon --mkfs" on the host machine results in nothing happening either. When using that command inside the monitor container it logs that the folder already exists.

Actions

Copy link

Updated by Greg Farnum almost 3 years ago

Status changed from New to Closed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Support #47455

How to recover cluster that lost its quorum?

Updated by Gunther Heinrich over 3 years ago

Updated by Greg Farnum almost 3 years ago