Project

General

Profile

Actions

Support #47455

closed

How to recover cluster that lost its quorum?

Added by Gunther Heinrich over 3 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

This relates to the documentation issue I posted yesterday here: https://tracker.ceph.com/issues/47436

Does anyone know how I can recover a cluster in version Octopus that has lost its quorum? I attempted to follow the documentation linked above but the results weren't good.
Since yesterday I had the idea to follow the instructions by entering the monitor-container via cephadm (cephadm enter --fsid ... --name ...) but when I tried to extract the monmap inside the monitor container ceph-mon gave me the following error:

ceph-mon -i host-id --extract-monmap /tmp/monmap
2020-09-15T10:00:15.726+0000 7f26a753b700 -1 rocksdb: IO error: While lock file: /var/lib/ceph/mon/host-id/store.db/LOCK: Resource temporarily unavailable
2020-09-15T10:00:15.726+0000 7f26a753b700 -1 error opening mon data directory at '/var/lib/ceph/mon/host-id': (22) Invalid argument

I could try to remove the lock but I don't want to tinker too much at the moment.

Thanks for your help!

Actions #1

Updated by Gunther Heinrich over 3 years ago

Update:
To find a possible way to somehow extract the monmap from the running mon I tried many combinations of the ceph-mon extraction (without -i and with -i (one time mon.host-id and one time host-id)). All commands below are used directly on the host, not the monitor container running on the host:

sudo ceph-mon -d --extract-monmap /tmp/monmap.bin

sometimes results in the following error
*** Caught signal (Aborted) **
 in thread 7fd3dc642700 thread_name:rocksdb:pst_st
ceph-mon: /build/ceph-UmdLL8/ceph-15.2.3/src/rocksdb/db/db_impl.cc:749: void rocksdb::DBImpl::DumpStats(): Assertion `cf_property_info != nullptr' failed.
Aborted

...some other times the following error is logged
7f452590a580  4 rocksdb: [db/version_set.cc:3747] Recovered from manifest file:/var/lib/ceph/mon/ceph-admin/store.db/MANIFEST-000023 succeeded,
manifest_file_number is 23, next_file_number is 25, last_sequence is 0, log_number is 22,prev_log_number is 0,max_column_family is 0,min_log_number_to_keep is 0

7f452590a580  4 rocksdb: [db/version_set.cc:3763] Column family [default] (ID 0), log number is 22

7f452590a580  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1601465666634552, "job": 1, "event": "recovery_started", "log_files": [24]}
7f452590a580  4 rocksdb: [db/db_impl_open.cc:581] Recovering log #24 mode 2
7f452590a580  4 rocksdb: [db/version_set.cc:3035] Creating manifest 26

7f452590a580  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1601465666712288, "job": 1, "event": "recovery_finished"}
7f452590a580  4 rocksdb: DB pointer 0x555bb957e000
7f452590a580 -1 unable to read magic from mon data

The command

sudo ceph-mon -d -i mon.host-id --extract-monmap /tmp/monmap.bin

results in this error:
7f10baf9d580 -1 monitor data directory at '/var/lib/ceph/mon/ceph-mon.host-id' does not exist: have you run 'mkfs'?

Using the command
sudo ceph-mon -d -i host-id --extract-monmap /tmp/monmap.bin

results in nothing happening at all, though this could relate to the fact that the mon daemon was not stopped because the monitor process/daemon is running in a container and I had no success in stopping it.
The command "sudo ceph-mon --mkfs" on the host machine results in nothing happening either. When using that command inside the monitor container it logs that the folder already exists.

Actions #2

Updated by Greg Farnum almost 3 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF