Bug #11313: Monitors not forming quorum : flapping : Cluster unreachable - Ceph - Ceph

Actions

Copy link

Bug #11313

closed

Monitors not forming quorum : flapping : Cluster unreachable

Added by karan singh about 9 years ago. Updated about 9 years ago.

Status:

Can't reproduce

Priority:

Urgent

Assignee:

Category:

Monitor

Target version:

% Done:

Source:

Q/A

Tags:

monitor down, cluster unreachable

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

0.80

ceph-qa-suite:

rados

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Today i encountered a weird problem with my ceph cluster , which was working before.

I was troubleshooting my PG which are incomplete. So i tried marking osd as lost.

As soon as i marked osd as lost # ceph osd lost 115 --yes-i-really-mean-it cluster became unreachable. All the monitors lost quorum and i was not able to connect to the cluster back.

It tried restarting services several times , but no luck.

Below are the logs that shows that the problem came exactly after # ceph osd lost 115

2015-04-02 13:44:47.280207 7f376cce1700  0 mon.pouta-s01@0(leader) e3 handle_command mon_command({"prefix": "status"} v 0) v1
2015-04-02 13:45:09.118663 7f376d6e2700  0 mon.pouta-s01@0(leader).data_health(1920) update_stats avail 80% total 82932 MB, used 12329 MB, a
vail 66390 MB
2015-04-02 13:46:12.709808 7f376d6e2700  0 mon.pouta-s01@0(leader).data_health(1920) update_stats avail 80% total 82932 MB, used 12328 MB, a
vail 66391 MB
2015-04-02 13:47:14.233436 7f376d6e2700  0 mon.pouta-s01@0(leader).data_health(1920) update_stats avail 80% total 82932 MB, used 12327 MB, a
vail 66392 MB
2015-04-02 13:48:18.824272 7f376d6e2700  0 mon.pouta-s01@0(leader).data_health(1920) update_stats avail 80% total 82932 MB, used 12326 MB, a
vail 66392 MB
2015-04-02 13:48:19.059478 7f376cce1700  0 mon.pouta-s01@0(leader) e3 handle_command mon_command({"prefix": "osd lost", "sure": "--yes-i-really-mean-it", "id": 115} v 0) v1
2015-04-02 13:48:38.403011 7f376cce1700  0 log [INF] : mon.pouta-s01 calling new monitor election
2015-04-02 13:48:38.406063 7f376cce1700  0 log [INF] : mon.pouta-s01@0 won leader election with quorum 0,1,2
2015-04-02 13:48:49.455605 7f376cce1700  0 log [INF] : monmap e3: 3 mons at {pouta-s01=10.100.50.1:6789/0,pouta-s02=10.100.50.2:6789/0,pouta-s03=10.100.50.3:6789/0}
2015-04-02 13:48:49.456310 7f376cce1700  0 log [INF] : pgmap v598458: 18432 pgs: 7 down+incomplete, 18412 active+clean, 13 incomplete; 2338 GB data, 19068 GB used, 853 TB / 871 TB avail; 133/328874 unfound (0.040%)
2015-04-02 13:48:49.456409 7f376cce1700  0 log [INF] : mdsmap e1: 0/0/1 up
2015-04-02 13:48:49.457323 7f376cce1700  0 log [INF] : osdmap e262821: 240 osds: 240 up, 240 in
2015-04-02 13:48:49.665436 7f376cce1700  0 log [INF] : mon.pouta-s01 calling new monitor election
2015-04-02 13:48:49.668347 7f376cce1700  0 log [INF] : mon.pouta-s01@0 won leader election with quorum 0,1,2
2015-04-02 13:49:00.525321 7f376cce1700  0 log [INF] : monmap e3: 3 mons at {pouta-s01=10.100.50.1:6789/0,pouta-s02=10.100.50.2:6789/0,pouta-s03=10.100.50.3:6789/0}
2015-04-02 13:49:00.532675 7f376cce1700  0 log [INF] : pgmap v598458: 18432 pgs: 7 down+incomplete, 18412 active+clean, 13 incomplete; 2338 GB data, 19068 GB used, 853 TB / 871 TB avail; 133/328874 unfound (0.040%)
2015-04-02 13:49:00.532980 7f376cce1700  0 log [INF] : mdsmap e1: 0/0/1 up
2015-04-02 13:49:00.538582 7f376cce1700  0 log [INF] : osdmap e262822: 240 osds: 240 up, 240 in
2015-04-02 13:49:00.782187 7f376cce1700  0 log [INF] : mon.pouta-s01 calling new monitor election
2015-04-02 13:49:01.002140 7f376cce1700  0 log [INF] : mon.pouta-s01@0 won leader election with quorum 0,1,2

More logs : http://paste.ubuntu.com/10724004/

You can observer monitor entering into quorum and then loosing the quorum : http://paste.ubuntu.com/10723915/

Cluster unreachable : debug ms 1 : http://paste.ubuntu.com/10723862/

Monitor logs : debug mon 10 : http://paste.ubuntu.com/10723840/

I tried upgrading to 0.80.9 as well but no luck.

This is production cluster and need to fix really soon ( before easter holidays ) any help from the development team would be really helpful.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #11313

Monitors not forming quorum : flapping : Cluster unreachable

Updated by Joao Eduardo Luis about 9 years ago

Updated by Joao Eduardo Luis about 9 years ago

Updated by karan singh about 9 years ago

Updated by karan singh about 9 years ago

Updated by Joao Eduardo Luis about 9 years ago

Updated by karan singh about 9 years ago

Updated by karan singh about 9 years ago

Updated by Sage Weil about 9 years ago

Updated by karan singh about 9 years ago

Updated by Sage Weil about 9 years ago