Project

General

Profile

Actions

Bug #7626

closed

After updating ceph from 0.75 to 0.77 one of the three monitors can't start

Added by Jasper Siero about 10 years ago. Updated about 10 years ago.

Status:
Closed
Priority:
Urgent
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are using CentOS 6.5 and Ceph 0.77

After updating to 0.77 one monitor can't start:

cluster 554b69c8-bdca-4d63-b45c-f5fd16b5a836
health HEALTH_WARN 1 mons down, quorum 0,2 ceph-mon01,ceph-mon03
monmap e27: 3 mons at {ceph-mon01=10.1.2.1:6789/0,ceph-mon02=10.1.2.2:6789/0,ceph-mon03=10.1.2.3:6789/0}, election epoch 1718, quorum 0,2 ceph-mon01,ceph-mon03
mdsmap e5058: 1/1/1 up {0=ceph-mon01=up:active}, 1 up:standby
osdmap e9467: 12 osds: 12 up, 12 in
pgmap v351412: 992 pgs, 5 pools, 26478 MB data, 392 kobjects
58538 MB used, 290 GB / 347 GB avail
992 active+clean

service ceph start mon === mon.ceph-mon02 ===
Starting Ceph mon.ceph-mon02 on ceph-mon02...
[1327]: (33) Numerical argument out of domain
failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i ceph-mon02 --pid-file /var/run/ceph/mon.ceph-mon02.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph-mon02... === mon.ceph-mon02 ===
Starting Ceph mon.ceph-mon02 on ceph-mon02...
[1417]: (33) Numerical argument out of domain
failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i ceph-mon02 --pid-file /var/run/ceph/mon.ceph-mon02.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph-mon02...

I added the log of the monitor with problems with this issue: ceph-mon.ceph-mon02.log


Files

ceph-mon.ceph-mon02.log (13.6 KB) ceph-mon.ceph-mon02.log Jasper Siero, 03/06/2014 03:27 AM
ceph-mon.ceph-mon02.log-20140313 (20.3 KB) ceph-mon.ceph-mon02.log-20140313 log after adding debug mon = 10 to the ceph.conf Jasper Siero, 03/13/2014 05:09 AM
ceph-ceph-mon02.tar.gz (7.43 MB) ceph-ceph-mon02.tar.gz Jasper Siero, 03/14/2014 02:28 AM
Actions #1

Updated by Samuel Just about 10 years ago

  • Priority changed from High to Urgent
Actions #2

Updated by Joao Eduardo Luis about 10 years ago

can you please rerun the monitor with 'debug mon = 10' and attach the resulting log

Actions #3

Updated by Sage Weil about 10 years ago

  • Status changed from New to Need More Info
Actions #4

Updated by Jasper Siero about 10 years ago

I submitted the new log with debug mon = 10 added to the ceph.conf.
The two processes below also keeps running after I tried to start the monitor:
root 26185 0.1 0.1 51236 5792 pts/0 S 13:04 0:00 python /usr/sbin/ceph-create-keys -i ceph-mon02
root 26997 0.1 0.1 51236 5792 pts/0 S 13:05 0:00 python /usr/sbin/ceph-create-keys -i ceph-mon02

Actions #5

Updated by Sage Weil about 10 years ago

Jasper Siero wrote:

I submitted the new log with debug mon = 10 added to the ceph.conf.
The two processes below also keeps running after I tried to start the monitor:
root 26185 0.1 0.1 51236 5792 pts/0 S 13:04 0:00 python /usr/sbin/ceph-create-keys -i ceph-mon02
root 26997 0.1 0.1 51236 5792 pts/0 S 13:05 0:00 python /usr/sbin/ceph-create-keys -i ceph-mon02

Can you attach a tarball of the mon data directory? (/var/lib/ceph/mon/*). If the cluster is on a public network (it shouldn't be!) use ceph-post-file instead of attaching it to the bug. Thanks!

Actions #7

Updated by Ian Colle about 10 years ago

  • Status changed from Need More Info to New
  • Target version deleted (v0.77)
Actions #8

Updated by Sage Weil about 10 years ago

  • Status changed from New to 12
  • Source changed from other to Community (user)
Actions #9

Updated by Joao Eduardo Luis about 10 years ago

  • Status changed from 12 to In Progress
  • Assignee set to Joao Eduardo Luis
Actions #10

Updated by Joao Eduardo Luis about 10 years ago

The store attached to the ticket shows the latest 7 full osdmaps as being unable to be decoded, which would explain the monitor not being able to start. OSDMaps prior to that decode just fine. All incrementals also are decoded just fine.

I also happened to notice that the latest osdmap in this monitor is off by 400 versions from the one initially posted on 'ceph -s' in the ticket's description. I wonder whether this was a sync gone wrong, however that is unlikely as there's no evidence of a store sync to be in progress/interrupted on the mon store.

It is not yet clear why the maps won't decode.

Actions #11

Updated by Sage Weil about 10 years ago

This sounds like it could be 14ea8157eb2883b9f53c234044fe002153212ef8

Actions #12

Updated by Sage Weil about 10 years ago

Yes, I'm pretty sure it is.. this bug affected 0.77 and was fixed for 0.78. If I remember correctly, the full osdmaps are intact on the leader and healthy peons but any mon that caught up via sync is affected. Which matches this case. I think this one mon just needs to be blown away.

(And, upgrade to 0.78, where this bug is fixed!)

Joao, is that jives with what you saw, let's close this as resolved!

Actions #13

Updated by Jasper Siero about 10 years ago

I updated all nodes to 0.88 and removed the monitor and created a new one

Actions #14

Updated by Jasper Siero about 10 years ago

Jasper Siero wrote:

I updated all nodes to 0.88 0.78 and removed the monitor and created a new one ;-)

Actions #15

Updated by Joao Eduardo Luis about 10 years ago

  • Status changed from In Progress to Closed

as far as I can tell, Sage is right. Nothing else seems off. Closing the ticket.

Actions

Also available in: Atom PDF