Bug #7626: After updating ceph from 0.75 to 0.77 one of the three monitors can't start - Ceph - Ceph

Actions

Copy link

Bug #7626

closed

After updating ceph from 0.75 to 0.77 one of the three monitors can't start

Added by Jasper Siero about 10 years ago. Updated about 10 years ago.

Status:

Closed

Priority:

Urgent

Assignee:

Joao Eduardo Luis

Category:

Monitor

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We are using CentOS 6.5 and Ceph 0.77

After updating to 0.77 one monitor can't start:

cluster 554b69c8-bdca-4d63-b45c-f5fd16b5a836
     health HEALTH_WARN 1 mons down, quorum 0,2 ceph-mon01,ceph-mon03
     monmap e27: 3 mons at {ceph-mon01=10.1.2.1:6789/0,ceph-mon02=10.1.2.2:6789/0,ceph-mon03=10.1.2.3:6789/0}, election epoch 1718, quorum 0,2 ceph-mon01,ceph-mon03
     mdsmap e5058: 1/1/1 up {0=ceph-mon01=up:active}, 1 up:standby
     osdmap e9467: 12 osds: 12 up, 12 in
      pgmap v351412: 992 pgs, 5 pools, 26478 MB data, 392 kobjects
            58538 MB used, 290 GB / 347 GB avail
                 992 active+clean

service ceph start mon === mon.ceph-mon02 ===
Starting Ceph mon.ceph-mon02 on ceph-mon02...
[1327]: (33) Numerical argument out of domain
failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i ceph-mon02 --pid-file /var/run/ceph/mon.ceph-mon02.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph-mon02... === mon.ceph-mon02 ===
Starting Ceph mon.ceph-mon02 on ceph-mon02...
[1417]: (33) Numerical argument out of domain
failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i ceph-mon02 --pid-file /var/run/ceph/mon.ceph-mon02.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph-mon02...

I added the log of the monitor with problems with this issue: ceph-mon.ceph-mon02.log

Files

Download all files

ceph-mon.ceph-mon02.log (13.6 KB) ceph-mon.ceph-mon02.log		Jasper Siero, 03/06/2014 03:27 AM
ceph-mon.ceph-mon02.log-20140313 (20.3 KB) ceph-mon.ceph-mon02.log-20140313	log after adding debug mon = 10 to the ceph.conf	Jasper Siero, 03/13/2014 05:09 AM
ceph-ceph-mon02.tar.gz (7.43 MB) ceph-ceph-mon02.tar.gz		Jasper Siero, 03/14/2014 02:28 AM

Actions

Copy link

Updated by Samuel Just about 10 years ago

Priority changed from High to Urgent

Actions

Copy link

Updated by Joao Eduardo Luis about 10 years ago

can you please rerun the monitor with 'debug mon = 10' and attach the resulting log

Actions

Copy link

Updated by Sage Weil about 10 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Jasper Siero about 10 years ago

File ceph-mon.ceph-mon02.log-20140313 ceph-mon.ceph-mon02.log-20140313 added

I submitted the new log with debug mon = 10 added to the ceph.conf.
The two processes below also keeps running after I tried to start the monitor:
root 26185 0.1 0.1 51236 5792 pts/0 S 13:04 0:00 python /usr/sbin/ceph-create-keys -i ceph-mon02
root 26997 0.1 0.1 51236 5792 pts/0 S 13:05 0:00 python /usr/sbin/ceph-create-keys -i ceph-mon02

Actions

Copy link

Updated by Sage Weil about 10 years ago

Jasper Siero wrote:

I submitted the new log with debug mon = 10 added to the ceph.conf.
The two processes below also keeps running after I tried to start the monitor:
root 26185 0.1 0.1 51236 5792 pts/0 S 13:04 0:00 python /usr/sbin/ceph-create-keys -i ceph-mon02
root 26997 0.1 0.1 51236 5792 pts/0 S 13:05 0:00 python /usr/sbin/ceph-create-keys -i ceph-mon02

Can you attach a tarball of the mon data directory? (/var/lib/ceph/mon/*). If the cluster is on a public network (it shouldn't be!) use ceph-post-file instead of attaching it to the bug. Thanks!

Actions

Copy link

Updated by Jasper Siero about 10 years ago

File ceph-ceph-mon02.tar.gz ceph-ceph-mon02.tar.gz added

Actions

Copy link

Updated by Ian Colle about 10 years ago

Status changed from Need More Info to New
Target version deleted (~~v0.77~~)

Actions

Copy link

Updated by Sage Weil about 10 years ago

Status changed from New to 12
Source changed from other to Community (user)

Actions

Copy link

Updated by Joao Eduardo Luis about 10 years ago

Status changed from 12 to In Progress
Assignee set to Joao Eduardo Luis

Actions

Copy link

#10

Updated by Joao Eduardo Luis about 10 years ago

The store attached to the ticket shows the latest 7 full osdmaps as being unable to be decoded, which would explain the monitor not being able to start. OSDMaps prior to that decode just fine. All incrementals also are decoded just fine.

I also happened to notice that the latest osdmap in this monitor is off by 400 versions from the one initially posted on 'ceph -s' in the ticket's description. I wonder whether this was a sync gone wrong, however that is unlikely as there's no evidence of a store sync to be in progress/interrupted on the mon store.

It is not yet clear why the maps won't decode.

Actions

Copy link

#11

Updated by Sage Weil about 10 years ago

This sounds like it could be 14ea8157eb2883b9f53c234044fe002153212ef8

Actions

Copy link

#12

Updated by Sage Weil about 10 years ago

Yes, I'm pretty sure it is.. this bug affected 0.77 and was fixed for 0.78. If I remember correctly, the full osdmaps are intact on the leader and healthy peons but any mon that caught up via sync is affected. Which matches this case. I think this one mon just needs to be blown away.

(And, upgrade to 0.78, where this bug is fixed!)

Joao, is that jives with what you saw, let's close this as resolved!

Actions

Copy link

#13

Updated by Jasper Siero about 10 years ago

I updated all nodes to 0.88 and removed the monitor and created a new one

Actions

Copy link

#14

Updated by Jasper Siero about 10 years ago

Jasper Siero wrote:

I updated all nodes to ~~0.88~~ 0.78 and removed the monitor and created a new one ;-)

Actions

Copy link

#15

Updated by Joao Eduardo Luis about 10 years ago

Status changed from In Progress to Closed

as far as I can tell, Sage is right. Nothing else seems off. Closing the ticket.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #7626

After updating ceph from 0.75 to 0.77 one of the three monitors can't start

Updated by Samuel Just about 10 years ago

Updated by Joao Eduardo Luis about 10 years ago

Updated by Sage Weil about 10 years ago

Updated by Jasper Siero about 10 years ago

Updated by Sage Weil about 10 years ago

Updated by Jasper Siero about 10 years ago

Updated by Ian Colle about 10 years ago

Updated by Sage Weil about 10 years ago

Updated by Joao Eduardo Luis about 10 years ago

Updated by Joao Eduardo Luis about 10 years ago

Updated by Sage Weil about 10 years ago

Updated by Sage Weil about 10 years ago

Updated by Jasper Siero about 10 years ago

Updated by Jasper Siero about 10 years ago

Updated by Joao Eduardo Luis about 10 years ago