Bug #7626
closedAfter updating ceph from 0.75 to 0.77 one of the three monitors can't start
0%
Description
We are using CentOS 6.5 and Ceph 0.77
After updating to 0.77 one monitor can't start:
cluster 554b69c8-bdca-4d63-b45c-f5fd16b5a836
health HEALTH_WARN 1 mons down, quorum 0,2 ceph-mon01,ceph-mon03
monmap e27: 3 mons at {ceph-mon01=10.1.2.1:6789/0,ceph-mon02=10.1.2.2:6789/0,ceph-mon03=10.1.2.3:6789/0}, election epoch 1718, quorum 0,2 ceph-mon01,ceph-mon03
mdsmap e5058: 1/1/1 up {0=ceph-mon01=up:active}, 1 up:standby
osdmap e9467: 12 osds: 12 up, 12 in
pgmap v351412: 992 pgs, 5 pools, 26478 MB data, 392 kobjects
58538 MB used, 290 GB / 347 GB avail
992 active+clean
service ceph start mon
=== mon.ceph-mon02 ===
Starting Ceph mon.ceph-mon02 on ceph-mon02...
[1327]: (33) Numerical argument out of domain
failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i ceph-mon02 --pid-file /var/run/ceph/mon.ceph-mon02.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph-mon02...
=== mon.ceph-mon02 ===
Starting Ceph mon.ceph-mon02 on ceph-mon02...
[1417]: (33) Numerical argument out of domain
failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i ceph-mon02 --pid-file /var/run/ceph/mon.ceph-mon02.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph-mon02...
I added the log of the monitor with problems with this issue: ceph-mon.ceph-mon02.log
Files
Updated by Joao Eduardo Luis about 10 years ago
can you please rerun the monitor with 'debug mon = 10' and attach the resulting log
Updated by Sage Weil about 10 years ago
- Status changed from New to Need More Info
Updated by Jasper Siero about 10 years ago
I submitted the new log with debug mon = 10 added to the ceph.conf.
The two processes below also keeps running after I tried to start the monitor:
root 26185 0.1 0.1 51236 5792 pts/0 S 13:04 0:00 python /usr/sbin/ceph-create-keys -i ceph-mon02
root 26997 0.1 0.1 51236 5792 pts/0 S 13:05 0:00 python /usr/sbin/ceph-create-keys -i ceph-mon02
Updated by Sage Weil about 10 years ago
Jasper Siero wrote:
I submitted the new log with debug mon = 10 added to the ceph.conf.
The two processes below also keeps running after I tried to start the monitor:
root 26185 0.1 0.1 51236 5792 pts/0 S 13:04 0:00 python /usr/sbin/ceph-create-keys -i ceph-mon02
root 26997 0.1 0.1 51236 5792 pts/0 S 13:05 0:00 python /usr/sbin/ceph-create-keys -i ceph-mon02
Can you attach a tarball of the mon data directory? (/var/lib/ceph/mon/*). If the cluster is on a public network (it shouldn't be!) use ceph-post-file instead of attaching it to the bug. Thanks!
Updated by Jasper Siero about 10 years ago
- File ceph-ceph-mon02.tar.gz ceph-ceph-mon02.tar.gz added
Updated by Ian Colle about 10 years ago
- Status changed from Need More Info to New
- Target version deleted (
v0.77)
Updated by Sage Weil about 10 years ago
- Status changed from New to 12
- Source changed from other to Community (user)
Updated by Joao Eduardo Luis about 10 years ago
- Status changed from 12 to In Progress
- Assignee set to Joao Eduardo Luis
Updated by Joao Eduardo Luis about 10 years ago
The store attached to the ticket shows the latest 7 full osdmaps as being unable to be decoded, which would explain the monitor not being able to start. OSDMaps prior to that decode just fine. All incrementals also are decoded just fine.
I also happened to notice that the latest osdmap in this monitor is off by 400 versions from the one initially posted on 'ceph -s' in the ticket's description. I wonder whether this was a sync gone wrong, however that is unlikely as there's no evidence of a store sync to be in progress/interrupted on the mon store.
It is not yet clear why the maps won't decode.
Updated by Sage Weil about 10 years ago
This sounds like it could be 14ea8157eb2883b9f53c234044fe002153212ef8
Updated by Sage Weil about 10 years ago
Yes, I'm pretty sure it is.. this bug affected 0.77 and was fixed for 0.78. If I remember correctly, the full osdmaps are intact on the leader and healthy peons but any mon that caught up via sync is affected. Which matches this case. I think this one mon just needs to be blown away.
(And, upgrade to 0.78, where this bug is fixed!)
Joao, is that jives with what you saw, let's close this as resolved!
Updated by Jasper Siero about 10 years ago
I updated all nodes to 0.88 and removed the monitor and created a new one
Updated by Jasper Siero about 10 years ago
Jasper Siero wrote:
I updated all nodes to
0.880.78 and removed the monitor and created a new one ;-)
Updated by Joao Eduardo Luis about 10 years ago
- Status changed from In Progress to Closed
as far as I can tell, Sage is right. Nothing else seems off. Closing the ticket.