Bug #5062
closedmon: 0.61.2 asserts on AuthMonitor during monitor start
0%
Description
Florian from Smart Weblications bumped into this crash on one of his monitors roughly two hours after upgrading from 0.56.6 to 0.61.2
2013-05-14 17:40:59.458008 7f230f0bb780 0 ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60), process ceph-mon, pid 1471 2013-05-14 17:40:59.902657 7f230f0bb780 1 mon.0@-1(probing) e4 preinit fsid 2ff06d61-7b8e-4b10-b39d-138cfcc1ee43 2013-05-14 17:40:59.910734 7f230f0bb780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos()' thread 7f230f0bb780 time 2013-05-14 17:40:59.910363 mon/AuthMonitor.cc: 147: FAILED assert(ret == 0) ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60) 1: (AuthMonitor::update_from_paxos()+0x25ad) [0x56f44d] 2: (Monitor::init_paxos()+0xdf) [0x48292f] 3: (Monitor::preinit()+0x679) [0x490a39] 4: (main()+0x2b61) [0x478361] 5: (__libc_start_main()+0xfd) [0x7f230fd22c8d] 6: /usr/bin/ceph-mon() [0x474289] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have this monitor's store in my possession for further inquiry. Florian is however removing this monitor out of the cluster (quorum was maintained) and adding a new one.
Full log attached.
Files
Updated by Joao Eduardo Luis almost 11 years ago
- Status changed from New to In Progress
- Assignee set to Joao Eduardo Luis
Updated by Joao Eduardo Luis almost 11 years ago
- Status changed from In Progress to Need More Info
Waiting on Florian to provide me a copy of a healthy monitor's data dir to assess whether the "corrupted" state is pathological to all the monitors or due to some borked sync.
The source of the problem is a missing incremental on the AuthMonitor. Why I can only speculate, but my guess is that the monitor itself went through a sync process and might have ended up with a missing version; it was then elected the leader and went about with business as usual, proposing a new version -- during update_from_paxos() it asserted due to a missing version.
Given no other monitor asserted, and Florian's cluster has been running smoothly since he removed and added a new monitor, I'm leaning towards the above explanation.
Updated by Greg Farnum almost 11 years ago
Any idea how sync could have missed a version?
Updated by Joao Eduardo Luis almost 11 years ago
By finishing a sync before the current proposal is fully committed (and now that I think of it, we might still have the paxos version to corroborate that! I'll check and report back).
Anyway, this is why we rely on traditional paxos recovery to catch up with the cluster after joining quorum, and this might be an indicator of some other bug in there, which is what I've been looking into.
Updated by Joao Eduardo Luis almost 11 years ago
There has been another iteration of this bug happening on PGMonitor (from user saaby on IRC):
root@ceph1-cph1c16-mon1:~# ceph-mon -i ceph1-cph1c16-mon1 --debug-osd 20 -d 2013-05-29 11:02:44.273284 7fd3a3d02780 0 ceph version 0.61.2-26-g1071736 (1071736d3b6611b6c5edeb9b225f32b4e9afdc6d), process ceph-mon, pid 14473 starting mon.ceph1-cph1c16-mon1 rank 0 at 10.81.16.11:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph1-cph1c16-mon1 fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5 2013-05-29 11:02:44.309123 7fd3a3d02780 1 mon.ceph1-cph1c16-mon1@-1(probing) e3 preinit fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5 mon/PGMonitor.cc: In function 'virtual void PGMonitor::update_from_paxos()' thread 7fd3a3d02780 time 2013-05-29 11:02:44.800806 mon/PGMonitor.cc: 173: FAILED assert(err == 0) ceph version 0.61.2-26-g1071736 (1071736d3b6611b6c5edeb9b225f32b4e9afdc6d) 1: (PGMonitor::update_from_paxos()+0xe02) [0x556012] 2: (Monitor::init_paxos()+0x97) [0x4ac217] 3: (Monitor::preinit()+0x5ea) [0x4cc71a] 4: (main()+0x171d) [0x4a111d] 5: (__libc_start_main()+0xfd) [0x7fd3a1f81ead] 6: ceph-mon() [0x4a3609] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2013-05-29 11:02:44.802189 7fd3a3d02780 -1 mon/PGMonitor.cc: In function 'virtual void PGMonitor::update_from_paxos()' thread 7fd3a3d02780 time 2013-05-29 11:02:44.800806 mon/PGMonitor.cc: 173: FAILED assert(err == 0) ceph version 0.61.2-26-g1071736 (1071736d3b6611b6c5edeb9b225f32b4e9afdc6d) 1: (PGMonitor::update_from_paxos()+0xe02) [0x556012] 2: (Monitor::init_paxos()+0x97) [0x4ac217] 3: (Monitor::preinit()+0x5ea) [0x4cc71a] 4: (main()+0x171d) [0x4a111d] 5: (__libc_start_main()+0xfd) [0x7fd3a1f81ead] 6: ceph-mon() [0x4a3609] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Sage Weil almost 11 years ago
could this simply be:
- start sync
- sync last_committed
- crash before reaching osdmap_$lastcommitted
- osd restarts
- init_paxos
- tries to load last_commited, fails
? If so, simply adding is_sync_in_progress() check to init_paxos() is all that is needed. And i'm pretty sure we need that anyway...
Updated by Sage Weil almost 11 years ago
Do we have any logs or recent occurrences of this bug to go on, or mon logs of it happening?
If not, I think this is a 'can't reproduce'
Updated by Sage Weil almost 11 years ago
- Status changed from Need More Info to Can't reproduce
Updated by Joao Eduardo Luis over 10 years ago
This bug resurfaced in QA, in a v0.61.7 cluster. After looking into the store of the affected monitor and some logs, I could assess that the reason is that we skipped an 'auth' version on the leader after the leader went through the paxos recovery process. This looks a lot like the symptoms from #5750 that were fixed on 05b6c7e8645081f405c616735238ae89602d3cc6 and backported to 0.61.8 on 17aa2d6d16c77028bae1d2a77903cdfd81efa096