Project

General

Profile

Actions

Bug #5062

closed

mon: 0.61.2 asserts on AuthMonitor during monitor start

Added by Joao Eduardo Luis almost 11 years ago. Updated over 10 years ago.

Status:
Can't reproduce
Priority:
Urgent
Category:
Monitor
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Florian from Smart Weblications bumped into this crash on one of his monitors roughly two hours after upgrading from 0.56.6 to 0.61.2

2013-05-14 17:40:59.458008 7f230f0bb780  0 ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60), process ceph-mon, pid 1471
2013-05-14 17:40:59.902657 7f230f0bb780  1 mon.0@-1(probing) e4 preinit fsid 2ff06d61-7b8e-4b10-b39d-138cfcc1ee43
2013-05-14 17:40:59.910734 7f230f0bb780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos()' thread 7f230f0bb780 time 2013-05-14 17:40:59.910363
mon/AuthMonitor.cc: 147: FAILED assert(ret == 0)

 ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
 1: (AuthMonitor::update_from_paxos()+0x25ad) [0x56f44d]
 2: (Monitor::init_paxos()+0xdf) [0x48292f]
 3: (Monitor::preinit()+0x679) [0x490a39]
 4: (main()+0x2b61) [0x478361]
 5: (__libc_start_main()+0xfd) [0x7f230fd22c8d]
 6: /usr/bin/ceph-mon() [0x474289]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I have this monitor's store in my possession for further inquiry. Florian is however removing this monitor out of the cluster (quorum was maintained) and adding a new one.

Full log attached.


Files

XBqSDkrd.txt (45.1 KB) XBqSDkrd.txt Joao Eduardo Luis, 05/14/2013 09:01 AM
Actions #1

Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from New to In Progress
  • Assignee set to Joao Eduardo Luis
Actions #2

Updated by Ian Colle almost 11 years ago

  • Priority changed from High to Urgent
Actions #3

Updated by Joao Eduardo Luis almost 11 years ago

  • Description updated (diff)
Actions #4

Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from In Progress to Need More Info

Waiting on Florian to provide me a copy of a healthy monitor's data dir to assess whether the "corrupted" state is pathological to all the monitors or due to some borked sync.

The source of the problem is a missing incremental on the AuthMonitor. Why I can only speculate, but my guess is that the monitor itself went through a sync process and might have ended up with a missing version; it was then elected the leader and went about with business as usual, proposing a new version -- during update_from_paxos() it asserted due to a missing version.

Given no other monitor asserted, and Florian's cluster has been running smoothly since he removed and added a new monitor, I'm leaning towards the above explanation.

Actions #5

Updated by Greg Farnum almost 11 years ago

Any idea how sync could have missed a version?

Actions #6

Updated by Joao Eduardo Luis almost 11 years ago

By finishing a sync before the current proposal is fully committed (and now that I think of it, we might still have the paxos version to corroborate that! I'll check and report back).

Anyway, this is why we rely on traditional paxos recovery to catch up with the cluster after joining quorum, and this might be an indicator of some other bug in there, which is what I've been looking into.

Actions #7

Updated by Joao Eduardo Luis almost 11 years ago

There has been another iteration of this bug happening on PGMonitor (from user saaby on IRC):

root@ceph1-cph1c16-mon1:~# ceph-mon -i ceph1-cph1c16-mon1 --debug-osd 20 -d
2013-05-29 11:02:44.273284 7fd3a3d02780  0 ceph version 0.61.2-26-g1071736 (1071736d3b6611b6c5edeb9b225f32b4e9afdc6d), process ceph-mon, pid 14473
starting mon.ceph1-cph1c16-mon1 rank 0 at 10.81.16.11:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph1-cph1c16-mon1 fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5
2013-05-29 11:02:44.309123 7fd3a3d02780  1 mon.ceph1-cph1c16-mon1@-1(probing) e3 preinit fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5
mon/PGMonitor.cc: In function 'virtual void PGMonitor::update_from_paxos()' thread 7fd3a3d02780 time 2013-05-29 11:02:44.800806
mon/PGMonitor.cc: 173: FAILED assert(err == 0)
 ceph version 0.61.2-26-g1071736 (1071736d3b6611b6c5edeb9b225f32b4e9afdc6d)
 1: (PGMonitor::update_from_paxos()+0xe02) [0x556012]
 2: (Monitor::init_paxos()+0x97) [0x4ac217]
 3: (Monitor::preinit()+0x5ea) [0x4cc71a]
 4: (main()+0x171d) [0x4a111d]
 5: (__libc_start_main()+0xfd) [0x7fd3a1f81ead]
 6: ceph-mon() [0x4a3609]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2013-05-29 11:02:44.802189 7fd3a3d02780 -1 mon/PGMonitor.cc: In function 'virtual void PGMonitor::update_from_paxos()' thread 7fd3a3d02780 time 2013-05-29 11:02:44.800806
mon/PGMonitor.cc: 173: FAILED assert(err == 0)

 ceph version 0.61.2-26-g1071736 (1071736d3b6611b6c5edeb9b225f32b4e9afdc6d)
 1: (PGMonitor::update_from_paxos()+0xe02) [0x556012]
 2: (Monitor::init_paxos()+0x97) [0x4ac217]
 3: (Monitor::preinit()+0x5ea) [0x4cc71a]
 4: (main()+0x171d) [0x4a111d]
 5: (__libc_start_main()+0xfd) [0x7fd3a1f81ead]
 6: ceph-mon() [0x4a3609]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #8

Updated by Sage Weil almost 11 years ago

could this simply be:

- start sync
- sync last_committed
- crash before reaching osdmap_$lastcommitted
- osd restarts
- init_paxos
- tries to load last_commited, fails

? If so, simply adding is_sync_in_progress() check to init_paxos() is all that is needed. And i'm pretty sure we need that anyway...

Actions #9

Updated by Sage Weil almost 11 years ago

Do we have any logs or recent occurrences of this bug to go on, or mon logs of it happening?

If not, I think this is a 'can't reproduce'

Actions #10

Updated by Sage Weil almost 11 years ago

  • Status changed from Need More Info to Can't reproduce
Actions #11

Updated by Joao Eduardo Luis over 10 years ago

This bug resurfaced in QA, in a v0.61.7 cluster. After looking into the store of the affected monitor and some logs, I could assess that the reason is that we skipped an 'auth' version on the leader after the leader went through the paxos recovery process. This looks a lot like the symptoms from #5750 that were fixed on 05b6c7e8645081f405c616735238ae89602d3cc6 and backported to 0.61.8 on 17aa2d6d16c77028bae1d2a77903cdfd81efa096

Actions

Also available in: Atom PDF