Project

General

Profile

Bug #1789

mon: failed assert(paxosv == pg_map.version)

Added by Josh Durgin over 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From teuthology:~/log/mon.2.log.gz:

mon/PGMonitor.cc: In function 'virtual bool PGMonitor::update_from_paxos()', in thread '7f9f51567700'
mon/PGMonitor.cc: 165: FAILED assert(paxosv == pg_map.version)
 ceph version 0.38-278-g0622871 (commit:06228716e345a81ee2c93055a6a6133c540fbada)
 1: (PGMonitor::update_from_paxos()+0x1145) [0x514fd5]
 2: (PGMonitor::tick()+0x5e) [0x5071ae]
 3: (Monitor::tick()+0x65) [0x473705]
 4: (C_Mon_Tick::finish(int)+0x15) [0x496eb5]
 5: (SafeTimer::timer_thread()+0x4b0) [0x5fa1d0]
 6: (SafeTimerThread::entry()+0x15) [0x5fe475]
 7: (Thread::_entry_func(void*)+0x12) [0x582852]
 8: (()+0x7971) [0x7f9f557dd971]
 9: (clone()+0x6d) [0x7f9f5406c92d]

last_committed - From mon.c (6 Bytes) Matthew Roy, 02/27/2012 10:14 AM

first_committed - From mon.c (6 Bytes) Matthew Roy, 02/27/2012 10:14 AM

latest - From mon.c (121 KB) Matthew Roy, 02/27/2012 10:14 AM

mon.c.log.head - First 1200 lines of the mon.c log (273 KB) Matthew Roy, 02/27/2012 10:14 AM

core.monAssert1435.gz - Core dump for crash. (478 KB) Matthew Roy, 02/27/2012 12:35 PM

Associated revisions

Revision d10e1f46 (diff)
Added by Greg Farnum about 12 years ago

mon: fix slurp_latest to fill in any missing incrementals

Fixes #1789.

Signed-off-by: Greg Farnum <>

History

#1 Updated by Sage Weil over 12 years ago

  • Priority changed from Normal to High

#2 Updated by Sage Weil over 12 years ago

  • translation missing: en.field_position set to 14

#3 Updated by Sage Weil over 12 years ago

  • Status changed from New to Need More Info

have core, but no matching binary. not clear from code inspection what happened.

#4 Updated by Sage Weil about 12 years ago

  • translation missing: en.field_position deleted (29)
  • translation missing: en.field_position set to 30

#5 Updated by Sage Weil about 12 years ago

  • Priority changed from High to Normal

#6 Updated by Sage Weil about 12 years ago

  • Target version deleted (v0.40)
  • translation missing: en.field_position deleted (57)
  • translation missing: en.field_position set to 27

#7 Updated by Anonymous about 12 years ago

We only saw this the once, but we believe the bug and want to keep it open.

#8 Updated by Matthew Roy about 12 years ago

Crash occurred on the third monitor when starting after being down for several hours shortly after cluster creation. It's unclear to me whether this monitor ever came up after the cluster was initially created, I suspect it might not have. This cluster later had a bunch of authorization problems.

The assert occurs at line 965 in the attached log.

#9 Updated by Matthew Roy about 12 years ago

Core dump attached. Dumb thought: could this be related to http://tracker.newdream.net/issues/2110, they happened within 5 minutes of each other on this cluster, but on different servers.

#10 Updated by Greg Farnum about 12 years ago

Shouldn't be related — this is a problem with a single monitor daemon and the other is a write problem that an MDS is getting kicked back from an OSD.

#11 Updated by Greg Farnum about 12 years ago

  • Status changed from Need More Info to In Progress
  • Assignee set to Greg Farnum

Iiiinteresting. This assert is the post-update check, after loading and running through all the incrementals. (Meaning, it passed the pre-update checks.) And it's being called from slurp(). I wonder if we missed a problem there.

#12 Updated by Greg Farnum about 12 years ago

  • Status changed from In Progress to 4

Okay, figured it out. Our current slurp code pulls in all the incrementals, then sends off a request for latest_stashed. BUT, it's possible (especially with the pgmap state) that latest_stashed is newer than the incrementals we already pulled in, or possibly even discontiguous.
Which means that when we pull in the latest stashed, we will get a map without all the surrounding incrementals, AND when we update_from_paxos() we'll load that map, but not set up the rest of the Paxos state properly, and fail this assert.

Fixed it by just adding any missing incrementals to the slurp_latest response, which we should have done in the first place since there's already space for it and everything.

Did basic testing on wip-1789, appears to work.

#13 Updated by Greg Farnum about 12 years ago

  • Status changed from 4 to Resolved

Also available in: Atom PDF