Project

General

Profile

Actions

Bug #4434

closed

looping waiting for quorum after upgrade

Added by Ken Franklin about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

How we got here:
Bobtail .56 installed on burnupi60 failed daily upgrade due to new gitbuilder keys.
updated key.
upgrade to .57 then .58 with apt-get upgrade failed (#4424) wip-4424 installed
On ceph restart, no ceph-mds is started - ceph-mds.log attached.

ps -elf | grep ceph gives the following - no mds

0 S ubuntu 27465 26879 0 80 0 - 2026 pipe_w 13:50 pts/3 00:00:00 grep --color=auto ceph
1 S root 40490 1 0 80 0 - 42390 futex_ 11:07 ? 00:00:14 /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf
0 S root 40504 1 0 80 0 - 11315 poll_s 11:07 pts/0 00:00:08 /usr/bin/python /usr/sbin/ceph-create-keys -i a
1 S root 40958 1 0 80 0 - 156037 futex_ 11:07 ? 00:00:19 /usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf
1 S root 41193 1 0 80 0 - 158900 futex_ 11:07 ? 00:00:19 /usr/bin/ceph-osd -i 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf


Files

ceph-mds.a.log (162 KB) ceph-mds.a.log Ken Franklin, 03/13/2013 02:36 PM
Actions #1

Updated by Sam Lang about 11 years ago

Part of the bug appears to be in ceph, where the following returns an error, causing an infinite loop in get_key():

sudo ceph --cluster=ceph --name=mon. --keyring=/var/lib/ceph/mon/ceph-a/keyring auth get-or-create client.admin mon a

llow * osd allow * mds allow
access denied

Actions #2

Updated by Ken Franklin about 11 years ago

  • Project changed from devops to CephFS

changed the project

Actions #3

Updated by Greg Farnum about 11 years ago

More details? I'm not sure how the title relates to the bug description or MDS log. The log is crashing on the SessionMap decode, and my money is on all the encoding changes.
I'm hoping that this system just followed an unusual upgrade path that customers won't actually see and got busted up on the weird encoding versions we did. Can you provide more precise details on the upgrades and exact versions this system ran as?

Actions #4

Updated by Ken Franklin about 11 years ago

It's quite possible the upgrade was corrupted somewhere along the line. Prior to the issues the system was on 0.56.3, 0.57.667, and now 58.500

I moved from bobtail release to the master branch and ran into the issue in 4424. Gary fixed that in wip-4424 and now I am seeing this issue. Sam has taken a look at it and the title reflects what he saw so I filed the bug to track it.

I am not seeing this issue on the system tracking along the NEXT branch or on a fresh install of Master. It may very well be something no one will ever create in the real world. my concern is that it started from a "released" distribution and might happen between bobtail and cuttlefish.

Actions #5

Updated by Greg Farnum about 11 years ago

Just to make sure I'm tracking these upgrades correctly:
It was created on v0.56.3? (Not a branch.) Then it moved to master branch when it was following .57 (I can't translate .57.667 into a commit or a release)? And then you saw the issue when you upgraded to a fresher master following 0.58?

This path would indicate that you ran into the encoding stuff I'm hoping for. If you can get me the precise commit post-.57 commit then I can make sure.

Actions #6

Updated by Ken Franklin about 11 years ago

This is what was captured at the time the test was run successfully: Ceph Version: 0.57-667-g6a9cda7
The next instance installed from master was: Ceph Version: 0.58-500-gaf3b163
The current installation is sitting at: ceph version 0.58-501-g66be33a (wip-4424)

Is this what you are looking for or is there something else we should be capturing in the test log?

Actions #7

Updated by Greg Farnum about 11 years ago

Yep! This says that you ran a branch which included an unreleased set of encoding rules on the MDS which would have collided horribly with what ended up in 0.58. (af3b163 includes the merge of "wip-mds-encode-rebased" but not any of the follow-on conflict cleanup which came as a result of our needing to backport some stuff to bobtail.)

Hurray, not a problem for users, and not an unexpected collision of doom. :)

Actions #8

Updated by Greg Farnum about 11 years ago

  • Status changed from New to Resolved

Whoops@!

Actions

Also available in: Atom PDF