Project

General

Profile

Actions

Bug #3657

closed

rbd: crash mapping image

Added by Alex Elder over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm just creating this to track some activity from someone
on the mailing list reporting kernel crashes when attempting
to map rbd images.

Here's a link to the e-mail chain:
http://www.spinics.net/lists/ceph-devel/msg11184.html
The message subject is:
rbd kernel module crashes with different kernels
Some summary information though:
- Kernel is running in a Xen-based OracleVM guest
- Found using a number of kernel versions, including 3.7.1
- "rbd ls" shows the image to be mapped in the rbd pool
- map command looks OK to me, but reportedly results in an
immediate crash
- Crash info--stack trace, etc.--is not really available.
An image was provided but it only captured the last part
of a double-fault panic or something.

Assigning this to myself for now.

Actions #1

Updated by Alex Elder over 11 years ago

Ugis supplied two more images containing captured crash
stack traces. Both contained lines like this:

[   32.978290] kernel BUG at net/ceph/messenger.c:2366!

And that line is this, in ceph_fault():
BUG_ON(con->state != CON_STATE_CONNECTING &&
con->state != CON_STATE_NEGOTIATING &&
con->state != CON_STATE_OPEN);

The first problem with this is that we now recognize that
ceph_fault() can be called while the connection is in any
state. So the condition being asserted here is bogus.

The second problem is that we should really just be doing
a WARN_ON() call here (as well as other places asserting
things about the connection state). We shouldn't crash the
whole machine if a connection state is bad; odds are good
it will reset itself anyway and recover from the problem.

I sent two patches that address this problem, and hope to
hear back that the system is no longer crashing.

Actions #2

Updated by Alex Elder over 11 years ago

There is another thing that came from the two crash logs Ugis
just supplied. They both contained lines like this:

[   32.978013] libceph: mon2 10.3.3.3:6789 feature set mismatch, \
my 40002 < server's 40002, missing 0

This gets reported while negotiating a connection, when the
remote end (mon2 in this case) responds with a "FEATURES" tag.
There are two pieces of code on the server side that can cause
this.

The first indicates that the other end requires certain features
that the local end does not advertise it provides.

feat_missing = policy.features_required & ~(uint64_t)connect.features;
if (feat_missing) {
ldout(msgr->cct,1) << "peer missing required features " << std::hex << feat_missing << std::dec << dendl;
reply.tag = CEPH_MSGR_TAG_FEATURES;
msgr->lock.Unlock();
goto reply;
}

The second has to do with requiring the client to sign messages:

// If the server supports signing session messages, and it is configured to require the client
// to sign, and the client can't sign, bail out. PLR
if ((policy.features_supported & CEPH_FEATURE_MSG_AUTH) &&
msgr->cct->_conf->cephx_require_signatures &&
!(connect.features & CEPH_FEATURE_MSG_AUTH)) {
ldout(msgr->cct,1) << "Client can't sign messages." << dendl;
reply.tag = CEPH_MSGR_TAG_FEATURES;
msgr->lock.Unlock();
goto reply;
}

I'm pretty sure that, given the message to the user indicated
no missing features, it's this second one that is producing the
error for the user. (It would be nice to do a better job of
distinguishing this; that message is a bit confusing.)

Actions #3

Updated by Sage Weil over 11 years ago

hmm. yeah, it probably means we should set the required features during negotiation to include MSG_AUTH instead of doing an after-the-fact check.

first tho lets' confirm that cephx_require_signatures is true for him?

Actions #4

Updated by Alex Elder over 11 years ago

  • Status changed from New to In Progress

I got a response from Ugis. The patches I supplied to him
did stop the crashes he was seeing. So we'll want to get
these into our stable trees. I'll look at the current
testing branch also and will provide a comparable patch
to that if necessary.

He also reported that all of his nodes had the config option
"cephx require signatures = true" and when he turned those
off he was successful getting his rbd images to map.

So overall he is now a very happy customer...

Actions #5

Updated by Alex Elder over 11 years ago

I'm currently testing two patches related to this bug, and
while I haven't pushed them to the testing branch yet I
expect I will shortly.

122070a2 libceph: WARN, don't BUG on unexpected connection states
This is a patch I supplied to stop the crashes that Ugis was seeing.
The other patch I supplied is already present in the testing branch.

0fa6ebc6 libceph: fix protocol feature mismatch failure path
This is a patch Sage put together. The panic that was occurring was
due to ceph_fault() getting called while the connection was in an
unexpected state. Sage determined the state was indeed not to be
expected, and this patch fixes the problem.

So this bug is ready to be closed, pending completion of some
final sanity tests and pushing the result to the testing branch.

Actions #6

Updated by Alex Elder over 11 years ago

  • Status changed from In Progress to Resolved

Done, pushed to master, and soon to be included in a pull request
to Linus for 3.8.

Actions

Also available in: Atom PDF