Project

General

Profile

Actions

Bug #8178

closed

0.79: feature set mismatch, my 4a042a42 < server's 104a042a42, missing 1000000000

Added by Dmitry Smirnov almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

For some weeks I knew no troubles with RBD clients on Linux-3.13.10 x86_64.
Today after I created new erasure pool all RBD clients suddenly stopped:

libceph: mon1 {IP:6789} feature set mismatch, my 4a042a42 < server's 104a042a42, missing 1000000000

My attempt to recover by using previously working "ceph osd crush tunables default" was unsuccessful.
Removing new pool did not help either.

At the moment I was able to recover some clients by upgrading Linux kernel to version 3.14. This upgrade was undesirable and I wish I could recover by other means (how?). Is it possible?
Other Linux-3.13 clients are still unable to use RBD devices from replicated pool.

As my poor experience proved it is surprisingly easy to experience significant downtime as result of isolated experiment with erasure pool on cluster with RBD clients working on another pool(s).

If possible please prevent that kind of outcome?
Please advise how to recover without upgrading to Linux-3.14.

If my reading of "src/include/ceph_features.h" is correct missing feature 1000000000 translates to "CEPH_FEATURE_CRUSH_V2 (1ULL<<36)", right?

Thanks.

Actions #1

Updated by Ilya Dryomov almost 10 years ago

Hi Dmitry,

I'm assuming what you did is you created an EC pool, tried to map an
image out of the replicated pool, that failed, you removed the EC pool
and mapping an image out of the replicated pool still fails with
"missing 1000000000", right? If that's so, you need to do

$ ceph osd crush rule ls

It should give you something like

"replicated_ruleset",
"erasure-code"

and then remove "erasure-code" with

$ ceph osd crush rule rm erasure-code

The problem is that erasure compat flag is currently cluster-wide.
What it means is you can't have clients that know how to handle erasure
pools and clients that don't simultaneously, i.e. older clients will
stop working once you create an erasure pool, even if all they do is
talk to replicated pools.

As for the 'crush rule rm' part, that's necessary just because 3.13
kernel is practically 5 months old whereas 0.79 was released a couple
weeks ago. You have to be careful and follow release notes for each
ceph release if you want to mix and match like that.

Actions #2

Updated by Dmitry Smirnov almost 10 years ago

Dear Ilya,

You got the right impression but I didn't even mapped anything from new erasure pool when connected RBD clients stopped working. Thank you very much for confirming the root cause and pointing out the solution. Indeed after deleting erasure pool I've forgotten to remove rule set.

Removing erasure rule set was the key to restore RBD connectivity from Linux-3.13 clients.

Linux kernel 3.13.10 is less than one month old. I understand desire to use latest "bleeding edge" kernels but please don't dismiss previous releases so fast. In practice it is not feasible to upgrade to kernel that is not yet propagated to Debian "testing" or to "backports". Adopting latest kernel needs time, DKMS drivers should catch up etc. Upgrading all clients at once is troublesome and inconvenient. At least for RBD (which suppose to be production-ready) minimum support for at least one previous kernel is crucial.

From usability prospective I'd appreciate more user space warnings for operations that affect client compatibility cluster-wide.
"feature mismatch" errors should be accompanied by documentation about features and their compatibility matrix. As far as I'm aware this information is not documented yet (I couldn't find anything helpful regarding my situation).

Besides release notes do not mention required kernel versions (did I miss this information somewhere)? I was already running 0.79 for some time but problems started with the creation of erasure pool. Perhaps information about this problem and steps to resolve are worth including to release notes. There is potential for down time so users are better be warned.

Thanks again for very useful explanation and advise.

Actions #3

Updated by Ilya Dryomov almost 10 years ago

In terms of features, 3.13 is almost 6 months old (3.13-rc1 was
released 5 months ago). But yeah, we should definitely be better at
documenting which kernel supports what and the possible pitfalls, like
the one you hit. I have opened #8196 to that end.

Actions #4

Updated by Ilya Dryomov almost 10 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF