Project

General

Profile

Actions

Bug #15411

closed

Added new OSDs with 0.94.6 disabled RBD access

Added by Bosse Klykken about 8 years ago. Updated about 7 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've added two new nodes with 0.94.6, with a total of 60 OSDs, to a running cluster on 0.94.3, accidentally without upgrading the cluster to the newest version before adding the nodes, as is suggested in the documentation. After a short time, the RBD mounts stopped working, and couldn't be remounted. The RBD client logs mentioned stuff like:

libceph: osd44 10.xx.xx.xx:6820 feature set mismatch, my 2b84a842a42 < server's 102b84a842a42, missing 1000000000000
libceph: osd44 10.xx.xx.xx:6820 socket error on read
libceph: corrupt inc osdmap (-22) epoch 76574 off 60 (ffffc9002e58a058 of ffffc9002e58a01c-ffffc9002e58c1bc)
<and a lot of hex dumps filling the logs>

As far as we've been able to dig up, the feature set mismatch refers to CEPH_FEATURE_CRUSH_V4. When looking at the crush map, we noticed that the new OSDs were in buckets set with alg straw2, while the remainder were straw.

We've stopped the newly added OSDs, but the problem still persists:

#rbd map rbd-volume
rbd: sysfs write failed
rbd: map failed: (5) Input/output error

Attached is the compressed ceph report. After we generated this report, we tried removing the newly added OSDs from the crush map, in order to upgrade the cluster before adding them back, but the problem persists - we can't mount RBD volumes, and we get log messages with feature set mismatch as mentioned above.

Actions

Also available in: Atom PDF