Project

General

Profile

Actions

Bug #16211

closed

Some rbd images inaccessible after upgrade to jewel (error reading immutable metadata)

Added by David Hedberg almost 8 years ago. Updated almost 6 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We're running a small cluster with three nodes. Each node has one monitor and three osds running with XFS formatted disks and a SSD backed journal (~1GB each).

I recently upgraded these from (Ubuntu) 12.04/0.94.6-1precise through 12.04/0.94.7-1precise through 14.04/0.94.7-1trusty to 14.04/10.2.1-1trusty, which left me with a bunch of inaccessible rbd images. "rbd info" fails on 16 out of ~80 images (some of them presumably due to their unreadable parent).

Example:
-------

# rbd info ubuntu1204_base
2016-06-09 10:16:26.376164 7ffb8b3bb700 -1 librbd::image::OpenRequest: failed to retreive immutable metadata: (2) No such file or directory
rbd: error opening image ubuntu1204_base: (2) No such file or directory

# rbd info ubuntu1204_base --debug-ms 1
...
2016-06-09 10:16:57.720301 7fa6a2dfb700  1 -- 10.56.5.19:0/1307685403 <== osd.4 10.56.5.19:6802/5976 1 ==== osd_op_reply(3 rbd_header.4080512ae8944a [call,call] v0'0 uv156486030 ondisk = -2 ((2) No such file or directory)) v7 ==== 187+0+0 (1510277227 0 0) 0x7fa680000a60 con 0x7fa6880060e0
...

# rados -p rbd listomapvals rbd_header.4080512ae8944a
(no output)

Details:
-------

I suspect the upgrade to jewel is the source of the problem. All upgrades were rolling, performed by installing the new packages, restarting all the monitors one by one and then restarting all the osds, node by node. osd noout was set during through all the upgrades. When upgrading 12.04 -> 14.04 I installed the trusty packages before rebooting the servers after finishing the upgrade.

(When upgrading from 0.94.6-1precise to 0.94.7-1precise I hit a presumably unrelated bug that kept segfaulting the monitors. This seems to have been related to an old version of radosgw running on another server. The monitors went stable when I turned it off, but now radosgw itself seems to segfault in the same way and won't start. I have not looked into this further.)

When upgrading to jewel there was a significant period of time when both hammer and jewel osds were active, as I took some time to run chown on the files. I don't know when the problem started, but I think that the number of inaccessible rbd images went up during this procedure.

The images are all used in virtual machines (qemu-kvm) and the clients should all be firefly, hammer or jewel.

The tunables were set to argonaut (if I understood the output correctly), but I later updated them to firefly. I did this after the problem had occurred, however.

Issue #15561 might be related.


Files

rbd_debug.txt (6.88 KB) rbd_debug.txt rbd info ubuntu1204_base --debug-rbd=20 --debug-ms=1 David Hedberg, 06/09/2016 01:04 PM
emptyomapvals.txt (2.64 KB) emptyomapvals.txt David Hedberg, 06/09/2016 01:44 PM
Actions

Also available in: Atom PDF