Bug #64785: RBD persistent error corruption - rbd - Ceph

Actions

Copy link

Bug #64785

open

RBD persistent error corruption

Added by Jacobus Erasmus about 2 months ago. Updated about 2 months ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v18.2.1

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

If a virtual machine is set up with a rbd_persistent_cache_mode=ssd, and rbd_plugin=pwl_cache

When the virtual host runs out of memory the rbd_persistent_cache gets damaged so that the rbd images become inaccessible until "rbd persistent-cache invalidate is run"

(Basically the virtual machine is unbootable until an invalidate is run).

Actions

Copy link

Updated by Ilya Dryomov about 2 months ago

Jacobus Erasmus wrote:

If a virtual machine is set up with a rbd_persistent_cache_mode=ssd, and rbd_plugin=pwl_cache

When the virtual host runs out of memory the rbd_persistent_cache gets damaged so that the rbd images become inaccessible until "rbd persistent-cache invalidate is run"

Hi Jacobus,

What exactly does "virtual host runs out of memory" amount to -- QEMU processes getting axed by the OOM killer or something else/worse?

(Basically the virtual machine is unbootable until an invalidate is run).

What errors were observed when trying to boot the VM, if any? Does the VM seemingly hang or the QEMU process actually quits?

Was RBD logging enabled and if so at what "debug rbd" level?

Actions

Copy link

Updated by Jacobus Erasmus about 2 months ago

Ilya Dryomov wrote:

What exactly does "virtual host runs out of memory" amount to -- QEMU processes getting axed by the OOM killer or something else/worse?

QEMU process gets axed by OOM killer.

(Basically the virtual machine is unbootable until an invalidate is run).

What errors were observed when trying to boot the VM, if any? Does the VM seemingly hang or the QEMU process quits?

The VM boots the attached RBD is just not available. If it's a boot drive it will boot up to boot sequence. If it's a data drive it will simply not be accessible.

Was RBD logging enabled and if so at what "debug rbd" level?

Sorry no RBD logging. I do get kernel panic when I try to run 'rbd persistent_cache flush'

kill -9 or anything else seems to not be a problem only on memory out conditions that this happens.

Actions

Copy link

Updated by Ilya Dryomov about 2 months ago

Jacobus Erasmus wrote:

Ilya Dryomov wrote:

What errors were observed when trying to boot the VM, if any? Does the VM seemingly hang or the QEMU process quits?

The VM boots the attached RBD is just not available. If it's a boot drive it will boot up to boot sequence. If it's a data drive it will simply not be accessible.

It suspect QEMU just masks rbd_open() error by making the disk inaccessible.

Did you attempt to open the corresponding image in some other way, even something as simple as running "rbd info <imagename>" at that point? I'm looking for at least some log or error message.

Was RBD logging enabled and if so at what "debug rbd" level?

Sorry no RBD logging. I do get kernel panic when I try to run 'rbd persistent_cache flush'

A kernel panic in the VM or on the host? Do you have any of the snippets captured?

Also, by "when I try 'rbd persistent_cache flush'" perhaps you meant "after I run 'rbd persistent_cache flush' and try to boot the VM again"? I'm trying to establish the sequence of events.

kill -9 or anything else seems to not be a problem only on memory out conditions that this happens.

This is weird because from the QEMU process perspective it shouldn't be different to kill -9. Do you co-locate OSDs or anything else related to Ceph on VM hosts?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #64785

RBD persistent error corruption

Updated by Ilya Dryomov about 2 months ago

Updated by Jacobus Erasmus about 2 months ago

Updated by Ilya Dryomov about 2 months ago