Bug #391: snap create/delete caused corruption - rbd - Ceph

Actions

Copy link

Bug #391

closed

snap create/delete caused corruption

Added by Andrew F over 13 years ago. Updated almost 12 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Josh Durgin

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Yesterday, I created and deleted a RBD snapshot; this morning, I was greeted by a number of dire warnings in dmesg:

[1795870.647296] attempt to access beyond end of device
[1795870.648590] vda: rw=0, want=14413400712, limit=20971520

as well as all sorts of IO errors. Upon shutting down the VM and mounting it in another instance, though, the errors disappeared.

I haven't tried to reproduce this yet, so I'm not sure if the snapshot is what really caused it, but it seems likely.

Actions

Copy link

Updated by Sage Weil over 13 years ago

Status changed from New to Can't reproduce

this is old

Actions

Copy link

Updated by Sage Weil over 13 years ago

Project changed from 3 to 6
Category deleted (9)

Actions

Copy link

Updated by Yehuda Sadeh about 13 years ago

Status changed from Can't reproduce to In Progress

reopening

Actions

Copy link

Updated by Andrew F about 13 years ago

Managed to reproduce this by running this script (to create and delete snapshots) on the host:

#!/bin/bash
set -e
guest=$1
shift
disks="$*" 
if [ -z $guest ]; then
        echo "Need a guest name" 
        exit 1
fi
while true ; do
        date
        echo virsh snapshot-create $guest
        out="$( virsh snapshot-create $guest )" 
        echo "snap: $out" 
        snap=$( echo $out | egrep -o '[0-9]+' )
        sleep 1
        echo virsh snapshot-delete $guest $snap
        virsh snapshot-delete $guest $snap
        sleep 1
        for d in $disks ; do
                echo rbd snap rm $d --snap $snap
                rbd snap rm $d --snap $snap
        done
        echo
done

while disk activity was present on the guest. As little as three cycles of snapshot creation/destruction is sufficient to cause filesystem corruption.

Actions

Copy link

Updated by Josh Durgin about 13 years ago

Assignee set to Josh Durgin

Actions

Copy link

Updated by Andrew F about 13 years ago

Using ext3freezer on the guest during the snapshotting doesn't help either — as far as I can tell, simply taking/removing the snapshot is enough to cause corruption.

Actions

Copy link

Updated by Josh Durgin about 13 years ago

I haven't been able to reproduce this with the latest ceph and qemu-rbd. I'd like to upgrade the kvmtest cluster and see if it can be reproduced there.

Side note: virsh snapshot-delete does nothing (#390)

Actions

Copy link

Updated by Andrew F about 13 years ago

I haven't been able to reproduce this with the latest ceph and qemu-rbd. I'd like to upgrade the kvmtest cluster and see if it can be reproduced there.

Were you able to reproduce it with the versions running on kvmtest? If so, go ahead and upgrade... if not, you might just have not been managing to tickle the bug. The corruption isn't always to metadata, so it may help to look at the content integrity as well as fsck.

Side note: virsh snapshot-delete does nothing (#390)

Right, hence the manual rbd snap rm.

Actions

Copy link

Updated by Yehuda Sadeh about 13 years ago

oh.. missed the external rbd tool call. That might have caused the problem, as doing it while there's a running vm is a probable cause for corruption. The whole purpose of the newer version is to allow that (by using a new notification mechanism).

Actions

Copy link

#10