Project

General

Profile

Actions

Bug #19187

closed

Delete/discard operations initiated by a qemu/kvm guest get stuck

Added by Adam Wolfe Gordon about 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are frequently seeing delete/discard operations get stuck on rbd devices attached to qemu/kvm VMs. In the guest, the issue presents itself as a stuck i/o, with any additional i/o issued to the device also getting stuck:

guest# cat /sys/block/sda/inflight
       0        2

On the host we see can see the stuck delete operation in the objecter_requests:

host# ceph --admin-daemon /var/run/ceph/rbd-ceph-client.hypervisor01.104686.140551808376528.asok objecter_requests
{
    "ops": [
        {
            "tid": 2827,
            "pg": "4.b7604fd9",
            "osd": 708,
            "object_id": "rbd_data.3d31d74bb5ef91.0000000000000341",
            "object_locator": "@4",
            "target_object_id": "rbd_data.3d31d74bb5ef91.0000000000000341",
            "target_object_locator": "@4",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "713765s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "0=[]",
            "mtime": "2017-03-03 18:46:08.0.612098s",
            "osd_ops": [
                "delete" 
            ]
        }
    ],
    "linger_ops": [
        {
            "linger_id": 1,
            "pg": "4.ccf574ff",
            "osd": 205,
            "object_id": "rbd_header.3d31d74bb5ef91",
            "object_locator": "@4",
            "target_object_id": "rbd_header.3d31d74bb5ef91",
            "target_object_locator": "@4",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "snapid": "head",
            "registered": "1" 
        }
    ],
    "pool_ops": [],
    "pool_stat_ops": [],
    "statfs_ops": [],
    "command_ops": []
}

host# date
Fri Mar  3 20:02:25 UTC 2017

We have been able to reproduce this with both mkfs.ext4 (with its default discard setting), and by attaching an rbd device to the VM then running:

mkfs.ext4 -E nodiscard -F /dev/sda
mount -o nodiscard /dev/sda /mnt
dd if=/dev/urandom of=/mnt/big-file bs=1M count=200 oflag=sync
dd if=/dev/zero of=/mnt/big-file bs=1M count=200 oflag=sync
fstrim /mnt

The discard doesn't get stuck 100% of the time, but often enough that we can reproduce the issue at will.

Version info:

host# sudo ceph --version
ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
host# uname -a
Linux nbg1node863 3.13.0-110-generic #157-Ubuntu SMP Mon Feb 20 11:54:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
host# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:        14.04
Codename:       trusty

I've attached client logs - we have rbd and rados debug set to 10.


Files

logs.txt (515 KB) logs.txt Adam Wolfe Gordon, 03/03/2017 08:37 PM
rados-log.txt (91.4 KB) rados-log.txt rados delete with stuck client Adam Wolfe Gordon, 03/07/2017 09:35 PM
rados-log-after-kill.txt (61.1 KB) rados-log-after-kill.txt rados delete after killing stuck client Adam Wolfe Gordon, 03/07/2017 09:36 PM
ops.txt (59.2 KB) ops.txt Adam Wolfe Gordon, 03/09/2017 09:34 PM
Actions

Also available in: Atom PDF