Project

General

Profile

Bug #14522

RBD Snapshot locks Libvirt Disk With Active Writes

Added by Tyler Wilson about 8 years ago. Updated about 8 years ago.

Status:
Duplicate
Priority:
High
Assignee:
Jason Dillaman
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On the latest EL7 release of Libvirt (with RDO OpenStack Kilo) I am getting my RBD disks locked when I issue a simple snapshot on a VM with active writes going. Console on the VM shows a kernel timeout on IO message with libvirt eventually giving a lock error "Cannot start job (modify, none) for domain instance-000007ae; current job is (modify, none) owned by (42651 qemuProcessReconnect, 0 <null>) for (1193s, 0s) Timed out during operation: cannot acquire state change lock (held by qemuProcessReconnect)" The only fix is to hard reboot the VM, this is reproducible 100% of the time.

$ ceph -v
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

$ rpm -qa | grep libvirt
libvirt-client-1.2.17-13.el7_2.2.x86_64
libvirt-daemon-driver-interface-1.2.17-13.el7_2.2.x86_64
libvirt-daemon-driver-nodedev-1.2.17-13.el7_2.2.x86_64
libvirt-daemon-driver-network-1.2.17-13.el7_2.2.x86_64
libvirt-daemon-kvm-1.2.17-13.el7_2.2.x86_64
libvirt-daemon-driver-qemu-1.2.17-13.el7_2.2.x86_64
libvirt-daemon-driver-storage-1.2.17-13.el7_2.2.x86_64
libvirt-python-1.2.17-2.el7.x86_64
libvirt-daemon-driver-secret-1.2.17-13.el7_2.2.x86_64
libvirt-daemon-1.2.17-13.el7_2.2.x86_64
libvirt-daemon-driver-nwfilter-1.2.17-13.el7_2.2.x86_64

$ rpm -qa | grep rbd
python-rbd-0.94.5-0.el7.centos.x86_64
librbd1-0.94.5-0.el7.centos.x86_64

Snapshot Command I am doing:
$ rbd snap create compute/73ce4399-a757-4c0e-b1f4-471a83c15424_disk@testsnap

Snapshot created fine:
$ rbd snap ls compute/73ce4399-a757-4c0e-b1f4-471a83c15424_disk
SNAPID NAME SIZE
2252 testsnap 200 GB

gdb_output.txt View (176 KB) Tyler Wilson, 01/27/2016 11:36 PM

History

#1 Updated by Samuel Just about 8 years ago

  • Project changed from Ceph to rbd

#2 Updated by Jason Dillaman about 8 years ago

  • Status changed from New to In Progress
  • Assignee set to Jason Dillaman

#3 Updated by Jason Dillaman about 8 years ago

Tyler, would it be possible for you to attach the librbd log from qemu with "debug rbd = 20" from shortly before creating a snapshot? I will try to reproduce locally, but having the log might make it faster to find the cause.

#4 Updated by Tyler Wilson about 8 years ago

Hello Jason,

Thanks for taking assignment of this. Here is the output from creating a snapshot with debug enabled;

$ rbd snap create compute/73ce4399-a757-4c0e-b1f4-471a83c15424_disk@testsnap
2016-01-27 18:32:09.662351 7f7b4f1647c0 20 librbd::ImageCtx: enabling caching...
2016-01-27 18:32:09.662427 7f7b4f1647c0 20 librbd::ImageCtx: Initial cache settings: size=67108864 num_objects=10 max_dirty=25165824 target_dirty=16777216 max_dirty_age=1
2016-01-27 18:32:09.662750 7f7b4f1647c0 20 librbd: open_image: ictx = 0x3107500 name = '73ce4399-a757-4c0e-b1f4-471a83c15424_disk' id = '' snap_name = ''
2016-01-27 18:32:09.666224 7f7b4f1647c0 20 librbd: detect format of 73ce4399-a757-4c0e-b1f4-471a83c15424_disk : new
2016-01-27 18:32:09.672814 7f7b4f1647c0 10 librbd::ImageCtx:  cache bytes 67108864 -> about 2000 objects
2016-01-27 18:32:09.672820 7f7b4f1647c0 10 librbd::ImageCtx: init_layout stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix rbd_data.10ebd8f1a0c5d4b format rbd_data.10ebd8f1a0c5d4b.%016llx
2016-01-27 18:32:09.672868 7f7b4f1647c0 10 librbd::ImageWatcher: 0x3118a70 registering image watcher
2016-01-27 18:32:09.677398 7f7b4f1647c0 20 librbd: ictx_refresh 0x3107500
2016-01-27 18:32:09.680607 7f7b4f1647c0 20 librbd::ImageCtx: enabling caching...
2016-01-27 18:32:09.680633 7f7b4f1647c0 20 librbd::ImageCtx: Initial cache settings: size=67108864 num_objects=10 max_dirty=25165824 target_dirty=16777216 max_dirty_age=1
2016-01-27 18:32:09.680676 7f7b4f1647c0 20 librbd: open_image: ictx = 0x311a5c0 name = '' id = '10968c12ae8944a' snap_name = ''
2016-01-27 18:32:09.685589 7f7b4f1647c0 10 librbd::ImageCtx:  cache bytes 67108864 -> about 2000 objects
2016-01-27 18:32:09.685604 7f7b4f1647c0 10 librbd::ImageCtx: init_layout stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix rbd_data.10968c12ae8944a format rbd_data.10968c12ae8944a.%016llx
2016-01-27 18:32:09.685607 7f7b4f1647c0 20 librbd: ictx_refresh 0x311a5c0
2016-01-27 18:32:09.694336 7f7b4f1647c0 20 librbd: new snapshot id=189 name=snap size=53687091200 features=1
2016-01-27 18:32:09.694356 7f7b4f1647c0 20 librbd::ImageCtx: finished flushing cache
2016-01-27 18:32:09.694372 7f7b4f1647c0 20 librbd: snap_create 0x3107500 testsnap
2016-01-27 18:32:09.694373 7f7b4f1647c0 20 librbd: ictx_check 0x3107500
2016-01-27 18:32:09.694377 7f7b4f1647c0 20 librbd: snap_create_helper 0x3107500 testsnap
2016-01-27 18:32:09.694378 7f7b4f1647c0 20 librbd: ictx_check 0x3107500
2016-01-27 18:32:09.694379 7f7b4f1647c0 20 librbd::ImageCtx: finished flushing cache
2016-01-27 18:32:10.154027 7f7b4f1647c0 20 librbd: notify_change refresh_seq = 0 last_refresh = 0
2016-01-27 18:32:10.154593 7f7b41869700 10 librbd::ImageWatcher: 0x3118a70 image header updated
2016-01-27 18:32:10.155820 7f7b4f1647c0 20 librbd: close_image 0x3107500
2016-01-27 18:32:10.155828 7f7b4f1647c0 10 librbd::ImageCtx: canceling async requests: count=0
2016-01-27 18:32:10.156043 7f7b4f1647c0 20 librbd: close_image 0x311a5c0
2016-01-27 18:32:10.156046 7f7b4f1647c0 10 librbd::ImageCtx: canceling async requests: count=0
2016-01-27 18:32:10.156170 7f7b4f1647c0 10 librbd::ImageWatcher: 0x3118a70 unregistering image watcher

I can confirm the disk is now locked directly after (console is unresponsive).

#5 Updated by Jason Dillaman about 8 years ago

Thanks -- I was actually hoping to get the librbd logs from the qemu process itself before it locks up due to the snap_create (if at all possible).

#6 Updated by Tyler Wilson about 8 years ago

Hello Jason,

Can't seem to get the logs that you are looking for, where can I retrieve that? Here is my current ceph.conf on the Hardware node housing the VM (and colocated OSD's)

[global]
fsid = $fsid
mon_initial_members = cpl001, cpl002, cpl003
mon_host = 10.0.0.1,10.0.0.2,10.0.0.4
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 128
public_network = 10.0.0.0/23
osd_journal_size = 2048
auth_supported = cephx
osd_pool_default_pgp_num = 128
osd_mkfs_type = xfs
cluster_network = 10.2.0.0/15
osd_mkfs_options_xfs = -i size=2048
osd_mount_options_xfs = noatime,inode64
rbd default format = 2
debug rbd = 20

[osd]
osd_journal_size = 2048
osd_mkfs_type = xfs
osd_mkfs_options_xfs = -f -i size=2048
osd_mount_options_xfs = noatime,inode64
osd_op_threads = 8
max_open_files = 327680
filestore_wbthrottle_xfs_bytes_start_flusher = 419430400
filestore_wbthrottle_xfs_bytes_hard_limit = 4194304000
filestore_wbthrottle_xfs_ios_start_flusher = 5000
filestore_wbthrottle_xfs_ios_hard_limit = 50000
filestore_wbthrottle_xfs_inodes_start_flusher = 5000
filestore_wbthrottle_xfs_inodes_hard_limit = 50000

[client]
rbd_cache = true
rbd_cache_size = 67108864
debug rbd = 20
debug librbd = 20
log file = /var/log/ceph/client.log

/var/log/ceph/client.log only has the same info as above when I run the command.

#7 Updated by Jason Dillaman about 8 years ago

Hmm -- do you have SElinux enabled? You might be experiencing an SElinux denial preventing qemu from writing to /var/log/ceph/client.log due to mismatched context.

#8 Updated by Tyler Wilson about 8 years ago

SELinux is enabled however no denies are in the audit log. I disabled SELinux and retried; same result.

#9 Updated by Jason Dillaman about 8 years ago

Try updating your ceph.conf as described here: http://docs.ceph.com/docs/master/rbd/rbd-openstack/#configuring-nova

#10 Updated by Tyler Wilson about 8 years ago

Updated ceph.conf to that document;

$ cat /etc/ceph/ceph.conf
[global]
fsid = $fsid
mon_initial_members = cpl001, cpl002, cpl003
mon_host = 10.0.0.1,10.0.0.2,10.0.0.4
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 128
public_network = 10.0.0.0/23
osd_journal_size = 2048
auth_supported = cephx
osd_pool_default_pgp_num = 128
osd_mkfs_type = xfs
cluster_network = 10.2.0.0/15
osd_mkfs_options_xfs = -i size=2048
osd_mount_options_xfs = noatime,inode64
rbd default format = 2
debug rbd = 20

[osd]
osd_journal_size = 2048
osd_mkfs_type = xfs
osd_mkfs_options_xfs = -f -i size=2048
osd_mount_options_xfs = noatime,inode64
osd_op_threads = 8
max_open_files = 327680
filestore_wbthrottle_xfs_bytes_start_flusher = 419430400
filestore_wbthrottle_xfs_bytes_hard_limit = 4194304000
filestore_wbthrottle_xfs_ios_start_flusher = 5000
filestore_wbthrottle_xfs_ios_hard_limit = 50000
filestore_wbthrottle_xfs_inodes_start_flusher = 5000
filestore_wbthrottle_xfs_inodes_hard_limit = 50000

[client]
rbd_cache = true
rbd_cache_size = 67108864
rbd cache writethrough until flush = true
admin socket = /var/run/ceph/guests/$cluster-$type.$id.$pid.$cctid.asok
log file = /var/log/qemu/qemu-guest-$pid.log
rbd concurrent management ops = 20
debug rbd = 20
debug librbd = 20

Restarted ceph/libvirtd and the VM in question. Re-ran the snapshot command and it looks like the same data returned;

$ cat /var/log/qemu/qemu-guest-297421.log
2016-01-27 21:23:39.688611 7f501fdcd7c0 20 librbd::ImageCtx: enabling caching...
2016-01-27 21:23:39.688642 7f501fdcd7c0 20 librbd::ImageCtx: Initial cache settings: size=67108864 num_objects=10 max_dirty=25165824 target_dirty=16777216 max_dirty_age=1
2016-01-27 21:23:39.688718 7f501fdcd7c0 20 librbd: open_image: ictx = 0x2d877d0 name = '73ce4399-a757-4c0e-b1f4-471a83c15424_disk' id = '' snap_name = ''
2016-01-27 21:23:39.691494 7f501fdcd7c0 20 librbd: detect format of 73ce4399-a757-4c0e-b1f4-471a83c15424_disk : new
2016-01-27 21:23:39.697437 7f501fdcd7c0 10 librbd::ImageCtx:  cache bytes 67108864 -> about 2000 objects
2016-01-27 21:23:39.697443 7f501fdcd7c0 10 librbd::ImageCtx: init_layout stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix rbd_data.10ebd8f1a0c5d4b format rbd_data.10ebd8f1a0c5d4b.%016llx
2016-01-27 21:23:39.697479 7f501fdcd7c0 10 librbd::ImageWatcher: 0x2d98ed0 registering image watcher
2016-01-27 21:23:39.702395 7f501fdcd7c0 20 librbd: ictx_refresh 0x2d877d0
2016-01-27 21:23:39.710301 7f501fdcd7c0 20 librbd: new snapshot id=2278 name=test_snapshot size=214748364800 features=1
2016-01-27 21:23:39.710355 7f501fdcd7c0 20 librbd::ImageCtx: enabling caching...
2016-01-27 21:23:39.710387 7f501fdcd7c0 20 librbd::ImageCtx: Initial cache settings: size=67108864 num_objects=10 max_dirty=25165824 target_dirty=16777216 max_dirty_age=1
2016-01-27 21:23:39.710424 7f501fdcd7c0 20 librbd: open_image: ictx = 0x2da42c0 name = '' id = '10968c12ae8944a' snap_name = ''
2016-01-27 21:23:39.715799 7f501fdcd7c0 10 librbd::ImageCtx:  cache bytes 67108864 -> about 2000 objects
2016-01-27 21:23:39.715816 7f501fdcd7c0 10 librbd::ImageCtx: init_layout stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix rbd_data.10968c12ae8944a format rbd_data.10968c12ae8944a.%016llx
2016-01-27 21:23:39.715819 7f501fdcd7c0 20 librbd: ictx_refresh 0x2da42c0
2016-01-27 21:23:39.725199 7f501fdcd7c0 20 librbd: new snapshot id=189 name=snap size=53687091200 features=1
2016-01-27 21:23:39.725227 7f501fdcd7c0 20 librbd::ImageCtx: finished flushing cache
2016-01-27 21:23:39.725243 7f501fdcd7c0 20 librbd::ImageCtx: finished flushing cache
2016-01-27 21:23:39.725249 7f501fdcd7c0 20 librbd: snap_create 0x2d877d0 test_snapshot2
2016-01-27 21:23:39.725250 7f501fdcd7c0 20 librbd: ictx_check 0x2d877d0
2016-01-27 21:23:39.725254 7f501fdcd7c0 20 librbd: snap_create_helper 0x2d877d0 test_snapshot2
2016-01-27 21:23:39.725255 7f501fdcd7c0 20 librbd: ictx_check 0x2d877d0
2016-01-27 21:23:39.725256 7f501fdcd7c0 20 librbd::ImageCtx: finished flushing cache
2016-01-27 21:23:40.666100 7f501fdcd7c0 20 librbd: notify_change refresh_seq = 0 last_refresh = 0
2016-01-27 21:23:40.666606 7f500d7fa700 10 librbd::ImageWatcher: 0x2d98ed0 image header updated
2016-01-27 21:23:40.667308 7f501fdcd7c0 20 librbd: close_image 0x2d877d0
2016-01-27 21:23:40.667316 7f501fdcd7c0 10 librbd::ImageCtx: canceling async requests: count=0
2016-01-27 21:23:40.667397 7f501fdcd7c0 20 librbd: close_image 0x2da42c0
2016-01-27 21:23:40.667400 7f501fdcd7c0 10 librbd::ImageCtx: canceling async requests: count=0
2016-01-27 21:23:40.667485 7f501fdcd7c0 10 librbd::ImageWatcher: 0x2d98ed0 unregistering image watcher

Tried dumping the Ceph log for the socket as well, no dice;

$ ceph daemon /var/run/ceph/guests/ceph-client.compute.290624.140088947728384.asok log dump
{}

#11 Updated by Jason Dillaman about 8 years ago

Odd, no file named /var/log/qemu/qemu-guest-290624.log was created (based on the PID of your admin socket)? Can you describe the IO loads of the guest VM when the snapshot occurs? Is it something simple that I can repeat in my environment?

#12 Updated by Tyler Wilson about 8 years ago

Correct;

$ ls -la  /var/log/qemu/qemu-guest-290624.log
ls: cannot access /var/log/qemu/qemu-guest-290624.log: No such file or directory
$ ls -la /var/run/ceph/guests/*
srwxrwxr-x. 1 qemu qemu 0 Jan 27 21:19 /var/run/ceph/guests/ceph-client.compute.290624.140088947728384.asok

Unfortunately this was hard to produce on test instances, it seems the ones previously affected can reliably reproduce the issue however new instances created for testing purposes don't seem to have the problem. One common trait among the affected instances are they have lots of open files (web servers, database servers) however I do not think this is a factor as I was able to reproduce the lock on a single-user booted VM (init=/bin/bash) with a RO disk.

#13 Updated by Jason Dillaman about 8 years ago

Another option would be if you could install the ceph-debuginfo package, attach gdb to the hung qemu process, run "thread apply all bt", and paste the resulting traces to this ticket -- or alternatively run gcore against the process to generate a core dump that you can upload via ceph-post-file.

#14 Updated by Tyler Wilson about 8 years ago

Hello Jason,

I got the instance to lock once more and ran gdb attach $pid, then "thread apply all bt". Attached is the output.

#15 Updated by Tyler Wilson about 8 years ago

Tyler Wilson wrote:

Hello Jason,

I got the instance to lock once more and ran gdb attach $pid, then "thread apply all bt". Attached is the output.

Uploaded the core dump as well; ceph-post-file: 0311e037-5bdc-4c27-816c-1a67cce8041b

#16 Updated by Jason Dillaman about 8 years ago

Perfect, thanks!

#17 Updated by Jason Dillaman about 8 years ago

Is this only occurring on VMs where the RBD cache is disabled?

#18 Updated by Tyler Wilson about 8 years ago

Hello,

RBD Cache is enabled in ceph.conf (rbd_cache = true, rbd_cache_size = 67108864) for the client on all hypervisors and this is common to all libvirt XML for every instance;

    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <auth username='compute'>
        <secret type='ceph' uuid='$secret'/>
      </auth>
      <source protocol='rbd' name='compute/73ce4399-a757-4c0e-b1f4-471a83c15424_disk'>
        <host name='10.0.0.1' port='6789'/>
        <host name='10.0.0.2' port='6789'/>
        <host name='10.0.0.3' port='6789'/>
        <host name='10.0.0.4' port='6789'/>
      </source>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <iotune>
        <total_bytes_sec>614400000</total_bytes_sec>
        <read_iops_sec>3000</read_iops_sec>
        <write_iops_sec>1500</write_iops_sec>
      </iotune>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>

#19 Updated by Jason Dillaman about 8 years ago

cache='none' is disabling the cache.

#20 Updated by Jason Dillaman about 8 years ago

I might have (semi-)good news for you: I believe this is the same issue as tracked under #13755. The fix is already incorporated into the forthcoming 0.94.6 release.

In the meantime, you should be able work work around the issue by adding "rbd_non_blocking_aio = false" to your ceph.conf (or enable the librbd cache by changing "<driver name='qemu' type='raw' cache='none"/>' to "<driver name='qemu' type='raw' cache='writeback -- or -- writethrough'/>".

#21 Updated by Josh Durgin about 8 years ago

For OpenStack, to enable rbd caching you'll need to put this in the [libvirt] section of /etc/nova/nova.conf on each compute node:

disk_cachemodes="network=writeback" 

#22 Updated by Tyler Wilson about 8 years ago

Hello Jason/Josh,

I can confirm both of those changes will fix the issue. Thanks for your assistance!

#23 Updated by Jason Dillaman about 8 years ago

  • Status changed from In Progress to Duplicate

Excellent -- thanks for all your help with debugging. Closing this ticket as dup of issue #13755.

Also available in: Atom PDF