Project

General

Profile

Actions

Bug #16360

closed

Image not cleanly deleted with rbd on bluestore

Added by Alexandre Becholey almost 8 years ago. Updated about 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Issue

I was playing with rbd to create, delete and snapshot images so that I can have VM disks on Ceph. I have three images (debian, ubuntu trusty and xenial cloud images) that were snapshot and protected so that I can clone them.

For each test I did with VMs:

  • clone a disk with rbd clone libvirt/_IMG_trusty-server-cloudimg-amd64@snap libvirt/test1
  • spin the VM and do some tests
  • delete the VM
  • delete the disk with rbd rm libvirt/test1
  • repeat

Everything was fine until it didn't cleanly delete test1:

# rbd rm libvirt/test1; rbd clone libvirt/_IMG_trusty-server-cloudimg-amd64@snap libvirt/test1 && rbd ls -l libvirt
2016-06-17 10:31:32.353948 7fbfc3b7e100 -1 WARNING: the following dangerous and experimental features are enabled: *
2016-06-17 10:31:32.353988 7fbfc3b7e100 -1 WARNING: the following dangerous and experimental features are enabled: *
2016-06-17 10:31:32.355525 7fbfc3b7e100 -1 WARNING: the following dangerous and experimental features are enabled: *
Removing image: 100% complete...done.
2016-06-17 10:31:32.454006 7efca05e2100 -1 WARNING: the following dangerous and experimental features are enabled: *
2016-06-17 10:31:32.454049 7efca05e2100 -1 WARNING: the following dangerous and experimental features are enabled: *
2016-06-17 10:31:32.455920 7efca05e2100 -1 WARNING: the following dangerous and experimental features are enabled: *
2016-06-17 10:31:32.494994 7efca05e2100 -1 librbd: error adding image to directory: (17) File exists
2016-06-17 10:31:32.497393 7efca05e2100 -1 librbd: error creating child: (17) File exists
rbd: clone error: (17) File exists

Is it possible to purge test1 without having to delete the pools and/or the OSDs?

Detail

NB: The warnings are removed in the next console outputs

I have a small lab cluster with 3 nodes. Each have:

  • a MON
  • a MDS
  • 12 OSD bluestore created with:
    ceph-disk -v prepare --bluestore --dmcrypt /dev/sda
    each OSD has an entry in ceph.conf to store rocksDB on SSDs (they were used for journals before):
      [osd.0]
      host = storage0
      bluestore block db path = /dev/mapper/vgCephJournal_md125-ceph_block_db_0
      bluestore block wal path = /dev/mapper/vgCephJournal_md125-ceph_block_wal_0
      

    It automatically created the symlinks:
    # ls -l /var/lib/ceph/osd/ceph-0/block*
    lrwxrwxrwx 1 ceph ceph 48 Jun 16 15:20 /var/lib/ceph/osd/ceph-0/block -> /dev/mapper/2a0d2cb0-afc2-4cb8-a64d-6c558a341318
    lrwxrwxrwx 1 ceph ceph 47 Jun 16 15:20 /var/lib/ceph/osd/ceph-0/block.db -> /dev/mapper/vgCephJournal_md125-ceph_block_db_0
    lrwxrwxrwx 1 ceph ceph 58 Jun 16 15:20 /var/lib/ceph/osd/ceph-0/block_dmcrypt -> /dev/disk/by-partuuid/2a0d2cb0-afc2-4cb8-a64d-6c558a341318
    -rw-r--r-- 1 ceph ceph 37 Jun 16 15:20 /var/lib/ceph/osd/ceph-0/block_uuid
    lrwxrwxrwx 1 ceph ceph 48 Jun 16 15:20 /var/lib/ceph/osd/ceph-0/block.wal -> /dev/mapper/vgCephJournal_md125-ceph_block_wal_0
    

I have a forth node which also has a MON.

All 4 machines are up-to-date Ubuntu Xenial with the following Ceph version:

# ceph --version
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

The cluster has 3 pools:

# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    23265G     20283G        2982G         12.82 
POOLS:
    NAME         ID     USED       %USED     MAX AVAIL     OBJECTS 
    libvirt      4       3344M      0.04         5548G         850 
    data         5        679G      8.76         5548G      173977 
    metadata     6      25278k         0         5548G          26

libvirt and data both have 512 PG each and metadata has 128.

The cluster is healthy:

# ceph -s
    cluster 2127ffaa-3d77-4f88-87ee-8e2278e4c074
     health HEALTH_OK
     monmap e10: 4 mons at {compute0=172.16.1.10:6789/0,storage0=172.16.1.20:6789/0,storage1=172.16.1.21:6789/0,storage2=172.16.1.22:6789/0}
            election epoch 78, quorum 0,1,2,3 compute0,storage0,storage1,storage2
      fsmap e75: 1/1/1 up {0=storage1.public=up:active}, 2 up:standby
     osdmap e645: 36 osds: 36 up, 36 in
            flags sortbitwise
      pgmap v8395: 1152 pgs, 3 pools, 682 GB data, 170 kobjects
            2982 GB used, 20283 GB / 23265 GB avail
                1152 active+clean

Now when I tried to get information about test1, I have the following:

# rbd ls libvirt
_IMG_debian-8.5.0-openstack-amd64
_IMG_trusty-server-cloudimg-amd64
_IMG_xenial-server-cloudimg-amd64
test1
# rbd ls -l libvirt
NAME                                    SIZE PARENT FMT PROT LOCK 
_IMG_debian-8.5.0-openstack-amd64      2048M          2           
_IMG_debian-8.5.0-openstack-amd64@snap 2048M          2 yes       
_IMG_trusty-server-cloudimg-amd64      2252M          2           
_IMG_trusty-server-cloudimg-amd64@snap 2252M          2 yes       
_IMG_xenial-server-cloudimg-amd64      2252M          2           
_IMG_xenial-server-cloudimg-amd64@snap 2252M          2 yes       
rbd: list: (2) No such file or directory
# rbd info libvirt/test1
rbd: error opening image test1: (2) No such file or directory
# ceph osd map libvirt test1
osdmap e645 pool 'libvirt' (4) object 'test1' -> pg 4.bddbf0b9 (4.b9) -> up ([26,16,9], p26) acting ([26,16,9], p26)

I searched for test1 with rados and removed it but that didn't change anything (there was no rbd_id.test1 object):

# rados -p libvirt ls | grep test1
test1
# rados -p libvirt rm test1

When I retry to delete the image, it doesn't change anything:

# rbd rm libvirt/test1
Removing image: 100% complete...done.
# rbd info libvirt/test1
rbd: error opening image test1: (2) No such file or directory

You may find attached all four rbd commands with the debug level increased.

I remain at your disposal for further information.


Files

ls.log (4.19 KB) ls.log rbd ls --debug-rbd=20 --debug-ms=1 Alexandre Becholey, 06/17/2016 10:35 AM
info.log (6.16 KB) info.log rbd info ubuntu1204_base --debug-rbd=20 --debug-ms=1 Alexandre Becholey, 06/17/2016 10:35 AM
ls-l.log (43.8 KB) ls-l.log rbd ls -l --debug-rbd=20 --debug-ms=1 Alexandre Becholey, 06/17/2016 10:35 AM
rm.log (10.7 KB) rm.log rbd rm libvirt/test1 --debug-rbd=20 --debug-ms=1 Alexandre Becholey, 06/17/2016 10:47 AM
Actions #1

Updated by Alexandre Becholey almost 8 years ago

Furthermore I can't create new images:

# rbd clone libvirt/_IMG_trusty-server-cloudimg-amd64@snap libvirt/test
# rbd ls libvirt
_IMG_debian-8.5.0-openstack-amd64
_IMG_trusty-server-cloudimg-amd64
_IMG_xenial-server-cloudimg-amd64
test1
# rbd clone libvirt/_IMG_trusty-server-cloudimg-amd64@snap libvirt/test
2016-06-17 12:55:27.660513 7f0994c4c100 -1 librbd: rbd image test already exists
rbd: clone error: (17) File exists
root@compute0:~# rbd ls libvirt
_IMG_debian-8.5.0-openstack-amd64
_IMG_trusty-server-cloudimg-amd64
_IMG_xenial-server-cloudimg-amd64
test1

Actions #2

Updated by Jason Dillaman almost 8 years ago

  • Assignee set to Sage Weil
Actions #3

Updated by Alexandre Becholey almost 8 years ago

The cluster is not healthy any more:

    cluster 2127ffaa-3d77-4f88-87ee-8e2278e4c074
     health HEALTH_ERR
            3 pgs inconsistent
            3 scrub errors
     monmap e10: 4 mons at {compute0=172.16.1.10:6789/0,storage0=172.16.1.20:6789/0,storage1=172.16.1.21:6789/0,storage2=172.16.1.22:6789/0}
            election epoch 78, quorum 0,1,2,3 compute0,storage0,storage1,storage2
      fsmap e75: 1/1/1 up {0=storage1.public=up:active}, 2 up:standby
     osdmap e645: 36 osds: 36 up, 36 in
            flags sortbitwise
      pgmap v10745: 1152 pgs, 3 pools, 682 GB data, 170 kobjects
            2982 GB used, 20283 GB / 23265 GB avail
                1149 active+clean
                   3 active+clean+inconsistent

Actions #4

Updated by Alexandre Becholey almost 8 years ago

After repairing the 3 PGs, two of them keep failing:

2016-06-21 11:26:19.905700 mon.1 [INF] from='client.? 172.16.1.10:0/3698790893' entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "4.1c"}]: dispatch
2016-06-21 11:26:20.131239 mon.1 [INF] from='client.? 172.16.1.10:0/3358205872' entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "4.18b"}]: dispatch
2016-06-21 11:26:22.716280 mon.0 [INF] pgmap v12439: 1152 pgs: 1 active+clean+inconsistent, 1151 active+clean; 4596 MB data, 946 GB used, 22318 GB / 23265 GB avail
2016-06-21 11:26:24.750069 mon.0 [INF] pgmap v12440: 1152 pgs: 1152 active+clean; 4596 MB data, 946 GB used, 22318 GB / 23265 GB avail; 0 B/s, 0 keys/s, 0 objects/s recovering
2016-06-21 11:26:20.643613 osd.28 [INF] 4.1c repair starts
2016-06-21 11:26:20.828711 osd.28 [ERR] 4.1c shard 28: soid 4:3831950c:::rbd_directory:head omap_digest 0xffffffff != best guess omap_digest 0x918c395a from auth shard 6
2016-06-21 11:26:20.828764 osd.28 [ERR] 4.1c repair 0 missing, 1 inconsistent objects
2016-06-21 11:26:20.828780 osd.28 [ERR] 4.1c repair 1 errors, 1 fixed
2016-06-21 11:26:21.018938 osd.1 [INF] 4.18b repair starts
2016-06-21 11:26:21.147409 osd.1 [ERR] 4.18b shard 28 missing 4:d1f82166:::rbd_header.119a279e2a9e3:head
2016-06-21 11:26:21.147465 osd.1 [ERR] 4.18b repair 1 missing, 0 inconsistent objects
2016-06-21 11:26:21.147481 osd.1 [ERR] 4.18b repair 1 errors, 1 fixed
2016-06-21 11:26:50.754169 mon.0 [INF] HEALTH_OK

2016-06-21 11:27:11.853995 mon.1 [INF] from='client.? 172.16.1.10:0/3807871970' entity='client.admin' cmd=[{"prefix": "pg deep-scrub", "pgid": "4.1c"}]: dispatch
2016-06-21 11:27:12.073902 mon.1 [INF] from='client.? 172.16.1.10:0/2018208149' entity='client.admin' cmd=[{"prefix": "pg deep-scrub", "pgid": "4.18b"}]: dispatch
2016-06-21 11:27:12.653499 osd.28 [INF] 4.1c deep-scrub starts
2016-06-21 11:27:12.841558 osd.28 [ERR] 4.1c shard 28: soid 4:3831950c:::rbd_directory:head omap_digest 0xffffffff != best guess omap_digest 0x918c395a from auth shard 6
2016-06-21 11:27:12.841662 osd.28 [ERR] 4.1c deep-scrub 0 missing, 1 inconsistent objects
2016-06-21 11:27:12.841668 osd.28 [ERR] 4.1c deep-scrub 1 errors
2016-06-21 11:27:18.068798 mon.0 [INF] pgmap v12441: 1152 pgs: 1 active+clean+inconsistent, 1151 active+clean; 4596 MB data, 946 GB used, 22318 GB / 23265 GB avail
2016-06-21 11:27:14.027263 osd.1 [INF] 4.18b deep-scrub starts
2016-06-21 11:27:14.151964 osd.1 [ERR] 4.18b shard 28 missing 4:d1f82166:::rbd_header.119a279e2a9e3:head
2016-06-21 11:27:14.152072 osd.1 [ERR] 4.18b deep-scrub 1 missing, 0 inconsistent objects
2016-06-21 11:27:14.152077 osd.1 [ERR] 4.18b deep-scrub 1 errors
2016-06-21 11:27:19.089094 mon.0 [INF] pgmap v12442: 1152 pgs: 2 active+clean+inconsistent, 1150 active+clean; 4596 MB data, 946 GB used, 22318 GB / 23265 GB avail
2016-06-21 11:27:50.754768 mon.0 [INF] HEALTH_ERR; 2 pgs inconsistent; 2 scrub errors

Here is the log of the PG that has been successfully repaired:

2016-06-21 09:28:12.438673 osd.2 172.16.1.20:6804/27826 321 : cluster [INF] 4.78 repair starts
2016-06-21 09:28:12.565706 osd.2 172.16.1.20:6804/27826 322 : cluster [ERR] 4.78 shard 19 missing 4:1e2ef459:::rbd_id.test:head
2016-06-21 09:28:12.565757 osd.2 172.16.1.20:6804/27826 323 : cluster [ERR] 4.78 repair 1 missing, 0 inconsistent objects
2016-06-21 09:28:12.565773 osd.2 172.16.1.20:6804/27826 324 : cluster [ERR] 4.78 repair 1 errors, 1 fixed

2016-06-21 11:31:24.348320 mon.1 [INF] from='client.? 172.16.1.10:0/2068510417' entity='client.admin' cmd=[{"prefix": "pg deep-scrub", "pgid": "4.78"}]: dispatch
2016-06-21 11:31:25.822731 osd.2 [INF] 4.78 deep-scrub starts
2016-06-21 11:31:25.955523 osd.2 [INF] 4.78 deep-scrub ok

Actions #5

Updated by Jason Dillaman almost 8 years ago

  • Project changed from rbd to Ceph
Actions #6

Updated by Sage Weil about 7 years ago

  • Status changed from New to Can't reproduce

if you see anything like this on kraken or later, please reopen!

Actions

Also available in: Atom PDF