Bug #48060

data loss in EC pool

Added by Hannes Tamme about 2 years ago. Updated over 1 year ago.

Target version:
% Done:


1 - critical
Affected Versions:
Pull request ID:
Crash signature (v1):
Crash signature (v2):


We have data LOSS in our EC pool k4m2.
Pool is used for RBD volumes. 15 RBD volumes have broken objects.
Broken object contains shards with different versions. ALL osd are UP and IN. No SMART errors.
Network latency between components is not more than 0.2 ms (10-40 Gbit bonded network interfaces).
OSD hosts have more than 20G of RAM per OSD and more than 4 dedicated cores per osd.
This is a production system.

root@osd-host:~# uname -a
Linux osd-host 5.4.0-42-generic #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

root@ik01:~# ceph version
ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)

root@ik01:~# apt-cache policy ceph
Installed: 15.2.5-1bionic
Candidate: 15.2.5-1bionic
Version table: *** 15.2.5-1bionic 1001
1001 bionic/main amd64 Packages

root@mon-host:~# ceph osd dump | grep cinder-data
pool 30 'cinder-data' erasure profile k4m2 size 6 min_size 5 crush_rule 4 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 377788 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 application rbd

root@mon-host:~# ceph osd erasure-code-profile get k4m2

root@mon-host:~# ceph health detail
pg 30.e6 is stuck undersized for 44h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [45,2147483647,6,2147483647,42,22]

root@ik01:~# ceph pg 30.e6 list_unfound| egrep '(rbd|need|have)'
"oid": "rbd_data.29.4ed8dede74eb6f.0000000000000224",
"need": "364963'36484992",
"have": "0'0",

We dumped out (with ceph-objectstore-tool) all shards from 0 to 5 and found that they have different versions:

"version": "363510'36481974",
"version": "364963'36484992",
"version": "364963'36484992",
"version": "364963'36484992",
"version": "363579'36482218",
"version": "359913'36472321",
"version": "359913'36472321",

We found more than 6 osd's that contained missing object pieces.
So we rebooted all OSD, one at a time. No extra healthy shards where found.

Is there any commands to "glue" broken pieces together and put them back in?
In our test system we tray to put wrong shard version in and the result was:
'/build/ceph-15.2.5/src/osd/ 5698: FAILED ceph_assert(clone_size.count(clone))'

Currently IO operation hags when we tray to do some IO with broken objects (rbd import/export or rados get/put).

We have aprox. 12 hours before we accept permanent data loss with following command:
root@mon-host:~# ceph pg 30.e6 mark_unfound_lost delete

trimmed_ceph-osd.22.log View (335 KB) Hannes Tamme, 11/02/2020 05:16 AM

trimmed_ceph-osd.34.log View (291 KB) Hannes Tamme, 11/02/2020 05:16 AM

trimmed_ceph-osd.43.log View (331 KB) Hannes Tamme, 11/02/2020 05:16 AM

Related issues

Related to RADOS - Bug #51024: OSD - FAILED ceph_assert(clone_size.count(clone), keeps on restarting after one host reboot New


#1 Updated by Hannes Tamme about 2 years ago

We accepted data loss and executing command:

root@ik01:~# ceph pg 30.17 mark_unfound_lost delete
pg has 1 objects unfound and apparently lost marking

Now we have tree down OSD's with (and they won't start)
/build/ceph-15.2.5/src/osd/ 5698: FAILED ceph_assert(clone_size.count(clone))'

I attached tree crashed OSD logs.

Currently our production Ceph cluster is down.

#2 Updated by Hannes Tamme about 2 years ago

When osd.22, osd.34 and osd.43 was down ceph -w give us endless lines of:

2020-11-02T07:18:00.464219+0200 osd.45 [WRN] slow request osd_op(client.1590808960.0:3265229 30.e6s0 30.d3bbf4e6 (undecoded) ondisk+retry+write+known_if_redirected e379620) initiated 2020-11-02T07:08:28.561045+0200 currently delayed

When I leave osd.34 down I can start osd.22 and osd.34

Also we have NEW unfound object:
pg 30.14e has 1 unfound objects

We added following flags noscrub,nodeep-scrub since we are recovering.
Current situation is that we have 1 down OSD and no scrubbing.

#3 Updated by Hannes Tamme about 2 years ago

During the day unfound object list grows. Current state is:
[ERR] PG_DAMAGED: Possible data damage: 5 pgs recovery_unfound
pg 30.44 is active+recovery_unfound+undersized+degraded+remapped, acting [2,42,44,2147483647,22,26], 2 unfound
pg is active+recovery_unfound+undersized+degraded+remapped, acting [42,26,2147483647,20,44,2147483647], 2 unfound
pg 30.e6 is active+recovery_unfound+undersized+degraded+remapped, acting [45,2147483647,6,2147483647,42,22], 15 unfound
pg 30.ff is active+recovery_unfound+undersized+degraded+remapped, acting [41,44,2147483647,42,35,2147483647], 3 unfound
pg 30.14e is active+recovery_unfound+undersized+degraded+remapped, acting [0,22,19,2147483647,2147483647,18], 1 unfound

#4 Updated by Hannes Tamme about 2 years ago

osd.22 crashed every minute with '/build/ceph-15.2.5/src/osd/ 5698: FAILED ceph_assert(clone_size.count(clone))':
2020-11-03T00:15:17.927+0200 7fbafff30700 -1 * Caught signal (Aborted) *
2020-11-03T08:49:08.561+0200 7f8336f31700 -1 *
Caught signal (Aborted)
2020-11-03T08:49:51.750+0200 7fe801b14700 -1 *
Caught signal (Aborted) *

we desiced to remove all PG from osd.22:
root@ik01:~/pg# ceph osd crush reweight osd.22 0

since PG 30.17 one object was marked unfound delete but still existed in OSD, we decided to delete it:
exported objects:
root@ik08:~/pg2# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22 --pgid 30.17s2 '{"oid":"rbd_data.29.041c376b8b4567.0000000000000203","key":"","snapid":78053,"hash":3343369239,"max":0,"pool":30,"namespace":"","shard_id":2,"max":0}' get-bytes > snapid_78053_rbd_data.29.041c376b8b4567.0000000000000203

root@ik08:~/pg2# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22 --pgid 30.17s2 '{"oid":"rbd_data.29.041c376b8b4567.0000000000000203","key":"","snapid":-2,"hash":3343369239,"max":0,"pool":30,"namespace":"","shard_id":2,"max":0}' get-bytes > snapid_2_rbd_data.29.041c376b8b4567.0000000000000203

root@ik08:~/pg2# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22 --pgid 30.17s2 '{"oid":"rbd_data.29.041c376b8b4567.0000000000000203","key":"","snapid":-2,"hash":3343369239,"max":0,"pool":30,"namespace":"","shard_id":2,"max":0}' remove
remove 2#30:e823e2e3:::rbd_data.29.041c376b8b4567.0000000000000203:head#

root@ik08:~/pg2# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22 --pgid 30.17s2 '{"oid":"rbd_data.29.041c376b8b4567.0000000000000203","key":"","snapid":78053,"hash":3343369239,"max":0,"pool":30,"namespace":"","shard_id":2,"max":0}' remove
remove 2#30:e823e2e3:::rbd_data.29.041c376b8b4567.0000000000000203:130e5#

now osd.22 is up but it seems to have 4 PG permanently:
root@ik01:~/pg# ceph osd df | grep ^22
22 ssd 0 1.00000 3.5 TiB 1.6 TiB 1.6 TiB 19 MiB 2.6 GiB 1.9 TiB 46.22 1.01 4 up

unfoud deleted PG is stuck undersized without any aparent reason:
pg 30.17 is stuck undersized for 12m, current state active+undersized+remapped+backfilling, last acting [2147483647,2147483647,22,44,26,42]

client I/0 seems to flow:
client: 3.2 MiB/s rd, 5.8 MiB/s wr, 1.28k op/s rd, 431 op/s wr

#5 Updated by Hannes Tamme about 2 years ago

We ended up decommissioning ceph. For unlucky 'incomplete PG' googlers - here are steps how to migrate openstack VM storage away from ceph RBD (normal retype hangs).

#find broken object location inside RBD image
for pref in $(rbd info cinder-metadata/volume-dd4c799a-8eae-4d0b-8411-1b82928a5568 | awk '/pref/ {print $2}');do for pg in 30.14e 30.44 30.ff 30.e6;do ceph pg $pg list_unfound | grep $pref; done;done
"oid": "rbd_data.29.e1844759adea3d.00000000000263f9",
echo $((16#263f9))

#make non-thin volume space
dd if=/dev/zero of=/gpfs/space/cloud/backup/_volume-dd4c799a-8eae-4d0b-8411-1b82928a5568 iflag=fullblock bs=1G count=2048

#stop I/O and map rbd image
nova stop ea504c59-feb9-4570-83cc-8d5f69192889
rbd-nbd map cinder-metadata/volume-dd4c799a-8eae-4d0b-8411-1b82928a5568

#prevent accidentally reading broken object
echo 0 > /sys/devices/virtual/block/nbd2/queue/read_ahead_kb

#dump out everything you can
dd if=/dev/nbd2 of=/gpfs/space/cloud/backup/_volume-dd4c799a-8eae-4d0b-8411-1b82928a5568 bs=4M count=156665
dd if=/dev/nbd2 of=/gpfs/space/cloud/backup/_volume-dd4c799a-8eae-4d0b-8411-1b82928a5568 skip=156666 seek=156666 bs=4M

#map as block device
qemu-nbd --connect=/dev/nbd2 -f raw /gpfs/space/cloud/backup/_volume-dd4c799a-8eae-4d0b-8411-1b82928a5568

#hope for best and fix volume filesystem
e2fsck -y /dev/nbd2
e2fsck 1.44.1 (24-Mar-2018)
/dev/nbd2: clean, 278075/134217728 files, 462958482/536870912 blocks
qemu-nbd --disconnect /dev/nbd2
/dev/nbd2 disconnected
mount -o loop /gpfs/space/cloud/backup/_volume-dd4c799a-8eae-4d0b-8411-1b82928a5568 /mnt/dd4c799a-8eae-4d0b-8411-1b82928a5568
ls /mnt/dd4c799a-8eae-4d0b-8411-1b82928a5568/lost+found/
umount /mnt/dd4c799a-8eae-4d0b-8411-1b82928a5568

#move volume to correct place if needed
/gpfs/space/cloud/backup/_volume-dd4c799a-8eae-4d0b-8411-1b82928a5568 /gpfs/space/cloud/openstack/cinder/volumes/volume-dd4c799a-8eae-4d0b-8411-1b82928a5568

#update openstack cinder database for VM new storage location/driver (record old DB values if needed)
select * from cinder.volumes where id='dd4c799a-8eae-4d0b-8411-1b82928a5568'\G
update cinder.volumes set cluster_name='ceph@prod#prod', host='ik02@prod#prod', volume_type_id='5f84ccb5-78aa-4237-b46f-0174b39b269c', status = 'in-use' where id='dd4c799a-8eae-4d0b-8411-1b82928a5568' limit 1;

select * from cinder.volume_attachment where volume_id='dd4c799a-8eae-4d0b-8411-1b82928a5568' and deleted='0'\G
select * from nova.block_device_mapping where volume_id='dd4c799a-8eae-4d0b-8411-1b82928a5568' and deleted_at is NULL\G

update nova.block_device_mapping set connection_info='{"driver_volume_type": "gpfs", "data": {"name": "volume-dd4c799a-8eae-4d0b-8411-1b82928a5568", "device_path": "/gpfs/space/cloud/openstack/cinder/volumes/volume-dd4c799a-8eae-4d0b-8411-1b82928a5568", "qos_specs": null, "access_mode": "rw", "encrypted": false}, "connector": {"platform": "x86_64", "os_type": "linux", "ip": "", "host": "hypervisor-hostname", "multipath": false, "initiator": "", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-1b82928a5568"}, "serial": "dd4c799a-8eae-4d0b-8411-1b82928a5568"}' where deleted_at is NULL and volume_id='dd4c799a-8eae-4d0b-8411-1b82928a5568' limit 1;

#start VM
nova reboot --hard ea504c59-feb9-4570-83cc-8d5f69192889

#monitor ceph cluster IO to make sure that ceph storage is not used any more (wait ~30 seconds)
rbd perf image iotop

#6 Updated by Mykola Golub about 2 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)

#7 Updated by Jeremi A over 1 year ago

I have had exactly the same issue with my cluster - while not even having any disks or host being lost.

#8 Updated by Neha Ojha over 1 year ago

  • Related to Bug #51024: OSD - FAILED ceph_assert(clone_size.count(clone), keeps on restarting after one host reboot added

Also available in: Atom PDF