Project

General

Profile

Actions

Bug #48060

open

data loss in EC pool

Added by Hannes Tamme over 3 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have data LOSS in our EC pool k4m2.
Pool is used for RBD volumes. 15 RBD volumes have broken objects.
Broken object contains shards with different versions. ALL osd are UP and IN. No SMART errors.
Network latency between components is not more than 0.2 ms (10-40 Gbit bonded network interfaces).
OSD hosts have more than 20G of RAM per OSD and more than 4 dedicated cores per osd.
This is a production system.

root@osd-host:~# uname -a
Linux osd-host 5.4.0-42-generic #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

root@ik01:~# ceph version
ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)

root@ik01:~# apt-cache policy ceph
ceph:
Installed: 15.2.5-1bionic
Candidate: 15.2.5-1bionic
Version table: *** 15.2.5-1bionic 1001
1001 https://download.ceph.com/debian-octopus bionic/main amd64 Packages

root@mon-host:~# ceph osd dump | grep cinder-data
pool 30 'cinder-data' erasure profile k4m2 size 6 min_size 5 crush_rule 4 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 377788 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 application rbd

root@mon-host:~# ceph osd erasure-code-profile get k4m2
crush-device-class=ssd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8

root@mon-host:~# ceph health detail
pg 30.e6 is stuck undersized for 44h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [45,2147483647,6,2147483647,42,22]

root@ik01:~# ceph pg 30.e6 list_unfound| egrep '(rbd|need|have)'
"oid": "rbd_data.29.4ed8dede74eb6f.0000000000000224",
"need": "364963'36484992",
"have": "0'0",

We dumped out (with ceph-objectstore-tool) all shards from 0 to 5 and found that they have different versions:

"version": "363510'36481974",
"version": "364963'36484992",
"version": "364963'36484992",
"version": "364963'36484992",
"version": "363579'36482218",
"version": "359913'36472321",
"version": "359913'36472321",

We found more than 6 osd's that contained missing object pieces.
So we rebooted all OSD, one at a time. No extra healthy shards where found.

Is there any commands to "glue" broken pieces together and put them back in?
In our test system we tray to put wrong shard version in and the result was:
'/build/ceph-15.2.5/src/osd/osd_types.cc: 5698: FAILED ceph_assert(clone_size.count(clone))'

Currently IO operation hags when we tray to do some IO with broken objects (rbd import/export or rados get/put).

We have aprox. 12 hours before we accept permanent data loss with following command:
root@mon-host:~# ceph pg 30.e6 mark_unfound_lost delete


Files

trimmed_ceph-osd.22.log (335 KB) trimmed_ceph-osd.22.log Hannes Tamme, 11/02/2020 05:16 AM
trimmed_ceph-osd.34.log (291 KB) trimmed_ceph-osd.34.log Hannes Tamme, 11/02/2020 05:16 AM
trimmed_ceph-osd.43.log (331 KB) trimmed_ceph-osd.43.log Hannes Tamme, 11/02/2020 05:16 AM

Related issues 1 (1 open0 closed)

Related to RADOS - Bug #51024: OSD - FAILED ceph_assert(clone_size.count(clone), keeps on restarting after one host rebootNew

Actions
Actions

Also available in: Atom PDF