Project

General

Profile

Actions

Bug #16605

open

Can't fix PG stats after manual shard deletion

Added by Bryan Apperson almost 8 years ago. Updated almost 7 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
EC Pools
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This issue was encountered while trying to workaround:

http://tracker.ceph.com/issues/14154
http://tracker.ceph.com/issues/12200 (resolved) <---- here is where ceph community handle the assertion in infernalis
http://tracker.ceph.com/issues/10044
http://tracker.ceph.com/issues/10018
http://tracker.ceph.com/issues/9537

In ceph hammer.

Reproduction steps:

1. Create pool ectest with k3,m2 erasure host failure domain.

2. Write some objects in:
rados bench -p ectest 120 write -t 1 --no-cleanup

3. List objects, Find PG for one object:

[root@mon1 ~]# ceph osd map ectest benchmark_data_mon1_20977_object505
osdmap e32 pool 'ectest' (1) object 'benchmark_data_mon1_20977_object505' -> pg 1.7307abff (1.3f) -> up ([12,17,9,1,5], p12) acting ([12,17,9,1,5], p12)

4. Delete all shards of the object in the PG OSDs

5. Scrub the PG:

2016-06-30 22:14:40.359231 mon.0 192.168.57.2:6789/0 6176 : cluster [INF] pgmap v207: 128 pgs: 127 active+clean, 1 active+clean+inconsistent; 2334 MB data, 4585 MB used, 395 GB / 399 GB avail
2016-06-30 23:01:11.147291 osd.12 192.168.57.8:6802/22047 1 : cluster [INF] 1.3f scrub starts
2016-06-30 23:01:11.254492 osd.12 192.168.57.8:6802/22047 2 : cluster [ERR] 1.3fs0 scrub stat mismatch, got 9/10 objects, 0/0 clones, 9/10 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 37783584/41981760 bytes,0/0 hit_set_archive bytes.
2016-06-30 23:01:11.254501 osd.12 192.168.57.8:6802/22047 3 : cluster [ERR] 1.3f scrub 1 errors
2016-06-30 22:15:00.800910 mon.0 192.168.57.2:6789/0 6183 : cluster [INF] pgmap v208: 128 pgs: 127 active+clean, 1 active+clean+inconsistent; 2334 MB data, 4585 MB used, 395 GB / 399 GB avail
2016-06-30 23:01:28.150423 osd.12 192.168.57.8:6802/22047 4 : cluster [INF] 1.3f deep-scrub starts
2016-06-30 23:01:28.282228 osd.12 192.168.57.8:6802/22047 5 : cluster [ERR] 1.3fs0 deep-scrub stat mismatch, got 9/10 objects, 0/0 clones, 9/10 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 37783584/41981760 bytes,0/0 hit_set_archive bytes.
2016-06-30 23:01:28.282236 osd.12 192.168.57.8:6802/22047 6 : cluster [ERR] 1.3f deep-scrub 1 errors

Resolutions attempted:

rados rm:
rados rm does not see the object as present. This means that the metadata causing the counter to be off cannot be deleted.

Placing a new object of the same name and then deleting it does not resolve the issue. The counters either in the omap or pgmap are incorrect.

ceph-objectstore-tool:
[root@data5 meta]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/osd17 --journal-path /dev/disk/by-id/ata-VBOX_HARDDISK_VB936bc28c-5399b440-part2 benchmark_data_mon1_20977_object505 remove

This was performed for all OSDs in the PG with the inconsistency - to remove the exact object. It did not resolve the issue.

Actions #1

Updated by Bryan Apperson almost 8 years ago

I also wrote a leveldb client to do in and inspect the omap - it looks like data for the object is still present.

Actions #3

Updated by Bryan Apperson over 7 years ago

A quick fix would be:

L11680 of ReplicatedPG.cc

We can comment out the repair conditional to update the stats and swap the binary out in our particular case - but the scrub process should have a way to correct these stats when mismatched (gracefully).

if (repair) {
++scrubber.fixed;
info.stats.stats = scrub_cstat;
info.stats.dirty_stats_invalid = false;
info.stats.omap_stats_invalid = false;
info.stats.hitset_stats_invalid = false;
info.stats.hitset_bytes_stats_invalid = false;
publish_stats_to_osd();
share_pg_info();
}
Actions #4

Updated by Loïc Dachary over 7 years ago

  • Target version deleted (v0.94.8)
Actions #5

Updated by Greg Farnum almost 7 years ago

  • Subject changed from EC Pool PG Inconsistent after manual shard deletion to Can't fix PG stats after manual shard deletion
  • Priority changed from Normal to Low
Actions #6

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category changed from OSD to EC Pools
  • Component(RADOS) OSD added
Actions

Also available in: Atom PDF