Project

General

Profile

Actions

Bug #8752

closed

firefly: scrub/repair stat mismatch

Added by Dmitry Smirnov almost 10 years ago. Updated over 8 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Two dozen PGs are in "active+clean+inconsistent" state.
Attempted "ceph pg repair" reports fixed error(s) but next scrub or deep-scrub reveal the same (or similar) problem.
All 12 OSDs were replaced but problem do not go away and it is not clear what to expect (corruption?) or how to recover. Please advise.

2014-07-06 09:44:42.632205 osd.1 [ERR] 20.e deep-scrub stat mismatch, got 3280/3280 objects, 0/0 clones, 1634/1634 dirty, 0/0 omap, 4/4 hit_set_archive, 1871/1871 whiteouts, 893452192/893452181 bytes.                                  
2014-07-06 09:44:42.632212 osd.1 [ERR] 20.e deep-scrub 1 errors 

2014-07-06 09:53:10.496110 osd.1 [ERR] 20.e repair stat mismatch, got 3281/3281 objects, 0/0 clones, 1635/1635 dirty, 0/0 omap, 4/4 hit_set_archive, 1874/1874 whiteouts, 889179125/889179115 bytes.                                      
2014-07-06 09:53:10.496176 osd.1 [ERR] 20.e repair 1 errors, 1 fixed 

2014-07-06 16:24:06.753233 osd.1 [ERR] 20.e scrub stat mismatch, got 3330/3330 objects, 0/0 clones, 1635/1635 dirty, 0/0 omap, 4/4 hit_set_archive, 1751/1751 whiteouts, 1994079525/1994079526 bytes.                                     
2014-07-06 16:24:06.753237 osd.1 [ERR] 20.e scrub 1 errors 

2014-07-06 16:32:03.587865 osd.1 [ERR] 20.e repair stat mismatch, got 3333/3333 objects, 0/0 clones, 1635/1635 dirty, 0/0 omap, 4/4 hit_set_archive, 1751/1751 whiteouts, 2006662452/2006662455 bytes.                                    
2014-07-06 16:32:03.587944 osd.1 [ERR] 20.e repair 1 errors, 1 fixed 

2014-07-06 17:07:09.114170 osd.1 [ERR] 20.e scrub stat mismatch, got 3333/3333 objects, 0/0 clones, 1635/1635 dirty, 0/0 omap, 4/4 hit_set_archive, 1751/1751 whiteouts, 2006662455/2006662452 bytes.                                     
2014-07-06 17:07:09.114176 osd.1 [ERR] 20.e scrub 1 errors 
2014-07-06 17:10:26.163036 osd.1 [ERR] 20.e repair stat mismatch, got 3333/3333 objects, 0/0 clones, 1635/1635 dirty, 0/0 omap, 4/4 hit_set_archive, 1751/1751 whiteouts, 2006662455/2006662452 bytes.                                    
2014-07-06 17:10:26.163211 osd.1 [ERR] 20.e repair 1 errors, 1 fixed 

2014-07-06 19:55:31.549075 osd.1 [ERR] 20.e scrub stat mismatch, got 3334/3334 objects, 0/0 clones, 1635/1635 dirty, 0/0 omap, 4/4 hit_set_archive, 1729/1729 whiteouts, 2201876424/2201876433 bytes.                                     
2014-07-06 19:55:31.549079 osd.1 [ERR] 20.e scrub 1 errors 

2014-07-06 20:16:19.560180 osd.1 [ERR] 20.e repair stat mismatch, got 3315/3315 objects, 0/0 clones, 1635/1635 dirty, 0/0 omap, 4/4 hit_set_archive, 1725/1725 whiteouts, 2159604413/2159604425 bytes.                                    
2014-07-06 20:16:19.560267 osd.1 [ERR] 20.e repair 1 errors, 1 fixed

2014-07-06 20:23:14.901710 osd.1 [ERR] 20.e repair stat mismatch, got 3222/3222 objects, 0/0 clones, 1635/1635 dirty, 0/0 omap, 4/4 hit_set_archive, 1719/1719 whiteouts, 1832294131/1832294124 bytes. 
2014-07-06 20:23:14.901808 osd.1 [ERR] 20.e repair 1 errors, 1 fixed 
# ceph pg map 20.e
osdmap e53790 pg 20.e (20.e) -> up [1,8,12,4] acting [1,8,12,4]

All PGs seems to belong to replicated pool.

Actions #1

Updated by Dmitry Smirnov almost 10 years ago

No improvement with 0.80.3. I'm getting ~20 inconsistent PGs after every cycle of full "deep-scrub" (i.e. `ceph osd deep-scrub \*`).
It is always one error per PG, "mismatch in NNN/MMM bytes".
It looks like false-positive but I'm not sure how to confirm it.
"repair" seems to help for some time but eventually errors reappear.
All cluster components are up all the time, RBDs are used extensively by kernel clients and libvirt/KVM.
OSDs occasionally crash on "deep-scrub" as described in #8747.
Could this be related to #8830?

Actions #2

Updated by Sage Weil almost 10 years ago

  • Status changed from New to Duplicate

almost certainly a dup of #8830. fix will hit the firefly branch shortly!

Actions #3

Updated by Greg Farnum almost 10 years ago

  • Status changed from Duplicate to New

Actually, maybe not. The naive interpretation doesn't have #8830 causing differences in file sizes...but maybe it could if you're using different local filesystems to back different OSDs?

Actions #4

Updated by Samuel Just almost 10 years ago

  • Priority changed from Normal to High

This appears unrelated to 8830. Probably a stat miscounting bug somewhere in the cache/tiering code.

Actions #5

Updated by Dmitry Smirnov almost 10 years ago

If #8830 affect only XFS-based OSDs it is definitely not my case. All my OSDs are on Btrfs...
Objects from affected PGs (20.*) belong to replicated cache pool on top of EC pool...
Pool 20 is used for CephFS.

Actions #6

Updated by Samuel Just almost 10 years ago

Just fyi, this is a relatively harmless stat counting error. It shouldn't cause corruption. Not that I know how to fix it yet...

Actions #7

Updated by Dmitry Smirnov almost 10 years ago

Samuel Just wrote:

Just fyi, this is a relatively harmless stat counting error. It shouldn't cause corruption.

Thank you for confirming that. Very re-assuring.

I've checked integrity of ~200 GiB of data several times (before and after repair) and all of it is OK, no corruption whatsoever. This problem seems scarier than it is (e.g. "HEALTH_ERR 12 pgs inconsistent; 12 scrub errors").

Not that I know how to fix it yet...

No worries, I'm sure you'll think of something eventually.
Good luck and thank you.

Actions #8

Updated by Dmitry Smirnov over 9 years ago

Upgraded cluster to 0.80.4, restarted all components (previously MDS 0.80.2 could be still running), copied some data to CephFS. Now mismatches magically disappeared and three passes of deep-scrub found no problems whatsoever.
Frankly I don't understand what fixed it -- could it be that some old problematic objects were evicted from caching pool when I copied more data? No other ideas...

Actions #9

Updated by Sage Weil over 9 years ago

  • Status changed from New to Resolved
Actions #10

Updated by Sage Weil over 9 years ago

  • Status changed from Resolved to Can't reproduce
Actions #11

Updated by Dmitry Smirnov over 9 years ago

  • Status changed from Can't reproduce to New

This problem manifests only on caching pools.
I have two EC pools with the following settings:

directory=/usr/lib/x86_64-linux-gnu/ceph/erasure-code
k=2
m=3
plugin=jerasure
ruleset-failure-domain=host
technique=reed_sol_van

Here is the configuration of the relevant pools:

pool 17 'rbd_ec' erasure size 5 min_size 3 crush_ruleset 4 object_hash rjenkins pg_num 32 pgp_num 32 last_change 53650 flags hashpspool tiers 19 read_tier 19 write_tier 19 stripe_width 4096
pool 18 'data_ec' erasure size 5 min_size 3 crush_ruleset 5 object_hash rjenkins pg_num 32 pgp_num 32 last_change 53649 flags hashpspool tiers 20 read_tier 20 write_tier 20 stripe_width 4096
pool 19 'rbdc' replicated size 4 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 57125 flags hashpspool tier_of 17 cache_mode writeback target_bytes 1111111111111 target_objects 32768 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 60s x4 stripe_width 0
pool 20 'datac' replicated size 4 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 53649 flags hashpspool tier_of 18 cache_mode writeback target_bytes 1111111111111 target_objects 131072 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 60s x4 stripe_width 0

Inconsistencies manifest only on pools 19 ("rbdc") and 20 ("datac") used for RBD and CephFS accordingly.
I'm not sure about the exact IO pattern to produce inconsistencies but with some activity inconsistencies will appear sure enough.

On this cluster I have three other non-caching replicated pools that do not exhibit any inconsistencies.

Actions #12

Updated by Dmitry Smirnov over 9 years ago

On 0.80.5 inconsistencies disappear from pool 20 (CephFS caching pool) although I also stopped using kernel FS clients in favour of FUSE client.

Pool 19 (RBD caching pool) is still affected. All kernel RBD clients upgraded to 3.16.3.
Occasionally `ceph pg scrub` clears "inconsistent" state from affected PG;
sometimes scrubbing mark some PGs from pool 19 as "inconsistent".

Actions #13

Updated by Sage Weil over 9 years ago

Dmitry Smirnov wrote:

On 0.80.5 inconsistencies disappear from pool 20 (CephFS caching pool) although I also stopped using kernel FS clients in favour of FUSE client.

Pool 19 (RBD caching pool) is still affected. All kernel RBD clients upgraded to 3.16.3.
Occasionally `ceph pg scrub` clears "inconsistent" state from affected PG;
sometimes scrubbing mark some PGs from pool 19 as "inconsistent".

Is it possible the inconsistencies are correlated with the kernel (vs userspace) client? That would be quite odd. Perhaps a difference in the hints being sent...

Actions #14

Updated by Dmitry Smirnov over 9 years ago

Sage Weil wrote:

Is it possible the inconsistencies are correlated with the kernel (vs userspace) client? That would be quite odd. Perhaps a difference in the hints being sent...

I can't be 100% sure about it but as far as I can tell inconsistencies disappear from pool 20 some time after upgrade to 0.80.5 but before switching to FUSE clients. I think I had to mention change of FS clients just in case to avoid missing something relevant...

Actions #15

Updated by Dmitry Smirnov over 9 years ago

I think I found where it is happening. For a while I was using Btrfs-based OSDs with journals on SSD-based ext4. For example if I had two OSDs on Ceph node I would create ext4 partition on SSD and move "journal" files from OSDs to that partition. On OSD I would make symlinks to journal files.

Lately I've decided to try moving all journals back to their OSDs and to my surprise all inconsistencies disappear. I reproduced the issue on 0.80.6 and it was affecting only replicated caching pool in front of RBD erasure pool.

Actions #16

Updated by Dmitry Smirnov over 9 years ago

After upgrade to 0.87 I've noticed inconsistencies on all PGs of all caching pools again, during and after deleting of data from RBD and CephFS. All OSDs are Btrfs-based with internal journal and snapshotting enabled. No kernel clients; OSDs are running on kernel 3.16.5.

Actions #17

Updated by Loïc Dachary almost 9 years ago

  • Regression set to No
Actions #18

Updated by A G over 8 years ago

I have 0.80.9. I've set up caching pools and now I see same errors on the caching osds.
I have all xfs osds with the local journals (a file on the same osd). It is not related to 8830, the parameter "filestore_xfs_extsize" is set to False on all OSDs.

Any suggestions?

Actions #19

Updated by Samuel Just over 8 years ago

Exactly what stat mismatch are you seeing?

Actions #20

Updated by Samuel Just over 8 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF