Bug #7350: osd: scrub does not detect recently touched and then renamed backend files - Ceph - Ceph

Actions

Copy link

Bug #7350

closed

osd: scrub does not detect recently touched and then renamed backend files

Added by Florian Haas about 10 years ago. Updated about 10 years ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

Category:

OSD

Target version:

% Done:

Spent time:

2:30 h

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

4 - irritation

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This is on Dumpling (0.67.5-1precise).

Steps to reproduce:

Create a single-byte RADOS object and read it back:

root@ubuntu-ceph1:~# rados -p test put onebyte - <<< 'A'
root@ubuntu-ceph1:~# rados -p test get onebyte -
A

Figure out where it's stored:

root@ubuntu-ceph1:~# ceph osd map test onebyte
osdmap e120 pool 'test' (3) object 'onebyte' -> pg 3.ed47d009 (3.1) -> up [0,2] acting [0,2]

Simulate data corruption/bitrot. Open file in vi, replace 'A' with 'B'.

root@ubuntu-ceph1:~# vi /var/lib/ceph/osd/ceph-0/current/3.1_head/onebyte__head_ED47D009__3

Read back the file

root@ubuntu-ceph1:~# rados -p test get onebyte -
A

Interesting. We modified the Primary, it silently reads back the (correct) copy from the replica.

OK, let's scrub. After all the mtime was modified.

root@ubuntu-ceph1:~# ceph pg scrub 3.1; ceph -w
instructing pg 3.1 on osd.0 to scrub

cluster bd70ea39-58fc-4117-ade1-03a4d429cb49
   health HEALTH_OK
   monmap e4: 3 mons at {ubuntu-ceph1=192.168.122.201:6789/0,ubuntu-ceph2=192.168.122.202:6789/0,ubuntu-ceph3=192.168.122.203:6789/0}, election epoch 168, quorum 0,1,2 ubuntu-ceph1,ubuntu-ceph2,ubuntu-ceph3
   osdmap e120: 3 osds: 3 up, 3 in
    pgmap v893: 200 pgs: 200 active+clean; 2 bytes data, 108 MB used, 15218 MB / 15326 MB avail; 34B/s rd, 0op/s
   mdsmap e1: 0/0/1 up

2014-02-06 12:55:25.815616 mon.0 [INF] pgmap v893: 200 pgs: 200 active+clean; 2 bytes data, 108 MB used, 15218 MB / 15326 MB avail; 34B/s rd, 0op/s
2014-02-06 12:55:45.827785 mon.0 [INF] pgmap v894: 200 pgs: 200 active+clean; 2 bytes data, 108 MB used, 15218 MB / 15326 MB avail; 20B/s rd, 0op/s
2014-02-06 12:55:42.768414 osd.0 [INF] 3.1 scrub ok

Interesting again. Scrub completes OK, no errors. Let's read it back again:

root@ubuntu-ceph1:~# rados -p test get onebyte -
A

This despite the primary copy being different.

root@ubuntu-ceph1:~# cat /var/lib/ceph/osd/ceph-0/current/3.1_head/onebyte__head_ED47D009__3
B

All right, let's deep scrub.

root@ubuntu-ceph1:~# ceph pg deep-scrub 3.1; ceph -w
instructing pg 3.1 on osd.0 to deep-scrub

cluster bd70ea39-58fc-4117-ade1-03a4d429cb49
   health HEALTH_OK
   monmap e4: 3 mons at {ubuntu-ceph1=192.168.122.201:6789/0,ubuntu-ceph2=192.168.122.202:6789/0,ubuntu-ceph3=192.168.122.203:6789/0}, election epoch 168, quorum 0,1,2 ubuntu-ceph1,ubuntu-ceph2,ubuntu-ceph3
   osdmap e120: 3 osds: 3 up, 3 in
    pgmap v896: 200 pgs: 200 active+clean; 2 bytes data, 108 MB used, 15218 MB / 15326 MB avail; 40B/s rd, 0op/s
   mdsmap e1: 0/0/1 up

2014-02-06 12:56:11.218615 mon.0 [INF] pgmap v896: 200 pgs: 200 active+clean; 2 bytes data, 108 MB used, 15218 MB / 15326 MB avail; 40B/s rd, 0op/s
2014-02-06 12:56:15.842525 mon.0 [INF] pgmap v897: 200 pgs: 200 active+clean; 2 bytes data, 108 MB used, 15218 MB / 15326 MB avail
2014-02-06 12:56:14.788259 osd.0 [INF] 3.1 deep-scrub ok

No errors? Really?

root@ubuntu-ceph1:~# ceph health detail
HEALTH_OK

Yes, really. Odd.

Let's stop the OSD and see if we get the same results.

root@ubuntu-ceph1:~# sudo stop ceph-osd id=0
ceph-osd stop/waiting
root@ubuntu-ceph1:~# rados -p test get onebyte -
A

Hmmm, OK. Well, what happens if we bring the OSD back up?

root@ubuntu-ceph1:~# sudo start ceph-osd id=0
ceph-osd (ceph/0) start/running, process 11150
root@ubuntu-ceph1:~# rados -p test get onebyte -
error getting test/onebyte: No such file or directory

Yeah, ouch. What's our health status?

root@ubuntu-ceph1:~# ceph health detail
HEALTH_OK

Really? I don't think so. What's the status of our PGs?

root@ubuntu-ceph1:~# ceph -w
cluster bd70ea39-58fc-4117-ade1-03a4d429cb49
health HEALTH_OK
monmap e4: 3 mons at {ubuntu-ceph1=192.168.122.201:6789/0,ubuntu-ceph2=192.168.122.202:6789/0,ubuntu-ceph3=192.168.122.203:6789/0}, election epoch 168, quorum 0,1,2 ubuntu-ceph1,ubuntu-ceph2,ubuntu-ceph3
osdmap e124: 3 osds: 3 up, 3 in
pgmap v904: 200 pgs: 200 active+clean; 2 bytes data, 110 MB used, 15216 MB / 15326 MB avail
mdsmap e1: 0/0/1 up

2014-02-06 12:57:08.263928 mon.0 [INF] pgmap v904: 200 pgs: 200 active+clean; 2 bytes data, 110 MB used, 15216 MB / 15326 MB avail

All PGs active and clean? I have a hard time believing that.

Hmm, so let's see. What was my PG again?

root@ubuntu-ceph1:~# ceph osd map test onebyte
osdmap e124 pool 'test' (3) object 'onebyte' -> pg 3.ed47d009 (3.1) -> up [0,2] acting [0,2]

Maybe I'll repair it, even though my cluster tells me all is well?

root@ubuntu-ceph1:~# ceph pg repair 3.1
instructing pg 3.1 on osd.0 to repair

root@ubuntu-ceph1:~# rados -p test get onebyte -
A

That did it. But how Joe Average User should come to the conclusion that that is how to fix this I have no idea. :)

Actions

Copy link

Updated by Florian Haas about 10 years ago

Subject changed from Corrupted object undetected by both scrub and deep-scrub, appers lost when restarting primary OSD to Corrupted object undetected by both scrub and deep-scrub, appears lost when restarting primary OSD

Actions

Copy link

Updated by Sage Weil about 10 years ago

Subject changed from Corrupted object undetected by both scrub and deep-scrub, appears lost when restarting primary OSD to osd: scrub does not detect recently renamed backend files
Status changed from New to 12

the problem is that vi is renaming the file and we cache recently opened files. use echo asdf >> file or similar to modify the same file/inode, or make the fdcache flush that entry by generating some load in between with something like rados bench.

Actions

Copy link

Updated by Sage Weil about 10 years ago

Subject changed from osd: scrub does not detect recently renamed backend files to osd: scrub does not detect recently touched and then renamed backend files

Actions

Copy link

Updated by Florian Haas about 10 years ago

Severity changed from 3 - minor to 4 - irritation

Thanks Sage -- I can confirm that the issue does not appear when echo'ing directly into the file. So evidently it was indeed vi doing a rename; sorry for the noise. Considering a rename would hardly be the result of bitrot, and for good measure I did check if anything did fail when I flipped bits in the file xattrs instead of the file content (it did not), I'm downgrading this one to "irritation".

Actions

Copy link

Updated by Sage Weil about 10 years ago

Status changed from 12 to Won't Fix

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #7350

osd: scrub does not detect recently touched and then renamed backend files

Updated by Florian Haas about 10 years ago

Updated by Sage Weil about 10 years ago

Updated by Sage Weil about 10 years ago

Updated by Florian Haas about 10 years ago

Updated by Sage Weil about 10 years ago