Project

General

Profile

Actions

Bug #16567

open

Ceph raises scrub errors on pgs with objects being overwritten

Added by Goncalo Borges almost 8 years ago. Updated almost 7 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
FileStore, OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am submitting a bug following the advise from Ceph experts in the user mailing list.

1./ In jewel 10.2.2 I got some scrub errors

  1. ceph health detail
    HEALTH_ERR 2 pgs inconsistent; 2 scrub errors; crush map has legacy tunables (require bobtail, min is firefly)
    pg 6.39c is active+clean+inconsistent, acting [2,60,32]
    pg 6.263 is active+clean+inconsistent, acting [56,39,6]
    2 scrub errors
    crush map has legacy tunables (require bobtail, min is firefly); see http://ceph.com/docs/master/rados/operations/crush-map/#tunables

--- * ---

2./ I started to look to pg 6.263. Errors were only appearing in osd.56 logs but not in others. The log seems to report an error between the real disk size of the object and what ceph thinks it is supposed to be:

  1. cat ceph-osd.56.log-20160629 | grep -Hn 'ERR'
    (standard input):8569:2016-06-29 08:09:50.952397 7fd023322700 -1 log_channel(cluster) log [ERR] : scrub 6.263 6:c645f18e:::100002a343d.00000000:head on disk size (1836) does not match object info size (41242) adjusted for ondisk to (41242)
    (standard input):8602:2016-06-29 08:11:11.227865 7fd023322700 -1 log_channel(cluster) log [ERR] : 6.263 scrub 1 errors

--- * ---

3./ I performed a 'ceph pg repair 6.263'. Eventually, that pg went back to 'active+clean'

  1. ceph pg dump | grep ^6.263
    dumped all in format plain
    6.263 10845 0 0 0 0 39592671010 3037 3037 active+clean 2016-06-30 02:13:00.455293 1005'2126237 1005:2795768 [56,39,6] 56 [56,39,6] 56 1005'2076134 2016-06-30 02:13:00.455256 1005'2076134 2016-06-30 02:13:00.455256

However, in the logs i found

2016-06-30 02:03:03.992240 osd.56 192.231.127.226:6801/21569 278 : cluster [INF] 6.263 repair starts
2016-06-30 02:13:00.455237 osd.56 192.231.127.226:6801/21569 279 : cluster [INF] 6.263 repair ok, 0 fixed

--- * ---

4./ Inspecting a bit more, I found that the object inside the pg in all involved osds is changing size. I think this is because the object is being overwritten by a user application. Check the size of the object between consecutive 'ls -l' commands in osd.56 (but the same thing is true in 39 and 6)

  1. ls l /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    -rw-r--r-
    1 ceph ceph 8602 Jun 30 02:53 /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    [root@rccephosd8 ceph]# ls l /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    -rw-r--r-
    1 ceph ceph 170 Jun 30 02:53 /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
  1. ls l /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    -rw-r--r-
    1 ceph ceph 15436 Jun 30 02:53 /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
  1. ls l /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    -rw-r--r-
    1 ceph ceph 26044 Jun 30 02:53 /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
  1. ls l /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    -rw-r--r-
    1 ceph ceph 0 Jun 30 02:53 /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
  1. ls l /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    -rw-r--r-
    1 ceph ceph 14076 Jun 30 02:53 /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
  1. ls l /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    -rw-r--r-
    1 ceph ceph 31110 Jun 30 02:53 /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
  1. ls l /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    -rw-r--r-
    1 ceph ceph 0 Jun 30 02:53 /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
  1. ls l /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    -rw-r--r-
    1 ceph ceph 20230 Jun 30 02:53 /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
  1. ls l /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    -rw-r--r-
    1 ceph ceph 23392 Jun 30 02:53 /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
  1. ls l /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    -rw-r--r-
    1 ceph ceph 0 Jun 30 02:53 /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
  1. ls l /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    -rw-r--r-
    1 ceph ceph 0 Jun 30 02:53 /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
  1. ls l /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6
    -rw-r--r-
    1 ceph ceph 41412 Jun 30 02:53 /var/lib/ceph/osd/ceph-56/current/6.263_head/DIR_3/DIR_6/DIR_2/DIR_A/100002a343d.00000000__head_718FA263__6

--- * ---

5./ I am wondering If the scrub error I got is not really a consequence of the scrub procedure not handling properly with objects in pgs changing very fast (in case of overwrites)

Actions

Also available in: Atom PDF