Project

General

Profile

Actions

Bug #16997

closed

zfs: lfn attr not present causing osd backfilling to not progress

Added by Brian Felton over 7 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Problem: After removing (out + crush remove + auth del + osd rm) and eventually replacing three osds on a single host, I have five pgs that, after 3 weeks of recovery, are stuck in a state of active+undersized+degraded+remapped+backfilling.

Cluster details:
- 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04 3.16.0-77-generic, 72 6TB SAS2 drives per host, collocated journals)
- Hammer (ceph version 0.94.6-2 (f870be457b16e4ff56ced74ed3a3c9a4c781f281) -- this is a custom build on top of 0.94.6 that includes two Yehuda patches for issues 15745 and 15886)
- object storage use only
- erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)
- failure domain of host
- cluster is currently storing 178TB over 260 MObjects (5-6% utilization per OSD)
- all 5 stuck pgs belong to .rgw.buckets

The relevant section of our crushmap:

rule .rgw.buckets {
        ruleset 1
        type erasure
        min_size 7
        max_size 9
        step set_chooseleaf_tries 5
        step set_choose_tries 250
        step take default
        step chooseleaf indep 0 type host
        step emit
}

Dump of stuck pgs:

ceph pg dump_stuck
ok
pg_stat state   up      up_primary      acting  acting_primary
33.151d active+undersized+degraded+remapped+backfilling [424,546,273,167,471,631,155,38,47]     424     [424,546,273,167,471,631,155,38,2147483647]     424
33.6c1  active+undersized+degraded+remapped+backfilling [453,86,565,266,338,580,297,577,404]    453     [453,86,565,266,338,2147483647,297,577,404]     453
33.150d active+undersized+degraded+remapped+backfilling [555,452,511,550,643,431,141,329,486]   555     [555,2147483647,511,550,643,431,141,329,486]    555
33.13a8 active+undersized+degraded+remapped+backfilling [507,317,276,617,565,28,471,200,382]    507     [507,2147483647,276,617,565,28,471,200,382]     507
33.4c1  active+undersized+degraded+remapped+backfilling [413,440,464,129,641,416,295,266,431]   413     [413,440,2147483647,129,641,416,295,266,431]    413

This problem initially arose when we suffered three successive disk failures on the same host and removed the OSDs from the cluster ('out' + 'crush remove' + 'auth del' + 'osd rm'). We reached this state after a day of rebalancing and were stuck here for a few weeks. We replaced the disks last week and allowed the cluster to backfill and rebalance data to the new OSDs, but we became stuck on the same five pgs. All of the 'stuck' backfills are against OSDs on the host that lost then regained disks (node 07).

What I've tried:

  • Increased set_choose_tries from 50 to 250 in steps of 50 (crushtool testing showed mapping issues that should be satisfied at 100, but we continued changing the value up in hopes that it was a mapping error)
  • Restarted the backfilling OSDs on node 07
  • Restarted the leader OSDs
  • Marked the leader OSDs down to force peering

On the advice of Sam Just, I cranked up logging (--debug_ms 1 --debug_osd 20 --debug_filestore 20) on all OSDs across the five pgs and marked the leaders down. To keep noise to a minimum, I stopped all RadosGW instances prior to turning up debugging. I kept debugging on until the cluster returned to all pgs active+clean except for the five above.

Due to the number of logs, I am providing links here:

http://canada.os.ctl.io/osd-logs/ceph-osd.129.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.141.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.155.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.167.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.200.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.266.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.273.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.276.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.28.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.295.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.297.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.317.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.329.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.338.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.382.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.38.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.404.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.413.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.416.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.424.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.431.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.440.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.452.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.453.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.464.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.471.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.47.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.486.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.507.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.511.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.546.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.550.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.555.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.565.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.577.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.580.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.617.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.631.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.641.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.643.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.86.log.1.gz

As I understand things, there is no reason that the OSDs on node 07 should not be able to accept the backfills from the rest of the cluster for the pgs involved. It is entirely possible that I have something misconfigured that is causing/exacerbating this issue, but I am no stranger to losing/replacing disks and have not encountered this issue on any of my other clusters (five and growing).

Actions

Also available in: Atom PDF