Bug #16997
closedzfs: lfn attr not present causing osd backfilling to not progress
0%
Description
Problem: After removing (out + crush remove + auth del + osd rm) and eventually replacing three osds on a single host, I have five pgs that, after 3 weeks of recovery, are stuck in a state of active+undersized+degraded+remapped+backfilling.
Cluster details:
- 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04 3.16.0-77-generic, 72 6TB SAS2 drives per host, collocated journals)
- Hammer (ceph version 0.94.6-2 (f870be457b16e4ff56ced74ed3a3c9a4c781f281) -- this is a custom build on top of 0.94.6 that includes two Yehuda patches for issues 15745 and 15886)
- object storage use only
- erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)
- failure domain of host
- cluster is currently storing 178TB over 260 MObjects (5-6% utilization per OSD)
- all 5 stuck pgs belong to .rgw.buckets
The relevant section of our crushmap:
rule .rgw.buckets { ruleset 1 type erasure min_size 7 max_size 9 step set_chooseleaf_tries 5 step set_choose_tries 250 step take default step chooseleaf indep 0 type host step emit }
Dump of stuck pgs:
ceph pg dump_stuck ok pg_stat state up up_primary acting acting_primary 33.151d active+undersized+degraded+remapped+backfilling [424,546,273,167,471,631,155,38,47] 424 [424,546,273,167,471,631,155,38,2147483647] 424 33.6c1 active+undersized+degraded+remapped+backfilling [453,86,565,266,338,580,297,577,404] 453 [453,86,565,266,338,2147483647,297,577,404] 453 33.150d active+undersized+degraded+remapped+backfilling [555,452,511,550,643,431,141,329,486] 555 [555,2147483647,511,550,643,431,141,329,486] 555 33.13a8 active+undersized+degraded+remapped+backfilling [507,317,276,617,565,28,471,200,382] 507 [507,2147483647,276,617,565,28,471,200,382] 507 33.4c1 active+undersized+degraded+remapped+backfilling [413,440,464,129,641,416,295,266,431] 413 [413,440,2147483647,129,641,416,295,266,431] 413
This problem initially arose when we suffered three successive disk failures on the same host and removed the OSDs from the cluster ('out' + 'crush remove' + 'auth del' + 'osd rm'). We reached this state after a day of rebalancing and were stuck here for a few weeks. We replaced the disks last week and allowed the cluster to backfill and rebalance data to the new OSDs, but we became stuck on the same five pgs. All of the 'stuck' backfills are against OSDs on the host that lost then regained disks (node 07).
What I've tried:
- Increased set_choose_tries from 50 to 250 in steps of 50 (crushtool testing showed mapping issues that should be satisfied at 100, but we continued changing the value up in hopes that it was a mapping error)
- Restarted the backfilling OSDs on node 07
- Restarted the leader OSDs
- Marked the leader OSDs down to force peering
On the advice of Sam Just, I cranked up logging (--debug_ms 1 --debug_osd 20 --debug_filestore 20
) on all OSDs across the five pgs and marked the leaders down. To keep noise to a minimum, I stopped all RadosGW instances prior to turning up debugging. I kept debugging on until the cluster returned to all pgs active+clean except for the five above.
Due to the number of logs, I am providing links here:
http://canada.os.ctl.io/osd-logs/ceph-osd.129.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.141.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.155.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.167.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.200.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.266.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.273.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.276.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.28.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.295.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.297.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.317.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.329.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.338.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.382.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.38.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.404.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.413.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.416.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.424.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.431.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.440.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.452.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.453.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.464.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.471.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.47.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.486.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.507.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.511.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.546.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.550.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.555.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.565.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.577.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.580.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.617.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.631.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.641.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.643.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.86.log.1.gz
As I understand things, there is no reason that the OSDs on node 07 should not be able to accept the backfills from the rest of the cluster for the pgs involved. It is entirely possible that I have something misconfigured that is causing/exacerbating this issue, but I am no stranger to losing/replacing disks and have not encountered this issue on any of my other clusters (five and growing).