Project

General

Profile

Actions

Bug #16997

closed

zfs: lfn attr not present causing osd backfilling to not progress

Added by Brian Felton over 7 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Problem: After removing (out + crush remove + auth del + osd rm) and eventually replacing three osds on a single host, I have five pgs that, after 3 weeks of recovery, are stuck in a state of active+undersized+degraded+remapped+backfilling.

Cluster details:
- 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04 3.16.0-77-generic, 72 6TB SAS2 drives per host, collocated journals)
- Hammer (ceph version 0.94.6-2 (f870be457b16e4ff56ced74ed3a3c9a4c781f281) -- this is a custom build on top of 0.94.6 that includes two Yehuda patches for issues 15745 and 15886)
- object storage use only
- erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)
- failure domain of host
- cluster is currently storing 178TB over 260 MObjects (5-6% utilization per OSD)
- all 5 stuck pgs belong to .rgw.buckets

The relevant section of our crushmap:

rule .rgw.buckets {
        ruleset 1
        type erasure
        min_size 7
        max_size 9
        step set_chooseleaf_tries 5
        step set_choose_tries 250
        step take default
        step chooseleaf indep 0 type host
        step emit
}

Dump of stuck pgs:

ceph pg dump_stuck
ok
pg_stat state   up      up_primary      acting  acting_primary
33.151d active+undersized+degraded+remapped+backfilling [424,546,273,167,471,631,155,38,47]     424     [424,546,273,167,471,631,155,38,2147483647]     424
33.6c1  active+undersized+degraded+remapped+backfilling [453,86,565,266,338,580,297,577,404]    453     [453,86,565,266,338,2147483647,297,577,404]     453
33.150d active+undersized+degraded+remapped+backfilling [555,452,511,550,643,431,141,329,486]   555     [555,2147483647,511,550,643,431,141,329,486]    555
33.13a8 active+undersized+degraded+remapped+backfilling [507,317,276,617,565,28,471,200,382]    507     [507,2147483647,276,617,565,28,471,200,382]     507
33.4c1  active+undersized+degraded+remapped+backfilling [413,440,464,129,641,416,295,266,431]   413     [413,440,2147483647,129,641,416,295,266,431]    413

This problem initially arose when we suffered three successive disk failures on the same host and removed the OSDs from the cluster ('out' + 'crush remove' + 'auth del' + 'osd rm'). We reached this state after a day of rebalancing and were stuck here for a few weeks. We replaced the disks last week and allowed the cluster to backfill and rebalance data to the new OSDs, but we became stuck on the same five pgs. All of the 'stuck' backfills are against OSDs on the host that lost then regained disks (node 07).

What I've tried:

  • Increased set_choose_tries from 50 to 250 in steps of 50 (crushtool testing showed mapping issues that should be satisfied at 100, but we continued changing the value up in hopes that it was a mapping error)
  • Restarted the backfilling OSDs on node 07
  • Restarted the leader OSDs
  • Marked the leader OSDs down to force peering

On the advice of Sam Just, I cranked up logging (--debug_ms 1 --debug_osd 20 --debug_filestore 20) on all OSDs across the five pgs and marked the leaders down. To keep noise to a minimum, I stopped all RadosGW instances prior to turning up debugging. I kept debugging on until the cluster returned to all pgs active+clean except for the five above.

Due to the number of logs, I am providing links here:

http://canada.os.ctl.io/osd-logs/ceph-osd.129.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.141.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.155.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.167.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.200.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.266.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.273.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.276.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.28.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.295.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.297.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.317.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.329.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.338.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.382.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.38.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.404.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.413.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.416.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.424.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.431.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.440.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.452.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.453.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.464.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.471.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.47.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.486.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.507.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.511.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.546.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.550.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.555.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.565.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.577.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.580.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.617.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.631.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.641.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.643.log.1.gz
http://canada.os.ctl.io/osd-logs/ceph-osd.86.log.1.gz

As I understand things, there is no reason that the OSDs on node 07 should not be able to accept the backfills from the rest of the cluster for the pgs involved. It is entirely possible that I have something misconfigured that is causing/exacerbating this issue, but I am no stranger to losing/replacing disks and have not encountered this issue on any of my other clusters (five and growing).

Actions #1

Updated by Samuel Just over 7 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Brian Felton over 7 years ago

I also wanted to add that I've instructed the pgs to scrub, deep-scrub, and repair, but the cluster never takes these actions. I'm not sure this is relevant, but I wanted to be thorough.

Actions #3

Updated by Samuel Just over 7 years ago

I need some additional information:
1) Please confirm via the admin socket that all of the osds are running the same version, and that version is the one you mentioned above.
2) What versions has this cluster run in the past?
3) Please push a branch matching that sha1 to somewhere I can fetch it.

Actions #4

Updated by Samuel Just over 7 years ago

In addition to the above information, can you add a recursive ls of the 33.151d collection on osd 47?

Actions #5

Updated by Samuel Just over 7 years ago

Also, are you using zfs?

Actions #6

Updated by Brian Felton over 7 years ago

  1. All OSDs are running the same version -- confirmed (I can cut and paste the Ansible-y goodness if you'd like to verify yourself)
  2. The cluster has previously run 0.94.3. It was upgraded from there to 0.94.6-2
  3. https://github.com/bjfelton/as-ceph/tree/clc_hammer_patch
  4. http://canada.os.ctl.io/osd-logs/33.151ds.tgz
  5. Yes, we are using ZFS.
Actions #7

Updated by Samuel Just over 7 years ago

Are there no subdirectories in that directory? I need to see what subdirectories the files in that collection are in.

Actions #8

Updated by Samuel Just over 7 years ago

That github link gives a 404.

Actions #9

Updated by Brian Felton over 7 years ago

Sam,

My apologies. Here is an update, public link: https://github.com/bjfelton/ceph/tree/clc_hammer_patch

Also, there are no subdirectories:

root@osd07:/osd/47/current/33.151ds8_head# find . -type d
.
root@osd07:/osd/47/current/33.151ds8_head#

We have clusters paired together with one as the DR site. We have found that directory splitting on the OSDs causes significant performance problems due to our collocated journals, so we set one cluster in the pair to split on default boundaries and one to split much, much later in life. In our testing with ZFS, we didn't find this to be a problem, and this ensures we have a failover target whenever splits hit us.

Actions #10

Updated by Samuel Just over 7 years ago

It looks to me like HashIndex::list_by_hash is erroring out. We'll need more debugging there to figure out why.

Actions #11

Updated by Samuel Just over 7 years ago

https://github.com/athanatos/ceph/commit/6c9cffca444f7611bebd1be37e42bab2de12ba0b is based on v94.6 and adds some debugging (I have not even run this code, so use at your own risk). You'll want to cherry-pick it and run it on osd.47 and capture the same debugging levels on 47 and 424.

Actions #12

Updated by Brian Felton over 7 years ago

Sam,

I have built and deployed the patch, restarted OSDs 47 and 424, set debugging, downed 424, and let things settle. The following logs are the result:

http://osd-logs.canada.os.ctl.io/ceph-osd.47.log.20160816.tgz
http://osd-logs.canada.os.ctl.io/ceph-osd.424.log.20160816.tgz

I am reviewing as well to better understand the code path here, but your help is very much appreciated.

Brian

Actions #13

Updated by Samuel Just over 7 years ago

  • Subject changed from OSD Backfilling Cannot Progress to zfs: lfn attr not present causing osd backfilling to not progress
  • Priority changed from Urgent to Low

2016-08-16 17:08:06.991156 7f853ead3700 20 LFNIndex(/osd/47/current/33.151ds8_head) list_objects: lfn_translate returned: -61 for short_name default.284344.63\uNOMS\sda4fc86c-05c7-4ae6-82f4-a9ad90f5f601\sbackups\s20160429165633\sp1\snomis\scustomers\sbeforeUpgrade\supgrade\spatch\sNOMIS\uNODE\snomisserver\sstandalone\sdeployments\sproductdocs.war\sWEB-INF\sweb.xml___2b1cae32b58740dd2fb3_0_long

The lfn xattr is missing. This is probably a disconnect between how zfs handles xattrs and how xfs/ext4/btrfs do. If you want to investigate why this is happening, you'll have to dig into LFNIndex.* and HashIndex.* to see how we deal with long file names.

Actions #14

Updated by Samuel Just over 7 years ago

Also, I bet the primary (at least) is ok since the scan didn't error out. From that, I infer guessing that whatever happened to osd.47 (and I bet the target for all of the other stuck backfills) is unusual. You can probably work around the stall on osd.47 by using ceph-objectstore-tool to remove the partially backfilled shard for that pg on osd.47 and letting backfill restart. If the same problem happens again, then it should be relatively simple for you to modify the LFNIndex code to add debugging to work out why.

Actions #15

Updated by Brian Felton over 7 years ago

Sam,

First, I cannot thank you enough for your assistance here. The last bits of debugging pointed me in the right direction, and I've got my cluster in a healthy state again.

tl;dr -- there is no problem with Ceph or ZFS. Please close.

If you're still with me...

The problem was not ZFS. It is perfectly capable of handling Ceph's xattrs (both in normal cases and for lfn cases), although it does appear artificially limited here:

// max xattr value size
OPTION(filestore_max_xattr_value_size, OPT_U32, 0)      //Override
OPTION(filestore_max_xattr_value_size_xfs, OPT_U32, 64<<10)
OPTION(filestore_max_xattr_value_size_btrfs, OPT_U32, 64<<10)
// ext4 allows 4k xattrs total including some smallish extra fields and the
// keys.  We're allowing 2 512 inline attrs in addition some some filestore
// replay attrs.  After accounting for those, we still need to fit up to
// two attrs of this value.  That means we need this value to be around 1k
// to be safe.  This is hacky, but it's not worth complicating the code
// to work around ext4's total xattr limit.
OPTION(filestore_max_xattr_value_size_other, OPT_U32, 1<<10)

Since we've been running the clusters on ZFS for 18 or so months now, since we're currently storing hundreds of millions of objects, and since we're no strangers to disk replacements and backfills, it seemed unlikely that Ceph was not interacting properly with ZFS in most all cases. But the logs were certainly clear that we were missing an xattr here. So I did some digging. Using getfattr, I checked the file above on OSDs 47 and 424. On 424, I found a file with four attrs. On 47, however, the file was missing xattrs.

After taking a backup, I removed the file, cranked up debugging, and forced peering. And the file was no longer in the logs as problematic (although many others were). Now, at this point, I had two options -- do surgery on the files in the logs (i.e. remove them or manually set the attrs), or nuke the pg's contents on 47 and start the backfill over. Given the four other pgs in this state and the amount of time it would have taken to write the needed tooling, I stopped the OSD, took a backup of the contents, nuked everything except the __head* file, and let 'er rip.

Lather, rinse, repeat with the other four pgs, and a few hours later, I have a healthy cluster.

I'm not sure how I ended up with files with no xattrs, but I'd wager the thrashing caused by our collocated journal setup was the primary culprit. I'm not sure if being able to scrub/repair the pg would have solved this, but it's a moot point now.

Thanks again for your assistance.

Actions #16

Updated by Samuel Just over 7 years ago

  • Status changed from New to Can't reproduce

scrub/repair won't run when the pg isn't clean, no help there. I'm very concerned that those xattrs were missing. Both xfs and ext4 require a lot of coddling to handle our xattrs properly (hence the artificial limits). I would not be surprised if the OSD were doing something which exposes an issue with zfs's xattrs (or merely with our assumptions about how they work) since we don't test zfs. The good news is that scrub would have hung in this case as well, so as long as you leave scrub enabled, you'll get an early warning if this crops up again.

I'm closing this for now since zfs isn't a priority, but if you get any more information, feel free to reopen.

Actions

Also available in: Atom PDF