Project

General

Profile

Bug #18162

osd/ReplicatedPG.cc: recover_replicas: object added to missing set for backfill, but is not in recovering, error!

Added by Aaron Ten Clay over 2 years ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
EC Pools
Target version:
-
Start date:
12/06/2016
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
jewel,luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:

Description

I encountered the bug in #13937. I wanted to help test PR12088, and may have encountered an unrelated bug as a result.

     0> 2016-12-06 14:33:35.773259 7f3357278700 -1 osd/ReplicatedPG.cc: In function 'int ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)' thread 7f3357278700 time 2016-12-06 14:33:35.758593
osd/ReplicatedPG.cc: 10740: FAILED assert(0)

 ceph version 10.2.3-366-g289696d (289696d533038c2248c1fe0c8ee03adad343cfa9)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5619aedc4af0]
 2: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0xa3f) [0x5619ae87843f]
 3: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&, int*)+0xc2e) [0x5619ae87ffee]
 4: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x372) [0x5619ae6f3d72]
 5: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x20) [0x5619ae742090]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x5619aedb6cc1]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x5619aedb7dc0]
 8: (()+0x770a) [0x7f33819c970a]
 9: (clone()+0x6d) [0x7f337fa4282d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

After applying the PR at https://github.com/ceph/ceph/pull/12088, building ceph version 10.2.3-366-g289696d (289696d533038c2248c1fe0c8ee03adad343cfa9) on both Ubuntu 14.04 and 16.04 using the steps at http://docs.ceph.com/docs/jewel/install/build-ceph/ ...

I started the OSDs which had been marked "out" as per discussion related to #13937. Fairly shortly thereafter, the same OSDs which were crashing on 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) started crashing again, but with a new error. Attaching the entire log from one such OSD below with debug settings at 0/20.

As usual, please let me know what other information I can provide or tests I can run to help troubleshoot :)

ceph-osd.3.log.bz2 (442 KB) Aaron Ten Clay, 12/07/2016 12:14 AM

ec-handle-error-create-loc-list.patch View (5.17 KB) Alexandre Oliva, 01/03/2017 01:34 AM

ec-handle-error-in-backfill-read.patch View (10.8 KB) Alexandre Oliva, 01/06/2017 02:24 AM

adjust.patch View (542 Bytes) Alexandre Oliva, 01/13/2017 10:11 AM

retrying-while-recovering.patch View (1017 Bytes) Alexandre Oliva, 01/22/2017 02:30 AM


Related issues

Related to RADOS - Bug #18178: Unfound objects lost after OSD daemons restarted Won't Fix 12/07/2016
Copied to RADOS - Backport #22013: jewel: osd/ReplicatedPG.cc: recover_replicas: object added to missing set for backfill, but is not in recovering, error! Resolved
Copied to RADOS - Backport #22069: luminous: osd/ReplicatedPG.cc: recover_replicas: object added to missing set for backfill, but is not in recovering, error! Resolved

History

#1 Updated by David Zafman over 2 years ago

There is an object with peer missing:

peer osd.20(0) missing {2:d0cc2acb:::10000080236.0000007d:head=0'0}

An EIO from the 46(6) shard:

2016-12-06 14:33:35.757133 7f335c282700 10 osd.3 pg_epoch: 110014 pg[2.bs2( v 94305'214276 (78823'204254,94305'214276] local-les=110014 n=137180 ec=136 les/c/f 110014/79592/109085 110010/110010/110010) [20,2147483647,3,39,60,7,53,48,55,10,26,43,59,37]/[2147483647,2147483647,3,32,37,19,46,33,40,11,28,21,44,15] r=2 lpr=110010 pi=63136-110009/3155 rops=1 bft=7(5),10(9),20(0),26(10),37(13),39(3),43(11),48(7),53(6),55(8),59(12),60(4) crt=91692'214272 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling NIBBLEWISE] _failed_push: Read error 2:d0cc2acb:::10000080236.0000007d:head r=0 errors={46(6)=-5}

The _failed_push() removed the object from recovering, but it is after last_backfill, so it is expected to be there.

#2 Updated by David Zafman over 2 years ago

Another commit has been added to PR12088 to fix a problem. The scenario in this bug might not be impacted, but I thought I'd mention it.

#3 Updated by Aaron Ten Clay over 2 years ago

David Zafman wrote:

Another commit has been added to PR12088 to fix a problem. The scenario in this bug might not be impacted, but I thought I'd mention it.

I got the impression this additional commit fixes an unrelated issue, but I will compile again and test soon.

#4 Updated by Aaron Ten Clay over 2 years ago

David Zafman wrote:

Another commit has been added to PR12088 to fix a problem. The scenario in this bug might not be impacted, but I thought I'd mention it.

No joy. I tried both building the PR itself (based on 10.2.3) and manually merging the PR changeset onto the tag 10.2.4. In both cases, the behavior is the same as this bug report. I think this issue is distinct from what PR12088 is addressing.

Is there anything I can try to mitigate the problem without a code change? You mentioned the order of operations is leading to an expectation being violated - can I reset the expectation somehow?

My cluster has been operationally down for over a month now due to these two issues :)

#5 Updated by David Zafman over 2 years ago

Find all the "_failed_push: Read error" message in the logs of crashing OSDs and note the errors={osd#(shard#)=-5}. See log above which indicates and EIO from the disk for osd 46. shard# doesn't matter. Hopefully, there is a limited number of bad disks.

So for osd.46, for example, set noout and take osd.46 down. Replace the drive and bring osd.46 back. This should repopulate the new osd.46 disk. There are probably detailed instructions on how to replace a disk in a ceph how-to somewhere. Don't forget to unset noout once you've replaced all the bad drives.

#6 Updated by Aaron Ten Clay over 2 years ago

David Zafman wrote:

Find all the "_failed_push: Read error" message in the logs of crashing OSDs and note the errors={osd#(shard#)=-5}. See log above which indicates and EIO from the disk for osd 46. shard# doesn't matter. Hopefully, there is a limited number of bad disks.

So for osd.46, for example, set noout and take osd.46 down. Replace the drive and bring osd.46 back. This should repopulate the new osd.46 disk. There are probably detailed instructions on how to replace a disk in a ceph how-to somewhere. Don't forget to unset noout once you've replaced all the bad drives.

I'm not entirely sure I follow - are you suggesting replacing the OSD's backing store without removing the OSD from the CRUSH map? Or the "normal" process of killing the process, removing the OSD from the CRUSH map, then creating a new OSD, adding to CRUSH map, and starting the daemon?

I am entirely certain (by experimentation) that too many OSDs are having the read errors to successfully bring all PGs up if I replace all the OSDs having this problem. I don't know how my cluster got to this state, but there are a few missing objects here and there across 20+ OSDs in a 65 OSD cluster, with k=10 m=4 erasure coding on the pool in question. Based on the log message above, it seems only one of 10 OSDs is missing the object, so I don't understand what's stopping the erasure code from recovering the chunk.

Any advice or pointers to in-depth documentation would be appreciated. I'll happily go through and manually resurrect objects as necessary, if need be, but I have no idea how to manually recover a missing EC chunk.

Thanks!

#7 Updated by Aaron Ten Clay over 2 years ago

I'm running an experiment against the current kraken branch, preventing the ceph_abort() call which produces this error. Will update when I have results.

https://work.techtonium.com/bitbucket/users/aaron/repos/ceph/commits/aa6226d22e2c5aa1265bd2d15c2488a8f49c5c41

#8 Updated by Aaron Ten Clay over 2 years ago

Aaron Ten Clay wrote:

I'm running an experiment against the current kraken branch, preventing the ceph_abort() call which produces this error. Will update when I have results.

https://work.techtonium.com/bitbucket/users/aaron/repos/ceph/commits/aa6226d22e2c5aa1265bd2d15c2488a8f49c5c41

Not unexpectedly, this hack did not resolve the issue. It led to a crash when the map of recovering PGs is not empty in PrimaryLogPG::_clear_recovery_state(). I'm not familiar enough with Ceph internals to go any further. Would greatly appreciate pointers. Not afraid to compile and run proposed fixes :)

Thanks,
-Aaron

#9 Updated by Alexandre Oliva about 2 years ago

I've been trying to address the problem that backfills abort when encountering a read error or a corrupt file in jewel. Having applied David Zafman's patch for 13937 and Kefu Chai's for 17857, I still hit the problem described in this bug.

Using the attached patch, I no longer get that problem. Backfill continues and often completes, in spite of read errors. However:

- sometimes, backfill will not complete; it will backfill remaining objects (at least with larger-than-default osd recovery max active and osd recovery max single start), and then get stuck logging the message "recovery ending with missing replicas" until the primary is restarted, at which point the backfilled PG advances to active+clean.

- other times, backfill will complete, but the object that faced the read failure will NOT be recovered even though backfill suggests it was; that would only be fixed after scrub and repair.

With an earlier version of the patch, I had more verbose logging of recovery retries, and I fixed the problem that we'd remove the source of the failed read from an empty missing_loc set, and then we'd retry it over and over. I have NOT, however, observed successful retries with the attached version of the patch, that underwent such cosmetic changes as moving the location and the log level of retries.

Which is all to say that this patch is not a fully-functional solution, but at least it does allow recovery to progress (though at times with manual intervention) even in the presence of read errors, instead of having the primary restart over and over, even when the read error occurs in another EC member. I'm still trying to figure out what I'm missing for it to complete the recovery reliably, but maybe someone more familiar with the recovery machinery will be able to figure it out and beat me to it.

I'll be glad to try patches or to implement suggestions to try and fix this for good.

#10 Updated by Alexandre Oliva about 2 years ago

Nevermind the patch; clearing backfills_in_flight seems to be wrong and dangerous; I think it might even make the object seem to disappear, or even actually disappear.

I am now trying to leave it alone, and have recover_replicas behave in a more useful way if it finds the object already in backfills_in_flight. I'll let you know how that goes once I've implemented it and given it at least some testing.

#11 Updated by Alexandre Oliva about 2 years ago

Here's a patch for jewel that, on top of the David Zafman's patch for 13937 and Kefu Chai's for 17857, enables osds to recover from read failures during EC backfills: unlike my earlier patch, recover_replicas is no longer responsible for the retries: recover_backfill is. The former just leaves retry-pending objects alone now, and the latter will stop backfill_info.objects from getting trimmed past the earliest pending retry. It's working beautifully!

#12 Updated by Alexandre Oliva about 2 years ago

The previous patch was a bit too conservative when constructing the 'have' set from missing_loc in ECBackend::get_min_avail_to_read_shards: it even prevented recovery from fixing the missing-but-should-be-there objects caused by an earlier, broken version of the patch I posted last week.

Here's an incremental patch that allows this sort of inconsistency to be fixed without crashing.

#13 Updated by Alexandre Oliva about 2 years ago

Here's another incremental patch, that fixes a problem in which the presence of multiple read errors in the same PG might cause the backfill interval to advance past some failures, discarding the information required to retry them, so that recovery will reach the end of the backfill and claim to complete recovery with missing replicas, but actually keeping a backfill slot busy. Fixing it required removing the assumption that, when recover_backfill is called and there are pending retries, the recovery set is empty. While at that, I arranged for the test that guards the verbose logging of retries to be skipped if we do not have pending retries for sure.

Now, is there anyone watching this space, that could integrate these patches, or should I post them elsewhere?

#14 Updated by Aaron Ten Clay about 2 years ago

Alexandre Oliva wrote:

Here's another incremental patch, that fixes a problem in which the presence of multiple read errors in the same PG might cause the backfill interval to advance past some failures, discarding the information required to retry them, so that recovery will reach the end of the backfill and claim to complete recovery with missing replicas, but actually keeping a backfill slot busy. Fixing it required removing the assumption that, when recover_backfill is called and there are pending retries, the recovery set is empty. While at that, I arranged for the test that guards the verbose logging of retries to be skipped if we do not have pending retries for sure.

Now, is there anyone watching this space, that could integrate these patches, or should I post them elsewhere?

I'd love to see these patches make their way into 11.x, which I upgraded to hoping to get resolution and found none. Of course now my OSD's cant be downgraded :D I hope some maintainers are watching this.

#15 Updated by Nathan Cutler about 2 years ago

Now, is there anyone watching this space, that could integrate these patches, or should I post them elsewhere?

You should post your patches as a GitHub pull request against the master branch of ceph/ceph.git. Make sure the commit description includes the following two lines at the end:

Fixes: http://tracker.ceph.com/issues/18162
Signed-off-by: Your Name <me@example.com>

For detailed instructions on submitting patches, see https://github.com/ceph/ceph/blob/master/SubmittingPatches.rst

#16 Updated by Nathan Cutler about 2 years ago

  • Backport set to kraken

#17 Updated by Nathan Cutler about 2 years ago

  • Backport changed from kraken to jewel,kraken

#18 Updated by Alexandre Oliva about 2 years ago

Sorry, github is off-limits for me (it tries to run non-Free Software on my browser, and it refuses to work if I don't let it). I'll look into posting the patches to the ceph-devel list.

That said, I'm afraid there's still something wrong even after with all of them installed, at least with non-default settings allowing starting multiple reads in a single operation. If we do so, and we get errors from different OSDs, the retry for both will insist on one of the failing OSDs, and fail again and again. I haven't been able to fix that yet. Should I submit the patchset to the list despite this caveat?

#19 Updated by David Zafman almost 2 years ago

  • Assignee set to David Zafman

I've created https://github.com/ceph/ceph/pull/14054 to track Alexandre's changes.

I'm working on handling out of space during backfill where I'm seeing this same crash. I'm modifying the code to abort the backfill upon an out of space condition and I'm on master, so it doesn't necessarily relate directly to this report.

#20 Updated by Greg Farnum almost 2 years ago

  • Project changed from Ceph to RADOS
  • Category changed from OSD to EC Pools
  • Component(RADOS) OSD added

#21 Updated by Alastair Dewhurst over 1 year ago

Hi

My institute has a large cluster running Kraken 11.2.1-0 and using EC 8+3 and believe we have run into this bug. I was wondering if any progress had been made on this?

I should also point out that I think the severity is set incorrectly (It should be much higher). The only solution to this problem we have found is to delete and recreate PGs which obviously causes data loss. You cannot run a production quality service using EC pools with this bug.

I am afraid I am not in a position to write/commit code but if there are other ways I could help with the resolution of the problem, I would certainly be willing to try.

Alastair

#22 Updated by David Zafman over 1 year ago

  • Backport changed from jewel,kraken to jewel,luminous
  • Severity changed from 3 - minor to 2 - major

#23 Updated by David Zafman over 1 year ago

  • Status changed from New to In Progress

#24 Updated by David Zafman over 1 year ago

  • Status changed from In Progress to Testing

#25 Updated by David Zafman over 1 year ago

  • Status changed from Testing to Need Review

#26 Updated by David Zafman over 1 year ago

  • Copied to Backport #22013: jewel: osd/ReplicatedPG.cc: recover_replicas: object added to missing set for backfill, but is not in recovering, error! added

#27 Updated by Alastair Dewhurst over 1 year ago

Is there a reason this hasn't been marked for back porting to Luminous yet? (Apologies if this is just because you haven't had time).

#28 Updated by David Zafman over 1 year ago

  • Status changed from Need Review to Pending Backport

We had an immediate need for this to be backported to jewel. Been busy on other things. Setting this to pending backport.

#29 Updated by David Zafman over 1 year ago

  • Copied to Backport #22069: luminous: osd/ReplicatedPG.cc: recover_replicas: object added to missing set for backfill, but is not in recovering, error! added

#30 Updated by David Zafman about 1 year ago

  • Status changed from Pending Backport to Resolved

#31 Updated by David Zafman about 1 year ago

  • Related to Bug #18178: Unfound objects lost after OSD daemons restarted added

Also available in: Atom PDF