Project

General

Profile

Actions

Bug #18162

closed

osd/ReplicatedPG.cc: recover_replicas: object added to missing set for backfill, but is not in recovering, error!

Added by Aaron T over 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
David Zafman
Category:
EC Pools
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
jewel,luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I encountered the bug in #13937. I wanted to help test PR12088, and may have encountered an unrelated bug as a result.

     0> 2016-12-06 14:33:35.773259 7f3357278700 -1 osd/ReplicatedPG.cc: In function 'int ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)' thread 7f3357278700 time 2016-12-06 14:33:35.758593
osd/ReplicatedPG.cc: 10740: FAILED assert(0)

 ceph version 10.2.3-366-g289696d (289696d533038c2248c1fe0c8ee03adad343cfa9)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5619aedc4af0]
 2: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0xa3f) [0x5619ae87843f]
 3: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&, int*)+0xc2e) [0x5619ae87ffee]
 4: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x372) [0x5619ae6f3d72]
 5: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x20) [0x5619ae742090]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x5619aedb6cc1]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x5619aedb7dc0]
 8: (()+0x770a) [0x7f33819c970a]
 9: (clone()+0x6d) [0x7f337fa4282d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

After applying the PR at https://github.com/ceph/ceph/pull/12088, building ceph version 10.2.3-366-g289696d (289696d533038c2248c1fe0c8ee03adad343cfa9) on both Ubuntu 14.04 and 16.04 using the steps at http://docs.ceph.com/docs/jewel/install/build-ceph/ ...

I started the OSDs which had been marked "out" as per discussion related to #13937. Fairly shortly thereafter, the same OSDs which were crashing on 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) started crashing again, but with a new error. Attaching the entire log from one such OSD below with debug settings at 0/20.

As usual, please let me know what other information I can provide or tests I can run to help troubleshoot :)


Files

ceph-osd.3.log.bz2 (442 KB) ceph-osd.3.log.bz2 Aaron T, 12/07/2016 12:14 AM
ec-handle-error-create-loc-list.patch (5.17 KB) ec-handle-error-create-loc-list.patch Alexandre Oliva, 01/03/2017 01:34 AM
ec-handle-error-in-backfill-read.patch (10.8 KB) ec-handle-error-in-backfill-read.patch Alexandre Oliva, 01/06/2017 02:24 AM
adjust.patch (542 Bytes) adjust.patch Alexandre Oliva, 01/13/2017 10:11 AM
retrying-while-recovering.patch (1017 Bytes) retrying-while-recovering.patch Alexandre Oliva, 01/22/2017 02:30 AM

Related issues 3 (0 open3 closed)

Related to RADOS - Bug #18178: Unfound objects lost after OSD daemons restartedWon't FixDavid Zafman12/07/2016

Actions
Copied to RADOS - Backport #22013: jewel: osd/ReplicatedPG.cc: recover_replicas: object added to missing set for backfill, but is not in recovering, error!ResolvedDavid ZafmanActions
Copied to RADOS - Backport #22069: luminous: osd/ReplicatedPG.cc: recover_replicas: object added to missing set for backfill, but is not in recovering, error!ResolvedDavid ZafmanActions
Actions

Also available in: Atom PDF