Project

General

Profile

Actions

Bug #10524

closed

FAILED assert(peer_missing.count(fromshard))

Added by Loïc Dachary over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://qa-proxy.ceph.com/teuthology/samuelj-2015-01-07_13:21:03-rados-wip-sam-testing-wip-testing-vanilla-fixes-basic-multi/689640/

   -22> 2015-01-07 23:46:39.374580 7f1b2fb28700  5 -- op tracker -- seq: 501, time: 2015-01-07 23:46:39.374436, event: throttled, op: MRecoveryReserve GRANT  pgid: 0.17, query_epoch: 6
   -21> 2015-01-07 23:46:39.374584 7f1b2fb28700  5 -- op tracker -- seq: 501, time: 2015-01-07 23:46:39.374497, event: all_read, op: MRecoveryReserve GRANT  pgid: 0.17, query_epoch: 6
   -20> 2015-01-07 23:46:39.374587 7f1b2fb28700  5 -- op tracker -- seq: 501, time: 2015-01-07 23:46:39.374564, event: dispatched, op: MRecoveryReserve GRANT  pgid: 0.17, query_epoch: 6
   -19> 2015-01-07 23:46:39.374592 7f1b2fb28700  5 -- op tracker -- seq: 501, time: 2015-01-07 23:46:39.374592, event: waiting_for_osdmap, op: MRecoveryReserve GRANT  pgid: 0.17, query_epoch: 6
   -18> 2015-01-07 23:46:39.374596 7f1b2fb28700 15 osd.0 6 require_same_or_newer_map 6 (i am 6) 0x4bb5c20
   -17> 2015-01-07 23:46:39.374618 7f1b2fb28700  5 -- op tracker -- seq: 501, time: 2015-01-07 23:46:39.374618, event: done, op: MRecoveryReserve GRANT  pgid: 0.17, query_epoch: 6
   -16> 2015-01-07 23:46:39.374629 7f1b2fb28700 10 osd.0 6 do_waiters -- start
   -15> 2015-01-07 23:46:39.374630 7f1b2fb28700 10 osd.0 6 do_waiters -- finish
   -14> 2015-01-07 23:46:39.374634 7f1b27b18700 10 osd.0 pg_epoch: 6 pg[0.17( v 6'15 (0'0,6'15] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'13 lcod 6'14 mlcod 6'14 active+recovery_wait m=1] handle_peering_event: epoch_sent: 6 epoch_requested: 6 RemoteRecoveryReserved
   -13> 2015-01-07 23:46:39.374679 7f1b27b18700  5 osd.0 pg_epoch: 6 pg[0.17( v 6'15 (0'0,6'15] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'13 lcod 6'14 mlcod 6'14 active+recovery_wait m=1] exit Started/Primary/Active/WaitRemoteRecoveryReserved 0.001485 1 0.000118
   -12> 2015-01-07 23:46:39.374700 7f1b27b18700  5 osd.0 pg_epoch: 6 pg[0.17( v 6'15 (0'0,6'15] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'13 lcod 6'14 mlcod 6'14 active+recovery_wait m=1] enter Started/Primary/Active/Recovering
   -11> 2015-01-07 23:46:39.374726 7f1b27b18700 10 osd.0 6 queue_for_recovery queued pg[0.17( v 6'15 (0'0,6'15] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'13 lcod 6'14 mlcod 6'14 active+recovering m=1]
   -10> 2015-01-07 23:46:39.374742 7f1b27b18700 10 log is not dirty
    -9> 2015-01-07 23:46:39.374767 7f1b2230d700 10 osd.0 6 do_recovery can start 5 (0/15 rops)
    -8> 2015-01-07 23:46:39.374771 7f1b2230d700 10 osd.0 6 do_recovery starting 5 pg[0.17( v 6'15 (0'0,6'15] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'13 lcod 6'14 mlcod 6'14 active+recovering m=1]
    -7> 2015-01-07 23:46:39.374789 7f1b2230d700 10 osd.0 pg_epoch: 6 pg[0.17( v 6'15 (0'0,6'15] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'13 lcod 6'14 mlcod 6'14 active+recovering m=1] recover_primary recovering 0 in pg
    -6> 2015-01-07 23:46:39.374800 7f1b2230d700 10 osd.0 pg_epoch: 6 pg[0.17( v 6'15 (0'0,6'15] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'13 lcod 6'14 mlcod 6'14 active+recovering m=1] recover_primary missing(1)
    -5> 2015-01-07 23:46:39.374814 7f1b2230d700 10 osd.0 pg_epoch: 6 pg[0.17( v 6'15 (0'0,6'15] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'13 lcod 6'14 mlcod 6'14 active+recovering m=1] recover_primary 190db197/benchmark_data_burnupi59_7159_object114/head//0 6'14 (missing) (missing head)
    -4> 2015-01-07 23:46:39.374830 7f1b2230d700 10 osd.0 pg_epoch: 6 pg[0.17( v 6'15 (0'0,6'15] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'13 lcod 6'14 mlcod 6'14 active+recovering m=1] start_recovery_op 190db197/benchmark_data_burnupi59_7159_object114/head//0
    -3> 2015-01-07 23:46:39.374883 7f1b2230d700 10 osd.0 6 start_recovery_op pg[0.17( v 6'15 (0'0,6'15] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 rops=1 crt=6'13 lcod 6'14 mlcod 6'14 active+recovering m=1] 190db197/benchmark_data_burnupi59_7159_object114/head//0 (5/15 rops)
    -2> 2015-01-07 23:46:39.374895 7f1b2230d700 10 osd.0 pg_epoch: 6 pg[0.17( v 6'15 (0'0,6'15] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 rops=1 crt=6'13 lcod 6'14 mlcod 6'14 active+recovering m=1] recover_object: 190db197/benchmark_data_burnupi59_7159_object114/head//0
    -1> 2015-01-07 23:46:39.374910 7f1b2230d700  7 osd.0 pg_epoch: 6 pg[0.17( v 6'15 (0'0,6'15] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 rops=1 crt=6'13 lcod 6'14 mlcod 6'14 active+recovering m=1] pull 190db197/benchmark_data_burnupi59_7159_object114/head//0 v 6'14 on osds 0 from osd.0
     0> 2015-01-07 23:46:39.378442 7f1b2230d700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedBackend::prepare_pull(eversion_t, const hobject_t&, ObjectContextRef, ReplicatedBackend::RPGHandle*)' thread 7f1b2230d700 time 2015-01-07 23:46:39.374934
osd/ReplicatedPG.cc: 8552: FAILED assert(peer_missing.count(fromshard))

 ceph version 0.90-793-g5f48d50 (5f48d505ab8a08832a65f449c7b927047c910cf9)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xba5f3b]
 2: (ReplicatedBackend::prepare_pull(eversion_t, hobject_t const&, std::tr1::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0xf2f) [0x85bb2f]
 3: (ReplicatedBackend::recover_object(hobject_t const&, eversion_t, std::tr1::shared_ptr<ObjectContext>, std::tr1::shared_ptr<ObjectContext>, PGBackend::RecoveryHandle*)+0x2ee) [0xa1309e]
 4: (ReplicatedPG::recover_missing(hobject_t const&, eversion_t, int, PGBackend::RecoveryHandle*)+0x5d2) [0x86d222]
 5: (ReplicatedPG::recover_primary(int, ThreadPool::TPHandle&)+0x139e) [0x8742ae]
 6: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*, ThreadPool::TPHandle&, int*)+0x54b) [0x8a729b]
 7: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x293) [0x69b843]
 8: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0x17) [0x6fc497]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa46) [0xb970e6]
 10: (ThreadPool::WorkThread::entry()+0x10) [0xb98190]
 11: (()+0x8182) [0x7f1b42384182]
 12: (clone()+0x6d) [0x7f1b408f038d]


Files

ceph-osd.0-bad.log.gz (711 KB) ceph-osd.0-bad.log.gz log of primary osd.0 recover from the primary : fail Loïc Dachary, 01/15/2015 12:54 PM
ceph-osd.0-good.log.gz (247 KB) ceph-osd.0-good.log.gz log of primary osd.0 recover from the replica : ok Loïc Dachary, 01/15/2015 12:54 PM

Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #10566: osd/ReplicatedPG.cc: 8729: FAILED assert(peer_missing.count(fromshard)) from scrub_test.yaml Duplicate01/18/2015

Actions
Actions #1

Updated by Loïc Dachary over 9 years ago

  • Status changed from New to In Progress

https://github.com/ceph/ceph-qa-suite/blob/master/tasks/scrub_test.py using ceph master succeeds when run alone (no thrashing).

Actions #2

Updated by Loïc Dachary over 9 years ago

  • Status changed from In Progress to Need More Info

I'm not able to reproduce it on master or even using the exact same commit. Will wait for it to show up again.

Actions #3

Updated by Samuel Just over 9 years ago

ubuntu@teuthology:/a/samuelj-2015-01-12_21:02:00-rados-wip-sam-testing-wip-testing-vanilla-fixes-basic-multi/700477

Actions #4

Updated by Samuel Just over 9 years ago

  • Status changed from Need More Info to 12
Actions #5

Updated by Loïc Dachary over 9 years ago

the bad peer is selected as a good peer because PGBackend::be_select_auth_object things the bad peer is an authoritative candidate

  -224> 2015-01-07 23:46:39.366423 7f1b21b0c700 10 osd.0 pg_epoch: 6 pg[0.17( v 6'14 (0'0,6'14] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'12 lcod 6'13 mlcod 6'13 active+clean+scrubbing+deep+inconsistent+repair] be_select_auth_object: selecting osd 0 for obj 190db197/benchmark_data_burnupi59_7159_object114/head//0
  -223> 2015-01-07 23:46:39.366463 7f1b21b0c700 10 osd.0 pg_epoch: 6 pg[0.17( v 6'14 (0'0,6'14] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'12 lcod 6'13 mlcod 6'13 active+clean+scrubbing+deep+inconsistent+repair] be_select_auth_object: selecting osd 2 for obj 190db197/benchmark_data_burnupi59_7159_object114/head//0
  -222> 2015-01-07 23:46:39.366508 7f1b21b0c700 20 osd.0 pg_epoch: 6 pg[0.17( v 6'14 (0'0,6'14] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'12 lcod 6'13 mlcod 6'13 active+clean+scrubbing+deep+inconsistent+repair] be_compare_scrubmaps noting missing digest on 190db197/benchmark_data_burnupi59_7159_object114/head//0

Actions #6

Updated by Loïc Dachary over 9 years ago

  • Status changed from 12 to In Progress

ReplicatedBackend::prepare_pull will try to pull from any location added to missing_loc which will unfortunately include the bad peer in some cases because PGBackend::be_select_auth_object may include in the authoritative list a peer that is not good according to a different logic.

In this case the digest of the bad peer is found to be incorrect but be_select_auth_object knows nothing about digests and selects it as an authorithative peer. I think the proper fix is to make it so PGBackend::be_select_auth_object do not select as authoritative an osd that's not valid.

Actions #7

Updated by Loïc Dachary over 9 years ago

Maybe the problem is that recovery should not try to recover an object that's already repaired ?

Updated by Loïc Dachary over 9 years ago

The difference between the bad and the good log is that the object to be repaired is on the primary for the bad and on a replica for the good. When "repairing on the primary, here is what we have:

2015-01-07 23:46:39.367556 7f1b21b0c700 10 osd.0 pg_epoch: 6 pg[0.17( v 6'14 (0'0,6'14] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'12 lcod 6'13 mlcod 6'13 active+clean+scrubbing+deep+inconsistent+repair] _scrub recording digests for 190db197/benchmark_data_burnupi59_7159_object114/head//0
2015-01-07 23:46:39.367584 7f1b21b0c700 15 filestore(/var/lib/ceph/osd/ceph-0) getattr 0.17_head/190db197/benchmark_data_burnupi59_7159_object114/head//0 '_'
2015-01-07 23:46:39.367666 7f1b21b0c700 10 filestore(/var/lib/ceph/osd/ceph-0) getattr 0.17_head/190db197/benchmark_data_burnupi59_7159_object114/head//0 '_' = 274
2015-01-07 23:46:39.367703 7f1b21b0c700 15 filestore(/var/lib/ceph/osd/ceph-0) getattr 0.17_head/190db197/benchmark_data_burnupi59_7159_object114/head//0 'snapset'
2015-01-07 23:46:39.367733 7f1b21b0c700 10 filestore(/var/lib/ceph/osd/ceph-0) getattr 0.17_head/190db197/benchmark_data_burnupi59_7159_object114/head//0 'snapset' = 31

could it be that this repair also has the side effect of removing osd.0 from the peer_missing list, which would be logic since there is nothing left to repair, and explain why recovery fails later with
2015-01-07 23:46:39.374910 7f1b2230d700  7 osd.0 pg_epoch: 6 pg[0.17( v 6'15 (0'0,6'15] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 rops=1 crt=6'13 lcod 6'14 mlcod 6'14 active+recovering m=1] pull 190db197/benchmark_data_burnupi59_7159_object114/head//0 v 6'14 on osds 0 from osd.0
2015-01-07 23:46:39.378442 7f1b2230d700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedBackend::prepare_pull(eversion_t, const hobject_t&, ObjectContextRef, ReplicatedBackend::RPGHandle*)' thread 7f1b2230d700 time 2015-01-07 23:46:39.374934
osd/ReplicatedPG.cc: 8552: FAILED assert(peer_missing.count(fromshard))

because osd.0 is no longer in the peer_missing list.

Actions #9

Updated by Loïc Dachary over 9 years ago

This is what should happen

2015-01-15 22:55:18.006470 7fce91a44700 10 osd.1 pg_epoch: 17 pg[1.3( v 13'1 (0'0,13'1] local-les=17 n=1 ec=3 les/c 17/17 16/16/16) [1,0] r=0 lpr=16 crt=0'0 lcod 0'0 mlcod 0'0 active+clean+scrubbing+deep+repair] be_select_auth_object: selecting osd 0 for obj 847441d7/SOMETHING/head//1
2015-01-15 22:55:18.006540 7fce91a44700 10 osd.1 pg_epoch: 17 pg[1.3( v 13'1 (0'0,13'1] local-les=17 n=1 ec=3 les/c 17/17 16/16/16) [1,0] r=0 lpr=16 crt=0'0 lcod 0'0 mlcod 0'0 active+clean+scrubbing+deep+repair] be_select_auth_object: rejecting osd 1 for obj 847441d7/SOMETHING/head//1, data digest mismatch 0xb4356e27 != 0x2ddbf8f5

i.e. the osd that has the wrong content / bad digest is rejected. But this is what we see instead
2015-01-07 23:46:39.366423 7f1b21b0c700 10 osd.0 pg_epoch: 6 pg[0.17( v 6'14 (0'0,6'14] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'12 lcod 6'13 mlcod 6'13 active+clean+scrubbing+deep+inconsistent+repair] be_select_auth_object: selecting osd 0 for obj 190db197/benchmark_data_burnupi59_7159_object114/head//0
2015-01-07 23:46:39.366463 7f1b21b0c700 10 osd.0 pg_epoch: 6 pg[0.17( v 6'14 (0'0,6'14] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'12 lcod 6'13 mlcod 6'13 active+clean+scrubbing+deep+inconsistent+repair] be_select_auth_object: selecting osd 2 for obj 190db197/benchmark_data_burnupi59_7159_object114/head//0

i.e. selecting both osd as authoritative, followed by
2015-01-07 23:46:39.366508 7f1b21b0c700 20 osd.0 pg_epoch: 6 pg[0.17( v 6'14 (0'0,6'14] local-les=5 n=12 ec=1 les/c 5/6 4/4/2) [0,2] r=0 lpr=4 crt=6'12 lcod 6'13 mlcod 6'13 active+clean+scrubbing+deep+inconsistent+repair] be_compare_scrubmaps noting missing digest on 190db197/benchmark_data_burnupi59_7159_object114/head//0

Actions #10

Updated by Loïc Dachary over 9 years ago

  • The file content change while the OSD is running
  • deep-scrub / repair does notice something is wrong but does not reject the OSD with the problem nor does it add the osd into peer_missing
  • the OSD with the problem is among the authoritative locations from which the object can be retrieved
  • during recovery the OSD with the problem is selected and hit the assert because it does not show in peer_missing
Actions #11

Updated by Loïc Dachary over 9 years ago

https://github.com/ceph/ceph/pull/3389 will provide more information about the problem the next time it happens. I unsuccessfully tried to reproduce it.

Actions #12

Updated by Loïc Dachary over 9 years ago

  • Status changed from In Progress to Need More Info
Actions #13

Updated by Loïc Dachary over 9 years ago

  • Status changed from Need More Info to In Progress
Actions #14

Updated by Loïc Dachary over 9 years ago

<sjustwork> loicd: ok, so what is your current theory?
<loicd> sjustwork: that when the primary is recovering it uses a good peer that turns out to be itself although it is a bad peer, because it was not rejected as it should  have
<loicd> the problem with that theory is that I can't figure out how it happens, although it seems to be consistent with the logs
<sjustwork> /a/samuelj-2015-01-12_21:02:00-rados-wip-sam-testing-wip-testing-vanilla-fixes-basic-multi/700477/remote/ceph-osd.sorted.filtered
<sjustwork> and search backwards from the end for 
-*- loicd looking
<sjustwork> 6c48b997/benchmark_data_plana79_6637_object16/head//0
<sjustwork> from ceph-osd.0.log in that directory, that is our problem object
<sjustwork> what is the bad peer, and where did we attempt to pull from?
<loicd> the bad peer would the primary for 6c48b997/benchmark_data_plana79_6637_object16/head
<loicd> i.e. osd.0
<sjustwork> ok, where did we attempt to pull from?
<loicd> from osd.0 itself
<loicd> benchmark_data_plana79_6637_object16/head//0 v 5'12 on osds 0 from osd.0
<loicd> on osd.0
<sjustwork> oh, but only after we failed to pull from osd 2
<sjustwork> that's interesting
<loicd> that is the primary tries to pull from itself
<loicd> is it ? 
-*- loicd looking
-*- loicd browsing /a/samuelj-2015-01-12_21:02:00-rados-wip-sam-testing-wip-testing-vanilla-fixes-basic-multi/700477/remote/ceph-osd.0.log back from the assert
<sjustwork> just a few lines up
<sjustwork> ceph-osd.sorted.filtered contains all of the lines from all osd logs
<sjustwork> sorted by time
<sjustwork> with that pg grepped out
<loicd> ah, indeed
<sjustwork> btw: 
<sjustwork> for i in ceph-osd.?.log; do grep '^2015.*' $i > $i.filtered & done; wait; sort -m *.filtered > ceph-osd.sorted
<loicd> cool
<loicd> is there a logic that tries to pull from another good peer if a previous good peer has failed ? 
<sjustwork> yes, look through ReplicatedBackend and ReplicatedPG for something like _failed_push
<sjustwork> loicd: what is peer_missing for?
<loicd> contrary to what we had before, the 9406b7f71f91f2f0d6825b5acbc00d7994aeeefd there can be more than one location for a given missing_loc
<sjustwork> that is normal
<sjustwork> in normal recovery, there can be more than one location
<sjustwork> crucial for EC pools
<loicd> yes. I mean that before 9406b7f71f91f2f0d6825b5acbc00d7994aeeefd repair would not add more than one 
<loicd> I'm not sure what you're asking about peer_missing ? 
<loicd> it's going to be used when pulling from the peer that contains what the primary needs
<loicd> I'm however unclear on how / when it is updated. I tried to go over the places where it is set and unset but it's still blury
<sjustwork> loicd: ok, first thing is that it does not contain an entry for the primary
<sjustwork> though I think the peer_missing we pass into the PGBackend has that difference abstracted out
<loicd> which makes sense to me since it is the bad peer ? 
<sjustwork> loicd: ignore scrub/repair for a moment
<loicd> ok
<sjustwork> loicd: peer_missing is core to peering and recovery in general
<sjustwork> we have an entry for every osd in the acting set and every osd we may want to recover from
<sjustwork> because
<sjustwork> for a given osd, the info + the missing set determines which objects are (should be) current on the peer
<loicd> oh, not juste the one that have information about missing / inconsistent objects ! interesting
<sjustwork> that is, the osd has all object from [hobject_t::min(), info.last_backfill] *except* for the entries in its peer_missing set
<sjustwork> *all objects
<sjustwork> *in its missing set
<sjustwork> thus, for all osds which we may need to ask the question "does this osd have this object" 
<sjustwork> we need it's missing set, and we need its info
<sjustwork> thus, we keep a peer_missing and a peer_info map for all acting and recovery source osds
<sjustwork> that's the background
<loicd> ok, so not having peer_missing is not normal, at all, that a good lead
<sjustwork> crucially, we do not keep one for the primary
<loicd> ah, ahum :-)
<sjustwork> because that would just duplicate the primary state
<sjustwork> *but*
<sjustwork> we usually hide that bit from the backend
<sjustwork> so, go take a look at how peer_missing gets passed between ReplicatedPG and PGBackend
-*- loicd looks
<sjustwork> a few other things I am noticing
<sjustwork> prepare_pull() assumes that the fromshard is in peer_missing
<sjustwork> which makes sense
<sjustwork> because we do not pull objects which exist on the primary
<loicd> yes
<loicd> I don't see where peer_missing goes from replicatedpg and pgbackend
<sjustwork> loicd: PGBackend.h
<sjustwork> defines listener methods which pass that information
<sjustwork> *which the backend can use to query that information from the listener
<sjustwork> but I think that's a red herring at the moment
<sjustwork> in repair_objcet
<sjustwork> in the else branch
<sjustwork> nvm, I see where it is being added to the primary missing set
<loicd> ok, maybe_get_shard_missing / get_shard_missing
<sjustwork> yeah, but I don't think that's related to this bug actually
<sjustwork> ugh, now I'm trying to figure out why it failed to pull from 2
<sjustwork> you'll want to find that code path
<sjustwork> it's a bit hacky
<sjustwork> but there is  PGListener callback so that the backend can inform the PG that an object location is actually flaky
<loicd> ok
<loicd> I though it was ReplicatedPG::failed_push but no, the message does not show in the logs
<sjustwork> I...thought it was that also, one sec
<sjustwork> yes it does
<sjustwork> 2015-01-13 01:09:41.992776 7f2d06841700  0 osd.0 pg_epoch: 5 pg[0.17( v 5'13 (0'0,5'13] local-les=5 n=10 ec=1 les/c 5/5 4/4/3) [0,2] r=0 lpr=4 rops=1 crt=5'11 lcod 5'12 mlcod 5'12 active+recovering m=1] _failed_push 6c48b997/benchmark_data_plana79_6637_object16/head//0 from shard 2, reps on 0 unfound? 0
<loicd> hum
<loicd> I was looking at another log
-*- loicd comparing http://tracker.ceph.com/attachments/download/1600/ceph-osd.0-bad.log.gz with /a/samuelj-2015-01-12_21:02:00-rados-wip-sam-testing-wip-testing-vanilla-fixes-basic-multi/700477/remote/ceph-osd.0.log around the _failed_push (that does not show in the redmine archived log)
Actions #16

Updated by Samuel Just about 9 years ago

First, we probably need to not update the oi digest if we are going to repair the object.

Actions #17

Updated by Samuel Just about 9 years ago

Yeah, the bug appears to be a race between the oi update and recover_primary.

Actions #19

Updated by Sage Weil about 9 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF