Project

General

Profile

Actions

Bug #4257

closed

osd: clearing recovery state on pg removal races with applying pushes

Added by Sage Weil about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The sync on_removal() call to clear_recovery() conflicts with push transactions that are in-flight to the filestore... those assert later in

2013-02-24 08:20:26.828218 7f3931f91700 10 osd.3 pg_epoch: 357 pg[97.3( v 357'24 lc 355'16 (0'0,357'24] local-les=357 n=8 ec=344 les/c 357/355 356/356/356) [3,4] r=0 lpr=356 pi=354-355/1 luod=0'0 mlcod 0'0 active+recovering m=6 u=6] _applied_recovered_object obc(6648b783/44.obj/head//97)
2013-02-24 08:20:26.828255 7f3931f91700 10 osd.3 pg_epoch: 357 pg[97.3( v 357'24 lc 355'16 (0'0,357'24] local-les=357 n=8 ec=344 les/c 357/355 356/356/356) [3,4] r=0 lpr=356 pi=354-355/1 luod=0'0 mlcod 0'0 active+recovering m=6 u=6] put_object_context 0x207e500 6648b783/44.obj/head//97 2 -> 1
2013-02-24 08:20:26.828281 7f3931f91700 10 osd.3 pg_epoch: 357 pg[97.3( v 357'24 lc 355'16 (0'0,357'24] local-les=357 n=8 ec=344 les/c 357/355 356/356/356) [3,4] r=0 lpr=356 pi=354-355/1 luod=0'0 mlcod 0'0 active+recovering m=6 u=6] finish_recovery_op 6648b783/44.obj/head//97
2013-02-24 08:20:26.830310 7f3931f91700 -1 osd/PG.cc: In function 'void PG::finish_recovery_op(const hobject_t&, bool)' thread 7f3931f91700 time 2013-02-24 08:20:26.828302
osd/PG.cc: 1943: FAILED assert(recovery_ops_active > 0)

 ceph version 0.57-493-g704db85 (704db850131643b26bafe6594946cacce483c171)
 1: (PG::finish_recovery_op(hobject_t const&, bool)+0x13d) [0x68229d]
 2: (ReplicatedPG::C_OSD_CompletedPull::finish(int)+0x5b) [0x5d374b]
 3: (Context::complete(int)+0xa) [0x5cff2a]
 4: (std::tr1::_Sp_counted_base_impl<RunOnDelete*, std::tr1::_Sp_deleter<RunOnDelete>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x1a) [0x73d92a]
 5: (Wrapper<std::tr1::shared_ptr<RunOnDelete> >::~Wrapper()+0x72) [0x73d9e2]
 6: (Finisher::finisher_thread_entry()+0x1ce) [0x7a687e]
 7: (()+0x7e9a) [0x7f393cf13e9a]
 8: (clone()+0x6d) [0x7f393b2af4bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I think the solution is to queue a pg peering operation, have the normal flush/reset machinery kick in, and have that workqueue do the pg deletion.

Actions #1

Updated by Sage Weil about 11 years ago

Actually, I think this might be a bug in the non-removal case too. We call cancel_recovery() from start_peering_interval(), in the Reset state, but the flush doesn't happen until Started, which means we can cancel pushes that are en route to disk.

Actions #2

Updated by Sage Weil about 11 years ago

  • Status changed from 12 to 7
  • Assignee set to Sage Weil

nevermind.. just need to call start_peering_interval() to set last_peering_interval. testing wip-4257.

Actions #3

Updated by Sage Weil about 11 years ago

  • Assignee changed from Sage Weil to Samuel Just
Actions #4

Updated by Sage Weil about 11 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF