Actions
Bug #4257
closedosd: clearing recovery state on pg removal races with applying pushes
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
The sync on_removal() call to clear_recovery() conflicts with push transactions that are in-flight to the filestore... those assert later in
2013-02-24 08:20:26.828218 7f3931f91700 10 osd.3 pg_epoch: 357 pg[97.3( v 357'24 lc 355'16 (0'0,357'24] local-les=357 n=8 ec=344 les/c 357/355 356/356/356) [3,4] r=0 lpr=356 pi=354-355/1 luod=0'0 mlcod 0'0 active+recovering m=6 u=6] _applied_recovered_object obc(6648b783/44.obj/head//97) 2013-02-24 08:20:26.828255 7f3931f91700 10 osd.3 pg_epoch: 357 pg[97.3( v 357'24 lc 355'16 (0'0,357'24] local-les=357 n=8 ec=344 les/c 357/355 356/356/356) [3,4] r=0 lpr=356 pi=354-355/1 luod=0'0 mlcod 0'0 active+recovering m=6 u=6] put_object_context 0x207e500 6648b783/44.obj/head//97 2 -> 1 2013-02-24 08:20:26.828281 7f3931f91700 10 osd.3 pg_epoch: 357 pg[97.3( v 357'24 lc 355'16 (0'0,357'24] local-les=357 n=8 ec=344 les/c 357/355 356/356/356) [3,4] r=0 lpr=356 pi=354-355/1 luod=0'0 mlcod 0'0 active+recovering m=6 u=6] finish_recovery_op 6648b783/44.obj/head//97 2013-02-24 08:20:26.830310 7f3931f91700 -1 osd/PG.cc: In function 'void PG::finish_recovery_op(const hobject_t&, bool)' thread 7f3931f91700 time 2013-02-24 08:20:26.828302 osd/PG.cc: 1943: FAILED assert(recovery_ops_active > 0) ceph version 0.57-493-g704db85 (704db850131643b26bafe6594946cacce483c171) 1: (PG::finish_recovery_op(hobject_t const&, bool)+0x13d) [0x68229d] 2: (ReplicatedPG::C_OSD_CompletedPull::finish(int)+0x5b) [0x5d374b] 3: (Context::complete(int)+0xa) [0x5cff2a] 4: (std::tr1::_Sp_counted_base_impl<RunOnDelete*, std::tr1::_Sp_deleter<RunOnDelete>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x1a) [0x73d92a] 5: (Wrapper<std::tr1::shared_ptr<RunOnDelete> >::~Wrapper()+0x72) [0x73d9e2] 6: (Finisher::finisher_thread_entry()+0x1ce) [0x7a687e] 7: (()+0x7e9a) [0x7f393cf13e9a] 8: (clone()+0x6d) [0x7f393b2af4bd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I think the solution is to queue a pg peering operation, have the normal flush/reset machinery kick in, and have that workqueue do the pg deletion.
Actions