Bug #8123
OSD: received operation against clone which was not backfilled (but HEAD was)
Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Let's start by saying I saw this on my wip-fast-dispatch branch, so it could be a bug there rather than in master. But given the timescales involved it'd be a pretty weird one, and I can't figure out how it would have happened. It looks to me like an interaction between cache pools, snapshots, and backfilling.
From /a/gregf-2014-04-15_14:09:51-rados:thrash-wip-fast-dispatch-testing-basic-plana/193070, a selection of entries from osd.4's log:
2014-04-15 15:28:52.875829 7ff530c37700 20 merge_log 518'32 (0'0) promote 15b2515a/foo/head/test-rados-api-plana25-3664-21/85 by osd.5.0:1222 2014-04-15 15:28:25.770968 2014-04-15 15:28:52.875836 7ff530c37700 20 merge_log 518'33 (518'32) modify 15b2515a/foo/head/test-rados-api-plana25-3664-21/85 by client.4282.0:1076 2014-04-15 15:28:25.696410 2014-04-15 15:28:52.875845 7ff530c37700 20 merge_log 519'34 (518'33) clone 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 by unknown.0.0:0 2014-04-15 15:28:25.696410 snaps [a] 2014-04-15 15:28:52.875853 7ff530c37700 20 merge_log 519'35 (518'33) modify 15b2515a/foo/head/test-rados-api-plana25-3664-21/85 by client.4282.0:1078 2014-04-15 15:28:26.872343 2014-04-15 15:28:52.875862 7ff530c37700 20 merge_log 520'36 (519'35) clone 15b2515a/foo/b/test-rados-api-plana25-3664-21/85 by unknown.0.0:0 2014-04-15 15:28:26.872343 snaps [b] 2014-04-15 15:28:52.875871 7ff530c37700 20 merge_log 520'37 (519'35) modify 15b2515a/foo/head/test-rados-api-plana25-3664-21/85 by client.4282.0:1080 2014-04-15 15:28:28.066434 2014-04-15 15:28:52.875898 7ff530c37700 10 merge_log result log((0'0,520'37], crt=520'37) missing(34) changed=1
2014-04-15 15:28:54.564323 7ff527ff8700 1 -- 10.214.131.14:6808/22773 <== osd.5 10.214.131.14:6807/23869 1319 ==== MOSDPGPush(85.2 529 [PushOp(15b2515a/foo/head/test-rados-api-plana25-3664-21/85, version: 520'37, data_included: [0~1], data_size: 1, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(15b2515a/foo/head/test-rados-api-plana25-3664-21/85@520'37, copy_subset: [0~1], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:1, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 ==== 1074+0+0 (1295830409 0 0) 0x36bca00 con 0x24b3000 2014-04-15 15:28:54.564405 7ff527ff8700 10 osd.4 529 handle_replica_op MOSDPGPush(85.2 529 [PushOp(15b2515a/foo/head/test-rados-api-plana25-3664-21/85, version: 520'37, data_included: [0~1], data_size: 1, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(15b2515a/foo/head/test-rados-api-plana25-3664-21/85@520'37, copy_subset: [0~1], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:1, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 epoch 529 2014-04-15 15:28:54.564418 7ff527ff8700 20 osd.4 529 should_share_map osd.5 10.214.131.14:6807/23869 529 2014-04-15 15:28:54.564425 7ff527ff8700 15 osd.4 529 enqueue_op 0x3339780 prio 63 cost 1001 latency 0.000158 MOSDPGPush(85.2 529 [PushOp(15b2515a/foo/head/test-rados-api-plana25-3664-21/85, version: 520'37, data_included: [0~1], data_size: 1, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(15b2515a/foo/head/test-rados-api-plana25-3664-21/85@520'37, copy_subset: [0~1], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:1, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 ... 2014-04-15 15:28:54.629084 7ff531438700 10 osd.4 pg_epoch: 529 pg[85.2( v 520'37 lc 0'0 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=1 lpr=527 pi=477-525/1 luod=0'0 crt=520'37 active m=34] on_local_recover: 15b2515a/foo/head/test-rados-api-plana25-3664-21/85 2014-04-15 15:28:54.629101 7ff531438700 10 osd.4 pg_epoch: 529 pg[85.2( v 520'37 lc 0'0 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=1 lpr=527 pi=477-525/1 luod=0'0 crt=520'37 active m=34] got missing 15b2515a/foo/head/test-rados-api-plana25-3664-21/85 v 520'37 ... 2014-04-15 15:28:57.919684 7ff53dd7f700 1 -- 10.214.131.14:6808/22773 --> 10.214.131.14:6807/23869 -- MOSDPGPushReply(85.2 529 [PushReplyOp(15b2515a/foo/head/test-rados-api-plana25-3664-21/85)]) v2 -- ?+0 0x3466000 con 0x24b3000
2014-04-15 15:28:58.002222 7ff527ff8700 1 -- 10.214.131.14:6808/22773 <== osd.5 10.214.131.14:6807/23869 1326 ==== osd_sub_op(client.4282.0:1099 85.2 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 [] v 529'38 snapset=0=[]:[] snapc=0=[]) v9 ==== 1327+0+414 (2329021886 0 1387132702) 0x3d79800 con 0x24b3000 2014-04-15 15:28:58.002283 7ff527ff8700 10 osd.4 529 handle_replica_op osd_sub_op(client.4282.0:1099 85.2 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 [] v 529'38 snapset=0=[]:[] snapc=0=[]) v9 epoch 529 2014-04-15 15:28:58.002294 7ff527ff8700 20 osd.4 529 should_share_map osd.5 10.214.131.14:6807/23869 529 2014-04-15 15:28:58.002301 7ff527ff8700 15 osd.4 529 enqueue_op 0x2438780 prio 127 cost 414 latency 0.000144 osd_sub_op(client.4282.0:1099 85.2 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 [] v 529'38 snapset=0=[]:[] snapc=0=[]) v9 2014-04-15 15:28:58.002364 7ff531438700 10 osd.4 529 dequeue_op 0x2438780 prio 127 cost 414 latency 0.000207 osd_sub_op(client.4282.0:1099 85.2 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 [] v 529'38 snapset=0=[]:[] snapc=0=[]) v9 pg pg[85.2( v 520'37 lc 470'15 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=1 lpr=527 pi=477-525/1 luod=0'0 crt=520'37 lcod 470'5 active m=18] 2014-04-15 15:28:58.002423 7ff531438700 10 osd.4 pg_epoch: 529 pg[85.2( v 520'37 lc 470'15 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=1 lpr=527 pi=477-525/1 luod=0'0 crt=520'37 lcod 470'5 active m=18] handle_message: 0x2438780 2014-04-15 15:28:58.002448 7ff531438700 10 osd.4 pg_epoch: 529 pg[85.2( v 520'37 lc 470'15 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=1 lpr=527 pi=477-525/1 luod=0'0 crt=520'37 lcod 470'5 active m=18] sub_op_modify trans 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 v 529'38 (transaction) 181 ... 2014-04-15 15:28:58.017139 7ff531438700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedBackend::sub_op_modify(OpRequestRef)' thread 7ff531438700 time 2014-04-15 15:28:58.002483 osd/ReplicatedPG.cc: 7283: FAILED assert(!parent->get_log().get_missing().is_missing(soid)) ceph version 0.79-261-ga5bbaf0 (a5bbaf079c7a7d2bb18a96c112dcd7b3a6e2c1da) 1: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0x11f5) [0x7e10d5] 2: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x55c) [0x9116ec] 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1ee) [0x7c2d4e] 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1ca) [0x64fe1a] 5: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x5c8) [0x650858] 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x67fe8c] 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xa5c516] 8: (ThreadPool::WorkThread::entry()+0x10) [0xa5e320] 9: (()+0x7e9a) [0x7ff548ce4e9a] 10: (clone()+0x6d) [0x7ff5472a53fd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
From osd.5, the context around the subop message send is:
2014-04-15 15:28:58.001739 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] finish_flush 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 tid 175 (0) Success 2014-04-15 15:28:58.001784 7f60afaff700 20 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] try_flush_mark_clean already holding write lock: rwstate(write n=1 w=0) 2014-04-15 15:28:58.001800 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] try_flush_mark_clean clearing DIRTY flag for 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 2014-04-15 15:28:58.001819 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] new_repop rep_tid 1226 on osd_op(client.4282.0:1099 test-rados-api-plana25-3664-21/foo@a [cache-flush] 85.15b2515a ack+read+ignore_cache e529) v4 2014-04-15 15:28:58.001842 7f60afaff700 20 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] finish_ctx 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 0x3dfe680 op clean 2014-04-15 15:28:58.001860 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] mtime unchanged at 2014-04-15 15:28:25.696410 2014-04-15 15:28:58.001886 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] no snapset (this is a clone) 2014-04-15 15:28:58.001902 7f60afaff700 20 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] finish_ctx encoding snaps [a] 2014-04-15 15:28:58.001918 7f60afaff700 20 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] simple_repop_submit 0x3eb9e40 2014-04-15 15:28:58.001930 7f60afaff700 7 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] issue_repop rep_tid 1226 o 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 2014-04-15 15:28:58.001976 7f60afaff700 20 osd.5 529 _share_map_outgoing 0x2bf89a0 already has epoch 529 2014-04-15 15:28:58.001980 7f60afaff700 1 -- 10.214.131.14:6807/23869 --> osd.4 10.214.131.14:6808/22773 -- osd_sub_op(client.4282.0:1099 85.2 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 [] v 529'38 snapset=0=[]:[] snapc=0=[]) v9 -- ?+414 0x4188c00 2014-04-15 15:28:58.002002 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] append_log log((0'0,520'37], crt=0'0) [529'38 (519'34) clean 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 by client.4282.0:1099 0.000000 snaps [a]] 2014-04-15 15:28:58.002027 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 529'38 (0'0,529'38] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 luod=520'37 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] add_log_entry 529'38 (519'34) clean 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 by client.4282.0:1099 0.000000 snaps [a]
History
#1 Updated by Samuel Just almost 10 years ago
- Priority changed from High to Urgent
#2 Updated by Samuel Just almost 10 years ago
- Status changed from New to Can't reproduce