Project

General

Profile

Bug #8123

OSD: received operation against clone which was not backfilled (but HEAD was)

Added by Greg Farnum almost 10 years ago. Updated almost 10 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Let's start by saying I saw this on my wip-fast-dispatch branch, so it could be a bug there rather than in master. But given the timescales involved it'd be a pretty weird one, and I can't figure out how it would have happened. It looks to me like an interaction between cache pools, snapshots, and backfilling.

From /a/gregf-2014-04-15_14:09:51-rados:thrash-wip-fast-dispatch-testing-basic-plana/193070, a selection of entries from osd.4's log:

2014-04-15 15:28:52.875829 7ff530c37700 20 merge_log 518'32 (0'0) promote  15b2515a/foo/head/test-rados-api-plana25-3664-21/85 by osd.5.0:1222 2014-04-15 15:28:25.770968
2014-04-15 15:28:52.875836 7ff530c37700 20 merge_log 518'33 (518'32) modify   15b2515a/foo/head/test-rados-api-plana25-3664-21/85 by client.4282.0:1076 2014-04-15 15:28:25.696410
2014-04-15 15:28:52.875845 7ff530c37700 20 merge_log 519'34 (518'33) clone    15b2515a/foo/a/test-rados-api-plana25-3664-21/85 by unknown.0.0:0 2014-04-15 15:28:25.696410 snaps [a]
2014-04-15 15:28:52.875853 7ff530c37700 20 merge_log 519'35 (518'33) modify   15b2515a/foo/head/test-rados-api-plana25-3664-21/85 by client.4282.0:1078 2014-04-15 15:28:26.872343
2014-04-15 15:28:52.875862 7ff530c37700 20 merge_log 520'36 (519'35) clone    15b2515a/foo/b/test-rados-api-plana25-3664-21/85 by unknown.0.0:0 2014-04-15 15:28:26.872343 snaps [b]
2014-04-15 15:28:52.875871 7ff530c37700 20 merge_log 520'37 (519'35) modify   15b2515a/foo/head/test-rados-api-plana25-3664-21/85 by client.4282.0:1080 2014-04-15 15:28:28.066434
2014-04-15 15:28:52.875898 7ff530c37700 10 merge_log result log((0'0,520'37], crt=520'37) missing(34) changed=1
2014-04-15 15:28:54.564323 7ff527ff8700  1 -- 10.214.131.14:6808/22773 <== osd.5 10.214.131.14:6807/23869 1319 ==== MOSDPGPush(85.2 529 [PushOp(15b2515a/foo/head/test-rados-api-plana25-3664-21/85, version: 520'37, data_included: [0~1], data_size: 1, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(15b2515a/foo/head/test-rados-api-plana25-3664-21/85@520'37, copy_subset: [0~1], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:1, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 ==== 1074+0+0 (1295830409 0 0) 0x36bca00 con 0x24b3000
2014-04-15 15:28:54.564405 7ff527ff8700 10 osd.4 529 handle_replica_op MOSDPGPush(85.2 529 [PushOp(15b2515a/foo/head/test-rados-api-plana25-3664-21/85, version: 520'37, data_included: [0~1], data_size: 1, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(15b2515a/foo/head/test-rados-api-plana25-3664-21/85@520'37, copy_subset: [0~1], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:1, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 epoch 529
2014-04-15 15:28:54.564418 7ff527ff8700 20 osd.4 529 should_share_map osd.5 10.214.131.14:6807/23869 529
2014-04-15 15:28:54.564425 7ff527ff8700 15 osd.4 529 enqueue_op 0x3339780 prio 63 cost 1001 latency 0.000158 MOSDPGPush(85.2 529 [PushOp(15b2515a/foo/head/test-rados-api-plana25-3664-21/85, version: 520'37, data_included: [0~1], data_size: 1, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(15b2515a/foo/head/test-rados-api-plana25-3664-21/85@520'37, copy_subset: [0~1], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:1, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2
...
2014-04-15 15:28:54.629084 7ff531438700 10 osd.4 pg_epoch: 529 pg[85.2( v 520'37 lc 0'0 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=1 lpr=527 pi=477-525/1 luod=0'0 crt=520'37 active m=34] on_local_recover: 15b2515a/foo/head/test-rados-api-plana25-3664-21/85
2014-04-15 15:28:54.629101 7ff531438700 10 osd.4 pg_epoch: 529 pg[85.2( v 520'37 lc 0'0 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=1 lpr=527 pi=477-525/1 luod=0'0 crt=520'37 active m=34] got missing 15b2515a/foo/head/test-rados-api-plana25-3664-21/85 v 520'37
...
2014-04-15 15:28:57.919684 7ff53dd7f700  1 -- 10.214.131.14:6808/22773 --> 10.214.131.14:6807/23869 -- MOSDPGPushReply(85.2 529 [PushReplyOp(15b2515a/foo/head/test-rados-api-plana25-3664-21/85)]) v2 -- ?+0 0x3466000 con 0x24b3000
2014-04-15 15:28:58.002222 7ff527ff8700  1 -- 10.214.131.14:6808/22773 <== osd.5 10.214.131.14:6807/23869 1326 ==== osd_sub_op(client.4282.0:1099 85.2 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 [] v 529'38 snapset=0=[]:[] snapc=0=[]) v9 ==== 1327+0+414 (2329021886 0 1387132702) 0x3d79800 con 0x24b3000
2014-04-15 15:28:58.002283 7ff527ff8700 10 osd.4 529 handle_replica_op osd_sub_op(client.4282.0:1099 85.2 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 [] v 529'38 snapset=0=[]:[] snapc=0=[]) v9 epoch 529
2014-04-15 15:28:58.002294 7ff527ff8700 20 osd.4 529 should_share_map osd.5 10.214.131.14:6807/23869 529
2014-04-15 15:28:58.002301 7ff527ff8700 15 osd.4 529 enqueue_op 0x2438780 prio 127 cost 414 latency 0.000144 osd_sub_op(client.4282.0:1099 85.2 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 [] v 529'38 snapset=0=[]:[] snapc=0=[]) v9
2014-04-15 15:28:58.002364 7ff531438700 10 osd.4 529 dequeue_op 0x2438780 prio 127 cost 414 latency 0.000207 osd_sub_op(client.4282.0:1099 85.2 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 [] v 529'38 snapset=0=[]:[] snapc=0=[]) v9 pg pg[85.2( v 520'37 lc 470'15 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=1 lpr=527 pi=477-525/1 luod=0'0 crt=520'37 lcod 470'5 active m=18]
2014-04-15 15:28:58.002423 7ff531438700 10 osd.4 pg_epoch: 529 pg[85.2( v 520'37 lc 470'15 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=1 lpr=527 pi=477-525/1 luod=0'0 crt=520'37 lcod 470'5 active m=18] handle_message: 0x2438780
2014-04-15 15:28:58.002448 7ff531438700 10 osd.4 pg_epoch: 529 pg[85.2( v 520'37 lc 470'15 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=1 lpr=527 pi=477-525/1 luod=0'0 crt=520'37 lcod 470'5 active m=18] sub_op_modify trans 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 v 529'38 (transaction) 181
...
2014-04-15 15:28:58.017139 7ff531438700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedBackend::sub_op_modify(OpRequestRef)' thread 7ff531438700 time 2014-04-15 15:28:58.002483
osd/ReplicatedPG.cc: 7283: FAILED assert(!parent->get_log().get_missing().is_missing(soid))

 ceph version 0.79-261-ga5bbaf0 (a5bbaf079c7a7d2bb18a96c112dcd7b3a6e2c1da)
 1: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0x11f5) [0x7e10d5]
 2: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x55c) [0x9116ec]
 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1ee) [0x7c2d4e]
 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1ca) [0x64fe1a]
 5: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x5c8) [0x650858]
 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x67fe8c]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xa5c516]
 8: (ThreadPool::WorkThread::entry()+0x10) [0xa5e320]
 9: (()+0x7e9a) [0x7ff548ce4e9a]
 10: (clone()+0x6d) [0x7ff5472a53fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

From osd.5, the context around the subop message send is:

2014-04-15 15:28:58.001739 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] finish_flush 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 tid 175 (0) Success
2014-04-15 15:28:58.001784 7f60afaff700 20 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] try_flush_mark_clean already holding write lock: rwstate(write n=1 w=0)
2014-04-15 15:28:58.001800 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] try_flush_mark_clean clearing DIRTY flag for 15b2515a/foo/a/test-rados-api-plana25-3664-21/85
2014-04-15 15:28:58.001819 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] new_repop rep_tid 1226 on osd_op(client.4282.0:1099 test-rados-api-plana25-3664-21/foo@a [cache-flush] 85.15b2515a ack+read+ignore_cache e529) v4
2014-04-15 15:28:58.001842 7f60afaff700 20 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] finish_ctx 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 0x3dfe680 op clean   
2014-04-15 15:28:58.001860 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering]  mtime unchanged at 2014-04-15 15:28:25.696410
2014-04-15 15:28:58.001886 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering]  no snapset (this is a clone)
2014-04-15 15:28:58.001902 7f60afaff700 20 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] finish_ctx encoding snaps [a]
2014-04-15 15:28:58.001918 7f60afaff700 20 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] simple_repop_submit 0x3eb9e40
2014-04-15 15:28:58.001930 7f60afaff700  7 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] issue_repop rep_tid 1226 o 15b2515a/foo/a/test-rados-api-plana25-3664-21/85
2014-04-15 15:28:58.001976 7f60afaff700 20 osd.5 529 _share_map_outgoing 0x2bf89a0 already has epoch 529
2014-04-15 15:28:58.001980 7f60afaff700  1 -- 10.214.131.14:6807/23869 --> osd.4 10.214.131.14:6808/22773 -- osd_sub_op(client.4282.0:1099 85.2 15b2515a/foo/a/test-rados-api-plana25-3664-21/85 [] v 529'38 snapset=0=[]:[] snapc=0=[]) v9 -- ?+414 0x4188c00
2014-04-15 15:28:58.002002 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 520'37 (0'0,520'37] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] append_log log((0'0,520'37], crt=0'0) [529'38 (519'34) clean    15b2515a/foo/a/test-rados-api-plana25-3664-21/85 by client.4282.0:1099 0.000000 snaps [a]]
2014-04-15 15:28:58.002027 7f60afaff700 10 osd.5 pg_epoch: 529 pg[85.2( v 529'38 (0'0,529'38] local-les=527 n=34 ec=439 les/c 527/481 526/526/444) [5,4] r=0 lpr=526 pi=477-525/1 luod=520'37 rops=10 crt=0'0 lcod 519'35 mlcod 0'0 active+recovering] add_log_entry 529'38 (519'34) clean    15b2515a/foo/a/test-rados-api-plana25-3664-21/85 by client.4282.0:1099 0.000000 snaps [a]

History

#1 Updated by Samuel Just almost 10 years ago

  • Priority changed from High to Urgent

#2 Updated by Samuel Just almost 10 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF