Project

General

Profile

Bug #40721

backfill caught in loop from block

Added by Sage Weil over 4 years ago. Updated over 3 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] recover_replicas(10)
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] recover_primary recovering 0 in pg, missing missing(0 may_include_deletes = 1)
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] recover_primary {}
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] recover_backfill (10) bft=7(2) last_backfill_started 2:f7b829c8:::smithi16613012-16:2db
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] peer osd.7(2) info 2.fs2( v 874'1021 lc 856'1013 (464'631,874'1021] lb 2:f7b829c8:::smithi16613012-16:2db local-lis/les=833/834 n=6 ec=20/20 lis/c 833/680 les/c/f 834/681/0 833/833/619) 
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] update_range: bi is current 
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]]    my backfill interval BackfillInfo(2:f7b829c8:::smithi16613012-16:head-MAX 1 objects {2:f7b829c8:::smithi16613012-16:head=824'1009})
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]]  peer shard 7(2) backfill BackfillInfo(MAX-MAX 0 objects)
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] get_object_context: found obc in cache: 0x55b491e44940
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] get_object_context: 0x55b491e44940 2:f7b829c8:::smithi16613012-16:head rwstate(write n=1 w=0) oi: 2:f7b829c8:::smithi16613012-16:head(824'1009 client.4308.0:8771 dirty|data_digest s 1386
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] backfill blocking on 2:f7b829c8:::smithi16613012-16:head; could not get rw_manager lock
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] backfill_pos is 2:f7b829c8:::smithi16613012-16:head
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] starting new_last_backfill at 2:f7b829c8:::smithi16613012-16:2db
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] possible new_last_backfill at 2:f7b829c8:::smithi16613012-16:2db
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] final new_last_backfill at 2:f7b829c8:::smithi16613012-16:2db
ing mbc={0={},1={},2={}} trimq=[2d7~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]]  started 0
...
~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] recover_replicas(10)
~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] recover_primary recovering 0 in pg, missing missing(0 may_include_deletes = 1)
~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] recover_primary {}
~1,2da~1,2dc~1,2e1~2,2e4~3,2e8~1,2ea~1,2ed~2,2f1~2,2f4~2,2f7~1,2f9~2,2fd~2,301~1,307~2,30b~2]] recover_backfill (10) bft=7(2) last_backfill_started 2:f7b829c8:::smithi16613012-16:2db

and it loops.
/a/sage-2019-07-10_01:52:27-rados-wip-sage-testing-2019-07-09-1801-distro-basic-smithi/4107563 osd.4 pg 2.f

History

#1 Updated by Sage Weil over 4 years ago

actually, this retry is triggered on every osdmap.

#2 Updated by Sage Weil over 4 years ago

original blocked request is

2d7~1,2da~1,2dc~1,2e1~1] ps=36] get_object_context: creating obc from disk: 0x55b491e44940
2d7~1,2da~1,2dc~1,2e1~1] ps=36] get_object_context: 0x55b491e44940 2:f7b829c8:::smithi16613012-16:head rwstate(none n=0 w=0) oi: 2:f7b829c8:::smithi16613012-16:head(824'1009 client.4308.0:8771 dirty|data_digest s 13869056 uv 1220 dd 97d4a27b alloc_hint [0 0 0]) exists: 1 ssc: 0x55b492ade2c0 snapset: 2db=[]:{2db=[2d
2d7~1,2da~1,2dc~1,2e1~1] ps=36] find_object_context 2:f7b829c8:::smithi16613012-16:head @head oi=2:f7b829c8:::smithi16613012-16:head(824'1009 client.4308.0:8771 dirty|data_digest s 13869056 uv 1220 dd 97d4a27b alloc_hint [0 0 0])
2d7~1,2da~1,2dc~1,2e1~1] ps=36] do_op oi 2:f7b829c8:::smithi16613012-16:head(824'1009 client.4308.0:8771 dirty|data_digest s 13869056 uv 1220 dd 97d4a27b alloc_hint [0 0 0])
2d7~1,2da~1,2dc~1,2e1~1] ps=36] do_op obc obc(2:f7b829c8:::smithi16613012-16:head rwstate(write n=1 w=0))
2d7~1,2da~1,2dc~1,2e1~1] ps=36] execute_ctx 0x55b493246900
2d7~1,2da~1,2dc~1,2e1~1] ps=36] execute_ctx 2:f7b829c8:::smithi16613012-16:head [copy-from ver 1207] ov 824'1009 av 834'1012 snapc 2e2=[2e2] snapset 2db=[]:{2db=[2da,2d7]}
2d7~1,2da~1,2dc~1,2e1~1] ps=36] do_osd_op 2:f7b829c8:::smithi16613012-16:head [copy-from ver 1207]
2d7~1,2da~1,2dc~1,2e1~1] ps=36] do_osd_op  copy-from ver 1207
2d7~1,2da~1,2dc~1,2e1~1] ps=36] start_copy 2:f7b829c8:::smithi16613012-16:head from 2:4854f1b1:::smithi16613012-25:head @2 v1207 flags 0
2d7~1,2da~1,2dc~1,2e1~1] ps=36] _copy_some obc(2:f7b829c8:::smithi16613012-16:head rwstate(write n=1 w=0)) 0x55b49482ed10

#3 Updated by Greg Farnum over 4 years ago

  • Assignee set to Samuel Just

#4 Updated by Samuel Just over 4 years ago

Unfortunately, I think the job number is wrong -- I don't see that object in the log (smithi19817795-* objects are included instead).

#5 Updated by Samuel Just over 4 years ago

Or, I guess the directory is probably correct in that the teuthology.log output is consistent with the above, but the ceph-osd.?.log files appear to be from something else possibly.

#6 Updated by Samuel Just over 4 years ago

The copy_from on that object lasted until the end of the test. It did succeed, but presumably during shutdown once the interval changed. Need to find a way for lb=2:f7b829c8:::smithi16613012-16:2db to block the completion of the copyfrom.

#7 Updated by Samuel Just over 4 years ago

Based on the snapcontext, make_writeable should have created a clone.

#8 Updated by Samuel Just over 4 years ago

  • Priority changed from Urgent to High

I don't think I can make further progress without more logs, I'm marking this need more info for the time being. As it's a problem with copyfrom and apparently not occuring all that often, I'm also dropping it to high.

#9 Updated by Samuel Just over 4 years ago

  • Status changed from 12 to Need More Info

#10 Updated by Neha Ojha over 3 years ago

  • Status changed from Need More Info to Can't reproduce

Also available in: Atom PDF