Actions
Bug #9253
closedECBackend::continue_recovery_op assert when not enough shards
Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
The object is missing on too many shards to recover ( missing_on_shards=1,2,3,4,5,6,7 ) and when minimum_to_decode returns an error, the caller may assert
-14> 2014-08-27 13:10:39.588092 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] continue_recovery_op: continuing RecoveryOp(hoid=3164fad3/vpm03021113-47/head//1 v=8'6 missing_on=0(6),1(4),2(3),6(1),7(5),9(7),12(2) missing_on_shards=1,2,3,4,5,6,7 recovery_info=ObjectRecoveryInfo(3164fad3/vpm03021113-47/head//1@8'6, copy_subset: [], clone_subset: {}) recovery_progress=ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:true) pending_read=0 obc refcount=3 state=IDLE waiting_on_pushes= extent_requested=0,0 -13> 2014-08-27 13:10:39.588109 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 0(6) -12> 2014-08-27 13:10:39.588119 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 1(4) -11> 2014-08-27 13:10:39.588129 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 2(3) -10> 2014-08-27 13:10:39.588138 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 4(0) -9> 2014-08-27 13:10:39.588148 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 6(1) -8> 2014-08-27 13:10:39.588157 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 7(5) -7> 2014-08-27 13:10:39.588166 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 9(7) -6> 2014-08-27 13:10:39.588175 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 12(2) -5> 2014-08-27 13:10:39.588184 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking missing_loc 1(1) -4> 2014-08-27 13:10:39.588194 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking missing_loc 3(3) -3> 2014-08-27 13:10:39.588203 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking missing_loc 4(0) -2> 2014-08-27 13:10:39.588212 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking missing_loc 9(5) -1> 2014-08-27 13:10:39.588222 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking missing_loc 10(2) 0> 2014-08-27 13:10:39.594817 7fa2f555b700 -1 osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 7fa2f555b700 time 2014-08-27 13:10:39.588243 osd/ECBackend.cc: 478: FAILED assert(!op.recovery_progress.first) ceph version 0.84-713-g95be35a (95be35abd30bb2374accea5a72ffd26a6c25635a) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb78d7b] 2: (ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)+0x1eea) [0xa0360a] 3: (ECBackend::run_recovery_op(PGBackend::RecoveryHandle*, int)+0x285) [0xa04c35] 4: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0x3d6) [0x860a36] 5: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*, ThreadPool::TPHandle&, int*)+0x5b3) [0x88e073] 6: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x28b) [0x689b9b] 7: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0x17) [0x6e8d37] 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa46) [0xb69f06] 9: (ThreadPool::WorkThread::entry()+0x10) [0xb6afb0] 10: (()+0x8182) [0x7fa315653182] 11: (clone()+0x6d) [0x7fa313bbf38d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Loïc Dachary over 9 years ago
Using 95be35abd30bb2374accea5a72ffd26a6c25635a
os_type: ubuntu os_version: '14.04' overrides: ceph: conf: global: osd heartbeat grace: 40 mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon warn on legacy crush tunables: false osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 log-whitelist: - slow request - scrub mismatch - ScrubResult ceph-deploy: branch: dev: next conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 install: ceph: branch: wip-7238-lrc-plugin roles: - - mon.a - osd.0 - osd.1 - osd.2 - osd.3 - - mon.b - mon.c - osd.4 - osd.5 - osd.6 - osd.7 - - client.0 - osd.8 - osd.9 - osd.10 - osd.11 - osd.12 suite_path: /home/loic/software/ceph/ceph-qa-suite tasks: - install: branch: wip-7238-lrc-plugin - ceph: fs: xfs - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 timeout: 1200 - rados: clients: [client.0] ops: 4000 objects: 500 ec_pool: true erasure_code_profile: name: LRCprofile plugin: LRC k: 4 m: 2 l: 3 ruleset-failure-domain: osd op_weights: read: 45 write: 0 append: 45 delete: 10
Actions