Project

General

Profile

Actions

Bug #9253

closed

ECBackend::continue_recovery_op assert when not enough shards

Added by Loïc Dachary over 9 years ago. Updated over 9 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The object is missing on too many shards to recover ( missing_on_shards=1,2,3,4,5,6,7 ) and when minimum_to_decode returns an error, the caller may assert

   -14> 2014-08-27 13:10:39.588092 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] continue_recovery_op: continuing RecoveryOp(hoid=3164fad3/vpm03021113-47/head//1 v=8'6 missing_on=0(6),1(4),2(3),6(1),7(5),9(7),12(2) missing_on_shards=1,2,3,4,5,6,7 recovery_info=ObjectRecoveryInfo(3164fad3/vpm03021113-47/head//1@8'6, copy_subset: [], clone_subset: {}) recovery_progress=ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:true) pending_read=0 obc refcount=3 state=IDLE waiting_on_pushes= extent_requested=0,0
   -13> 2014-08-27 13:10:39.588109 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 0(6)
   -12> 2014-08-27 13:10:39.588119 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 1(4)
   -11> 2014-08-27 13:10:39.588129 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 2(3)
   -10> 2014-08-27 13:10:39.588138 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 4(0)
    -9> 2014-08-27 13:10:39.588148 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 6(1)
    -8> 2014-08-27 13:10:39.588157 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 7(5)
    -7> 2014-08-27 13:10:39.588166 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 9(7)
    -6> 2014-08-27 13:10:39.588175 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking acting 12(2)
    -5> 2014-08-27 13:10:39.588184 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking missing_loc 1(1)
    -4> 2014-08-27 13:10:39.588194 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking missing_loc 3(3)
    -3> 2014-08-27 13:10:39.588203 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking missing_loc 4(0)
    -2> 2014-08-27 13:10:39.588212 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking missing_loc 9(5)
    -1> 2014-08-27 13:10:39.588222 7fa2f555b700 10 osd.4 pg_epoch: 63 pg[1.13s0( v 25'136 (0'0,25'136] local-les=63 n=17 ec=7 les/c 63/45 62/62/21) [4,6,12,2,1,7,0,9] r=0 lpr=62 pi=44-61/1 rops=5 crt=25'129 lcod 0'0 mlcod 0'0 active+recovering+degraded] get_min_avail_to_read_shards: checking missing_loc 10(2)
     0> 2014-08-27 13:10:39.594817 7fa2f555b700 -1 osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 7fa2f555b700 time 2014-08-27 13:10:39.588243
osd/ECBackend.cc: 478: FAILED assert(!op.recovery_progress.first)

 ceph version 0.84-713-g95be35a (95be35abd30bb2374accea5a72ffd26a6c25635a)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb78d7b]
 2: (ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)+0x1eea) [0xa0360a]
 3: (ECBackend::run_recovery_op(PGBackend::RecoveryHandle*, int)+0x285) [0xa04c35]
 4: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0x3d6) [0x860a36]
 5: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*, ThreadPool::TPHandle&, int*)+0x5b3) [0x88e073]
 6: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x28b) [0x689b9b]
 7: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0x17) [0x6e8d37]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa46) [0xb69f06]
 9: (ThreadPool::WorkThread::entry()+0x10) [0xb6afb0]
 10: (()+0x8182) [0x7fa315653182]
 11: (clone()+0x6d) [0x7fa313bbf38d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #9263: erasure-code: ECBackend crashes when mapping failsRejected08/28/2014

Actions
Actions #1

Updated by Loïc Dachary over 9 years ago

Using 95be35abd30bb2374accea5a72ffd26a6c25635a

os_type: ubuntu
os_version: '14.04'
overrides:
  ceph:
    conf:
      global:
        osd heartbeat grace: 40
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - scrub mismatch
    - ScrubResult
  ceph-deploy:
    branch:
      dev: next
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      branch: wip-7238-lrc-plugin
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
  - osd.3
- - mon.b
  - mon.c
  - osd.4
  - osd.5
  - osd.6
  - osd.7
- - client.0
  - osd.8
  - osd.9
  - osd.10
  - osd.11
  - osd.12
suite_path: /home/loic/software/ceph/ceph-qa-suite
tasks:
- install:
    branch: wip-7238-lrc-plugin
- ceph:
    fs: xfs
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- rados:
    clients: [client.0]
    ops: 4000
    objects: 500
    ec_pool: true
    erasure_code_profile:
      name: LRCprofile
      plugin: LRC
      k: 4
      m: 2
      l: 3
      ruleset-failure-domain: osd
    op_weights:
      read: 45
      write: 0
      append: 45
      delete: 10
Actions #2

Updated by Loïc Dachary over 9 years ago

  • Status changed from New to Duplicate

Duplicate of #9263

Actions

Also available in: Atom PDF