Bug #9715: assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting) firefly - Ceph - Ceph

Actions

Copy link

Bug #9715

closed

assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting) firefly

Added by Yuri Weinstein over 9 years ago. Updated over 9 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-10-08_19:30:01-upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi/534732/

Coredump in ceph-osd.0.log.gz

ceph-osd.0.log.gz:     0> 2014-10-09 05:38:13.647225 7fb6c4cd2700 -1 *** Caught signal (Aborted) **
ceph-osd.0.log.gz: in thread 7fb6c4cd2700
ceph-osd.0.log.gz:
ceph-osd.0.log.gz: ceph version 0.80.6-56-gfd20a1d (fd20a1d01bde67fb1edc6058e38435af9d5d6abc)
ceph-osd.0.log.gz: 1: ceph-osd() [0x980baf]
ceph-osd.0.log.gz: 2: (()+0x10340) [0x7fb6dabb4340]
ceph-osd.0.log.gz: 3: (gsignal()+0x39) [0x7fb6d925bf89]
ceph-osd.0.log.gz: 4: (abort()+0x148) [0x7fb6d925f398]
ceph-osd.0.log.gz: 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb6d9b676b5]
ceph-osd.0.log.gz: 6: (()+0x5e836) [0x7fb6d9b65836]
ceph-osd.0.log.gz: 7: (()+0x5e863) [0x7fb6d9b65863]
ceph-osd.0.log.gz: 8: (()+0x5eaa2) [0x7fb6d9b65aa2]
ceph-osd.0.log.gz: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa63752]
ceph-osd.0.log.gz: 10: (PG::choose_acting(pg_shard_t&)+0x1366) [0x750cc6]
ceph-osd.0.log.gz: 11: (PG::RecoveryState::GetLog::GetLog(boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x11e) [0x750f3e]
ceph-osd.0.log.gz: 12: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::GetLog, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xb8) [0x797618]
ceph-osd.0.log.gz: 13: (boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x13a) [0x79797a]
ceph-osd.0.log.gz: 14: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x78246b]
ceph-osd.0.log.gz: 15: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd4) [0x7825e4]
ceph-osd.0.log.gz: 16: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1d1) [0x731771]
ceph-osd.0.log.gz: 17: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x294) [0x6484e4]

archive_path: /var/lib/teuthworker/archive/teuthology-2014-10-08_19:30:01-upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi/534732
branch: giant
description: upgrade:dumpling-firefly-x:stress-split/{00-cluster/start.yaml 01-dumpling-install/dumpling.yaml
  02-partial-upgrade-firefly/firsthalf.yaml 03-thrash/default.yaml 04-mona-upgrade-firefly/mona.yaml
  05-workload/rbd-cls.yaml 06-monb-upgrade-firefly/monb.yaml 07-workload/rbd_api.yaml
  08-monc-upgrade-firefly/monc.yaml 09-workload/{rbd-python.yaml rgw-s3tests.yaml}
  10-osds-upgrade-firefly/secondhalf.yaml 11-workload/snaps-few-objects.yaml 12-partial-upgrade-x/first.yaml
  13-thrash/default.yaml 14-mona-upgrade-x/mona.yaml 15-workload/rbd-import-export.yaml
  16-monb-upgrade-x/monb.yaml 17-workload/readwrite.yaml 18-monc-upgrade-x/monc.yaml
  19-workload/radosbench.yaml 20-osds-upgrade-x/osds_secondhalf.yaml 21-final-workload/rados_stress_watch.yaml
  distros/ubuntu_14.04.yaml}
email: ceph-qa@ceph.com
job_id: '534732'
kernel: &id001
  kdb: true
  sha1: distro
last_in_suite: false
machine_type: plana,burnupi,mira
name: teuthology-2014-10-08_19:30:01-upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi
nuke-on-error: true
os_type: ubuntu
os_version: '14.04'
overrides:
  admin_socket:
    branch: giant
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch
    sha1: 3bfb5fab41b6247259183c3f52c786e35beb3b01
  ceph-deploy:
    branch:
      dev: giant
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 3bfb5fab41b6247259183c3f52c786e35beb3b01
  s3tests:
    branch: giant
  workunit:
    sha1: 3bfb5fab41b6247259183c3f52c786e35beb3b01
owner: scheduled_teuthology@teuthology
priority: 1000
roles:
- - mon.a
  - mon.b
  - mds.a
  - osd.0
  - osd.1
  - osd.2
  - mon.c
- - osd.3
  - osd.4
  - osd.5
- - client.0
suite: upgrade:dumpling-firefly-x:stress-split
suite_branch: master
suite_path: /var/lib/teuthworker/src/ceph-qa-suite_master
targets:
  ubuntu@plana25.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC85wgMM7frtcFeCfatlKcc0Ru1HB4X/557M//6iIT2hQExLRtPADyOpfZdZmhP4Nh/mP6C9oB5gYH82sHnuVbCboq9J9OzBK0STFo4OIToRgbLJCTRfNuKt0VX0WCpvneQfA4SKmAsO527HgDcY/yyhzg67rWIel4LilQpFbPTe+rB9wBjO/DpbhxoF7d8vQUwtt2dYv6BXOvYPHCvgydTCAMIOgHIP/UqQceJgj/I3u85851yllYnBNE7LaJRXB96FlRtO25ZV7F7pFYLxyCsm+vGfRmp5YqdP+Qw72UaXuMpan+dQDwzpfRklLvolrq9jOLLYIvwnzd+GQgbRR87
  ubuntu@plana29.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDgHl5M5raEaWWKwH7thuC+Glhc6UlVLHpvIMDGMezlVOELXUB197ZAyurxpWurUteDEaqevUqjUgi08lEL0dHW4G9ulst8B3FqHrAvIvCmhKwGs1znrorZx2Bq+/8mreE4ocMWfT/R8sFzBsM3npgdgSqCAdDBSgI78S92WSlHGAqUz1A0iJoGdwiRjTKhCrI/tuIYyXUU9z/2vIR9bJTp0fwt2S3z4LnpKdsMU8BGpJwm8CPrVj/odRDZ/epgEFBySb4OUq68QDXlg8RHnL1D+zFVsoi/uU5o9rW55EAo7KnPDcdNCQSsZmaurwRaZxVSqhomF0kIU9oL1wlD+s7N
  ubuntu@plana68.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDN/p55jmz08gik2BX9h++ylasFj74ZysGxYsNDeeg4olDWNVz++6cDkRqoR+8SE7yZqk7iSZryr+Y3bQibjXK1PFeeiUtuJntIjIXIU7s9z3FC2EM3aJYB2wWW9IOuuFplEhg+QJAfxnzFLe0WJ9y6PzEITYejDD+pxzpS5fi0+D0WmNTwKlGBGMUz+6yAFZ0QrPxvSWkxIuZC1PRUefL3UUV2xCEmNge8PygeiRhdcn8iB8Ib1Bj+yyWUTFZ2RGbz6Y7sCVqckFGXIrhu6wjfYXpaYBYUbVg7R2qtwld6qybTcp+1RI8cc5RMyutX4PCC54Pjpbni7Kv/CusfXsSL
tasks:
- internal.lock_machines:
  - 3
  - plana,burnupi,mira
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.push_inventory: null
- internal.serialize_remote_roles: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: dumpling
- ceph:
    fs: xfs
- install.upgrade:
    osd.0:
      branch: firefly
- ceph.restart:
    daemons:
    - osd.0
    - osd.1
    - osd.2
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    thrash_primary_affinity: false
    timeout: 1200
- ceph.restart:
    daemons:
    - mon.a
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    branch: dumpling
    clients:
      client.0:
      - cls/test_cls_rbd.sh
- ceph.restart:
    daemons:
    - mon.b
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    clients:
      client.0:
      - rbd/test_librbd.sh
- install.upgrade:
    mon.c: null
- ceph.restart:
    daemons:
    - mon.c
    wait-for-healthy: false
    wait-for-osds-up: true
- ceph.wait_for_mon_quorum:
  - a
  - b
  - c
- workunit:
    clients:
      client.0:
      - rbd/test_librbd_python.sh
- rgw:
    client.0: null
    default_idle_timeout: 300
- s3tests:
    client.0:
      rgw_server: client.0
- install.upgrade:
    osd.3:
      branch: firefly
- ceph.restart:
    daemons:
    - osd.3
    - osd.4
    - osd.5
- rados:
    clients:
    - client.0
    objects: 50
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
- install.upgrade:
    osd.0: null
- ceph.restart:
    daemons:
    - osd.0
    - osd.1
    - osd.2
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    thrash_primary_affinity: false
    timeout: 1200
- ceph.restart:
    daemons:
    - mon.a
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    clients:
      client.0:
      - rbd/import_export.sh
    env:
      RBD_CREATE_ARGS: --new-format
- ceph.restart:
    daemons:
    - mon.b
    wait-for-healthy: false
    wait-for-osds-up: true
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 10
      read: 45
      write: 45
    ops: 4000
- ceph.restart:
    daemons:
    - mon.c
    wait-for-healthy: false
    wait-for-osds-up: true
- ceph.wait_for_mon_quorum:
  - a
  - b
  - c
- radosbench:
    clients:
    - client.0
    time: 1800
- install.upgrade:
    osd.3: null
- ceph.restart:
    daemons:
    - osd.3
    - osd.4
    - osd.5
- workunit:
    clients:
      client.0:
      - rados/stress_watch.sh
teuthology_branch: master
tube: multi
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.multi.3114

description: upgrade:dumpling-firefly-x:stress-split/{00-cluster/start.yaml 01-dumpling-install/dumpling.yaml
  02-partial-upgrade-firefly/firsthalf.yaml 03-thrash/default.yaml 04-mona-upgrade-firefly/mona.yaml
  05-workload/rbd-cls.yaml 06-monb-upgrade-firefly/monb.yaml 07-workload/rbd_api.yaml
  08-monc-upgrade-firefly/monc.yaml 09-workload/{rbd-python.yaml rgw-s3tests.yaml}
  10-osds-upgrade-firefly/secondhalf.yaml 11-workload/snaps-few-objects.yaml 12-partial-upgrade-x/first.yaml
  13-thrash/default.yaml 14-mona-upgrade-x/mona.yaml 15-workload/rbd-import-export.yaml
  16-monb-upgrade-x/monb.yaml 17-workload/readwrite.yaml 18-monc-upgrade-x/monc.yaml
  19-workload/radosbench.yaml 20-osds-upgrade-x/osds_secondhalf.yaml 21-final-workload/rados_stress_watch.yaml
  distros/ubuntu_14.04.yaml}
duration: 12110.70418214798
failure_reason: 'Command failed on plana29 with status 124: ''mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp
  && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1
  CEPH_REF=3bfb5fab41b6247259183c3f52c786e35beb3b01 TESTDIR="/home/ubuntu/cephtest" 
  CEPH_ID="0" adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage
  timeout 3h /home/ubuntu/cephtest/workunit.client.0/rbd/test_librbd_python.sh'''
flavor: basic
owner: scheduled_teuthology@teuthology
success: false

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Yuri Weinstein over 9 years ago

Subject changed from Coredumo in upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi run to Coredump in upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi run

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting); fails

58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering]  got osd.3 1.1b( empty local-les=0 n=0 ec=1 les/c 31/31 58/59/59)
  -113> 2014-10-09 05:38:13.535657 7fb6c4cd2700 10 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] update_heartbeat_peers 0,3,4 unchanged
  -112> 2014-10-09 05:38:13.535671 7fb6c4cd2700  5 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] exit Started/Primary/Peering/GetInfo 0.006107 3 0.000435
  -111> 2014-10-09 05:38:13.535684 7fb6c4cd2700  5 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] enter Started/Primary/Peering/GetLog
  -110> 2014-10-09 05:38:13.535697 7fb6c4cd2700 10 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.0 1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59)
  -109> 2014-10-09 05:38:13.535710 7fb6c4cd2700 10 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.3 1.1b( empty local-les=0 n=0 ec=1 les/c 31/31 58/59/59)
  -108> 2014-10-09 05:38:13.535721 7fb6c4cd2700 10 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.4 1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59)
  -107> 2014-10-09 05:38:13.535748 7fb6c4cd2700 10 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting newest update on osd.0 with 1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59)
up[0] needs backfill, osd.0 selected as primary instead
calc_acting primary is osd.0 with 1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59)
 shard 3 (up) backfill 1.1b( empty local-les=0 n=0 ec=1 les/c 31/31 58/59/59)
 osd.4 (up) accepted 1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59)
....
   -21> 2014-10-09 05:38:13.539457 7fb6c4cd2700 -1 osd/PG.cc: In function 'bool PG::choose_acting(pg_shard_t&)' thread 7fb6c4cd2700 time 2014-10-09 05:38:13.535760

osd/PG.cc: 1284: FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Subject changed from Coredump in upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi run to assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting) firefly

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Status changed from New to Duplicate

Actions

Copy link

Updated by David Zafman over 9 years ago

I see the change (92cfd370) that added the assert and didn't consider "compat_mode." In older OSDs we only have one backfill at a time. During an upgrade when not all OSDs are updated yet, compat_mode == true. In calc_replicated_acting() after the first backfill is added, no more will be. We could make the assert just be assert( !compat_mode && …), but what about the test "if (num_want_acting < pool.info.min_size) {", is that fine as is?

Actions

Copy link

Updated by Tamilarasi muthamizhan over 9 years ago

Parent task set to #9696

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

sjust: I think it's due to the compatibility thing where we include the backfill peer in the acting set if there are old osds. I think the assert is just not valid since the backfill and acting sets are not disjoint in that case.

Actions

Copy link

Updated by Tamilarasi muthamizhan over 9 years ago

Parent task deleted (~~#9696~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #9715

assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting) firefly

Updated by Yuri Weinstein over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by David Zafman over 9 years ago

Updated by Tamilarasi muthamizhan over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Tamilarasi muthamizhan over 9 years ago