Bug #9715
closedassert(want_acting_backfill.size() - want_backfill.size() == num_want_acting) firefly
0%
Description
Coredump in ceph-osd.0.log.gz
ceph-osd.0.log.gz: 0> 2014-10-09 05:38:13.647225 7fb6c4cd2700 -1 *** Caught signal (Aborted) ** ceph-osd.0.log.gz: in thread 7fb6c4cd2700 ceph-osd.0.log.gz: ceph-osd.0.log.gz: ceph version 0.80.6-56-gfd20a1d (fd20a1d01bde67fb1edc6058e38435af9d5d6abc) ceph-osd.0.log.gz: 1: ceph-osd() [0x980baf] ceph-osd.0.log.gz: 2: (()+0x10340) [0x7fb6dabb4340] ceph-osd.0.log.gz: 3: (gsignal()+0x39) [0x7fb6d925bf89] ceph-osd.0.log.gz: 4: (abort()+0x148) [0x7fb6d925f398] ceph-osd.0.log.gz: 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb6d9b676b5] ceph-osd.0.log.gz: 6: (()+0x5e836) [0x7fb6d9b65836] ceph-osd.0.log.gz: 7: (()+0x5e863) [0x7fb6d9b65863] ceph-osd.0.log.gz: 8: (()+0x5eaa2) [0x7fb6d9b65aa2] ceph-osd.0.log.gz: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa63752] ceph-osd.0.log.gz: 10: (PG::choose_acting(pg_shard_t&)+0x1366) [0x750cc6] ceph-osd.0.log.gz: 11: (PG::RecoveryState::GetLog::GetLog(boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x11e) [0x750f3e] ceph-osd.0.log.gz: 12: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::GetLog, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xb8) [0x797618] ceph-osd.0.log.gz: 13: (boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x13a) [0x79797a] ceph-osd.0.log.gz: 14: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x78246b] ceph-osd.0.log.gz: 15: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd4) [0x7825e4] ceph-osd.0.log.gz: 16: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1d1) [0x731771] ceph-osd.0.log.gz: 17: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x294) [0x6484e4]
archive_path: /var/lib/teuthworker/archive/teuthology-2014-10-08_19:30:01-upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi/534732 branch: giant description: upgrade:dumpling-firefly-x:stress-split/{00-cluster/start.yaml 01-dumpling-install/dumpling.yaml 02-partial-upgrade-firefly/firsthalf.yaml 03-thrash/default.yaml 04-mona-upgrade-firefly/mona.yaml 05-workload/rbd-cls.yaml 06-monb-upgrade-firefly/monb.yaml 07-workload/rbd_api.yaml 08-monc-upgrade-firefly/monc.yaml 09-workload/{rbd-python.yaml rgw-s3tests.yaml} 10-osds-upgrade-firefly/secondhalf.yaml 11-workload/snaps-few-objects.yaml 12-partial-upgrade-x/first.yaml 13-thrash/default.yaml 14-mona-upgrade-x/mona.yaml 15-workload/rbd-import-export.yaml 16-monb-upgrade-x/monb.yaml 17-workload/readwrite.yaml 18-monc-upgrade-x/monc.yaml 19-workload/radosbench.yaml 20-osds-upgrade-x/osds_secondhalf.yaml 21-final-workload/rados_stress_watch.yaml distros/ubuntu_14.04.yaml} email: ceph-qa@ceph.com job_id: '534732' kernel: &id001 kdb: true sha1: distro last_in_suite: false machine_type: plana,burnupi,mira name: teuthology-2014-10-08_19:30:01-upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi nuke-on-error: true os_type: ubuntu os_version: '14.04' overrides: admin_socket: branch: giant ceph: conf: mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon warn on legacy crush tunables: false osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 log-whitelist: - slow request - wrongly marked me down - objects unfound and apparently lost - log bound mismatch - wrongly marked me down - objects unfound and apparently lost - log bound mismatch sha1: 3bfb5fab41b6247259183c3f52c786e35beb3b01 ceph-deploy: branch: dev: giant conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 install: ceph: sha1: 3bfb5fab41b6247259183c3f52c786e35beb3b01 s3tests: branch: giant workunit: sha1: 3bfb5fab41b6247259183c3f52c786e35beb3b01 owner: scheduled_teuthology@teuthology priority: 1000 roles: - - mon.a - mon.b - mds.a - osd.0 - osd.1 - osd.2 - mon.c - - osd.3 - osd.4 - osd.5 - - client.0 suite: upgrade:dumpling-firefly-x:stress-split suite_branch: master suite_path: /var/lib/teuthworker/src/ceph-qa-suite_master targets: ubuntu@plana25.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC85wgMM7frtcFeCfatlKcc0Ru1HB4X/557M//6iIT2hQExLRtPADyOpfZdZmhP4Nh/mP6C9oB5gYH82sHnuVbCboq9J9OzBK0STFo4OIToRgbLJCTRfNuKt0VX0WCpvneQfA4SKmAsO527HgDcY/yyhzg67rWIel4LilQpFbPTe+rB9wBjO/DpbhxoF7d8vQUwtt2dYv6BXOvYPHCvgydTCAMIOgHIP/UqQceJgj/I3u85851yllYnBNE7LaJRXB96FlRtO25ZV7F7pFYLxyCsm+vGfRmp5YqdP+Qw72UaXuMpan+dQDwzpfRklLvolrq9jOLLYIvwnzd+GQgbRR87 ubuntu@plana29.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDgHl5M5raEaWWKwH7thuC+Glhc6UlVLHpvIMDGMezlVOELXUB197ZAyurxpWurUteDEaqevUqjUgi08lEL0dHW4G9ulst8B3FqHrAvIvCmhKwGs1znrorZx2Bq+/8mreE4ocMWfT/R8sFzBsM3npgdgSqCAdDBSgI78S92WSlHGAqUz1A0iJoGdwiRjTKhCrI/tuIYyXUU9z/2vIR9bJTp0fwt2S3z4LnpKdsMU8BGpJwm8CPrVj/odRDZ/epgEFBySb4OUq68QDXlg8RHnL1D+zFVsoi/uU5o9rW55EAo7KnPDcdNCQSsZmaurwRaZxVSqhomF0kIU9oL1wlD+s7N ubuntu@plana68.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDN/p55jmz08gik2BX9h++ylasFj74ZysGxYsNDeeg4olDWNVz++6cDkRqoR+8SE7yZqk7iSZryr+Y3bQibjXK1PFeeiUtuJntIjIXIU7s9z3FC2EM3aJYB2wWW9IOuuFplEhg+QJAfxnzFLe0WJ9y6PzEITYejDD+pxzpS5fi0+D0WmNTwKlGBGMUz+6yAFZ0QrPxvSWkxIuZC1PRUefL3UUV2xCEmNge8PygeiRhdcn8iB8Ib1Bj+yyWUTFZ2RGbz6Y7sCVqckFGXIrhu6wjfYXpaYBYUbVg7R2qtwld6qybTcp+1RI8cc5RMyutX4PCC54Pjpbni7Kv/CusfXsSL tasks: - internal.lock_machines: - 3 - plana,burnupi,mira - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.push_inventory: null - internal.serialize_remote_roles: null - internal.check_conflict: null - internal.check_ceph_data: null - internal.vm_setup: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.sudo: null - internal.syslog: null - internal.timer: null - chef: null - clock.check: null - install: branch: dumpling - ceph: fs: xfs - install.upgrade: osd.0: branch: firefly - ceph.restart: daemons: - osd.0 - osd.1 - osd.2 - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 thrash_primary_affinity: false timeout: 1200 - ceph.restart: daemons: - mon.a wait-for-healthy: false wait-for-osds-up: true - workunit: branch: dumpling clients: client.0: - cls/test_cls_rbd.sh - ceph.restart: daemons: - mon.b wait-for-healthy: false wait-for-osds-up: true - workunit: clients: client.0: - rbd/test_librbd.sh - install.upgrade: mon.c: null - ceph.restart: daemons: - mon.c wait-for-healthy: false wait-for-osds-up: true - ceph.wait_for_mon_quorum: - a - b - c - workunit: clients: client.0: - rbd/test_librbd_python.sh - rgw: client.0: null default_idle_timeout: 300 - s3tests: client.0: rgw_server: client.0 - install.upgrade: osd.3: branch: firefly - ceph.restart: daemons: - osd.3 - osd.4 - osd.5 - rados: clients: - client.0 objects: 50 op_weights: delete: 50 read: 100 rollback: 50 snap_create: 50 snap_remove: 50 write: 100 ops: 4000 - install.upgrade: osd.0: null - ceph.restart: daemons: - osd.0 - osd.1 - osd.2 - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 thrash_primary_affinity: false timeout: 1200 - ceph.restart: daemons: - mon.a wait-for-healthy: false wait-for-osds-up: true - workunit: clients: client.0: - rbd/import_export.sh env: RBD_CREATE_ARGS: --new-format - ceph.restart: daemons: - mon.b wait-for-healthy: false wait-for-osds-up: true - rados: clients: - client.0 objects: 500 op_weights: delete: 10 read: 45 write: 45 ops: 4000 - ceph.restart: daemons: - mon.c wait-for-healthy: false wait-for-osds-up: true - ceph.wait_for_mon_quorum: - a - b - c - radosbench: clients: - client.0 time: 1800 - install.upgrade: osd.3: null - ceph.restart: daemons: - osd.3 - osd.4 - osd.5 - workunit: clients: client.0: - rados/stress_watch.sh teuthology_branch: master tube: multi verbose: true worker_log: /var/lib/teuthworker/archive/worker_logs/worker.multi.3114
description: upgrade:dumpling-firefly-x:stress-split/{00-cluster/start.yaml 01-dumpling-install/dumpling.yaml 02-partial-upgrade-firefly/firsthalf.yaml 03-thrash/default.yaml 04-mona-upgrade-firefly/mona.yaml 05-workload/rbd-cls.yaml 06-monb-upgrade-firefly/monb.yaml 07-workload/rbd_api.yaml 08-monc-upgrade-firefly/monc.yaml 09-workload/{rbd-python.yaml rgw-s3tests.yaml} 10-osds-upgrade-firefly/secondhalf.yaml 11-workload/snaps-few-objects.yaml 12-partial-upgrade-x/first.yaml 13-thrash/default.yaml 14-mona-upgrade-x/mona.yaml 15-workload/rbd-import-export.yaml 16-monb-upgrade-x/monb.yaml 17-workload/readwrite.yaml 18-monc-upgrade-x/monc.yaml 19-workload/radosbench.yaml 20-osds-upgrade-x/osds_secondhalf.yaml 21-final-workload/rados_stress_watch.yaml distros/ubuntu_14.04.yaml} duration: 12110.70418214798 failure_reason: 'Command failed on plana29 with status 124: ''mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=3bfb5fab41b6247259183c3f52c786e35beb3b01 TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/workunit.client.0/rbd/test_librbd_python.sh''' flavor: basic owner: scheduled_teuthology@teuthology success: false
Updated by Yuri Weinstein over 9 years ago
- Subject changed from Coredumo in upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi run to Coredump in upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi run
Updated by Loïc Dachary over 9 years ago
assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting); fails
58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] got osd.3 1.1b( empty local-les=0 n=0 ec=1 les/c 31/31 58/59/59) -113> 2014-10-09 05:38:13.535657 7fb6c4cd2700 10 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] update_heartbeat_peers 0,3,4 unchanged -112> 2014-10-09 05:38:13.535671 7fb6c4cd2700 5 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] exit Started/Primary/Peering/GetInfo 0.006107 3 0.000435 -111> 2014-10-09 05:38:13.535684 7fb6c4cd2700 5 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] enter Started/Primary/Peering/GetLog -110> 2014-10-09 05:38:13.535697 7fb6c4cd2700 10 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.0 1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) -109> 2014-10-09 05:38:13.535710 7fb6c4cd2700 10 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.3 1.1b( empty local-les=0 n=0 ec=1 les/c 31/31 58/59/59) -108> 2014-10-09 05:38:13.535721 7fb6c4cd2700 10 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.4 1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) -107> 2014-10-09 05:38:13.535748 7fb6c4cd2700 10 osd.0 pg_epoch: 59 pg[1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) [3,4]/[0,3,4] r=0 lpr=59 pi=1-58/8 crt=0'0 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting newest update on osd.0 with 1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) up[0] needs backfill, osd.0 selected as primary instead calc_acting primary is osd.0 with 1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) shard 3 (up) backfill 1.1b( empty local-les=0 n=0 ec=1 les/c 31/31 58/59/59) osd.4 (up) accepted 1.1b( v 5'1 (5'1,5'1] local-les=31 n=1 ec=1 les/c 31/31 58/59/59) .... -21> 2014-10-09 05:38:13.539457 7fb6c4cd2700 -1 osd/PG.cc: In function 'bool PG::choose_acting(pg_shard_t&)' thread 7fb6c4cd2700 time 2014-10-09 05:38:13.535760 osd/PG.cc: 1284: FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)
Updated by Loïc Dachary over 9 years ago
- Subject changed from Coredump in upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi run to assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting) firefly
Updated by David Zafman over 9 years ago
I see the change (92cfd370) that added the assert and didn't consider "compat_mode." In older OSDs we only have one backfill at a time. During an upgrade when not all OSDs are updated yet, compat_mode == true. In calc_replicated_acting() after the first backfill is added, no more will be. We could make the assert just be assert( !compat_mode && …), but what about the test "if (num_want_acting < pool.info.min_size) {", is that fine as is?
Updated by Loïc Dachary over 9 years ago
sjust: I think it's due to the compatibility thing where we include the backfill peer in the acting set if there are old osds. I think the assert is just not valid since the backfill and acting sets are not disjoint in that case.