Bug #8042
mon: crash decoding incremental osdmap on split firefly/dumpling
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2014-04-07T23:19:49.226 DEBUG:teuthology.orchestra.run:Running [10.214.138.91]: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph osd out 2' 2014-04-07T23:19:50.300 INFO:teuthology.orchestra.run.err:[10.214.138.91]: marked out osd.2. 2014-04-07T23:19:50.468 INFO:teuthology.run_tasks:Running task ceph.restart... 2014-04-07T23:19:50.468 DEBUG:teuthology.task.ceph.mon.c:waiting for process to exit 2014-04-07T23:19:50.468 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/teuthology-firefly/teuthology/run_tasks.py", line 41, in run_tasks manager.__enter__() File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/home/teuthworker/teuthology-firefly/teuthology/task/ceph.py", line 1307, in restart ctx.daemons.get_daemon(type_, id_).stop() File "/home/teuthworker/teuthology-firefly/teuthology/task/ceph.py", line 57, in stop run.wait([self.proc]) File "/home/teuthworker/teuthology-firefly/teuthology/orchestra/run.py", line 356, in wait proc.exitstatus.get() File "/usr/lib/python2.7/dist-packages/gevent/event.py", line 207, in get raise self._exception CommandFailedError: Command failed on 10.214.138.145 with status 1: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f -i c' 2014-04-07T23:19:50.527 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/inktank/teuthology/search?q=d852633d236c4d3c94e4c160a95e0325 CommandFailedError: Command failed on 10.214.138.145 with status 1: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f -i c'
archive_path: /var/lib/teuthworker/archive/teuthology-2014-04-07_22:35:16-upgrade:dumpling-x:stress-split-firefly-distro-basic-vps/177714 description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rbd-cls.yaml 6-next-mon/monb.yaml 7-workload/rbd_api.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/rhel_6.4.yaml} email: null job_id: '177714' kernel: &id001 kdb: true sha1: distro last_in_suite: false machine_type: vps name: teuthology-2014-04-07_22:35:16-upgrade:dumpling-x:stress-split-firefly-distro-basic-vps nuke-on-error: true os_type: rhel os_version: '6.4' overrides: admin_socket: branch: firefly ceph: conf: mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon warn on legacy crush tunables: false osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 log-whitelist: - slow request - wrongly marked me down - objects unfound and apparently lost - log bound mismatch sha1: 010dff12c38882238591bb042f8e497a1f7ba020 ceph-deploy: branch: dev: firefly conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 install: ceph: sha1: 010dff12c38882238591bb042f8e497a1f7ba020 s3tests: branch: master workunit: sha1: 010dff12c38882238591bb042f8e497a1f7ba020 owner: scheduled_teuthology@teuthology roles: - - mon.a - mon.b - mds.a - osd.0 - osd.1 - osd.2 - - osd.3 - osd.4 - osd.5 - mon.c - - client.0 targets: ubuntu@vpm060.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAzWI4wJknBQtX6yXeHF7up6AyEpe//rgYIFMEap/9yycLd7DmL5hTt1jZYgFqBBaWe2lr1KoaK/UFWGrMtA387skmebyYBC3pKywWkdVs8s29uGh3X4y6R0Rb7a/2r5QoRwcnMuZcvuCS56iWyFOZ4gSKIUs2Ctnn3B91PsZYtP70FBHIkb5m++xlEuG9Z7xkF3R+m4PrcKy3joOc5kBg9vFMW7MwzE4RP3YcMxgUA8BFwAbUaNq8zMpUTnXsLSuUN6d5cyYYJyg9VuzKXD5aNX4GatR0IrSs0MqxaaoNK2x9y0j91L2EhxeaYmkCVxU2LqJbrgFTq0RW035pFahP+Q== ubuntu@vpm061.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA1qp1J4cUqLuVuLujPQ04qCAP6D/nTJqGnaF5JMCOwQtauPYarc30Z1OqVebgT7Qo8eBx2FzazIiGSbLJNrTtzXtvQNrR4mriZo71orXegKJgDyUSl3fVLMmr2rYF7XlQU+rATCm4BF0+Vdtzd0EEiFCcZJPVWpS/FGoKilsje32Y7t8FTNtX7bLLbpAvWnVkxjVXT+byHoZIUWej2MYEscgQzek4sF78HWPi/mcEcn3mayJdaIe2PaGusGfRjCMu0FTmUYtoY6MqmHWnszkIqzxy7bAqUytOh1O79o8WMDmkxzjkmiONdXEU9+CKKQxXzh4sRzG95SNQFHHDpX1rKQ== ubuntu@vpm062.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAoyLPghb+RdivtlTeH6T5GIomIUcF3/as4485o1Jdx9+0ArO+keS9RNZrdNSoy5NeL2HDURfsYS1miWUg1yore6wfVe30pLXxnzOPsT5hxt2e+eqOWMv1lzclYNCd7X9Ni63pU+KI5uWpfUwRFchPalXaAHu/8EMGusS45o2vm11gbLpV0YfJ7y08b0GvGfhrFVkJrJSSPU8dpYGI5BbfDAaLtFwAOxW7BV8zLSczuUNBCg2MVbpx780I9WYZaBS1BoZuI48KBr7JyKBrYh6+Udpi+HpRrqzamHW+/oKMmCtgMtIQ3H2K3o6v9yW6uGpMeBzrokicdLBgulO+9cfU6Q== tasks: - internal.lock_machines: - 3 - vps - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.check_conflict: null - internal.check_ceph_data: null - internal.vm_setup: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.sudo: null - internal.syslog: null - internal.timer: null - chef: null - clock.check: null - install: branch: dumpling - ceph: fs: xfs - install.upgrade: osd.0: null - ceph.restart: daemons: - osd.0 - osd.1 - osd.2 - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 thrash_primary_affinity: false timeout: 1200 - ceph.restart: daemons: - mon.a wait-for-healthy: false wait-for-osds-up: true - workunit: branch: dumpling clients: client.0: - cls/test_cls_rbd.sh - ceph.restart: daemons: - mon.b wait-for-healthy: false wait-for-osds-up: true - workunit: branch: dumpling clients: client.0: - rbd/test_librbd.sh - install.upgrade: mon.c: null - ceph.restart: daemons: - mon.c wait-for-healthy: false wait-for-osds-up: true - ceph.wait_for_mon_quorum: - a - b - c - workunit: branch: dumpling clients: client.0: - rados/test-upgrade-firefly.sh - workunit: branch: dumpling clients: client.0: - rbd/test_librbd_python.sh - rgw: client.0: idle_timeout: 120 - swift: client.0: rgw_server: client.0 - rados: clients: - client.0 objects: 500 op_weights: delete: 50 read: 100 rollback: 50 snap_create: 50 snap_remove: 50 write: 100 ops: 4000 teuthology_branch: firefly verbose: true worker_log: /var/lib/teuthworker/archive/worker_logs/worker.vps.17036
description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rbd-cls.yaml 6-next-mon/monb.yaml 7-workload/rbd_api.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/rhel_6.4.yaml} duration: 2959.741597175598 failure_reason: 'Command failed on 10.214.138.145 with status 1: ''sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f -i c''' flavor: basic owner: scheduled_teuthology@teuthology sentry_event: http://sentry.ceph.com/inktank/teuthology/search?q=d852633d236c4d3c94e4c160a95e0325 success: false
Associated revisions
mon/Elector: ignore ACK from peers without required features
If an old peer gets a PROPOSE from us, we need to be sure to ignore their
ACK. Ignoring their PROPOSEs isn't sufficient to keep them out of a
quorum.
Fixes: #8042
Signed-off-by: Sage Weil <sage@inktank.com>
History
#1 Updated by Sage Weil almost 10 years ago
- Subject changed from "err... marked out osd.2" in upgrade:dumpling-x:stress-split-firefly-distro-basic-vps suite to mon: crash decoding incremental osdmap on split firefly/dumpling
- Category set to Monitor
- Priority changed from Normal to Urgent
- Source changed from other to Q/A
-7> 2014-04-08 02:12:06.476065 7fca82375700 10 mon.c@1(peon).pg v413 update_logger -6> 2014-04-08 02:12:06.476104 7fca82375700 10 mon.c@1(peon).paxosservice(mdsmap 1..5) refresh -5> 2014-04-08 02:12:06.476138 7fca82375700 10 mon.c@1(peon).paxosservice(osdmap 1..238) refresh -4> 2014-04-08 02:12:06.476141 7fca82375700 15 mon.c@1(peon).osd e232 update_from_paxos paxos e 238, my e 232 -3> 2014-04-08 02:12:06.476179 7fca82375700 7 mon.c@1(peon).osd e232 update_from_paxos applying incremental 233 -2> 2014-04-08 02:12:06.675305 7fca8889a700 1 -- 10.214.138.145:6789/0 >> :/0 pipe(0x3ffa800 sd=23 :6789 s=0 pgs=0 cs=0 l=0 c=0x51b27e0).accept sd=23 10.214.138.145:39852/0 -1> 2014-04-08 02:12:06.675364 7fca8889a700 10 mon.c@1(peon) e1 ms_verify_authorizer 10.214.138.145:6804/3756 osd protocol 0 0> 2014-04-08 02:12:11.880935 7fca82375700 -1 *** Caught signal (Aborted) ** in thread 7fca82375700 ceph version 0.67.7-66-g051a17e (051a17eb008d75aa6b0737873318a2e7273501ab) 1: ceph-mon() [0x6497b1] 2: (()+0xf500) [0x7fca87f24500] 3: (gsignal()+0x35) [0x7fca869328a5] 4: (abort()+0x175) [0x7fca86934085] 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fca871eba5d] 6: (()+0xbcbe6) [0x7fca871e9be6] 7: (()+0xbcc13) [0x7fca871e9c13] 8: (()+0xbcd0e) [0x7fca871e9d0e] 9: ceph-mon() [0x78712f] 10: (OSDMap::Incremental::decode(ceph::buffer::list::iterator&)+0x1c9) [0x69f3d9] 11: (OSDMap::Incremental::Incremental(ceph::buffer::list&)+0x4a3) [0x5c9853] 12: (OSDMonitor::update_from_paxos(bool*)+0x1006) [0x5a7c46] 13: (PaxosService::refresh(bool*)+0x18c) [0x58baec] 14: (Monitor::refresh_from_paxos(bool*)+0x57) [0x531317] 15: (Paxos::do_refresh()+0x36) [0x57a566] 16: (Paxos::handle_commit(MMonPaxos*)+0x21a) [0x584bfa] 17: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x58601b] 18: (Monitor::_ms_dispatch(Message*)+0x104d) [0x560ded] 19: (Monitor::ms_dispatch(Message*)+0x32) [0x578f32] 20: (DispatchQueue::entry()+0x5a2) [0x7e5122] 21: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c047d] 22: (()+0x7851) [0x7fca87f1c851] 23: (clone()+0x6d) [0x7fca869e890d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#2 Updated by Ian Colle almost 10 years ago
- Assignee set to Joao Eduardo Luis
#3 Updated by Sage Weil almost 10 years ago
- Status changed from New to Fix Under Review
#4 Updated by Greg Farnum almost 10 years ago
- Status changed from Fix Under Review to 7
- Assignee changed from Joao Eduardo Luis to Sage Weil
#5 Updated by Sage Weil almost 10 years ago
- Status changed from 7 to Fix Under Review
#6 Updated by Sage Weil almost 10 years ago
- Assignee changed from Sage Weil to Greg Farnum
#7 Updated by Sage Weil almost 10 years ago
- Status changed from Fix Under Review to Resolved