Bug #9657
closedMMDSBeacon: failure to decode; compat_version = 3 on Firefly monitor
0%
Description
2014-10-03T10:28:14.138 DEBUG:teuthology.misc:Ceph health: HEALTH_OK 2014-10-03T10:28:14.138 INFO:teuthology.task.sequential:In sequential, running task sleep... 2014-10-03T10:28:14.138 INFO:teuthology.task.sleep:Sleeping for 60 2014-10-03T10:28:21.723 INFO:tasks.ceph.mon.b.plana25.stderr:*** Caught signal (Segmentation fault) ** 2014-10-03T10:28:21.723 INFO:tasks.ceph.mon.b.plana25.stderr: in thread 7fee597ae700 2014-10-03T10:28:21.740 INFO:tasks.ceph.mon.b.plana25.stderr: ceph version 0.80.6-27-g711a7e6 (711a7e6f81983ff2091caa0f232af914a04a041c) 2014-10-03T10:28:21.740 INFO:tasks.ceph.mon.b.plana25.stderr: 1: ceph-mon() [0x86933f] 2014-10-03T10:28:21.741 INFO:tasks.ceph.mon.b.plana25.stderr: 2: (()+0x10340) [0x7fee5f449340] 2014-10-03T10:28:21.741 INFO:tasks.ceph.mon.b.plana25.stderr: 3: (Message::get_source_inst() const+0x20) [0x57c2e0] 2014-10-03T10:28:21.741 INFO:tasks.ceph.mon.b.plana25.stderr: 4: (Monitor::handle_forward(MForward*)+0x2fe) [0x571d6e] 2014-10-03T10:28:21.741 INFO:tasks.ceph.mon.b.plana25.stderr: 5: (Monitor::dispatch(MonSession*, Message*, bool)+0x360) [0x572a60] 2014-10-03T10:28:21.742 INFO:tasks.ceph.mon.b.plana25.stderr: 6: (Monitor::_ms_dispatch(Message*)+0x215) [0x573045] 2014-10-03T10:28:21.742 INFO:tasks.ceph.mon.b.plana25.stderr: 7: (Monitor::ms_dispatch(Message*)+0x20) [0x590da0] 2014-10-03T10:28:21.742 INFO:tasks.ceph.mon.b.plana25.stderr: 8: (DispatchQueue::entry()+0x57a) [0x836bfa] 2014-10-03T10:28:21.742 INFO:tasks.ceph.mon.b.plana25.stderr: 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x74e67d] 2014-10-03T10:28:21.743 INFO:tasks.ceph.mon.b.plana25.stderr: 10: (()+0x8182) [0x7fee5f441182] 2014-10-03T10:28:21.743 INFO:tasks.ceph.mon.b.plana25.stderr: 11: (clone()+0x6d) [0x7fee5dbb538d] 2014-10-03T10:28:21.743 INFO:tasks.ceph.mon.b.plana25.stderr:2014-10-03 10:28:21.739553 7fee597ae700 -1 *** Caught signal (Segmentation fault) ** 2014-10-03T10:28:21.743 INFO:tasks.ceph.mon.b.plana25.stderr: in thread 7fee597ae700 2014-10-03T10:28:21.744 INFO:tasks.ceph.mon.b.plana25.stderr: 2014-10-03T10:28:21.744 INFO:tasks.ceph.mon.b.plana25.stderr: ceph version 0.80.6-27-g711a7e6 (711a7e6f81983ff2091caa0f232af914a04a041c) 2014-10-03T10:28:21.744 INFO:tasks.ceph.mon.b.plana25.stderr: 1: ceph-mon() [0x86933f] 2014-10-03T10:28:21.744 INFO:tasks.ceph.mon.b.plana25.stderr: 2: (()+0x10340) [0x7fee5f449340] 2014-10-03T10:28:21.745 INFO:tasks.ceph.mon.b.plana25.stderr: 3: (Message::get_source_inst() const+0x20) [0x57c2e0] 2014-10-03T10:28:21.745 INFO:tasks.ceph.mon.b.plana25.stderr: 4: (Monitor::handle_forward(MForward*)+0x2fe) [0x571d6e] 2014-10-03T10:28:21.745 INFO:tasks.ceph.mon.b.plana25.stderr: 5: (Monitor::dispatch(MonSession*, Message*, bool)+0x360) [0x572a60] 2014-10-03T10:28:21.745 INFO:tasks.ceph.mon.b.plana25.stderr: 6: (Monitor::_ms_dispatch(Message*)+0x215) [0x573045] 2014-10-03T10:28:21.746 INFO:tasks.ceph.mon.b.plana25.stderr: 7: (Monitor::ms_dispatch(Message*)+0x20) [0x590da0] 2014-10-03T10:28:21.746 INFO:tasks.ceph.mon.b.plana25.stderr: 8: (DispatchQueue::entry()+0x57a) [0x836bfa] 2014-10-03T10:28:21.746 INFO:tasks.ceph.mon.b.plana25.stderr: 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x74e67d] 2014-10-03T10:28:21.746 INFO:tasks.ceph.mon.b.plana25.stderr: 10: (()+0x8182) [0x7fee5f441182] 2014-10-03T10:28:21.747 INFO:tasks.ceph.mon.b.plana25.stderr: 11: (clone()+0x6d) [0x7fee5dbb538d] 2014-10-03T10:28:21.747 INFO:tasks.ceph.mon.b.plana25.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2014-10-03T10:28:21.747 INFO:tasks.ceph.mon.b.plana25.stderr: 2014-10-03T10:28:21.836 INFO:tasks.ceph.mon.b.plana25.stderr: 0> 2014-10-03 10:28:21.739553 7fee597ae700 -1 *** Caught signal (Segmentation fault) ** 2014-10-03T10:28:21.837 INFO:tasks.ceph.mon.b.plana25.stderr: in thread 7fee597ae700 2014-10-03T10:28:21.837 INFO:tasks.ceph.mon.b.plana25.stderr: 2014-10-03T10:28:21.837 INFO:tasks.ceph.mon.b.plana25.stderr: ceph version 0.80.6-27-g711a7e6 (711a7e6f81983ff2091caa0f232af914a04a041c) 2014-10-03T10:28:21.837 INFO:tasks.ceph.mon.b.plana25.stderr: 1: ceph-mon() [0x86933f] 2014-10-03T10:28:21.837 INFO:tasks.ceph.mon.b.plana25.stderr: 2: (()+0x10340) [0x7fee5f449340] 2014-10-03T10:28:21.838 INFO:tasks.ceph.mon.b.plana25.stderr: 3: (Message::get_source_inst() const+0x20) [0x57c2e0] 2014-10-03T10:28:21.838 INFO:tasks.ceph.mon.b.plana25.stderr: 4: (Monitor::handle_forward(MForward*)+0x2fe) [0x571d6e] 2014-10-03T10:28:21.838 INFO:tasks.ceph.mon.b.plana25.stderr: 5: (Monitor::dispatch(MonSession*, Message*, bool)+0x360) [0x572a60] 2014-10-03T10:28:21.838 INFO:tasks.ceph.mon.b.plana25.stderr: 6: (Monitor::_ms_dispatch(Message*)+0x215) [0x573045] 2014-10-03T10:28:21.839 INFO:tasks.ceph.mon.b.plana25.stderr: 7: (Monitor::ms_dispatch(Message*)+0x20) [0x590da0] 2014-10-03T10:28:21.839 INFO:tasks.ceph.mon.b.plana25.stderr: 8: (DispatchQueue::entry()+0x57a) [0x836bfa] 2014-10-03T10:28:21.839 INFO:tasks.ceph.mon.b.plana25.stderr: 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x74e67d] 2014-10-03T10:28:21.839 INFO:tasks.ceph.mon.b.plana25.stderr: 10: (()+0x8182) [0x7fee5f441182] 2014-10-03T10:28:21.840 INFO:tasks.ceph.mon.b.plana25.stderr: 11: (clone()+0x6d) [0x7fee5dbb538d] 2014-10-03T10:28:21.840 INFO:tasks.ceph.mon.b.plana25.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
archive_path: /var/lib/teuthworker/archive/teuthology-2014-10-01_19:20:01-upgrade:firefly-x-giant-distro-basic-multi/525516 branch: giant description: upgrade:firefly-x/parallel/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-workload/test_rbd_python.yaml 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 4-final-upgrade/client.yaml 5-final-workload/rgw_s3tests.yaml distros/ubuntu_14.04.yaml} email: ceph-qa@ceph.com job_id: '525516' kernel: &id001 kdb: true sha1: distro last_in_suite: false machine_type: plana,burnupi,mira name: teuthology-2014-10-01_19:20:01-upgrade:firefly-x-giant-distro-basic-multi nuke-on-error: true os_type: ubuntu os_version: '14.04' overrides: admin_socket: branch: giant ceph: conf: mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon warn on legacy crush tunables: false osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 log-whitelist: - slow request - scrub mismatch - ScrubResult sha1: b1ca1f23ff47857a27f6196c5a050f83f3acc9fc ceph-deploy: branch: dev: giant conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 install: ceph: sha1: b1ca1f23ff47857a27f6196c5a050f83f3acc9fc s3tests: branch: giant workunit: sha1: b1ca1f23ff47857a27f6196c5a050f83f3acc9fc owner: scheduled_teuthology@teuthology priority: 1000 roles: - - mon.a - mds.a - osd.0 - osd.1 - - mon.b - mon.c - osd.2 - osd.3 - - client.0 - client.1 suite: upgrade:firefly-x suite_branch: master suite_path: /var/lib/teuthworker/src/ceph-qa-suite_master targets: ubuntu@burnupi22.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCvcs33PS2MQMxrCXm6jPOz1tz2i0bGA+8uVmTWe7Iu2W+fsHonGTMHJxPVYRAMe8WDulEYtkwK64s9D/Ph38kRK0o62SA599NKVIPvh1LzZadkWCX6aKlLv6cQYvQxOaUBuAlOIKQ5h0IKxu2lBPCHg6CIRHLUCYmTcR2PzXahMS9ToGMq+NS36+/4HmYPQ80lJcf1D9J+m6ETVMZ+DDcV4B6DWyImczwfNvXJY/Mj10bh4ZyEt5MSTVqfFKqNa3K1fWDzWUfsx8G3QGnyXNwuhUfRBslkIn4bFM2oJvGpFqZOSPGEskjM3IZaqhcoydnYDTTIWHG/8K5WJ70ZBhdx ubuntu@burnupi34.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDVHu1u8/oxlx4Gs/CzuGsF6R5obvz8zBIZJ2oW6ZlWn3da3ybaWDEY3rmRtCEpmFIXK5UKFRFEqlKcbDVbl3OB53a4SUcgLgH0YcVgab3zy4rp7SDdBXzGJK7aM7hhGiKY73O7pKpFLX8thRxNIzRBR1Rr49Re41WXfb/45fDl2tiGNMX0QgorKUtMCkeKv4C/NhG4g+pk0j2kur4QCUfFGGzcYJNlpGzmyBoe0g8UYtLAPKOBjpUHY4iDwe2hB36ifiW1T9WvJ3f7/axcZpFuFosdMEJJ3mrIOAeko4CpcV7lJVCT3S/Kj9KsyklLt682ni999dQ/RRHDQkqd0Qth ubuntu@plana25.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC85wgMM7frtcFeCfatlKcc0Ru1HB4X/557M//6iIT2hQExLRtPADyOpfZdZmhP4Nh/mP6C9oB5gYH82sHnuVbCboq9J9OzBK0STFo4OIToRgbLJCTRfNuKt0VX0WCpvneQfA4SKmAsO527HgDcY/yyhzg67rWIel4LilQpFbPTe+rB9wBjO/DpbhxoF7d8vQUwtt2dYv6BXOvYPHCvgydTCAMIOgHIP/UqQceJgj/I3u85851yllYnBNE7LaJRXB96FlRtO25ZV7F7pFYLxyCsm+vGfRmp5YqdP+Qw72UaXuMpan+dQDwzpfRklLvolrq9jOLLYIvwnzd+GQgbRR87 tasks: - internal.lock_machines: - 3 - plana,burnupi,mira - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.push_inventory: null - internal.serialize_remote_roles: null - internal.check_conflict: null - internal.check_ceph_data: null - internal.vm_setup: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.sudo: null - internal.syslog: null - internal.timer: null - chef: null - clock.check: null - install: branch: firefly - print: '**** done installing firefly' - ceph: fs: xfs - print: '**** done ceph' - parallel: - workload - upgrade-sequence - print: '**** done parallel' - install.upgrade: client.0: null - print: '**** done install.upgrade' - rgw: - client.1 - s3tests: client.1: branch: dumpling rgw_server: client.1 teuthology_branch: master tube: multi upgrade-sequence: sequential: - install.upgrade: mon.a: null - print: '**** done install.upgrade mon.a to the version from teuthology-suite arg' - ceph.restart: daemons: - mon.a wait-for-healthy: true - sleep: duration: 60 - ceph.restart: daemons: - osd.0 - osd.1 wait-for-healthy: true - sleep: duration: 60 - ceph.restart: - mds.a - sleep: duration: 60 - print: '**** running mixed versions of osds and mons' - exec: mon.b: - ceph osd crush tunables firefly - install.upgrade: mon.b: null - print: '**** done install.upgrade mon.b to the version from teuthology-suite arg' - ceph.restart: daemons: - mon.b - mon.c wait-for-healthy: true - sleep: duration: 60 - ceph.restart: daemons: - osd.2 - osd.3 wait-for-healthy: true - sleep: duration: 60 verbose: true worker_log: /var/lib/teuthworker/archive/worker_logs/worker.multi.3171 workload: sequential: - workunit: branch: firefly clients: client.0: - rbd/test_librbd_python.sh
description: upgrade:firefly-x/parallel/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-workload/test_rbd_python.yaml 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 4-final-upgrade/client.yaml 5-final-workload/rgw_s3tests.yaml distros/ubuntu_14.04.yaml} duration: 754.9805121421814 failure_reason: 'Command failed on plana25 with status 1: ''sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f -i b''' flavor: basic owner: scheduled_teuthology@teuthology success: false
Updated by Samuel Just over 9 years ago
- Assignee set to Greg Farnum
- Priority changed from Normal to Urgent
Updated by Greg Farnum over 9 years ago
- Status changed from New to In Progress
Well, good news and bad news:
This is not a monitor bug, and my initial guess is that it will only affect clusters running Giant MDSes and Firefly monitors. (That is the case for this specific instance of the bug, but I haven't quite figured out how it happened in the common code paths yet.)
But the problem is that somehow the MMDSBeacon::HEAD_VERSION value of 3 in Giant (it was 2 in Firefly) is getting transmuted into header.compat_version rather than header.version. I haven't tracked that down yet and it scares me.
The actual crash is because of an unchecked dereference in Monitor::handle_forward, in which we assume that the MForward message actually has a forwarded message associated. But we here don't, because the message being forwarded is an MMDSBeaccon, and decoding failed. We can see an instance of it earlier when the MDS tried to connect directly to us:
2014-10-03 10:28:13.721812 7fee57ca8700 0 will not decode message of type 100 version 3 because compat_version 3 > supported version 2
Updated by Greg Farnum over 9 years ago
Okay, it's because Message::encode() transmutes a compat_version of 0 into compat_version == HEAD_VERSION, and we aren't explicitly setting a compat version. That was easy.
Updated by Greg Farnum over 9 years ago
- Subject changed from "Segmentation fault" in upgrade:firefly-x-giant-distro-basic-multi run to MMDSBeacon: failure to decode; compat_version = 3 on Firefly monitor
- Status changed from In Progress to Fix Under Review
Updated by Greg Farnum over 9 years ago
- Assignee changed from Greg Farnum to Tamilarasi muthamizhan
https://github.com/ceph/ceph/pull/2640
Tamil will put it through the upgrade suite.
Updated by Sage Weil over 9 years ago
- Status changed from Fix Under Review to Pending Backport
fix looks right. merged it into giant branch
Updated by Tamilarasi muthamizhan over 9 years ago
tested with wip-9657, fix works fine.
logs are copied to vpm102.front.sepia.ceph.com:/home/ubuntu/wip-9657
Updated by Tamilarasi muthamizhan over 9 years ago
- Assignee changed from Tamilarasi muthamizhan to Greg Farnum
Updated by Greg Farnum over 9 years ago
- Status changed from Pending Backport to Resolved
No backport is needed; this is done. (25bcc39bb809e2d13beea1529e4ab92d1b61fa5b)