Project

General

Profile

Actions

Bug #9657

closed

MMDSBeacon: failure to decode; compat_version = 3 on Firefly monitor

Added by Yuri Weinstein over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-10-01_19:20:01-upgrade:firefly-x-giant-distro-basic-multi/525516/

2014-10-03T10:28:14.138 DEBUG:teuthology.misc:Ceph health: HEALTH_OK
2014-10-03T10:28:14.138 INFO:teuthology.task.sequential:In sequential, running task sleep...
2014-10-03T10:28:14.138 INFO:teuthology.task.sleep:Sleeping for 60
2014-10-03T10:28:21.723 INFO:tasks.ceph.mon.b.plana25.stderr:*** Caught signal (Segmentation fault) **
2014-10-03T10:28:21.723 INFO:tasks.ceph.mon.b.plana25.stderr: in thread 7fee597ae700
2014-10-03T10:28:21.740 INFO:tasks.ceph.mon.b.plana25.stderr: ceph version 0.80.6-27-g711a7e6 (711a7e6f81983ff2091caa0f232af914a04a041c)
2014-10-03T10:28:21.740 INFO:tasks.ceph.mon.b.plana25.stderr: 1: ceph-mon() [0x86933f]
2014-10-03T10:28:21.741 INFO:tasks.ceph.mon.b.plana25.stderr: 2: (()+0x10340) [0x7fee5f449340]
2014-10-03T10:28:21.741 INFO:tasks.ceph.mon.b.plana25.stderr: 3: (Message::get_source_inst() const+0x20) [0x57c2e0]
2014-10-03T10:28:21.741 INFO:tasks.ceph.mon.b.plana25.stderr: 4: (Monitor::handle_forward(MForward*)+0x2fe) [0x571d6e]
2014-10-03T10:28:21.741 INFO:tasks.ceph.mon.b.plana25.stderr: 5: (Monitor::dispatch(MonSession*, Message*, bool)+0x360) [0x572a60]
2014-10-03T10:28:21.742 INFO:tasks.ceph.mon.b.plana25.stderr: 6: (Monitor::_ms_dispatch(Message*)+0x215) [0x573045]
2014-10-03T10:28:21.742 INFO:tasks.ceph.mon.b.plana25.stderr: 7: (Monitor::ms_dispatch(Message*)+0x20) [0x590da0]
2014-10-03T10:28:21.742 INFO:tasks.ceph.mon.b.plana25.stderr: 8: (DispatchQueue::entry()+0x57a) [0x836bfa]
2014-10-03T10:28:21.742 INFO:tasks.ceph.mon.b.plana25.stderr: 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x74e67d]
2014-10-03T10:28:21.743 INFO:tasks.ceph.mon.b.plana25.stderr: 10: (()+0x8182) [0x7fee5f441182]
2014-10-03T10:28:21.743 INFO:tasks.ceph.mon.b.plana25.stderr: 11: (clone()+0x6d) [0x7fee5dbb538d]
2014-10-03T10:28:21.743 INFO:tasks.ceph.mon.b.plana25.stderr:2014-10-03 10:28:21.739553 7fee597ae700 -1 *** Caught signal (Segmentation fault) **
2014-10-03T10:28:21.743 INFO:tasks.ceph.mon.b.plana25.stderr: in thread 7fee597ae700
2014-10-03T10:28:21.744 INFO:tasks.ceph.mon.b.plana25.stderr:
2014-10-03T10:28:21.744 INFO:tasks.ceph.mon.b.plana25.stderr: ceph version 0.80.6-27-g711a7e6 (711a7e6f81983ff2091caa0f232af914a04a041c)
2014-10-03T10:28:21.744 INFO:tasks.ceph.mon.b.plana25.stderr: 1: ceph-mon() [0x86933f]
2014-10-03T10:28:21.744 INFO:tasks.ceph.mon.b.plana25.stderr: 2: (()+0x10340) [0x7fee5f449340]
2014-10-03T10:28:21.745 INFO:tasks.ceph.mon.b.plana25.stderr: 3: (Message::get_source_inst() const+0x20) [0x57c2e0]
2014-10-03T10:28:21.745 INFO:tasks.ceph.mon.b.plana25.stderr: 4: (Monitor::handle_forward(MForward*)+0x2fe) [0x571d6e]
2014-10-03T10:28:21.745 INFO:tasks.ceph.mon.b.plana25.stderr: 5: (Monitor::dispatch(MonSession*, Message*, bool)+0x360) [0x572a60]
2014-10-03T10:28:21.745 INFO:tasks.ceph.mon.b.plana25.stderr: 6: (Monitor::_ms_dispatch(Message*)+0x215) [0x573045]
2014-10-03T10:28:21.746 INFO:tasks.ceph.mon.b.plana25.stderr: 7: (Monitor::ms_dispatch(Message*)+0x20) [0x590da0]
2014-10-03T10:28:21.746 INFO:tasks.ceph.mon.b.plana25.stderr: 8: (DispatchQueue::entry()+0x57a) [0x836bfa]
2014-10-03T10:28:21.746 INFO:tasks.ceph.mon.b.plana25.stderr: 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x74e67d]
2014-10-03T10:28:21.746 INFO:tasks.ceph.mon.b.plana25.stderr: 10: (()+0x8182) [0x7fee5f441182]
2014-10-03T10:28:21.747 INFO:tasks.ceph.mon.b.plana25.stderr: 11: (clone()+0x6d) [0x7fee5dbb538d]
2014-10-03T10:28:21.747 INFO:tasks.ceph.mon.b.plana25.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2014-10-03T10:28:21.747 INFO:tasks.ceph.mon.b.plana25.stderr:
2014-10-03T10:28:21.836 INFO:tasks.ceph.mon.b.plana25.stderr:     0> 2014-10-03 10:28:21.739553 7fee597ae700 -1 *** Caught signal (Segmentation fault) **
2014-10-03T10:28:21.837 INFO:tasks.ceph.mon.b.plana25.stderr: in thread 7fee597ae700
2014-10-03T10:28:21.837 INFO:tasks.ceph.mon.b.plana25.stderr:
2014-10-03T10:28:21.837 INFO:tasks.ceph.mon.b.plana25.stderr: ceph version 0.80.6-27-g711a7e6 (711a7e6f81983ff2091caa0f232af914a04a041c)
2014-10-03T10:28:21.837 INFO:tasks.ceph.mon.b.plana25.stderr: 1: ceph-mon() [0x86933f]
2014-10-03T10:28:21.837 INFO:tasks.ceph.mon.b.plana25.stderr: 2: (()+0x10340) [0x7fee5f449340]
2014-10-03T10:28:21.838 INFO:tasks.ceph.mon.b.plana25.stderr: 3: (Message::get_source_inst() const+0x20) [0x57c2e0]
2014-10-03T10:28:21.838 INFO:tasks.ceph.mon.b.plana25.stderr: 4: (Monitor::handle_forward(MForward*)+0x2fe) [0x571d6e]
2014-10-03T10:28:21.838 INFO:tasks.ceph.mon.b.plana25.stderr: 5: (Monitor::dispatch(MonSession*, Message*, bool)+0x360) [0x572a60]
2014-10-03T10:28:21.838 INFO:tasks.ceph.mon.b.plana25.stderr: 6: (Monitor::_ms_dispatch(Message*)+0x215) [0x573045]
2014-10-03T10:28:21.839 INFO:tasks.ceph.mon.b.plana25.stderr: 7: (Monitor::ms_dispatch(Message*)+0x20) [0x590da0]
2014-10-03T10:28:21.839 INFO:tasks.ceph.mon.b.plana25.stderr: 8: (DispatchQueue::entry()+0x57a) [0x836bfa]
2014-10-03T10:28:21.839 INFO:tasks.ceph.mon.b.plana25.stderr: 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x74e67d]
2014-10-03T10:28:21.839 INFO:tasks.ceph.mon.b.plana25.stderr: 10: (()+0x8182) [0x7fee5f441182]
2014-10-03T10:28:21.840 INFO:tasks.ceph.mon.b.plana25.stderr: 11: (clone()+0x6d) [0x7fee5dbb538d]
2014-10-03T10:28:21.840 INFO:tasks.ceph.mon.b.plana25.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
archive_path: /var/lib/teuthworker/archive/teuthology-2014-10-01_19:20:01-upgrade:firefly-x-giant-distro-basic-multi/525516
branch: giant
description: upgrade:firefly-x/parallel/{0-cluster/start.yaml 1-firefly-install/firefly.yaml
  2-workload/test_rbd_python.yaml 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 4-final-upgrade/client.yaml
  5-final-workload/rgw_s3tests.yaml distros/ubuntu_14.04.yaml}
email: ceph-qa@ceph.com
job_id: '525516'
kernel: &id001
  kdb: true
  sha1: distro
last_in_suite: false
machine_type: plana,burnupi,mira
name: teuthology-2014-10-01_19:20:01-upgrade:firefly-x-giant-distro-basic-multi
nuke-on-error: true
os_type: ubuntu
os_version: '14.04'
overrides:
  admin_socket:
    branch: giant
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - scrub mismatch
    - ScrubResult
    sha1: b1ca1f23ff47857a27f6196c5a050f83f3acc9fc
  ceph-deploy:
    branch:
      dev: giant
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: b1ca1f23ff47857a27f6196c5a050f83f3acc9fc
  s3tests:
    branch: giant
  workunit:
    sha1: b1ca1f23ff47857a27f6196c5a050f83f3acc9fc
owner: scheduled_teuthology@teuthology
priority: 1000
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
- - mon.b
  - mon.c
  - osd.2
  - osd.3
- - client.0
  - client.1
suite: upgrade:firefly-x
suite_branch: master
suite_path: /var/lib/teuthworker/src/ceph-qa-suite_master
targets:
  ubuntu@burnupi22.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCvcs33PS2MQMxrCXm6jPOz1tz2i0bGA+8uVmTWe7Iu2W+fsHonGTMHJxPVYRAMe8WDulEYtkwK64s9D/Ph38kRK0o62SA599NKVIPvh1LzZadkWCX6aKlLv6cQYvQxOaUBuAlOIKQ5h0IKxu2lBPCHg6CIRHLUCYmTcR2PzXahMS9ToGMq+NS36+/4HmYPQ80lJcf1D9J+m6ETVMZ+DDcV4B6DWyImczwfNvXJY/Mj10bh4ZyEt5MSTVqfFKqNa3K1fWDzWUfsx8G3QGnyXNwuhUfRBslkIn4bFM2oJvGpFqZOSPGEskjM3IZaqhcoydnYDTTIWHG/8K5WJ70ZBhdx
  ubuntu@burnupi34.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDVHu1u8/oxlx4Gs/CzuGsF6R5obvz8zBIZJ2oW6ZlWn3da3ybaWDEY3rmRtCEpmFIXK5UKFRFEqlKcbDVbl3OB53a4SUcgLgH0YcVgab3zy4rp7SDdBXzGJK7aM7hhGiKY73O7pKpFLX8thRxNIzRBR1Rr49Re41WXfb/45fDl2tiGNMX0QgorKUtMCkeKv4C/NhG4g+pk0j2kur4QCUfFGGzcYJNlpGzmyBoe0g8UYtLAPKOBjpUHY4iDwe2hB36ifiW1T9WvJ3f7/axcZpFuFosdMEJJ3mrIOAeko4CpcV7lJVCT3S/Kj9KsyklLt682ni999dQ/RRHDQkqd0Qth
  ubuntu@plana25.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC85wgMM7frtcFeCfatlKcc0Ru1HB4X/557M//6iIT2hQExLRtPADyOpfZdZmhP4Nh/mP6C9oB5gYH82sHnuVbCboq9J9OzBK0STFo4OIToRgbLJCTRfNuKt0VX0WCpvneQfA4SKmAsO527HgDcY/yyhzg67rWIel4LilQpFbPTe+rB9wBjO/DpbhxoF7d8vQUwtt2dYv6BXOvYPHCvgydTCAMIOgHIP/UqQceJgj/I3u85851yllYnBNE7LaJRXB96FlRtO25ZV7F7pFYLxyCsm+vGfRmp5YqdP+Qw72UaXuMpan+dQDwzpfRklLvolrq9jOLLYIvwnzd+GQgbRR87
tasks:
- internal.lock_machines:
  - 3
  - plana,burnupi,mira
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.push_inventory: null
- internal.serialize_remote_roles: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: firefly
- print: '**** done installing firefly'
- ceph:
    fs: xfs
- print: '**** done ceph'
- parallel:
  - workload
  - upgrade-sequence
- print: '**** done parallel'
- install.upgrade:
    client.0: null
- print: '**** done install.upgrade'
- rgw:
  - client.1
- s3tests:
    client.1:
      branch: dumpling
      rgw_server: client.1
teuthology_branch: master
tube: multi
upgrade-sequence:
  sequential:
  - install.upgrade:
      mon.a: null
  - print: '**** done install.upgrade mon.a to the version from teuthology-suite arg'
  - ceph.restart:
      daemons:
      - mon.a
      wait-for-healthy: true
  - sleep:
      duration: 60
  - ceph.restart:
      daemons:
      - osd.0
      - osd.1
      wait-for-healthy: true
  - sleep:
      duration: 60
  - ceph.restart:
    - mds.a
  - sleep:
      duration: 60
  - print: '**** running mixed versions of osds and mons'
  - exec:
      mon.b:
      - ceph osd crush tunables firefly
  - install.upgrade:
      mon.b: null
  - print: '**** done install.upgrade mon.b to the version from teuthology-suite arg'
  - ceph.restart:
      daemons:
      - mon.b
      - mon.c
      wait-for-healthy: true
  - sleep:
      duration: 60
  - ceph.restart:
      daemons:
      - osd.2
      - osd.3
      wait-for-healthy: true
  - sleep:
      duration: 60
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.multi.3171
workload:
  sequential:
  - workunit:
      branch: firefly
      clients:
        client.0:
        - rbd/test_librbd_python.sh
description: upgrade:firefly-x/parallel/{0-cluster/start.yaml 1-firefly-install/firefly.yaml
  2-workload/test_rbd_python.yaml 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 4-final-upgrade/client.yaml
  5-final-workload/rgw_s3tests.yaml distros/ubuntu_14.04.yaml}
duration: 754.9805121421814
failure_reason: 'Command failed on plana25 with status 1: ''sudo adjust-ulimits ceph-coverage
  /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f -i b'''
flavor: basic
owner: scheduled_teuthology@teuthology
success: false
Actions #1

Updated by Samuel Just over 9 years ago

  • Assignee set to Greg Farnum
  • Priority changed from Normal to Urgent
Actions #2

Updated by Greg Farnum over 9 years ago

  • Status changed from New to In Progress

Well, good news and bad news:
This is not a monitor bug, and my initial guess is that it will only affect clusters running Giant MDSes and Firefly monitors. (That is the case for this specific instance of the bug, but I haven't quite figured out how it happened in the common code paths yet.)

But the problem is that somehow the MMDSBeacon::HEAD_VERSION value of 3 in Giant (it was 2 in Firefly) is getting transmuted into header.compat_version rather than header.version. I haven't tracked that down yet and it scares me.

The actual crash is because of an unchecked dereference in Monitor::handle_forward, in which we assume that the MForward message actually has a forwarded message associated. But we here don't, because the message being forwarded is an MMDSBeaccon, and decoding failed. We can see an instance of it earlier when the MDS tried to connect directly to us:

2014-10-03 10:28:13.721812 7fee57ca8700  0 will not decode message of type 100 version 3 because compat_version 3 > supported version 2

Actions #3

Updated by Greg Farnum over 9 years ago

Okay, it's because Message::encode() transmutes a compat_version of 0 into compat_version == HEAD_VERSION, and we aren't explicitly setting a compat version. That was easy.

Actions #4

Updated by Greg Farnum over 9 years ago

  • Subject changed from "Segmentation fault" in upgrade:firefly-x-giant-distro-basic-multi run to MMDSBeacon: failure to decode; compat_version = 3 on Firefly monitor
  • Status changed from In Progress to Fix Under Review
Actions #5

Updated by Greg Farnum over 9 years ago

  • Assignee changed from Greg Farnum to Tamilarasi muthamizhan

https://github.com/ceph/ceph/pull/2640

Tamil will put it through the upgrade suite.

Actions #6

Updated by Sage Weil over 9 years ago

  • Status changed from Fix Under Review to Pending Backport

fix looks right. merged it into giant branch

Actions #7

Updated by Tamilarasi muthamizhan over 9 years ago

tested with wip-9657, fix works fine.

logs are copied to vpm102.front.sepia.ceph.com:/home/ubuntu/wip-9657

Actions #8

Updated by Tamilarasi muthamizhan over 9 years ago

  • Assignee changed from Tamilarasi muthamizhan to Greg Farnum
Actions #9

Updated by Greg Farnum over 9 years ago

  • Status changed from Pending Backport to Resolved

No backport is needed; this is done. (25bcc39bb809e2d13beea1529e4ab92d1b61fa5b)

Actions

Also available in: Atom PDF