Project

General

Profile

Bug #8042

mon: crash decoding incremental osdmap on split firefly/dumpling

Added by Yuri Weinstein almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-04-07_22:35:16-upgrade:dumpling-x:stress-split-firefly-distro-basic-vps/177714/

2014-04-07T23:19:49.226 DEBUG:teuthology.orchestra.run:Running [10.214.138.91]: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph osd out 2'
2014-04-07T23:19:50.300 INFO:teuthology.orchestra.run.err:[10.214.138.91]: marked out osd.2.
2014-04-07T23:19:50.468 INFO:teuthology.run_tasks:Running task ceph.restart...
2014-04-07T23:19:50.468 DEBUG:teuthology.task.ceph.mon.c:waiting for process to exit
2014-04-07T23:19:50.468 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-firefly/teuthology/run_tasks.py", line 41, in run_tasks
    manager.__enter__()
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/teuthworker/teuthology-firefly/teuthology/task/ceph.py", line 1307, in restart
    ctx.daemons.get_daemon(type_, id_).stop()
  File "/home/teuthworker/teuthology-firefly/teuthology/task/ceph.py", line 57, in stop
    run.wait([self.proc])
  File "/home/teuthworker/teuthology-firefly/teuthology/orchestra/run.py", line 356, in wait
    proc.exitstatus.get()
  File "/usr/lib/python2.7/dist-packages/gevent/event.py", line 207, in get
    raise self._exception
CommandFailedError: Command failed on 10.214.138.145 with status 1: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f -i c'
2014-04-07T23:19:50.527 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/inktank/teuthology/search?q=d852633d236c4d3c94e4c160a95e0325
CommandFailedError: Command failed on 10.214.138.145 with status 1: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f -i c'
archive_path: /var/lib/teuthworker/archive/teuthology-2014-04-07_22:35:16-upgrade:dumpling-x:stress-split-firefly-distro-basic-vps/177714
description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rbd-cls.yaml
  6-next-mon/monb.yaml 7-workload/rbd_api.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml
  rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/rhel_6.4.yaml}
email: null
job_id: '177714'
kernel: &id001
  kdb: true
  sha1: distro
last_in_suite: false
machine_type: vps
name: teuthology-2014-04-07_22:35:16-upgrade:dumpling-x:stress-split-firefly-distro-basic-vps
nuke-on-error: true
os_type: rhel
os_version: '6.4'
overrides:
  admin_socket:
    branch: firefly
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch
    sha1: 010dff12c38882238591bb042f8e497a1f7ba020
  ceph-deploy:
    branch:
      dev: firefly
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 010dff12c38882238591bb042f8e497a1f7ba020
  s3tests:
    branch: master
  workunit:
    sha1: 010dff12c38882238591bb042f8e497a1f7ba020
owner: scheduled_teuthology@teuthology
roles:
- - mon.a
  - mon.b
  - mds.a
  - osd.0
  - osd.1
  - osd.2
- - osd.3
  - osd.4
  - osd.5
  - mon.c
- - client.0
targets:
  ubuntu@vpm060.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAzWI4wJknBQtX6yXeHF7up6AyEpe//rgYIFMEap/9yycLd7DmL5hTt1jZYgFqBBaWe2lr1KoaK/UFWGrMtA387skmebyYBC3pKywWkdVs8s29uGh3X4y6R0Rb7a/2r5QoRwcnMuZcvuCS56iWyFOZ4gSKIUs2Ctnn3B91PsZYtP70FBHIkb5m++xlEuG9Z7xkF3R+m4PrcKy3joOc5kBg9vFMW7MwzE4RP3YcMxgUA8BFwAbUaNq8zMpUTnXsLSuUN6d5cyYYJyg9VuzKXD5aNX4GatR0IrSs0MqxaaoNK2x9y0j91L2EhxeaYmkCVxU2LqJbrgFTq0RW035pFahP+Q==
  ubuntu@vpm061.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA1qp1J4cUqLuVuLujPQ04qCAP6D/nTJqGnaF5JMCOwQtauPYarc30Z1OqVebgT7Qo8eBx2FzazIiGSbLJNrTtzXtvQNrR4mriZo71orXegKJgDyUSl3fVLMmr2rYF7XlQU+rATCm4BF0+Vdtzd0EEiFCcZJPVWpS/FGoKilsje32Y7t8FTNtX7bLLbpAvWnVkxjVXT+byHoZIUWej2MYEscgQzek4sF78HWPi/mcEcn3mayJdaIe2PaGusGfRjCMu0FTmUYtoY6MqmHWnszkIqzxy7bAqUytOh1O79o8WMDmkxzjkmiONdXEU9+CKKQxXzh4sRzG95SNQFHHDpX1rKQ==
  ubuntu@vpm062.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAoyLPghb+RdivtlTeH6T5GIomIUcF3/as4485o1Jdx9+0ArO+keS9RNZrdNSoy5NeL2HDURfsYS1miWUg1yore6wfVe30pLXxnzOPsT5hxt2e+eqOWMv1lzclYNCd7X9Ni63pU+KI5uWpfUwRFchPalXaAHu/8EMGusS45o2vm11gbLpV0YfJ7y08b0GvGfhrFVkJrJSSPU8dpYGI5BbfDAaLtFwAOxW7BV8zLSczuUNBCg2MVbpx780I9WYZaBS1BoZuI48KBr7JyKBrYh6+Udpi+HpRrqzamHW+/oKMmCtgMtIQ3H2K3o6v9yW6uGpMeBzrokicdLBgulO+9cfU6Q==
tasks:
- internal.lock_machines:
  - 3
  - vps
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: dumpling
- ceph:
    fs: xfs
- install.upgrade:
    osd.0: null
- ceph.restart:
    daemons:
    - osd.0
    - osd.1
    - osd.2
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    thrash_primary_affinity: false
    timeout: 1200
- ceph.restart:
    daemons:
    - mon.a
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    branch: dumpling
    clients:
      client.0:
      - cls/test_cls_rbd.sh
- ceph.restart:
    daemons:
    - mon.b
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rbd/test_librbd.sh
- install.upgrade:
    mon.c: null
- ceph.restart:
    daemons:
    - mon.c
    wait-for-healthy: false
    wait-for-osds-up: true
- ceph.wait_for_mon_quorum:
  - a
  - b
  - c
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rados/test-upgrade-firefly.sh
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rbd/test_librbd_python.sh
- rgw:
    client.0:
      idle_timeout: 120
- swift:
    client.0:
      rgw_server: client.0
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
teuthology_branch: firefly
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.vps.17036
description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rbd-cls.yaml
  6-next-mon/monb.yaml 7-workload/rbd_api.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml
  rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/rhel_6.4.yaml}
duration: 2959.741597175598
failure_reason: 'Command failed on 10.214.138.145 with status 1: ''sudo adjust-ulimits
  ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon
  -f -i c'''
flavor: basic
owner: scheduled_teuthology@teuthology
sentry_event: http://sentry.ceph.com/inktank/teuthology/search?q=d852633d236c4d3c94e4c160a95e0325
success: false

Associated revisions

Revision b3b502f1 (diff)
Added by Sage Weil almost 10 years ago

mon/Elector: ignore ACK from peers without required features

If an old peer gets a PROPOSE from us, we need to be sure to ignore their
ACK. Ignoring their PROPOSEs isn't sufficient to keep them out of a
quorum.

Fixes: #8042
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil almost 10 years ago

  • Subject changed from "err... marked out osd.2" in upgrade:dumpling-x:stress-split-firefly-distro-basic-vps suite to mon: crash decoding incremental osdmap on split firefly/dumpling
  • Category set to Monitor
  • Priority changed from Normal to Urgent
  • Source changed from other to Q/A
    -7> 2014-04-08 02:12:06.476065 7fca82375700 10 mon.c@1(peon).pg v413 update_logger
    -6> 2014-04-08 02:12:06.476104 7fca82375700 10 mon.c@1(peon).paxosservice(mdsmap 1..5) refresh
    -5> 2014-04-08 02:12:06.476138 7fca82375700 10 mon.c@1(peon).paxosservice(osdmap 1..238) refresh
    -4> 2014-04-08 02:12:06.476141 7fca82375700 15 mon.c@1(peon).osd e232 update_from_paxos paxos e 238, my e 232
    -3> 2014-04-08 02:12:06.476179 7fca82375700  7 mon.c@1(peon).osd e232 update_from_paxos  applying incremental 233
    -2> 2014-04-08 02:12:06.675305 7fca8889a700  1 -- 10.214.138.145:6789/0 >> :/0 pipe(0x3ffa800 sd=23 :6789 s=0 pgs=0 cs=0 l=0 c=0x51b27e0).accept sd=23 10.214.138.145:39852/0
    -1> 2014-04-08 02:12:06.675364 7fca8889a700 10 mon.c@1(peon) e1 ms_verify_authorizer 10.214.138.145:6804/3756 osd protocol 0
     0> 2014-04-08 02:12:11.880935 7fca82375700 -1 *** Caught signal (Aborted) **
 in thread 7fca82375700

 ceph version 0.67.7-66-g051a17e (051a17eb008d75aa6b0737873318a2e7273501ab)
 1: ceph-mon() [0x6497b1]
 2: (()+0xf500) [0x7fca87f24500]
 3: (gsignal()+0x35) [0x7fca869328a5]
 4: (abort()+0x175) [0x7fca86934085]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fca871eba5d]
 6: (()+0xbcbe6) [0x7fca871e9be6]
 7: (()+0xbcc13) [0x7fca871e9c13]
 8: (()+0xbcd0e) [0x7fca871e9d0e]
 9: ceph-mon() [0x78712f]
 10: (OSDMap::Incremental::decode(ceph::buffer::list::iterator&)+0x1c9) [0x69f3d9]
 11: (OSDMap::Incremental::Incremental(ceph::buffer::list&)+0x4a3) [0x5c9853]
 12: (OSDMonitor::update_from_paxos(bool*)+0x1006) [0x5a7c46]
 13: (PaxosService::refresh(bool*)+0x18c) [0x58baec]
 14: (Monitor::refresh_from_paxos(bool*)+0x57) [0x531317]
 15: (Paxos::do_refresh()+0x36) [0x57a566]
 16: (Paxos::handle_commit(MMonPaxos*)+0x21a) [0x584bfa]
 17: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x58601b]
 18: (Monitor::_ms_dispatch(Message*)+0x104d) [0x560ded]
 19: (Monitor::ms_dispatch(Message*)+0x32) [0x578f32]
 20: (DispatchQueue::entry()+0x5a2) [0x7e5122]
 21: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c047d]
 22: (()+0x7851) [0x7fca87f1c851]
 23: (clone()+0x6d) [0x7fca869e890d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#2 Updated by Ian Colle almost 10 years ago

  • Assignee set to Joao Eduardo Luis

#3 Updated by Sage Weil almost 10 years ago

  • Status changed from New to Fix Under Review

#4 Updated by Greg Farnum almost 10 years ago

  • Status changed from Fix Under Review to 7
  • Assignee changed from Joao Eduardo Luis to Sage Weil

#5 Updated by Sage Weil almost 10 years ago

  • Status changed from 7 to Fix Under Review

#6 Updated by Sage Weil almost 10 years ago

  • Assignee changed from Sage Weil to Greg Farnum

#7 Updated by Sage Weil almost 10 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF