Project

General

Profile

Actions

Bug #8725

closed

mds crashed in upgrade:dumpling-x:stress-split-master-testing-basic-plana

Added by Yuri Weinstein almost 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/ubuntu-2014-07-01_11:38:37-upgrade:dumpling-x:stress-split-master-testing-basic-plana/337407/

coredump info from /*remote/plana57/log/ceph-mds.a.log.gz :

2014-07-01 12:05:07.682897 7f81a917f700  0 mds.0.cache creating system inode with ino:200
2014-07-01 12:05:22.660113 7f81a917f700  1 mds.0.1 creating_done
2014-07-01 12:05:22.797318 7f81a917f700  1 mds.0.1 handle_mds_map i am now mds.0.1
2014-07-01 12:05:22.797331 7f81a917f700  1 mds.0.1 handle_mds_map state change up:creating --> up:active
2014-07-01 12:05:22.797335 7f81a917f700  1 mds.0.1 active_start
2014-07-01 12:06:36.026362 7f81a566e700  0 -- 10.214.132.21:6812/4830 >> 10.214.132.21:6808/4560 pipe(0x2d6e780 sd=22 :0 s=1 pgs=0 cs=0 l=1 c=0x2d3d2c0).fault
2014-07-01 12:06:42.028396 7f81a5870700  0 -- 10.214.132.21:6812/4830 >> 10.214.132.21:6800/4558 pipe(0x2cfac80 sd=22 :0 s=1 pgs=0 cs=0 l=1 c=0x2d3d160).fault
2014-07-01 12:06:48.048396 7f81a566e700  0 -- 10.214.132.21:6812/4830 >> 10.214.132.21:6804/4559 pipe(0x2cfaa00 sd=21 :0 s=1 pgs=0 cs=0 l=1 c=0x2d3d000).fault
2014-07-01 12:12:39.418941 7f81a917f700  0 monclient: hunting for new mon
2014-07-01 12:12:40.538635 7f81a917f700 -1 *** Caught signal (Aborted) **
 in thread 7f81a917f700

 ceph version 0.67.9-20-g583e6e3 (583e6e3ef7f28bf34fe038e8a2391f9325a69adf)
 1: ceph-mds() [0x98ea4a]
 2: (()+0xfcb0) [0x7f81acefccb0]
 3: (gsignal()+0x35) [0x7f81ab3d3425]
 4: (abort()+0x17b) [0x7f81ab3d6b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f81abd2669d]
 6: (()+0xb5846) [0x7f81abd24846]
 7: (()+0xb5873) [0x7f81abd24873]
 8: (()+0xb596e) [0x7f81abd2496e]
 9: (MDSMap::decode(ceph::buffer::list::iterator&)+0xe53) [0x801f43]
 10: (MDS::handle_mds_map(MMDSMap*)+0x5aa) [0x588f8a]
 11: (MDS::handle_core_message(Message*)+0x5bb) [0x58d67b]
 12: (MDS::_dispatch(Message*)+0x2f) [0x58ddaf]
 13: (MDS::ms_dispatch(Message*)+0x1d3) [0x58f843]
 14: (DispatchQueue::entry()+0x549) [0x95abe9]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x87b50d]
 16: (()+0x7e9a) [0x7f81acef4e9a]
 17: (clone()+0x6d) [0x7f81ab4913fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
 -1018> 2014-07-01 12:05:07.551553 7f81ad318780  5 asok(0x2cda000) register_command perfcounters_dump hook 0x2ccf010
 -1017> 2014-07-01 12:05:07.551587 7f81ad318780  5 asok(0x2cda000) register_command 1 hook 0x2ccf010
 -1016> 2014-07-01 12:05:07.551598 7f81ad318780  5 asok(0x2cda000) register_command perf dump hook 0x2ccf010
 -1015> 2014-07-01 12:05:07.551605 7f81ad318780  5 asok(0x2cda000) register_command perfcounters_schema hook 0x2ccf010
 -1014> 2014-07-01 12:05:07.551608 7f81ad318780  5 asok(0x2cda000) register_command 2 hook 0x2ccf010
 -1013> 2014-07-01 12:05:07.551611 7f81ad318780  5 asok(0x2cda000) register_command perf schema hook 0x2ccf010
 -1012> 2014-07-01 12:05:07.551614 7f81ad318780  5 asok(0x2cda000) register_command config show hook 0x2ccf010
--
   -30> 2014-07-01 12:12:37.436373 7f81a5870700  2 -- 10.214.132.21:6812/4830 >> 10.214.132.21:6800/4558 pipe(0x2cfac80 sd=18 :0 s=1 pgs=0 cs=0 l=1 c=0x2d3d160).connect error 10.214.132.21:6800/4558, 111: Connection refused
   -29> 2014-07-01 12:12:37.436424 7f81a5870700  2 -- 10.214.132.21:6812/4830 >> 10.214.132.21:6800/4558 pipe(0x2cfac80 sd=18 :0 s=1 pgs=0 cs=0 l=1 c=0x2d3d160).fault 111: Connection refused
   -28> 2014-07-01 12:12:37.572533 7f81a707a700  5 mds.0.1 is_laggy 18.893766 > 15 since last acked beacon
   -27> 2014-07-01 12:12:37.572557 7f81a707a700  5 mds.0.1 tick bailing out since we seem laggy
   -26> 2014-07-01 12:12:38.679548 7f81a707a700 10 monclient: _send_mon_message to mon.c at 10.214.131.19:6789/0
   -25> 2014-07-01 12:12:38.679574 7f81a707a700  1 -- 10.214.132.21:6812/4830 --> 10.214.131.19:6789/0 -- mdsbeacon(4107/a up:active seq 114 v5) v2 -- ?+0 0x2d4a340 con 0x2cf3580
   -24> 2014-07-01 12:12:39.418625 7f81a717b700  2 -- 10.214.132.21:6812/4830 >> 10.214.131.19:6789/0 pipe(0x2cfa500 sd=8 :36066 s=2 pgs=8 cs=1 l=1 c=0x2cf3580).reader couldn't read tag, Success
   -23> 2014-07-01 12:12:39.418683 7f81a717b700  2 -- 10.214.132.21:6812/4830 >> 10.214.131.19:6789/0 pipe(0x2cfa500 sd=8 :36066 s=2 pgs=8 cs=1 l=1 c=0x2cf3580).fault 0: Success
   -22> 2014-07-01 12:12:39.418920 7f81a917f700 10 monclient: ms_handle_reset current mon 10.214.131.19:6789/0
   -21> 2014-07-01 12:12:39.418941 7f81a917f700  0 monclient: hunting for new mon
   -20> 2014-07-01 12:12:39.418944 7f81a917f700 10 monclient: _reopen_session rank -1 name 
   -19> 2014-07-01 12:12:39.418950 7f81a917f700  1 -- 10.214.132.21:6812/4830 mark_down 0x2cf3580 -- pipe dne
   -18> 2014-07-01 12:12:39.419013 7f81a917f700 10 monclient: picked mon.b con 0x2d3d580 addr 10.214.132.21:6790/0
   -17> 2014-07-01 12:12:39.419039 7f81a917f700 10 monclient(hunting): _send_mon_message to mon.b at 10.214.132.21:6790/0
   -16> 2014-07-01 12:12:39.419047 7f81a917f700  1 -- 10.214.132.21:6812/4830 --> 10.214.132.21:6790/0 -- auth(proto 0 26 bytes epoch 1) v1 -- ?+0 0x2d646c0 con 0x2d3d580
   -15> 2014-07-01 12:12:39.419063 7f81a917f700 10 monclient(hunting): renew_subs
   -14> 2014-07-01 12:12:39.419068 7f81a917f700  5 mds.0.1 ms_handle_reset on 10.214.131.19:6789/0
   -13> 2014-07-01 12:12:39.420144 7f81a917f700  5 mds.0.1 ms_handle_connect on 10.214.132.21:6790/0
   -12> 2014-07-01 12:12:39.421258 7f81a917f700  1 -- 10.214.132.21:6812/4830 <== mon.2 10.214.132.21:6790/0 1 ==== auth_reply(proto 2 0 Success) v1 ==== 33+0+0 (2748226891 0 0) 0x2cdf600 con 0x2d3d580
   -11> 2014-07-01 12:12:39.421363 7f81a917f700 10 monclient(hunting): _send_mon_message to mon.b at 10.214.132.21:6790/0
   -10> 2014-07-01 12:12:39.421372 7f81a917f700  1 -- 10.214.132.21:6812/4830 --> 10.214.132.21:6790/0 -- auth(proto 2 128 bytes epoch 0) v1 -- ?+0 0x2d0c000 con 0x2d3d580
    -9> 2014-07-01 12:12:39.422600 7f81a917f700  1 -- 10.214.132.21:6812/4830 <== mon.2 10.214.132.21:6790/0 2 ==== auth_reply(proto 2 0 Success) v1 ==== 225+0+0 (310302456 0 0) 0x2d5b400 con 0x2d3d580
    -8> 2014-07-01 12:12:39.422676 7f81a917f700  1 monclient(hunting): found mon.b
    -7> 2014-07-01 12:12:39.422688 7f81a917f700 10 monclient: _send_mon_message to mon.b at 10.214.132.21:6790/0
    -6> 2014-07-01 12:12:39.422703 7f81a917f700  1 -- 10.214.132.21:6812/4830 --> 10.214.132.21:6790/0 -- mon_subscribe({mdsmap=6+,monmap=2+}) v2 -- ?+0 0x2d51000 con 0x2d3d580
    -5> 2014-07-01 12:12:39.422738 7f81a917f700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2014-07-01 12:12:09.422737)
    -4> 2014-07-01 12:12:39.423907 7f81a917f700  1 -- 10.214.132.21:6812/4830 <== mon.2 10.214.132.21:6790/0 3 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (131601739 0 0) 0x2cfbe00 con 0x2d3d580
    -3> 2014-07-01 12:12:39.423940 7f81a917f700 10 monclient: handle_subscribe_ack sent 2014-07-01 12:12:39.419065 renew after 2014-07-01 12:15:09.419065
    -2> 2014-07-01 12:12:40.535756 7f81a917f700  1 -- 10.214.132.21:6812/4830 <== mon.2 10.214.132.21:6790/0 4 ==== mdsmap(e 6) v1 ==== 598+0+0 (1390759494 0 0) 0x2d5ba00 con 0x2d3d580
    -1> 2014-07-01 12:12:40.535799 7f81a917f700  5 mds.0.1 handle_mds_map epoch 6 from mon.2
     0> 2014-07-01 12:12:40.538635 7f81a917f700 -1 *** Caught signal (Aborted) **
 in thread 7f81a917f700

 ceph version 0.67.9-20-g583e6e3 (583e6e3ef7f28bf34fe038e8a2391f9325a69adf)
 1: ceph-mds() [0x98ea4a]
 2: (()+0xfcb0) [0x7f81acefccb0]
 3: (gsignal()+0x35) [0x7f81ab3d3425]
 4: (abort()+0x17b) [0x7f81ab3d6b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f81abd2669d]
 6: (()+0xb5846) [0x7f81abd24846]
 7: (()+0xb5873) [0x7f81abd24873]
 8: (()+0xb596e) [0x7f81abd2496e]
 9: (MDSMap::decode(ceph::buffer::list::iterator&)+0xe53) [0x801f43]
 10: (MDS::handle_mds_map(MMDSMap*)+0x5aa) [0x588f8a]
 11: (MDS::handle_core_message(Message*)+0x5bb) [0x58d67b]
 12: (MDS::_dispatch(Message*)+0x2f) [0x58ddaf]
 13: (MDS::ms_dispatch(Message*)+0x1d3) [0x58f843]
 14: (DispatchQueue::entry()+0x549) [0x95abe9]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x87b50d]
 16: (()+0x7e9a) [0x7f81acef4e9a]
 17: (clone()+0x6d) [0x7f81ab4913fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2014-07-01T13:44:23.549 INFO:teuthology.orchestra.run.plana57.stderr:dumped all in format json
2014-07-01T13:44:23.659 INFO:teuthology.misc:Shutting down mds daemons...
2014-07-01T13:44:23.659 ERROR:teuthology.misc:Saw exception from mds.a
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-master/teuthology/misc.py", line 1093, in stop_daemons_of_type
    daemon.stop()
  File "/home/teuthworker/teuthology-master/teuthology/task/ceph.py", line 61, in stop
    run.wait([self.proc], timeout=timeout)
  File "/home/teuthworker/teuthology-master/teuthology/orchestra/run.py", line 424, in wait
    proc.wait()
  File "/home/teuthworker/teuthology-master/teuthology/orchestra/run.py", line 102, in wait
    exitstatus=status, node=self.hostname)
CommandFailedError: Command failed on plana57 with status 1: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mds -f -i a'
2014-07-01T13:44:23.684 INFO:teuthology.misc:Shutting down osd daemons...
archive_path: /var/lib/teuthworker/archive/ubuntu-2014-07-01_11:38:37-upgrade:dumpling-x:stress-split-master-testing-basic-plana/337407
branch: master
description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rbd-cls.yaml
  6-next-mon/monb.yaml 7-workload/rados_api_tests.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml
  rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_14.04.yaml}
email: null
job_id: '337407'
kernel: &id001
  kdb: true
  sha1: 8362a1290d075f376ba68521ffb3b42ecaaecfea
last_in_suite: false
machine_type: plana
name: ubuntu-2014-07-01_11:38:37-upgrade:dumpling-x:stress-split-master-testing-basic-plana
nuke-on-error: true
os_type: ubuntu
os_version: '14.04'
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch
    sha1: 1eca89df3586e07409773ff6797095bfc6ec2dcc
  ceph-deploy:
    branch:
      dev: master
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 1eca89df3586e07409773ff6797095bfc6ec2dcc
  s3tests:
    branch: master
  workunit:
    sha1: 1eca89df3586e07409773ff6797095bfc6ec2dcc
owner: yuriw
priority: 10
roles:
- - mon.a
  - mon.b
  - mds.a
  - osd.0
  - osd.1
  - osd.2
- - osd.3
  - osd.4
  - osd.5
  - mon.c
- - client.0
suite: upgrade:dumpling-x:stress-split
targets:
  ubuntu@plana21.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDVbQbk5qpDD2687Wu8iZt7sHmIQbrLr4Esj4NzOzmkkwvhj0p7GmO825mdpu/YP25mXQkrvuKlfuKHZ9QyxfyiCy051FeuPqhSk0IqYYaTVRslrvQ9uSa+IhqE23LxFhWQt7Kgl9DqG7377qqgEXTqBCj/LMD2ix4ugXYRTVFQIXibvZlTjEsNlcPD61R80ZcWa6Jd1jm4XPtqKlr5Sfe4DfWb/VomgHC/frSdmAQTRwikaMpHOonLAo2Hx6WQ/6TOgeDfXgla7wZzIVD3aHTAXVFzkqVb/V6brLn7hMP2Ok3dpo1nDFRTY7Q3/PTJFKeqVkZZgYv0GMTzFDD4NNKN
  ubuntu@plana34.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDhl4nIoX/Xy21FdSNkHIKvw+VnRxEXBW+XW4ES9FJSNkAQ3fwmZxHyA71PIjzb/pFLyWKroR/QlcQth76U9Kj3OmU1zdPgtgTHeMY6nXoY+4moEbFPJdkQyJq3oarBc1J2UXl5msnQAsK0k0AjOwLEDcpdAVuzztKry6hIKGiNlGs8Eueo0MFfI710HJZGB6HyDr51NmMfP8SqS6KAonacLyxwd8F71ygT0Y9p4LE1dPPVkS8bJ9eov6qx401O9ZvCVC2wjce9g7p15wHbQroPVRz2gm/GeIeCSTCvmm+08BOxuKS3gSEoUZOJO00BxmMWbJMyregrVNcMt563swgp
  ubuntu@plana57.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDFHCeGWMPGOLyScKFkduv7aJL9bpMUPZQATO9lxpWu1NtzYndPJtWcyUxgWlItu75SJwpXx/l2GhPYcDKrR1Nl37+dbgs5TeDTbr9YdQBuLPbkbIZMQqO4GqUjurEwLU3vFUZ0X7PTlUqn6qwpT+I2YJua19eF2cRQFIGYVZMzaezm47uh67cdKFh0RTA1pSJ2qM/WMn91boRWcsRQrmn4BeOzfpGfSPDRjrHXHiPx3Br4zcOi/3lOxNFcEeoBrA47PMxvxVIlbmxKDfNjHpQQT18VFWb+qcTAzf+zdBy3iDRFFS45fPrqlWjGn9sK74EbRQanDrZlrFkg2a/HIe5T
tasks:
- internal.lock_machines:
  - 3
  - plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.serialize_remote_roles: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: dumpling
- ceph:
    fs: xfs
- install.upgrade:
    osd.0: null
- ceph.restart:
    daemons:
    - osd.0
    - osd.1
    - osd.2
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    thrash_primary_affinity: false
    timeout: 1200
- ceph.restart:
    daemons:
    - mon.a
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    branch: dumpling
    clients:
      client.0:
      - cls/test_cls_rbd.sh
- ceph.restart:
    daemons:
    - mon.b
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rados/test-upgrade-firefly.sh
- install.upgrade:
    mon.c: null
- ceph.restart:
    daemons:
    - mon.c
    wait-for-healthy: false
    wait-for-osds-up: true
- ceph.wait_for_mon_quorum:
  - a
  - b
  - c
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rados/test-upgrade-firefly.sh
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rbd/test_librbd_python.sh
- rgw:
    client.0:
      idle_timeout: 300
- swift:
    client.0:
      rgw_server: client.0
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
teuthology_branch: master
tube: plana
verbose: false
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.plana.12155
client.0-kernel-sha1: 8362a1290d075f376ba68521ffb3b42ecaaecfea
description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rbd-cls.yaml
  6-next-mon/monb.yaml 7-workload/rados_api_tests.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml
  rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_14.04.yaml}
duration: 6957.407259941101
failure_reason: 'Command failed on plana57 with status 1: ''sudo adjust-ulimits ceph-coverage
  /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mds -f -i a'''
flavor: basic
mon.a-kernel-sha1: 8362a1290d075f376ba68521ffb3b42ecaaecfea
osd.3-kernel-sha1: 8362a1290d075f376ba68521ffb3b42ecaaecfea
owner: yuriw
success: false
Actions #1

Updated by Sage Weil almost 10 years ago

  • Project changed from Ceph to CephFS
  • Priority changed from Normal to High
Actions #2

Updated by Loïc Dachary over 9 years ago

Looks like a similar problem at upgrade:firefly-x:stress-split

2014-08-08T10:17:01.182 INFO:tasks.radosbench.radosbench.0.vpm182.stdout:   305      16        84        68  0.729339         0         -   24.8856
2014-08-08T10:17:03.110 INFO:tasks.radosbench.radosbench.0.vpm182.stdout:   306      16        84        68  0.725589         0         -   24.8856
2014-08-08T10:17:03.616 INFO:tasks.thrashosds.ceph_manager:no progress seen, keeping timeout for now
2014-08-08T10:17:03.617 INFO:teuthology.orchestra.run.vpm180:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph pg dump --format=json'
2014-08-08T10:17:04.179 INFO:tasks.radosbench.radosbench.0.vpm182.stdout:   307      16        84        68  0.723525         0         -   24.8856
2014-08-08T10:17:05.179 INFO:tasks.radosbench.radosbench.0.vpm182.stdout:   308      16        84        68  0.721606         0         -   24.8856
2014-08-08T10:17:06.210 INFO:tasks.radosbench.radosbench.0.vpm182.stdout:   309      16        84        68  0.719637         0         -   24.8856
2014-08-08T10:17:06.830 INFO:teuthology.orchestra.run.vpm180.stderr:dumped all in format json
2014-08-08T10:17:07.754 INFO:tasks.radosbench.radosbench.0.vpm182.stdout:   310      16        84        68  0.716708         0         -   24.8856
2014-08-08T10:17:08.777 INFO:tasks.ceph.mon.c.vpm181.stderr:     0> 2014-08-08 17:16:39.477530 7f903999a700 -1 *** Caught signal (Aborted) **
2014-08-08T10:17:08.778 INFO:tasks.ceph.mon.c.vpm181.stderr: in thread 7f903999a700
2014-08-08T10:17:08.778 INFO:tasks.ceph.mon.c.vpm181.stderr:
2014-08-08T10:17:08.778 INFO:tasks.ceph.mon.c.vpm181.stderr: ceph version 0.80.5-9-gb65cef6 (b65cef678777c1b87d25385595bf0df96168703e)
2014-08-08T10:17:08.778 INFO:tasks.ceph.mon.c.vpm181.stderr: 1: ceph-mon() [0x862b0f]
2014-08-08T10:17:08.779 INFO:tasks.ceph.mon.c.vpm181.stderr: 2: (()+0x10340) [0x7f903f127340]
2014-08-08T10:17:08.779 INFO:tasks.ceph.mon.c.vpm181.stderr: 3: (gsignal()+0x39) [0x7f903d7cef89]
2014-08-08T10:17:08.779 INFO:tasks.ceph.mon.c.vpm181.stderr: 4: (abort()+0x148) [0x7f903d7d2398]
2014-08-08T10:17:08.779 INFO:tasks.ceph.mon.c.vpm181.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f903e0da6b5]
2014-08-08T10:17:08.779 INFO:tasks.ceph.mon.c.vpm181.stderr: 6: (()+0x5e836) [0x7f903e0d8836]
2014-08-08T10:17:08.779 INFO:tasks.ceph.mon.c.vpm181.stderr: 7: (()+0x5e863) [0x7f903e0d8863]
2014-08-08T10:17:08.779 INFO:tasks.ceph.mon.c.vpm181.stderr: 8: (()+0x5eaa2) [0x7f903e0d8aa2]
2014-08-08T10:17:08.779 INFO:tasks.ceph.mon.c.vpm181.stderr: 9: (MDSMap::decode(ceph::buffer::list::iterator&)+0xc7c) [0x6c354c]
2014-08-08T10:17:08.780 INFO:tasks.ceph.mon.c.vpm181.stderr: 10: (MDSMonitor::update_from_paxos(bool*)+0x645) [0x5fbb35]
2014-08-08T10:17:08.780 INFO:tasks.ceph.mon.c.vpm181.stderr: 11: (PaxosService::refresh(bool*)+0x19a) [0x5a2ada]
2014-08-08T10:17:08.780 INFO:tasks.ceph.mon.c.vpm181.stderr: 12: (Monitor::refresh_from_paxos(bool*)+0x6f) [0x54168f]
2014-08-08T10:17:08.780 INFO:tasks.ceph.mon.c.vpm181.stderr: 13: (Paxos::do_refresh()+0x24) [0x590354]
2014-08-08T10:17:08.780 INFO:tasks.ceph.mon.c.vpm181.stderr: 14: (Paxos::handle_commit(MMonPaxos*)+0x1b9) [0x5966f9]
2014-08-08T10:17:08.780 INFO:tasks.ceph.mon.c.vpm181.stderr: 15: (Paxos::dispatch(PaxosServiceMessage*)+0x1cb) [0x59d8ab]
2014-08-08T10:17:08.781 INFO:tasks.ceph.mon.c.vpm181.stderr: 16: (Monitor::dispatch(MonSession*, Message*, bool)+0x5a6) [0x571736]
2014-08-08T10:17:08.781 INFO:tasks.ceph.mon.c.vpm181.stderr: 17: (Monitor::_ms_dispatch(Message*)+0x215) [0x571ad5]
2014-08-08T10:17:08.781 INFO:tasks.ceph.mon.c.vpm181.stderr: 18: (Monitor::ms_dispatch(Message*)+0x20) [0x58f780]
2014-08-08T10:17:08.781 INFO:tasks.ceph.mon.c.vpm181.stderr: 19: (DispatchQueue::entry()+0x57a) [0x830eaa]
2014-08-08T10:17:08.781 INFO:tasks.ceph.mon.c.vpm181.stderr: 20: (DispatchQueue::DispatchThread::entry()+0xd) [0x74937d]
2014-08-08T10:17:08.781 INFO:tasks.ceph.mon.c.vpm181.stderr: 21: (()+0x8182) [0x7f903f11f182]
2014-08-08T10:17:08.781 INFO:tasks.ceph.mon.c.vpm181.stderr: 22: (clone()+0x6d) [0x7f903d89338d]
2014-08-08T10:17:08.781 INFO:tasks.ceph.mon.c.vpm181.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2014-08-08T10:17:08.782 INFO:tasks.ceph.mon.c.vpm181.stderr:
2014-08-08T10:17:09.297 INFO:tasks.radosbench.radosbench.0.vpm182.stdout:   311      16        84        68  0.713808         0         -   24.8856

Actions #3

Updated by Loïc Dachary over 9 years ago

And the same trace at upgrade:firefly-x:stress-split

2014-08-08T10:31:55.917 INFO:tasks.ceph.mds.a.vpm130.stderr: ceph version 0.80.5-9-gb65cef6 (b65cef678777c1b87d25385595bf0df96168703e)
2014-08-08T10:31:55.917 INFO:tasks.ceph.mds.a.vpm130.stderr: 1: ceph-mds() [0x7f777f]
2014-08-08T10:31:55.917 INFO:tasks.ceph.mds.a.vpm130.stderr: 2: (()+0x10340) [0x7f40b45c8340]
2014-08-08T10:31:55.917 INFO:tasks.ceph.mds.a.vpm130.stderr: 3: (gsignal()+0x39) [0x7f40b2e73f89]
2014-08-08T10:31:55.917 INFO:tasks.ceph.mds.a.vpm130.stderr: 4: (abort()+0x148) [0x7f40b2e77398]
2014-08-08T10:31:55.917 INFO:tasks.ceph.mds.a.vpm130.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f40b377f6b5]
2014-08-08T10:31:55.917 INFO:tasks.ceph.mds.a.vpm130.stderr: 6: (()+0x5e836) [0x7f40b377d836]
2014-08-08T10:31:55.918 INFO:tasks.ceph.mds.a.vpm130.stderr: 7: (()+0x5e863) [0x7f40b377d863]
2014-08-08T10:31:55.918 INFO:tasks.ceph.mds.a.vpm130.stderr: 8: (()+0x5eaa2) [0x7f40b377daa2]
2014-08-08T10:31:55.918 INFO:tasks.ceph.mds.a.vpm130.stderr: 9: (MDSMap::decode(ceph::buffer::list::iterator&)+0xc7c) [0x83247c]
2014-08-08T10:31:55.918 INFO:tasks.ceph.mds.a.vpm130.stderr: 10: (MDS::handle_mds_map(MMDSMap*)+0x2f2) [0x58aaa2]
2014-08-08T10:31:55.918 INFO:tasks.ceph.mds.a.vpm130.stderr: 11: (MDS::handle_core_message(Message*)+0xb03) [0x58eed3]
2014-08-08T10:31:55.918 INFO:tasks.ceph.mds.a.vpm130.stderr: 12: (MDS::_dispatch(Message*)+0x32) [0x58f0f2]
2014-08-08T10:31:55.918 INFO:tasks.ceph.mds.a.vpm130.stderr: 13: (MDS::ms_dispatch(Message*)+0xa3) [0x590ad3]
2014-08-08T10:31:55.918 INFO:tasks.ceph.mds.a.vpm130.stderr: 14: (DispatchQueue::entry()+0x57a) [0x99d62a]
2014-08-08T10:31:55.919 INFO:tasks.ceph.mds.a.vpm130.stderr: 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x8beb5d]
2014-08-08T10:31:55.919 INFO:tasks.ceph.mds.a.vpm130.stderr: 16: (()+0x8182) [0x7f40b45c0182]
2014-08-08T10:31:55.919 INFO:tasks.ceph.mds.a.vpm130.stderr: 17: (clone()+0x6d) [0x7f40b2f3838d]
2014-08-08T10:31:55.919 INFO:tasks.ceph.mds.a.vpm130.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #4

Updated by Loïc Dachary over 9 years ago

Another similar crash

2014-08-08T10:04:38.098 INFO:tasks.rados.rados.0.vpm184.stdout:190:  writing vpm1845876-190 from 600791 to 600792 tid 1
2014-08-08T10:04:38.098 INFO:tasks.rados.rados.0.vpm184.stdout: waiting on 16
2014-08-08T10:04:38.428 INFO:tasks.rados.rados.0.vpm184.stdout:186:  finishing write tid 1 to vpm1845876-186
2014-08-08T10:04:38.428 INFO:tasks.rados.rados.0.vpm184.stdout:186:  finishing write tid 2 to vpm1845876-186
2014-08-08T10:04:38.432 INFO:tasks.rados.rados.0.vpm184.stdout:186:  finishing write tid 3 to vpm1845876-186
2014-08-08T10:04:38.433 INFO:tasks.rados.rados.0.vpm184.stdout:186:  finishing write tid 5 to vpm1845876-186
2014-08-08T10:04:39.211 INFO:tasks.ceph.mon.c.vpm183.stderr:terminate called after throwing an instance of 'ceph::buffer::malformed_input'
2014-08-08T10:04:39.211 INFO:tasks.ceph.mon.c.vpm183.stderr:  what():  buffer::malformed_input: __PRETTY_FUNCTION__ unknown encoding version > 4
2014-08-08T10:04:39.212 INFO:tasks.ceph.mon.c.vpm183.stderr:*** Caught signal (Aborted) **
2014-08-08T10:04:39.212 INFO:tasks.ceph.mon.c.vpm183.stderr: in thread 7f5aaa140700
2014-08-08T10:04:39.441 INFO:tasks.rados.rados.0.vpm184.stdout:175:  finishing write tid 3 to vpm1845876-175
2014-08-08T10:04:39.915 INFO:tasks.ceph.mon.c.vpm183.stderr: ceph version 0.80.5-9-gb65cef6 (b65cef678777c1b87d25385595bf0df96168703e)
2014-08-08T10:04:39.915 INFO:tasks.ceph.mon.c.vpm183.stderr: 1: ceph-mon() [0x862b0f]
2014-08-08T10:04:39.915 INFO:tasks.ceph.mon.c.vpm183.stderr: 2: (()+0x10340) [0x7f5aaf8cd340]
2014-08-08T10:04:39.915 INFO:tasks.ceph.mon.c.vpm183.stderr: 3: (gsignal()+0x39) [0x7f5aadf74f89]
2014-08-08T10:04:39.916 INFO:tasks.ceph.mon.c.vpm183.stderr: 4: (abort()+0x148) [0x7f5aadf78398]
2014-08-08T10:04:39.916 INFO:tasks.ceph.mon.c.vpm183.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f5aae8806b5]
2014-08-08T10:04:39.916 INFO:tasks.ceph.mon.c.vpm183.stderr: 6: (()+0x5e836) [0x7f5aae87e836]
2014-08-08T10:04:39.916 INFO:tasks.ceph.mon.c.vpm183.stderr: 7: (()+0x5e863) [0x7f5aae87e863]
2014-08-08T10:04:39.916 INFO:tasks.ceph.mon.c.vpm183.stderr: 8: (()+0x5eaa2) [0x7f5aae87eaa2]
2014-08-08T10:04:39.916 INFO:tasks.ceph.mon.c.vpm183.stderr: 9: (MDSMap::decode(ceph::buffer::list::iterator&)+0xc7c) [0x6c354c]
2014-08-08T10:04:39.916 INFO:tasks.ceph.mon.c.vpm183.stderr: 10: (MDSMonitor::update_from_paxos(bool*)+0x645) [0x5fbb35]
2014-08-08T10:04:39.916 INFO:tasks.ceph.mon.c.vpm183.stderr: 11: (PaxosService::refresh(bool*)+0x19a) [0x5a2ada]
2014-08-08T10:04:39.917 INFO:tasks.ceph.mon.c.vpm183.stderr: 12: (Monitor::refresh_from_paxos(bool*)+0x6f) [0x54168f]
2014-08-08T10:04:39.917 INFO:tasks.ceph.mon.c.vpm183.stderr: 13: (Paxos::do_refresh()+0x24) [0x590354]
2014-08-08T10:04:39.917 INFO:tasks.ceph.mon.c.vpm183.stderr: 14: (Paxos::handle_commit(MMonPaxos*)+0x1b9) [0x5966f9]
2014-08-08T10:04:39.917 INFO:tasks.ceph.mon.c.vpm183.stderr: 15: (Paxos::dispatch(PaxosServiceMessage*)+0x1cb) [0x59d8ab]
2014-08-08T10:04:39.917 INFO:tasks.ceph.mon.c.vpm183.stderr: 16: (Monitor::dispatch(MonSession*, Message*, bool)+0x5a6) [0x571736]
2014-08-08T10:04:39.917 INFO:tasks.ceph.mon.c.vpm183.stderr: 17: (Monitor::_ms_dispatch(Message*)+0x215) [0x571ad5]
2014-08-08T10:04:39.917 INFO:tasks.ceph.mon.c.vpm183.stderr: 18: (Monitor::ms_dispatch(Message*)+0x20) [0x58f780]
2014-08-08T10:04:39.917 INFO:tasks.ceph.mon.c.vpm183.stderr: 19: (DispatchQueue::entry()+0x57a) [0x830eaa]
2014-08-08T10:04:39.918 INFO:tasks.ceph.mon.c.vpm183.stderr: 20: (DispatchQueue::DispatchThread::entry()+0xd) [0x74937d]
2014-08-08T10:04:39.918 INFO:tasks.ceph.mon.c.vpm183.stderr: 21: (()+0x8182) [0x7f5aaf8c5182]
2014-08-08T10:04:39.918 INFO:tasks.ceph.mon.c.vpm183.stderr: 22: (clone()+0x6d) [0x7f5aae03938d]
2014-08-08T10:04:39.918 INFO:tasks.ceph.mon.c.vpm183.stderr:2014-08-08 17:04:39.915353 7f5aaa140700 -1 *** Caught signal (Aborted) **
2014-08-08T10:04:39.918 INFO:tasks.ceph.mon.c.vpm183.stderr: in thread 7f5aaa140700

Actions #6

Updated by Sage Weil over 9 years ago

  • Priority changed from High to Urgent
Actions #7

Updated by Sage Weil over 9 years ago

we probably have to do a reencoding trick like we do in MOSDMap?

Actions #8

Updated by Sage Weil over 9 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to John Spray
Actions #9

Updated by Sage Weil over 9 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF