Bug #9788: "Assertion: common/HeartbeatMap.cc: 79" placeholder for "hit suicide timeout" issues - Ceph - Ceph

Actions

Copy link

Bug #9788

closed

"Assertion: common/HeartbeatMap.cc: 79" placeholder for "hit suicide timeout" issues

Added by Yuri Weinstein over 9 years ago. Updated about 9 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-10-13_19:30:01-upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi/546345/

Error from 'scrap':

Assertion: common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.67.11-22-gddc8a82 (ddc8a827d1baabc0bcb1df9ded37edc9820d8cac)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x107) [0x816bb7]
 2: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long)+0x8e) [0x81705e]
 3: (ThreadPool::worker(ThreadPool::WorkThread*)+0x471) [0x8b6ae1]
 4: (ThreadPool::WorkThread::entry()+0x10) [0x8b8b70]
 5: (()+0x7e9a) [0x7f8b876b5e9a]
 6: (clone()+0x6d) [0x7f8b859a53fd]
['546345']

2014-10-15T00:16:27.453 ERROR:teuthology.run_tasks:Manager failed: radosbench
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 117, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/var/lib/teuthworker/src/ceph-qa-suite_giant/tasks/radosbench.py", line 92, in task
    run.wait(radosbench.itervalues(), timeout=timeout)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 381, in wait
    check_time()
  File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 127, in __call__
    raise MaxWhileTries(error_msg)
MaxWhileTries: reached maximum tries (1500) after waiting for 9000 seconds

archive_path: /var/lib/teuthworker/archive/teuthology-2014-10-13_19:30:01-upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi/546345
branch: giant
description: upgrade:dumpling-firefly-x:stress-split/{00-cluster/start.yaml 01-dumpling-install/dumpling.yaml
  02-partial-upgrade-firefly/firsthalf.yaml 03-thrash/default.yaml 04-mona-upgrade-firefly/mona.yaml
  05-workload/rbd-cls.yaml 06-monb-upgrade-firefly/monb.yaml 07-workload/radosbench.yaml
  08-monc-upgrade-firefly/monc.yaml 09-workload/{rbd-python.yaml rgw-s3tests.yaml}
  10-osds-upgrade-firefly/secondhalf.yaml 11-workload/snaps-few-objects.yaml 12-partial-upgrade-x/first.yaml
  13-thrash/default.yaml 14-mona-upgrade-x/mona.yaml 15-workload/rbd-import-export.yaml
  16-monb-upgrade-x/monb.yaml 17-workload/readwrite.yaml 18-monc-upgrade-x/monc.yaml
  19-workload/radosbench.yaml 20-osds-upgrade-x/osds_secondhalf.yaml 21-final-workload/rados_stress_watch.yaml
  distros/ubuntu_12.04.yaml}
email: ceph-qa@ceph.com
job_id: '546345'
kernel: &id001
  kdb: true
  sha1: distro
last_in_suite: false
machine_type: plana,burnupi,mira
name: teuthology-2014-10-13_19:30:01-upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi
nuke-on-error: true
os_type: ubuntu
os_version: '12.04'
overrides:
  admin_socket:
    branch: giant
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch
    sha1: 674781960b8856ae684520c3b0e9a6b8c2bc7bec
  ceph-deploy:
    branch:
      dev: giant
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 674781960b8856ae684520c3b0e9a6b8c2bc7bec
  s3tests:
    branch: giant
  workunit:
    sha1: 674781960b8856ae684520c3b0e9a6b8c2bc7bec
owner: scheduled_teuthology@teuthology
priority: 1000
roles:
- - mon.a
  - mon.b
  - mds.a
  - osd.0
  - osd.1
  - osd.2
  - mon.c
- - osd.3
  - osd.4
  - osd.5
- - client.0
suite: upgrade:dumpling-firefly-x:stress-split
suite_branch: giant
suite_path: /var/lib/teuthworker/src/ceph-qa-suite_giant
targets:
  ubuntu@mira076.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCnEuRqBgd2DrRVNhCSPfldQUUJ4HeQKbPVLUtbC8wNRrv2Nk9ZuVUn5cb1LBJQreJM/p17q4fIO8bZyApZ6RZu+Q9pW70WIE3U+Z6xtINgi9xq6/mqnMauuqkDYiePhR9CDCbVVfBp/zVDOJVeCdV9TG5AZ0Xt2YciQkaVmmvxdRr4v5zhdw6vDumnfZsI5K+J0p2hII8e2HUrUkMTVKO0mu1rXzIqGQFOSArPTfCLAOgQfUG5s/e6QMC4NI+BOy2cVp/8yCzKv6FPDDvdEknmLh9tQ9HbS8SyOGPtdj9wfoIKo7UbOnJiDSu2KOliyljEB3YUTrzNClM7W/pWpobV
  ubuntu@plana15.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDA0GHi/HAXxKVnAdyh6NHBoqEq2Qk7z6Hb3SFt5+mljUWThTAkVPTf4QdpSshH/D+5v4VJHXp7lHYhZZJCS50z3w+af8cmREqwUgnA0zEjKKaXaVIdkAfDkh7LH3vllIGah3PlMPKF6njfvuocJ1pr1QneCLTmbHVCYsdWTGgRW7te1fn7vhXDJbGZMumHL5k/HO7iRDaw9cNuozWuqI5/d8UwdvQ/rhhbSKNef3w2hh2C4CU/nCkOGXFVyJZdo2pSJ2k/jBcPWSh+V3qNtIpthDqzTDmmpD8BFdW9MXxO5pfFRDsInWdgTsxZOrWtPuQy9+an20KbU2N5F4JoQX6N
  ubuntu@plana78.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC8m+86JHGSyRkSWj9p/K6JUbRcPjB7TtLZ9OBudXAGZNgReiOJoCU5kkpwejl0uXXCOHe/DB/bH81JCQbqY3XCJjU5JZ1wBsL/owaErPSfbbaouNV2k1FQjiSXYtPzx+qwEOeOZtEBPQ4p04npai6NzPLX43OGx/UiAwpyEGfVxZedmci0VBtC7QdCQkP3sNJqSxFYdoVGjU5jv6BarPqV8LM4v00f8TmD1GdP51bfLGSKii6UU1IKXXR78ifb+9QUX4p/Clkl6Qgz8CJ70Iu+mcBZclJaGoAyuoKBhXE2oi2W1cQVquPqloxbN+VbbjoOL5OHbGg2euxyohZhgJaF
tasks:
- internal.lock_machines:
  - 3
  - plana,burnupi,mira
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.push_inventory: null
- internal.serialize_remote_roles: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: dumpling
- ceph:
    fs: xfs
- install.upgrade:
    osd.0:
      branch: firefly
- ceph.restart:
    daemons:
    - osd.0
    - osd.1
    - osd.2
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    thrash_primary_affinity: false
    timeout: 1200
- ceph.restart:
    daemons:
    - mon.a
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    branch: dumpling
    clients:
      client.0:
      - cls/test_cls_rbd.sh
- ceph.restart:
    daemons:
    - mon.b
    wait-for-healthy: false
    wait-for-osds-up: true
- radosbench:
    clients:
    - client.0
    time: 1800
- install.upgrade:
    mon.c: null
- ceph.restart:
    daemons:
    - mon.c
    wait-for-healthy: false
    wait-for-osds-up: true
- ceph.wait_for_mon_quorum:
  - a
  - b
  - c
- workunit:
    clients:
      client.0:
      - rbd/test_librbd_python.sh
- rgw:
    client.0: null
    default_idle_timeout: 300
- s3tests:
    client.0:
      rgw_server: client.0
- install.upgrade:
    osd.3:
      branch: firefly
- ceph.restart:
    daemons:
    - osd.3
    - osd.4
    - osd.5
- rados:
    clients:
    - client.0
    objects: 50
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
- install.upgrade:
    osd.0: null
- ceph.restart:
    daemons:
    - osd.0
    - osd.1
    - osd.2
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    thrash_primary_affinity: false
    timeout: 1200
- ceph.restart:
    daemons:
    - mon.a
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    clients:
      client.0:
      - rbd/import_export.sh
    env:
      RBD_CREATE_ARGS: --new-format
- ceph.restart:
    daemons:
    - mon.b
    wait-for-healthy: false
    wait-for-osds-up: true
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 10
      read: 45
      write: 45
    ops: 4000
- ceph.restart:
    daemons:
    - mon.c
    wait-for-healthy: false
    wait-for-osds-up: true
- ceph.wait_for_mon_quorum:
  - a
  - b
  - c
- radosbench:
    clients:
    - client.0
    time: 1800
- install.upgrade:
    osd.3: null
- ceph.restart:
    daemons:
    - osd.3
    - osd.4
    - osd.5
- workunit:
    clients:
      client.0:
      - rados/stress_watch.sh
teuthology_branch: master
tube: multi
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.multi.3124

description: upgrade:dumpling-firefly-x:stress-split/{00-cluster/start.yaml 01-dumpling-install/dumpling.yaml
  02-partial-upgrade-firefly/firsthalf.yaml 03-thrash/default.yaml 04-mona-upgrade-firefly/mona.yaml
  05-workload/rbd-cls.yaml 06-monb-upgrade-firefly/monb.yaml 07-workload/radosbench.yaml
  08-monc-upgrade-firefly/monc.yaml 09-workload/{rbd-python.yaml rgw-s3tests.yaml}
  10-osds-upgrade-firefly/secondhalf.yaml 11-workload/snaps-few-objects.yaml 12-partial-upgrade-x/first.yaml
  13-thrash/default.yaml 14-mona-upgrade-x/mona.yaml 15-workload/rbd-import-export.yaml
  16-monb-upgrade-x/monb.yaml 17-workload/readwrite.yaml 18-monc-upgrade-x/monc.yaml
  19-workload/radosbench.yaml 20-osds-upgrade-x/osds_secondhalf.yaml 21-final-workload/rados_stress_watch.yaml
  distros/ubuntu_12.04.yaml}
duration: 20570.046264886856
failure_reason: 'Command failed on plana78 with status 124: ''mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp
  && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1
  CEPH_REF=674781960b8856ae684520c3b0e9a6b8c2bc7bec TESTDIR="/home/ubuntu/cephtest" 
  CEPH_ID="0" adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage
  timeout 3h /home/ubuntu/cephtest/workunit.client.0/rbd/test_librbd_python.sh'''
flavor: basic
owner: scheduled_teuthology@teuthology
success: false

Actions

Copy link

Updated by Yuri Weinstein over 9 years ago

suite:upgrade:dumpling
run: http://pulpito.front.sepia.ceph.com/teuthology-2014-10-14_17:00:01-upgrade:dumpling-dumpling-distro-basic-vps/

Job: http://qa-proxy.ceph.com/teuthology/teuthology-2014-10-14_17:00:01-upgrade:dumpling-dumpling-distro-basic-vps/548219/

Assertion: common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.67.11-22-gddc8a82 (ddc8a827d1baabc0bcb1df9ded37edc9820d8cac)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x107) [0x816bb7]
 2: (ceph::HeartbeatMap::is_healthy()+0xa7) [0x817567]
 3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x817b13]
 4: (CephContextServiceThread::entry()+0x55) [0x8d3e65]
 5: (()+0x7e9a) [0x7fa3343cde9a]
 6: (clone()+0x6d) [0x7fa3326d631d]
['548219']

Actions

Copy link

Updated by Samuel Just over 9 years ago

Status changed from New to Rejected

Two osds, both on mira076 timed out:
osd5: a stat in the op_tp took 3 minutes (completed, surprisingly, right before the suicide)

2014-10-14 19:12:42.233398 - 2014-10-14 19:15:17.213734

osd3: flush took 3 minutes (also, wierdly, completed right before the suicide)

2014-10-14 19:12:33.548599 - 2014-10-14 19:15:17.352885

I think it's safe to blame this on something environmental.

Actions

Copy link

Updated by Yuri Weinstein over 9 years ago

Status changed from Rejected to New

suite:upgrade:firefly-x
next

Run http://pulpito.front.sepia.ceph.com/teuthology-2014-11-03_17:18:01-upgrade:firefly-x-next-distro-basic-vps/

Same issues on two jobs ['584644', '584647']:

Assertion: common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.80.7-82-g1a9d000 (1a9d000bb679a7392b9dd115373c3827c9626694)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x12b) [0xa116ab]
 2: (ceph::HeartbeatMap::is_healthy()+0xa7) [0xa11fd7]
 3: (OSD::handle_osd_ping(MOSDPing*)+0x7e8) [0x63c568]
 4: (OSD::heartbeat_dispatch(Message*)+0x563) [0x64e353]
 5: (DispatchQueue::entry()+0x5a2) [0xb64212]
 6: (DispatchQueue::DispatchThread::entry()+0xd) [0xb2fb2d]
 7: (()+0x79d1) [0x7f84148059d1]
 8: (clone()+0x6d) [0x7f841379586d]
['584644', '584647']

Actions

Copy link

Updated by Yuri Weinstein over 9 years ago

Assignee set to Samuel Just

Actions

Copy link

Updated by Yuri Weinstein over 9 years ago

Also seeing in run http://pulpito.front.sepia.ceph.com/teuthology-2014-11-04_19:00:01-rados-dumpling-distro-basic-multi/

Job http://pulpito.front.sepia.ceph.com/teuthology-2014-11-04_19:00:01-rados-dumpling-distro-basic-multi/586835/

Assertion: common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.67.11-28-gea73bf5 (ea73bf5b6f8d9f2ec04bd2eb9809b62011fd66e0)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x107) [0x816bb7]
2: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long)+0x8e) [0x81705e]
3: (ThreadPool::worker(ThreadPool::WorkThread*)+0x8bc) [0x8b6f2c]
4: (ThreadPool::WorkThread::entry()+0x10) [0x8b8b70]
5: (()+0x7e9a) [0x7fa23dd06e9a]
6: (clone()+0x6d) [0x7fa23bff53fd]
['586835']

Actions

Copy link

Updated by Samuel Just over 9 years ago

584644 and 584647 both stuck in sync, probably environmental.

Actions

Copy link

Updated by Samuel Just over 9 years ago

Status changed from New to Rejected

2014-11-05 09:29:31.507827 7fa236d5b700 10 filestore(/var/lib/ceph/osd/ceph-3) sync_entry commit took 150.696754, interval was 154.277379
2014-11-05 09:29:31.507842 7fa236d5b700 10 journal commit_finish thru 6538
2014-11-05 09:29:31.507844 7fa236d5b700 5 journal committed_thru 6538 (last_committed_seq 6505)
2014-11-05 09:29:31.507848 7fa236d5b700 10 journal header: block_size 4096 alignment 4096 max_size 104857600
2014-11-05 09:29:31.507850 7fa236d5b700 10 journal header: start 79454208
2014-11-05 09:29:31.507852 7fa236d5b700 10 journal write_pos 101711872
2014-11-05 09:29:31.507853 7fa236d5b700 10 journal committed_thru done
2014-11-05 09:29:31.507901 7fa236d5b700 15 filestore(/var/lib/ceph/osd/ceph-3) sync_entry committed to op_seq 6538
2014-11-05 09:29:31.507904 7fa232d53700 10 filestore(/var/lib/ceph/osd/ceph-3) _set_replay_guard 6553.0.3 done
2014-11-05 09:29:31.507915 7fa236d5b700 20 filestore(/var/lib/ceph/osd/ceph-3) sync_entry waiting for max_interval 5.000000

Another sync took >2min -- long enough for suicide timeout on things waiting for filestore.

Actions

Copy link

Updated by Yuri Weinstein over 9 years ago

Status changed from Rejected to New

Logs are in http://pulpito.front.sepia.ceph.com/teuthology-2014-11-13_17:33:44-upgrade:giant-x-next-distro-basic-vps/600134/

Assertion: common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.87-27-gccfd241 (ccfd2414c68afda55bf4cefa2441ea6d53d87cc6)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb8249b]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2a9) [0xac0f99]
 3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xac1826]
 4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xac1f07]
 5: (CephContextServiceThread::entry()+0x154) [0xb969e4]
 6: (()+0x8182) [0x7fd903a1b182]
 7: (clone()+0x6d) [0x7fd901f85fbd]
['600134']

Actions

Copy link

Updated by Samuel Just over 9 years ago

Status changed from New to Rejected

I think this one is the giant messenger deadlock, #9921, updated 9921, closing this ticket again.

Actions

Copy link

#10

Updated by Yuri Weinstein over 9 years ago

Subject changed from "Assertion: common/HeartbeatMap.cc: 79" in upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi run to "Assertion: common/HeartbeatMap.cc: 79" placeholder for "hit suicide timeout" issues
Status changed from Rejected to New

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-11-24_17:18:03-upgrade:firefly-x-next-distro-basic-vps/619498/

Assertion: common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.88-230-g9ba17a3 (9ba17a321db06d3d76c9295e411c76842194b25c)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x307) [0xa9a5d7]
 2: (ceph::HeartbeatMap::is_healthy()+0xbf) [0xa9ae3f]
 3: (OSD::handle_osd_ping(MOSDPing*)+0x751) [0x66e9b1]
 4: (OSD::heartbeat_dispatch(Message*)+0x42b) [0x66fb5b]
 5: (DispatchQueue::entry()+0x4fa) [0xae387a]
 6: (DispatchQueue::DispatchThread::entry()+0xd) [0xacf62d]
 7: (()+0x79d1) [0x7f86cbc3b9d1]
 8: (clone()+0x6d) [0x7f86ca9c3b6d]
['619498']

Actions

Copy link

#11

Updated by Yuri Weinstein over 9 years ago

One more in run http://pulpito.ceph.com/teuthology-2014-12-01_18:18:01-upgrade:firefly-x-giant-distro-basic-vps/

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-12-01_18:18:01-upgrade:firefly-x-giant-distro-basic-vps/630056/teuthology.log

Assertion: common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.87-40-g65f6814 (65f6814847fe8644f5d77a9021fbf13043b76dbe)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x307) [0xa58007]
 2: (ceph::HeartbeatMap::is_healthy()+0xbf) [0xa5886f]
 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xa58e58]
 4: (CephContextServiceThread::entry()+0x136) [0xaa5a76]
 5: (()+0x79d1) [0x7fa6fd07b9d1]
 6: (clone()+0x6d) [0x7fa6fc00b86d]

Actions

Copy link

#12

Updated by Yuri Weinstein about 9 years ago

On vps again

Run: http://pulpito.ceph.com/teuthology-2015-01-28_17:05:01-upgrade:giant-x-next-distro-basic-vps/
Job: 728083
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-01-28_17:05:01-upgrade:giant-x-next-distro-basic-vps/728083/

2015-01-28T22:24:44.499 INFO:tasks.rados.rados.0.vpm101.stdout:update_object_version oid 492 v 288 (ObjNum 1343 snap 0 seq_num 1343) dirty exists
2015-01-28T22:24:44.500 INFO:tasks.rados.rados.0.vpm101.stdout:2050:  expect (ObjNum 1106 snap 0 seq_num 1106)
2015-01-28T22:24:45.555 INFO:tasks.ceph.osd.4.vpm011.stderr:common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fbcbd374700 time 2015-01-29 01:24:38.219770
2015-01-28T22:24:45.555 INFO:tasks.ceph.osd.4.vpm011.stderr:common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
2015-01-28T22:24:46.936 INFO:tasks.rados.rados.0.vpm101.stdout:2047: done (8 left)
2015-01-28T22:24:46.937 INFO:tasks.rados.rados.0.vpm101.stdout:2049: done (7 left)
2015-01-28T22:24:46.937 INFO:tasks.rados.rados.0.vpm101.stdout:2050: done (6 left)
2015-01-28T22:24:46.937 INFO:tasks.rados.rados.0.vpm101.stdout:2051: done (5 left)
2015-01-28T22:24:46.937 INFO:tasks.rados.rados.0.vpm101.stdout:2056: read oid 108 snap -1
2015-01-28T22:24:46.937 INFO:tasks.rados.rados.0.vpm101.stdout:2057: write oid 166 current snap is 0
2015-01-28T22:24:46.937 INFO:tasks.rados.rados.0.vpm101.stdout:2057:  seq_num 1346 ranges {634782=550181,1734880=618064,2634781=1}
2015-01-28T22:24:46.965 INFO:tasks.rados.rados.0.vpm101.stdout:2057:  writing vpm1015289-166 from 634782 to 1184963 tid 1
2015-01-28T22:24:46.983 INFO:tasks.ceph.osd.4.vpm011.stderr: ceph version 0.91-388-g5064787 (50647876971a2fe65a02e4de3c0bc62fec4887c4)
2015-01-28T22:24:46.983 INFO:tasks.ceph.osd.4.vpm011.stderr: 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x307) [0xbefd97]
2015-01-28T22:24:46.983 INFO:tasks.ceph.osd.4.vpm011.stderr: 2: (ceph::HeartbeatMap::is_healthy()+0xbf) [0xbf05ff]
2015-01-28T22:24:46.984 INFO:tasks.ceph.osd.4.vpm011.stderr: 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xbf0be8]
2015-01-28T22:24:46.984 INFO:tasks.ceph.osd.4.vpm011.stderr: 4: (CephContextServiceThread::entry()+0x136) [0xa3ff06]
2015-01-28T22:24:46.984 INFO:tasks.ceph.osd.4.vpm011.stderr: 5: (()+0x79d1) [0x7fbcc1e9a9d1]
2015-01-28T22:24:46.984 INFO:tasks.ceph.osd.4.vpm011.stderr: 6: (clone()+0x6d) [0x7fbcc0c22b6d]
2015-01-28T22:24:46.984 INFO:tasks.ceph.osd.4.vpm011.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2015-01-28T22:24:46.984 INFO:tasks.ceph.osd.4.vpm011.stderr:2015-01-29 01:24:46.975908 7fbcbd374700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fbcbd374700 time 2015-01-29 01:24:38.219770
2015-01-28T22:24:46.984 INFO:tasks.ceph.osd.4.vpm011.stderr:common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
2015-01-28T22:24:46.984 INFO:tasks.ceph.osd.4.vpm011.stderr:
2015-01-28T22:24:46.985 INFO:tasks.ceph.osd.4.vpm011.stderr: ceph version 0.91-388-g5064787 (50647876971a2fe65a02e4de3c0bc62fec4887c4)
2015-01-28T22:24:46.985 INFO:tasks.ceph.osd.4.vpm011.stderr: 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x307) [0xbefd97]
2015-01-28T22:24:46.985 INFO:tasks.ceph.osd.4.vpm011.stderr: 2: (ceph::HeartbeatMap::is_healthy()+0xbf) [0xbf05ff]
2015-01-28T22:24:46.985 INFO:tasks.ceph.osd.4.vpm011.stderr: 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xbf0be8]
2015-01-28T22:24:46.985 INFO:tasks.ceph.osd.4.vpm011.stderr: 4: (CephContextServiceThread::entry()+0x136) [0xa3ff06]
2015-01-28T22:24:46.985 INFO:tasks.ceph.osd.4.vpm011.stderr: 5: (()+0x79d1) [0x7fbcc1e9a9d1]
2015-01-28T22:24:46.985 INFO:tasks.ceph.osd.4.vpm011.stderr: 6: (clone()+0x6d) [0x7fbcc0c22b6d]
2015-01-28T22:24:46.985 INFO:tasks.ceph.osd.4.vpm011.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2015-01-28T22:24:46.986 INFO:tasks.ceph.osd.4.vpm011.stderr:
2015-01-28T22:24:46.993 INFO:tasks.rados.rados.0.vpm101.stdout:2057:  writing vpm1015289-166 from 1734880 to 2352944 tid 2
2015-01-28T22:24:46.998 INFO:tasks.rados.rados.0.vpm101.stdout:2057:  writing vpm1015289-166 from 2634781 to 2634782 tid 3
2015-01-28T22:24:46.999 INFO:tasks.rados.rados.0.vpm101.stdout:2053:  expect (ObjNum 1301 snap 0 seq_num 1301)
2015-01-28T22:24:47.097 INFO:tasks.ceph.osd.4.vpm011.stderr:     0> 2015-01-29 01:24:46.975908 7fbcbd374700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fbcbd374700 time 2015-01-29 01:24:38.219770
2015-01-28T22:24:47.097 INFO:tasks.ceph.osd.4.vpm011.stderr:common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
2015-01-28T22:24:47.097 INFO:tasks.ceph.osd.4.vpm011.stderr:
2015-01-28T22:24:47.097 INFO:tasks.ceph.osd.4.vpm011.stderr: ceph version 0.91-388-g5064787 (50647876971a2fe65a02e4de3c0bc62fec4887c4)
2015-01-28T22:24:47.097 INFO:tasks.ceph.osd.4.vpm011.stderr: 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x307) [0xbefd97]
2015-01-28T22:24:47.097 INFO:tasks.ceph.osd.4.vpm011.stderr: 2: (ceph::HeartbeatMap::is_healthy()+0xbf) [0xbf05ff]
2015-01-28T22:24:47.097 INFO:tasks.ceph.osd.4.vpm011.stderr: 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xbf0be8]
2015-01-28T22:24:47.097 INFO:tasks.ceph.osd.4.vpm011.stderr: 4: (CephContextServiceThread::entry()+0x136) [0xa3ff06]
2015-01-28T22:24:47.098 INFO:tasks.ceph.osd.4.vpm011.stderr: 5: (()+0x79d1) [0x7fbcc1e9a9d1]
2015-01-28T22:24:47.098 INFO:tasks.ceph.osd.4.vpm011.stderr: 6: (clone()+0x6d) [0x7fbcc0c22b6d]
2015-01-28T22:24:47.098 INFO:tasks.ceph.osd.4.vpm011.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2015-01-28T22:24:47.098 INFO:tasks.ceph.osd.4.vpm011.stderr:
2015-01-28T22:24:47.622 INFO:tasks.ceph.osd.4.vpm011.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion'
2015-01-28T22:24:47.959 INFO:tasks.ceph.osd.4.vpm011.stderr:*** Caught signal (Aborted) **
2015-01-28T22:24:47.959 INFO:tasks.ceph.osd.4.vpm011.stderr: in thread 7fbcbd374700
2015-01-28T22:24:49.058 INFO:tasks.rados.rados.0.vpm101.stdout:2053: done (6 left)
2015-01-28T22:24:49.058 INFO:tasks.rados.rados.0.vpm101.stdout:2058: delete oid 478 current snap is 0
2015-01-28T22:24:49.058 INFO:tasks.rados.rados.0.vpm101.stdout:2052:  expect (ObjNum 847 snap 0 seq_num 847)
2015-01-28T22:24:49.358 INFO:tasks.ceph.osd.4.vpm011.stderr: ceph version 0.91-388-g5064787 (50647876971a2fe65a02e4de3c0bc62fec4887c4)
2015-01-28T22:24:49.358 INFO:tasks.ceph.osd.4.vpm011.stderr: 1: ceph-osd() [0xa39a55]
2015-01-28T22:24:49.359 INFO:tasks.ceph.osd.4.vpm011.stderr: 2: (()+0xf710) [0x7fbcc1ea2710]
2015-01-28T22:24:49.359 INFO:tasks.ceph.osd.4.vpm011.stderr: 3: (gsignal()+0x35) [0x7fbcc0b6c925]
2015-01-28T22:24:49.359 INFO:tasks.ceph.osd.4.vpm011.stderr: 4: (abort()+0x175) [0x7fbcc0b6e105]
2015-01-28T22:24:49.359 INFO:tasks.ceph.osd.4.vpm011.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fbcc1426a5d]
2015-01-28T22:24:49.360 INFO:tasks.ceph.osd.4.vpm011.stderr: 6: (()+0xbcbe6) [0x7fbcc1424be6]
2015-01-28T22:24:49.360 INFO:tasks.ceph.osd.4.vpm011.stderr: 7: (()+0xbcc13) [0x7fbcc1424c13]
2015-01-28T22:24:49.360 INFO:tasks.ceph.osd.4.vpm011.stderr: 8: (()+0xbcd0e) [0x7fbcc1424d0e]
2015-01-28T22:24:49.360 INFO:tasks.ceph.osd.4.vpm011.stderr: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x57a) [0xb270fa]
2015-01-28T22:24:49.361 INFO:tasks.ceph.osd.4.vpm011.stderr: 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x307) [0xbefd97]
2015-01-28T22:24:49.361 INFO:tasks.ceph.osd.4.vpm011.stderr: 11: (ceph::HeartbeatMap::is_healthy()+0xbf) [0xbf05ff]
2015-01-28T22:24:49.361 INFO:tasks.ceph.osd.4.vpm011.stderr: 12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xbf0be8]
2015-01-28T22:24:49.361 INFO:tasks.ceph.osd.4.vpm011.stderr: 13: (CephContextServiceThread::entry()+0x136) [0xa3ff06]
2015-01-28T22:24:49.362 INFO:tasks.ceph.osd.4.vpm011.stderr: 14: (()+0x79d1) [0x7fbcc1e9a9d1]
2015-01-28T22:24:49.362 INFO:tasks.ceph.osd.4.vpm011.stderr: 15: (clone()+0x6d) [0x7fbcc0c22b6d]
2015-01-28T22:24:49.676 INFO:tasks.ceph.osd.4.vpm011.stderr:2015-01-29 01:24:49.355266 7fbcbd374700 -1 *** Caught signal (Aborted) **
2015-01-28T22:24:49.676 INFO:tasks.ceph.osd.4.vpm011.stderr: in thread 7fbcbd374700
2015-01-28T22:24:49.676 INFO:tasks.ceph.osd.4.vpm011.stderr:
2015-01-28T22:24:49.676 INFO:tasks.ceph.osd.4.vpm011.stderr: ceph version 0.91-388-g5064787 (50647876971a2fe65a02e4de3c0bc62fec4887c4)
2015-01-28T22:24:49.676 INFO:tasks.ceph.osd.4.vpm011.stderr: 1: ceph-osd() [0xa39a55]
2015-01-28T22:24:49.676 INFO:tasks.ceph.osd.4.vpm011.stderr: 2: (()+0xf710) [0x7fbcc1ea2710]
2015-01-28T22:24:49.677 INFO:tasks.ceph.osd.4.vpm011.stderr: 3: (gsignal()+0x35) [0x7fbcc0b6c925]
2015-01-28T22:24:49.677 INFO:tasks.ceph.osd.4.vpm011.stderr: 4: (abort()+0x175) [0x7fbcc0b6e105]
2015-01-28T22:24:49.677 INFO:tasks.ceph.osd.4.vpm011.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fbcc1426a5d]
2015-01-28T22:24:49.677 INFO:tasks.ceph.osd.4.vpm011.stderr: 6: (()+0xbcbe6) [0x7fbcc1424be6]
2015-01-28T22:24:49.677 INFO:tasks.ceph.osd.4.vpm011.stderr: 7: (()+0xbcc13) [0x7fbcc1424c13]
2015-01-28T22:24:49.677 INFO:tasks.ceph.osd.4.vpm011.stderr: 8: (()+0xbcd0e) [0x7fbcc1424d0e]
2015-01-28T22:24:49.677 INFO:tasks.ceph.osd.4.vpm011.stderr: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x57a) [0xb270fa]
2015-01-28T22:24:49.677 INFO:tasks.ceph.osd.4.vpm011.stderr: 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x307) [0xbefd97]
2015-01-28T22:24:49.678 INFO:tasks.ceph.osd.4.vpm011.stderr: 11: (ceph::HeartbeatMap::is_healthy()+0xbf) [0xbf05ff]
2015-01-28T22:24:49.678 INFO:tasks.ceph.osd.4.vpm011.stderr: 12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xbf0be8]
2015-01-28T22:24:49.678 INFO:tasks.ceph.osd.4.vpm011.stderr: 13: (CephContextServiceThread::entry()+0x136) [0xa3ff06]
2015-01-28T22:24:49.678 INFO:tasks.ceph.osd.4.vpm011.stderr: 14: (()+0x79d1) [0x7fbcc1e9a9d1]
2015-01-28T22:24:49.678 INFO:tasks.ceph.osd.4.vpm011.stderr: 15: (clone()+0x6d) [0x7fbcc0c22b6d]
2015-01-28T22:24:49.678 INFO:tasks.ceph.osd.4.vpm011.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions

Copy link

#13

Updated by Sage Weil about 9 years ago

Status changed from New to Closed

if you ever see this on vps it is generally the vm's fault. let's only reopen this if we see it on bare metal.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #9788

"Assertion: common/HeartbeatMap.cc: 79" placeholder for "hit suicide timeout" issues

Updated by Yuri Weinstein over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Yuri Weinstein about 9 years ago

Updated by Sage Weil about 9 years ago