Project

General

Profile

Actions

Bug #2784

closed

osd hit suicide timeout

Added by Tamilarasi muthamizhan almost 12 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Log: ubuntu@teuthology:/a/teuthology-2012-07-12_19:00:15-regression-master-testing-gcov/10615

ubuntu@teuthology:/a/teuthology-2012-07-12_19:00:15-regression-master-testing-gcov/10615$ zcat /log/osd.2.log.gz

-1> 2012-07-12 20:09:15.281759 7fc217006700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fc20e7f5700' had suicide timed out after 180
0> 2012-07-12 20:09:15.282794 7fc217006700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fc217006700 time 2012-07-12 20:09:15.281772
common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
ceph version 0.48argonaut-358-gbcfa573 (commit:bcfa573f5f615f3403ff71da0212cd1cee7e7d9c)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x433) [0x9d3ae3]
2: (ceph::HeartbeatMap::is_healthy()+0x8f) [0x9d48cf]
3: (ceph::HeartbeatMap::check_touch_file()+0x2b) [0x9d4cbb]
4: (CephContextServiceThread::entry()+0x6d) [0x92461d]
5: (Thread::_entry_func(void*)+0x12) [0x8f95b2]
6: (()+0x7e9a) [0x7fc21955ee9a]
7: (clone()+0x6d) [0x7fc217b134bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- end dump of recent events ---
2012-07-12 20:09:15.285177 7fc217006700 -1 ** Caught signal (Aborted) *
in thread 7fc217006700

ceph version 0.48argonaut-358-gbcfa573 (commit:bcfa573f5f615f3403ff71da0212cd1cee7e7d9c)
1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x80eb9a]
2: (()+0xfcb0) [0x7fc219566cb0]
3: (gsignal()+0x35) [0x7fc217a57445]
4: (abort()+0x17b) [0x7fc217a5abab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fc2183a569d]
6: (()+0xb5846) [0x7fc2183a3846]
7: (()+0xb5873) [0x7fc2183a3873]
8: (()+0xb596e) [0x7fc2183a396e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x385) [0x912275]
10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x433) [0x9d3ae3]
11: (ceph::HeartbeatMap::is_healthy()+0x8f) [0x9d48cf]
12: (ceph::HeartbeatMap::check_touch_file()+0x2b) [0x9d4cbb]
13: (CephContextServiceThread::entry()+0x6d) [0x92461d]
14: (Thread::_entry_func(void*)+0x12) [0x8f95b2]
15: (()+0x7e9a) [0x7fc21955ee9a]
16: (clone()+0x6d) [0x7fc217b134bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
0> 2012-07-12 20:09:15.285177 7fc217006700 -1 ** Caught signal (Aborted) *
in thread 7fc217006700

ceph version 0.48argonaut-358-gbcfa573 (commit:bcfa573f5f615f3403ff71da0212cd1cee7e7d9c)
1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x80eb9a]
2: (()+0xfcb0) [0x7fc219566cb0]
3: (gsignal()+0x35) [0x7fc217a57445]
4: (abort()+0x17b) [0x7fc217a5abab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fc2183a569d]
6: (()+0xb5846) [0x7fc2183a3846]
7: (()+0xb5873) [0x7fc2183a3873]
8: (()+0xb596e) [0x7fc2183a396e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x385) [0x912275]
10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x433) [0x9d3ae3]
11: (ceph::HeartbeatMap::is_healthy()+0x8f) [0x9d48cf]
12: (ceph::HeartbeatMap::check_touch_file()+0x2b) [0x9d4cbb]
13: (CephContextServiceThread::entry()+0x6d) [0x92461d]
14: (Thread::_entry_func(void*)+0x12) [0x8f95b2]
15: (()+0x7e9a) [0x7fc21955ee9a]
16: (clone()+0x6d) [0x7fc217b134bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- end dump of recent events ---

ubuntu@teuthology:/a/teuthology-2012-07-12_19:00:15-regression-master-testing-gcov/10615$ cat config.yaml
kernel: &id001
kdb: true
sha1: ea18acf27e2f7cee4ac9d01719564414d2cd64b5
nuke-on-error: true
overrides:
ceph:
coverage: true
fs: btrfs
log-whitelist:
- slow request
sha1: bcfa573f5f615f3403ff71da0212cd1cee7e7d9c
roles:
- - mon.a
- osd.0
- osd.1
- osd.2
- - mds.a
- osd.3
- osd.4
- osd.5
- - client.0
targets:
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDx6LtrVRWV3GjiTYgw3lIpuljK4ObcgjFcitU2ZIkJZzBK3DokJ5AvTNhbYOWpo0bJqaiheFM82UEiiqCs6ChgRbSbd++RSfVT9PejPpmPipLB8Bj24xzdrdCqUaQoMNr6J5+h7xcWiCeW/8sDBJIyWVOSO0AGGQnc88HwVExSIuzsfM9ergnQaQcmrCqf6PTrpVZWeBQPqOnWAy+fCka/vD+omclH9cYyLeK/tVTIYHWBns4nLzL0FeQFe8e3uyoCtFzHvOC4ziIKVRv/WpHt+fZq8M1IGUNaQAR1v3x/eKF0ut2EUYTCfR03MAGEjhv0IMX6XloaNjCFeMXsrHh7
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDo+Kh24vRxeTQ6/n5PIIGuxrPHPRO/xMQlwoLHi7mR01cIXJMG5wet7mp2om3/5SZSDcLBHduDKrdWL142Sg5fC0zZPUggbxS7nz/UCjYBzMsOtHEUAU5Gs0KFopOCHXNEveK95ezsroMAD5+jS/IEpiooYCkrR3H+NSvUU0Ae352PlXqV0vamkYzyQyEMmhFE50ALhUXbKMve3d2mxJee5sqVZSBmQTbze9RKUA96t9iiwiheflXbN1i9WHlbBOIue5pZ5fM3/vqPWgaShfFpa0pT56QKJfjyFcDeCLOislo23E5qKAJOi5vn5BoYVtG3niNQpt/YbYGfDEHVeqt9
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDEwyNlwC9Utqf3PCjL2JR4wwDkzpdEJuW93DOW82vYVisYEGod454JwXeNkjqzTUk6tXeRoUM9f/C6sZS3LFgHcMYt6m0sxP8DC4qU+q0YxCw9zLY8bXKe4DDjijM62h/SnyqyOWIh9amGT7wRwZEHBV1BKvZbNxQIJ7ESkuKsk/tJfWKhq7dSw6E/+MZ4yQtXvTyaJ3pK96Hq2uoUkawv+FxXBrzG3FtTTYA8gqA1SIiV3erEIQuBK/WD74i5yK4rwpfGTo7jNc0V6wrwO1BKFj/OGjSC+2LSAkBgf8WLe6UL/dHr3bBEyzm0V4xMf5Iqb8JGvkaXNEfbFqzKC2Wv
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
log-whitelist:
- wrongly marked me down
- objects unfound and apparently lost
- thrashosds:
timeout: 1200
- rados:
clients:
- client.0
objects: 50
op_weights:
delete: 50
read: 100
snap_create: 50
snap_remove: 50
snap_rollback: 50
write: 100
ops: 4000
ubuntu@teuthology:/a/teuthology-2012-07-12_19:00:15-regression-master-testing-gcov/10615$ cat summary.yaml
ceph-sha1: bcfa573f5f615f3403ff71da0212cd1cee7e7d9c
description: collection:thrash clusters:6-osd-3-machine.yaml fs:btrfs.yaml thrashers:default.yaml
workloads:snaps-few-objects.yaml
duration: 2186.8045082092285
failure_reason: 'Command failed with status 1: ''/tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage
/tmp/cephtest/archive/coverage /tmp/cephtest/daemon-helper term /tmp/cephtest/binary/usr/local/bin/ceph-osd
-f -i 2 -c /tmp/cephtest/ceph.conf'''
flavor: gcov
owner: scheduled_teuthology@teuthology
success: false


Files

ceph-osd.69.log.gz (1.04 MB) ceph-osd.69.log.gz Joao Eduardo Luis, 12/18/2012 09:46 AM
Actions #1

Updated by Sage Weil over 11 years ago

  • Status changed from New to Can't reproduce
Actions #2

Updated by Tamilarasi muthamizhan over 11 years ago

This test hung in the nightlies.

Logs: ubuntu@teuthology:/a/teuthology-2012-08-22_00:00:07-regression-next-testing-basic/6240

ubuntu@plana03:/tmp/cephtest/archive/log/osd.4.log
common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const cha
r*, time_t)' thread 8345700 time 2012-08-22 02:27:45.138146
common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.50-116-g4a0704e (commit:4a0704e64a733b7bb14fb4103cd1cd54e4e7da8a)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x107) [0x73c277]
 2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x73cc07]
 3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x73ce53]
 4: (CephContextServiceThread::entry()+0x55) [0x7ecf65]
 5: (()+0x7e9a) [0x503be9a]
 6: (clone()+0x6d) [0x6a604bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ubuntu@plana08:/tmp/cephtest/archive/log/osd.2.log
2012-08-22 03:01:49.981735 10b56700  1 2012-08-22 03:01:50.259373 8345700 -1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x10b56700' had timed out after 4*** Caught signal (Aborted) **
 in thread 8345700

 ceph version 0.50-116-g4a0704e (commit:4a0704e64a733b7bb14fb4103cd1cd54e4e7da8a)
 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x72295a]
 2: (()+0xfcb0) [0x5043cb0]
 3: (gsignal()+0x35) [0x69a4445]
 4: (abort()+0x17b) [0x69a7bab]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x621569d]
 6: (()+0xb5846) [0x6213846]
 7: (()+0xb5873) [0x6213873]
 8: (()+0xb596e) [0x621396e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x7e16ff]
 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x107) [0x73c277]
 11: (ceph::HeartbeatMap::is_healthy()+0x87) [0x73cc07]
 12: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x73ce53]
 13: (CephContextServiceThread::entry()+0x55) [0x7ecf65]
 14: (()+0x7e9a) [0x503be9a]
 15: (clone()+0x6d) [0x6a604bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ubuntu@teuthology:/a/teuthology-2012-08-22_00:00:07-regression-next-testing-basic/6240$ cat config.yaml 
kernel: &id001
  kdb: true
  sha1: 938838c46ed6da8d59b8e3b8143b588fe84584f5
nuke-on-error: true
overrides:
  ceph:
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 4a0704e64a733b7bb14fb4103cd1cd54e4e7da8a
    valgrind:
      mds:
      - --tool=memcheck
      mon:
      - --tool=memcheck
      osd:
      - --tool=memcheck
  workunit:
    sha1: 4a0704e64a733b7bb14fb4103cd1cd54e4e7da8a
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana02.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCtjMpSkaJhFqFtpo5AEe3KHygR+ueaWU+gYrrRzPa8YvmR0TCapw0kz77y1Fjcfh8rkTapnevpaYgQSMrMs0Yc34kF5XtNRuQXkpTwrhS8isZJBeNSc1W5XeKjj4KB/UuzBywJq0h/0KbH1DrMy72cGISOzdiP9CMA5KUvJo0m31wv1+MPcPn/5AhZgoWPStfaZdb4TaJUrNLrws0oRXa0yQbUa6WmUBsYhHsw4K1ukJAcJwVjcgAAv1N+GnyuWLVs+pvknBO3Whv1RhjY6EDGjun1MDPw+OE3wJsJX7BRr8eZv2Avi7pRlseWeWJwgsHMJ/j0yhf+SCy1+oSPrD2b
  ubuntu@plana03.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDHsojB9j3Wcz/TMv8Jh9mMIXxe3sUT1XaU+r9rVt0vY8UBwmxLmF2+AfrS4KX8nWZeY581cPGTb658GmcxPGiGSSEUEKRbDsav9fwFP2bT3KVDRYO3hz2UBGfQnC2xva5cE2PChHH50bcckc2O6jeZuks8rtfMG05z+gWesLCGKwXfxgcaYXZtMKWcDEaEvrkDdCFqZB7jBNvNbujHuRrL1eUAuh5xk34psFlpDqjA9TOLJpcTMKxvG2TssSMEbcthGFNAclM86S8jpzb2v4es8DFpOQqJnnHzoDeoMAL8W48H9ijBdvsaJNEPqA6Tj2Bsx21ybIxI/Q9lz03nmxIL
  ubuntu@plana08.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCqqzUa6OfNvqirlipO2jY32KanSfk2mSstJJdakPuKVijDoaDHvC7L14A9iNMChI/+OgVJRpBF4DI+QwrO00yj4jSGSuq/dtp6vxkX5fw1+g21uqZG7Sl6S89ytRkRs+NFoNY1jhWR0Qo4opEim9qApVSxlouG61L++IEv7zhZ62ogpknTMQhkgpHJ4w146silaCh6vnmoNsTBt+eIuVE/7vhMQep4REpw5uVZR6PVUBsJDJAkJuyAkNu3Xva7KC4W22sjzSqHKWqzNeAmPxQ8Ywvu5PWQulOtA/LF9gAVsJjbKE7+ZsXVYvTfpZtEKfOduss8dB3lP8Xez6CbUzsv
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - suites/fsstress.sh
Actions #3

Updated by Tamilarasi muthamizhan over 11 years ago

Recent logs: /a/teuthology-2012-09-08_04:00:03-regression-stable-master-basic/19039

Actions #4

Updated by Joao Eduardo Luis over 11 years ago

This bug popped again on v0.55.1

renzhi on IRC stumbled upon it after upgrading from v0.48.2, and has been unable to bring most of his osds back up (only roughly 22 out of 75 were able to be brought back up).

Log is attached.

Actions #5

Updated by Samuel Just over 11 years ago

  • Status changed from 12 to Resolved

Not actually a bug in the renzhi case.

Actions

Also available in: Atom PDF