Bug #2784
closedosd hit suicide timeout
0%
Description
Log: ubuntu@teuthology:/a/teuthology-2012-07-12_19:00:15-regression-master-testing-gcov/10615
ubuntu@teuthology:/a/teuthology-2012-07-12_19:00:15-regression-master-testing-gcov/10615$ zcat remote/ubuntu@plana39.front.sepia.ceph.com/log/osd.2.log.gz
-1> 2012-07-12 20:09:15.281759 7fc217006700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fc20e7f5700' had suicide timed out after 180
0> 2012-07-12 20:09:15.282794 7fc217006700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fc217006700 time 2012-07-12 20:09:15.281772
common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
ceph version 0.48argonaut-358-gbcfa573 (commit:bcfa573f5f615f3403ff71da0212cd1cee7e7d9c)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x433) [0x9d3ae3]
2: (ceph::HeartbeatMap::is_healthy()+0x8f) [0x9d48cf]
3: (ceph::HeartbeatMap::check_touch_file()+0x2b) [0x9d4cbb]
4: (CephContextServiceThread::entry()+0x6d) [0x92461d]
5: (Thread::_entry_func(void*)+0x12) [0x8f95b2]
6: (()+0x7e9a) [0x7fc21955ee9a]
7: (clone()+0x6d) [0x7fc217b134bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- end dump of recent events ---
2012-07-12 20:09:15.285177 7fc217006700 -1 ** Caught signal (Aborted) *
in thread 7fc217006700
ceph version 0.48argonaut-358-gbcfa573 (commit:bcfa573f5f615f3403ff71da0212cd1cee7e7d9c)
1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x80eb9a]
2: (()+0xfcb0) [0x7fc219566cb0]
3: (gsignal()+0x35) [0x7fc217a57445]
4: (abort()+0x17b) [0x7fc217a5abab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fc2183a569d]
6: (()+0xb5846) [0x7fc2183a3846]
7: (()+0xb5873) [0x7fc2183a3873]
8: (()+0xb596e) [0x7fc2183a396e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x385) [0x912275]
10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x433) [0x9d3ae3]
11: (ceph::HeartbeatMap::is_healthy()+0x8f) [0x9d48cf]
12: (ceph::HeartbeatMap::check_touch_file()+0x2b) [0x9d4cbb]
13: (CephContextServiceThread::entry()+0x6d) [0x92461d]
14: (Thread::_entry_func(void*)+0x12) [0x8f95b2]
15: (()+0x7e9a) [0x7fc21955ee9a]
16: (clone()+0x6d) [0x7fc217b134bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
0> 2012-07-12 20:09:15.285177 7fc217006700 -1 ** Caught signal (Aborted) *
in thread 7fc217006700
ceph version 0.48argonaut-358-gbcfa573 (commit:bcfa573f5f615f3403ff71da0212cd1cee7e7d9c)
1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x80eb9a]
2: (()+0xfcb0) [0x7fc219566cb0]
3: (gsignal()+0x35) [0x7fc217a57445]
4: (abort()+0x17b) [0x7fc217a5abab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fc2183a569d]
6: (()+0xb5846) [0x7fc2183a3846]
7: (()+0xb5873) [0x7fc2183a3873]
8: (()+0xb596e) [0x7fc2183a396e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x385) [0x912275]
10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x433) [0x9d3ae3]
11: (ceph::HeartbeatMap::is_healthy()+0x8f) [0x9d48cf]
12: (ceph::HeartbeatMap::check_touch_file()+0x2b) [0x9d4cbb]
13: (CephContextServiceThread::entry()+0x6d) [0x92461d]
14: (Thread::_entry_func(void*)+0x12) [0x8f95b2]
15: (()+0x7e9a) [0x7fc21955ee9a]
16: (clone()+0x6d) [0x7fc217b134bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- end dump of recent events ---
ubuntu@teuthology:/a/teuthology-2012-07-12_19:00:15-regression-master-testing-gcov/10615$ cat config.yaml
kernel: &id001
kdb: true
sha1: ea18acf27e2f7cee4ac9d01719564414d2cd64b5
nuke-on-error: true
overrides:
ceph:
coverage: true
fs: btrfs
log-whitelist:
- slow request
sha1: bcfa573f5f615f3403ff71da0212cd1cee7e7d9c
roles:
- - mon.a
- osd.0
- osd.1
- osd.2
- - mds.a
- osd.3
- osd.4
- osd.5
- - client.0
targets:
ubuntu@plana09.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDx6LtrVRWV3GjiTYgw3lIpuljK4ObcgjFcitU2ZIkJZzBK3DokJ5AvTNhbYOWpo0bJqaiheFM82UEiiqCs6ChgRbSbd++RSfVT9PejPpmPipLB8Bj24xzdrdCqUaQoMNr6J5+h7xcWiCeW/8sDBJIyWVOSO0AGGQnc88HwVExSIuzsfM9ergnQaQcmrCqf6PTrpVZWeBQPqOnWAy+fCka/vD+omclH9cYyLeK/tVTIYHWBns4nLzL0FeQFe8e3uyoCtFzHvOC4ziIKVRv/WpHt+fZq8M1IGUNaQAR1v3x/eKF0ut2EUYTCfR03MAGEjhv0IMX6XloaNjCFeMXsrHh7
ubuntu@plana39.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDo+Kh24vRxeTQ6/n5PIIGuxrPHPRO/xMQlwoLHi7mR01cIXJMG5wet7mp2om3/5SZSDcLBHduDKrdWL142Sg5fC0zZPUggbxS7nz/UCjYBzMsOtHEUAU5Gs0KFopOCHXNEveK95ezsroMAD5+jS/IEpiooYCkrR3H+NSvUU0Ae352PlXqV0vamkYzyQyEMmhFE50ALhUXbKMve3d2mxJee5sqVZSBmQTbze9RKUA96t9iiwiheflXbN1i9WHlbBOIue5pZ5fM3/vqPWgaShfFpa0pT56QKJfjyFcDeCLOislo23E5qKAJOi5vn5BoYVtG3niNQpt/YbYGfDEHVeqt9
ubuntu@plana40.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDEwyNlwC9Utqf3PCjL2JR4wwDkzpdEJuW93DOW82vYVisYEGod454JwXeNkjqzTUk6tXeRoUM9f/C6sZS3LFgHcMYt6m0sxP8DC4qU+q0YxCw9zLY8bXKe4DDjijM62h/SnyqyOWIh9amGT7wRwZEHBV1BKvZbNxQIJ7ESkuKsk/tJfWKhq7dSw6E/+MZ4yQtXvTyaJ3pK96Hq2uoUkawv+FxXBrzG3FtTTYA8gqA1SIiV3erEIQuBK/WD74i5yK4rwpfGTo7jNc0V6wrwO1BKFj/OGjSC+2LSAkBgf8WLe6UL/dHr3bBEyzm0V4xMf5Iqb8JGvkaXNEfbFqzKC2Wv
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
log-whitelist:
- wrongly marked me down
- objects unfound and apparently lost
- thrashosds:
timeout: 1200
- rados:
clients:
- client.0
objects: 50
op_weights:
delete: 50
read: 100
snap_create: 50
snap_remove: 50
snap_rollback: 50
write: 100
ops: 4000
ubuntu@teuthology:/a/teuthology-2012-07-12_19:00:15-regression-master-testing-gcov/10615$ cat summary.yaml
ceph-sha1: bcfa573f5f615f3403ff71da0212cd1cee7e7d9c
description: collection:thrash clusters:6-osd-3-machine.yaml fs:btrfs.yaml thrashers:default.yaml
workloads:snaps-few-objects.yaml
duration: 2186.8045082092285
failure_reason: 'Command failed with status 1: ''/tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage
/tmp/cephtest/archive/coverage /tmp/cephtest/daemon-helper term /tmp/cephtest/binary/usr/local/bin/ceph-osd
-f -i 2 -c /tmp/cephtest/ceph.conf'''
flavor: gcov
owner: scheduled_teuthology@teuthology
success: false
Files
Updated by Sage Weil over 11 years ago
- Status changed from New to Can't reproduce
Updated by Tamilarasi muthamizhan over 11 years ago
This test hung in the nightlies.
Logs: ubuntu@teuthology:/a/teuthology-2012-08-22_00:00:07-regression-next-testing-basic/6240
ubuntu@plana03:/tmp/cephtest/archive/log/osd.4.log common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const cha r*, time_t)' thread 8345700 time 2012-08-22 02:27:45.138146 common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") ceph version 0.50-116-g4a0704e (commit:4a0704e64a733b7bb14fb4103cd1cd54e4e7da8a) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x107) [0x73c277] 2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x73cc07] 3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x73ce53] 4: (CephContextServiceThread::entry()+0x55) [0x7ecf65] 5: (()+0x7e9a) [0x503be9a] 6: (clone()+0x6d) [0x6a604bd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ubuntu@plana08:/tmp/cephtest/archive/log/osd.2.log 2012-08-22 03:01:49.981735 10b56700 1 2012-08-22 03:01:50.259373 8345700 -1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x10b56700' had timed out after 4*** Caught signal (Aborted) ** in thread 8345700 ceph version 0.50-116-g4a0704e (commit:4a0704e64a733b7bb14fb4103cd1cd54e4e7da8a) 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x72295a] 2: (()+0xfcb0) [0x5043cb0] 3: (gsignal()+0x35) [0x69a4445] 4: (abort()+0x17b) [0x69a7bab] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x621569d] 6: (()+0xb5846) [0x6213846] 7: (()+0xb5873) [0x6213873] 8: (()+0xb596e) [0x621396e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x7e16ff] 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x107) [0x73c277] 11: (ceph::HeartbeatMap::is_healthy()+0x87) [0x73cc07] 12: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x73ce53] 13: (CephContextServiceThread::entry()+0x55) [0x7ecf65] 14: (()+0x7e9a) [0x503be9a] 15: (clone()+0x6d) [0x6a604bd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ubuntu@teuthology:/a/teuthology-2012-08-22_00:00:07-regression-next-testing-basic/6240$ cat config.yaml kernel: &id001 kdb: true sha1: 938838c46ed6da8d59b8e3b8143b588fe84584f5 nuke-on-error: true overrides: ceph: fs: btrfs log-whitelist: - slow request sha1: 4a0704e64a733b7bb14fb4103cd1cd54e4e7da8a valgrind: mds: - --tool=memcheck mon: - --tool=memcheck osd: - --tool=memcheck workunit: sha1: 4a0704e64a733b7bb14fb4103cd1cd54e4e7da8a roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 targets: ubuntu@plana02.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCtjMpSkaJhFqFtpo5AEe3KHygR+ueaWU+gYrrRzPa8YvmR0TCapw0kz77y1Fjcfh8rkTapnevpaYgQSMrMs0Yc34kF5XtNRuQXkpTwrhS8isZJBeNSc1W5XeKjj4KB/UuzBywJq0h/0KbH1DrMy72cGISOzdiP9CMA5KUvJo0m31wv1+MPcPn/5AhZgoWPStfaZdb4TaJUrNLrws0oRXa0yQbUa6WmUBsYhHsw4K1ukJAcJwVjcgAAv1N+GnyuWLVs+pvknBO3Whv1RhjY6EDGjun1MDPw+OE3wJsJX7BRr8eZv2Avi7pRlseWeWJwgsHMJ/j0yhf+SCy1+oSPrD2b ubuntu@plana03.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDHsojB9j3Wcz/TMv8Jh9mMIXxe3sUT1XaU+r9rVt0vY8UBwmxLmF2+AfrS4KX8nWZeY581cPGTb658GmcxPGiGSSEUEKRbDsav9fwFP2bT3KVDRYO3hz2UBGfQnC2xva5cE2PChHH50bcckc2O6jeZuks8rtfMG05z+gWesLCGKwXfxgcaYXZtMKWcDEaEvrkDdCFqZB7jBNvNbujHuRrL1eUAuh5xk34psFlpDqjA9TOLJpcTMKxvG2TssSMEbcthGFNAclM86S8jpzb2v4es8DFpOQqJnnHzoDeoMAL8W48H9ijBdvsaJNEPqA6Tj2Bsx21ybIxI/Q9lz03nmxIL ubuntu@plana08.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCqqzUa6OfNvqirlipO2jY32KanSfk2mSstJJdakPuKVijDoaDHvC7L14A9iNMChI/+OgVJRpBF4DI+QwrO00yj4jSGSuq/dtp6vxkX5fw1+g21uqZG7Sl6S89ytRkRs+NFoNY1jhWR0Qo4opEim9qApVSxlouG61L++IEv7zhZ62ogpknTMQhkgpHJ4w146silaCh6vnmoNsTBt+eIuVE/7vhMQep4REpw5uVZR6PVUBsJDJAkJuyAkNu3Xva7KC4W22sjzSqHKWqzNeAmPxQ8Ywvu5PWQulOtA/LF9gAVsJjbKE7+ZsXVYvTfpZtEKfOduss8dB3lP8Xez6CbUzsv tasks: - internal.lock_machines: 3 - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.check_conflict: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.syslog: null - internal.timer: null - chef: null - clock: null - ceph: null - ceph-fuse: null - workunit: clients: all: - suites/fsstress.sh
Updated by Tamilarasi muthamizhan over 11 years ago
Recent logs: /a/teuthology-2012-09-08_04:00:03-regression-stable-master-basic/19039
Updated by Joao Eduardo Luis over 11 years ago
- File ceph-osd.69.log.gz ceph-osd.69.log.gz added
- Status changed from Can't reproduce to 12
This bug popped again on v0.55.1
renzhi on IRC stumbled upon it after upgrading from v0.48.2, and has been unable to bring most of his osds back up (only roughly 22 out of 75 were able to be brought back up).
Log is attached.
Updated by Samuel Just over 11 years ago
- Status changed from 12 to Resolved
Not actually a bug in the renzhi case.