Bug #9554
"FAILED assert(0 == "hit suicide timeout")" in upgrade:firefly-firefly-testing-basic-vps run
0%
Description
Assert in */remote/vpm094/log/ceph-osd.5.log.gz
2014-09-21 00:54:07.875038 7f8a22953700 0 -- 10.214.138.153:6807/10046 >> 10.214.138.153:6815/9759 pipe(0x2740280 sd=46 :54506 s=2 pgs=146 cs=253 l=0 c=0x2748160).fault, initiating reconnect 2014-09-21 00:54:08.125001 7f8a25376700 10 osd.5 69 OSD::ms_get_authorizer type=osd 2014-09-21 00:54:08.125740 7f8a3dbcc700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f8a3dbcc700 time 2014-09-21 00:54:07.864249 common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout") ceph version 0.80.5-229-g2675b0d (2675b0d159ff0608bc6e607be3727ed23e7e2976) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x107) [0x9c3b77] 2: (ceph::HeartbeatMap::is_healthy()+0xa7) [0x9c4527] 3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x9c4ad3] 4: (CephContextServiceThread::entry()+0x133) [0xa90a43] 5: (()+0x7e9a) [0x7f8a40eb5e9a] 6: (clone()+0x6d) [0x7f8a3f86731d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
archive_path: /var/lib/teuthworker/archive/teuthology-2014-09-20_15:08:15-upgrade:firefly-firefly-testing-basic-vps/500976 branch: firefly description: upgrade:firefly/older/{0-cluster/start.yaml 1-install/v0.80.2.yaml 2-workload/blogbench.yaml 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 4-final/{monthrash.yaml osdthrash.yaml testrgw.yaml} distros/ubuntu_12.04.yaml} email: ceph-qa@ceph.com job_id: '500976' kernel: &id001 kdb: true sha1: 3ed0e301a4806e64958a50b28a34abc093b953ab last_in_suite: false machine_type: vps name: teuthology-2014-09-20_15:08:15-upgrade:firefly-firefly-testing-basic-vps nuke-on-error: true os_type: ubuntu os_version: '12.04' overrides: admin_socket: branch: firefly ceph: conf: global: osd heartbeat grace: 100 mon: debug mon: 20 debug ms: 1 debug paxos: 20 osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 fs: xfs log-whitelist: - slow request - scrub - wrongly marked me down - objects unfound and apparently lost - log bound mismatch sha1: 2675b0d159ff0608bc6e607be3727ed23e7e2976 ceph-deploy: branch: dev: firefly conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 install: ceph: sha1: 2675b0d159ff0608bc6e607be3727ed23e7e2976 rgw: default_idle_timeout: 1200 s3tests: branch: firefly idle_timeout: 1200 workunit: sha1: 2675b0d159ff0608bc6e607be3727ed23e7e2976 owner: scheduled_teuthology@teuthology priority: 1000 roles: - - mon.a - mds.a - osd.0 - osd.1 - osd.2 - - mon.b - mon.c - osd.3 - osd.4 - osd.5 - client.0 - client.1 suite: upgrade:firefly suite_branch: firefly suite_path: /var/lib/teuthworker/src/ceph-qa-suite_firefly targets: ubuntu@vpm094.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDovlOblFaaeTfgD+3efJrUi4TZLYO2gyLf4WtnHh80GMiqDLb8d+t6qesiD+4ZtkbSXQITK1i2FcQpwGqFEWzzdVNZA7w0CVNGW76QxH5n2LOEGxksAdzfJdDbhdDAqYgdye7E+HvNJtg+rLt/QTfykIlB0QqDpFYkAbFlYexMf3qjNBAbnYaH3T7Hz0C/1hfw3PQVnQ62ElmsdI2XmYO74jQP8DX2OqnEjlL+jCwnqAXz7inxdvgJyJsDs8lISXEkV/JmX7VA7/pUyTfn7ELYLJZjYEu1bWpO7c8dtJjGw/nNDEcUJp7k6E+D33+J0Tea2q7PEwr2h0k5LbDSxWe5 ubuntu@vpm143.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDA46JFVM8gN6P87utvoKa6ybXTBnaXuX2QKRw3WNiqFRARnHW1gY4naM4ohbJGEzODdtDORO0LXfHbOAv6s0E3VNqlaQgtLUt+Ft8joDdIF/lL7zd9c5VzRgg9XZ0221I7q4E6EG2zo2jSvufgniO1f7oXjz6TOgMJShxV0/hV1pXGDXPY9U9i7l5mRtRschUl1uvni1CxftMuWbqiBXOgk6qqU2cWJffnBO4k38W6DMInqikUJdS1QAwzWEX6/OTtsMeHlXnepW0GJfC0/vSwFf48+jJBxq38NwJB9F4dWIU0jLauK90a7X+6sye5lnRYdnR8O5QcASUbYiJFTW35 tasks: - internal.lock_machines: - 2 - vps - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.push_inventory: null - internal.serialize_remote_roles: null - internal.check_conflict: null - internal.check_ceph_data: null - internal.vm_setup: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.sudo: null - internal.syslog: null - internal.timer: null - chef: null - clock.check: null - install: tag: v0.80.2 - ceph: log-whitelist: - scrub mismatch - ScrubResult - parallel: - workload - upgrade-sequence - sequential: - mon_thrash: revive_delay: 20 thrash_delay: 1 - ceph-fuse: null - workunit: clients: client.0: - suites/dbench.sh - sequential: - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 timeout: 1200 - ceph-fuse: - client.0 - workunit: clients: client.0: - suites/iogen.sh - sequential: - rgw: - client.1 - s3tests: client.1: rgw_server: client.1 teuthology_branch: master tube: vps upgrade-sequence: sequential: - install.upgrade: all: branch: firefly - ceph.restart: - mon.a - sleep: duration: 60 - ceph.restart: - mon.b - sleep: duration: 60 - ceph.restart: - mon.c - sleep: duration: 60 - ceph.restart: - mds.a - sleep: duration: 60 - ceph.restart: - osd.0 - sleep: duration: 30 - ceph.restart: - osd.1 - sleep: duration: 30 - ceph.restart: - osd.2 - sleep: duration: 30 - ceph.restart: - osd.3 - sleep: duration: 30 - ceph.restart: - osd.4 - sleep: duration: 30 - ceph.restart: - osd.5 verbose: true worker_log: /var/lib/teuthworker/archive/worker_logs/worker.vps.3019 workload: workunit: clients: client.0: - suites/blogbench.sh
description: upgrade:firefly/older/{0-cluster/start.yaml 1-install/v0.80.2.yaml 2-workload/blogbench.yaml 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 4-final/{monthrash.yaml osdthrash.yaml testrgw.yaml} distros/ubuntu_12.04.yaml} duration: 5435.529995203018 failure_reason: 'Command failed on vpm094 with status 1: ''sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-osd -f -i 5''' flavor: basic mon.a-kernel-sha1: 3ed0e301a4806e64958a50b28a34abc093b953ab mon.b-kernel-sha1: 3ed0e301a4806e64958a50b28a34abc093b953ab owner: scheduled_teuthology@teuthology success: false
History
#1 Updated by Samuel Just over 9 years ago
- Assignee set to David Zafman
- Priority changed from Normal to Urgent
#2 Updated by David Zafman over 9 years ago
The crashed osd.5 was on a node that had a load average of 38. osd.1 didn't see ping responses although it saw osd.5 initiated pings during this time of high load.
osd.5 is the only one that reported a loadavg because of trying to scrub. One of 4 scrubs (pg 0.e) took a very long time to finish. This was during the period of high load average.
osd.5:
2014-09-21 00:45:52.501620 7f8a29c80700 10 osd.5 pg_epoch: 48 pg[0.a( v 24'779 (0'0,24'779] local-les=47 n=0 ec=1 les/c 47/48 22/42/22) [5,1] r=0 lpr=42 crt=24'776 lcod 24'778 mlcod 0'0 active+clean+scrubbing] starting a new chunky scrub
2014-09-21 00:46:00.228866 7f8a29c80700 10 osd.5 pg_epoch: 50 pg[0.1a( v 24'779 (0'0,24'779] local-les=48 n=0 ec=1 les/c 48/48 22/42/22) [5,1] r=0 lpr=42 crt=24'776 lcod 0'0 mlcod 0'0 active+clean+scrubbing] starting a new chunky scrub
2014-09-21 00:48:29.935088 7f8a29c80700 10 osd.5 pg_epoch: 66 pg[0.1e( v 24'507 (0'0,24'507] local-les=61 n=0 ec=1 les/c 61/64 54/54/22) [5,3] r=0 lpr=54 crt=24'503 lcod 0'0 mlcod 0'0 active+clean+scrubbing] starting a new chunky scrub
2014-09-21 00:49:04.491443 7f8a29c80700 10 osd.5 pg_epoch: 68 pg[0.e( v 68'515 (0'0,68'515] local-les=62 n=3 ec=1 les/c 62/65 54/54/22) [5,3] r=0 lpr=54 luod=65'513 crt=65'512 lcod 65'512 mlcod 65'512 active+clean+scrubbing] starting a new chunky scrub
2014-09-21 00:50:50.774347 7f8a3abc6700 20 osd.5 68 scrub_should_schedule loadavg 38 >= max 5 = no, load too high
2014-09-21 00:50:58.612390 7f8a3abc6700 0 log [WRN] : 9 slow requests, 1 included below; oldest blocked for > 120.206632 secs
2014-09-21 00:51:06.237568 7f8a3abc6700 0 log [WRN] : 9 slow requests, 1 included below; oldest blocked for > 127.831814 secs
2014-09-21 00:51:10.168962 7f8a3abc6700 0 log [WRN] : 10 slow requests, 5 included below; oldest blocked for > 131.763179 secs
2014-09-21 00:51:18.938596 7f8a3abc6700 0 log [WRN] : 10 slow requests, 1 included below; oldest blocked for > 140.532841 secs
2014-09-21 00:51:21.123826 7f8a3abc6700 0 log [WRN] : 11 slow requests, 1 included below; oldest blocked for > 142.718069 secs
2014-09-21 00:51:31.478292 7f8a3abc6700 0 log [WRN] : 11 slow requests, 1 included below; oldest blocked for > 153.072539 secs
osd.1:
2014-09-21 00:53:02.452857 7ff18acfa700 -1 osd.1 72 heartbeat_check: no reply from osd.5 since back 2014-09-21 00:51:21.826360 front 2014-09-21 00:51:21.826360 (cutoff 2014-09-21 00:51:22.452857)
osd.5:
2014-09-21 00:53:23.807286 7f8a2fc8c700 1 -- 10.214.138.153:0/10046 <== osd.1 10.214.139.170:6820/7323 666 ==== osd_ping(you_died e75 stamp 2014-09-21 00:53:23.767985) v2 ==== 47+0+0 (2473526350 0 0) 0x38758c0 con 0x2748c60
2014-09-21 00:53:57.287920 7f8a29c80700 10 osd.5 pg_epoch: 68 pg[0.e( v 68'515 (0'0,68'515] local-les=62 n=3 ec=1 les/c 62/65 54/54/22) [5,3] r=0 lpr=54 crt=65'512 lcod 68'514 mlcod 68'514 active+clean+scrubbing] _scrub (scrub) finish
Other OSDs also responded with you_died.
#3 Updated by David Zafman over 9 years ago
- Status changed from New to Can't reproduce
Looks like just an overloaded node.
#4 Updated by Sahana Lokeshappa over 9 years ago
Its reproducible in ceph 0.84 (customized build)
2014-11-04 15:26:39.388499 7f377cac3700 0 -- 10.242.42.172:7241/2527458 >> 10.242.42.166:6838/2037987 pipe(0x17eb7180 sd=190 :7241 s=2 pgs=13061 cs=1 l=0 c=0xaa272e0).fault with nothing to send, going to standby
2014-11-04 15:26:40.756493 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f374bbbe700' had timed out after 15
2014-11-04 15:26:40.756509 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f375100b700' had timed out after 15
2014-11-04 15:26:40.756511 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f3759792700' had timed out after 15
2014-11-04 15:26:40.756513 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f379e744700' had timed out after 15
2014-11-04 15:26:40.756515 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f379df43700' had timed out after 15
2014-11-04 15:26:45.756582 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f374bbbe700' had timed out after 15
2014-11-04 15:26:45.756596 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f375100b700' had timed out after 15
2014-11-04 15:26:45.756598 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f3759792700' had timed out after 15
2014-11-04 15:26:45.756600 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f379e744700' had timed out after 15
2014-11-04 15:26:45.756601 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f379df43700' had timed out after 15
2014-11-04 15:26:49.083657 7f378c2ba700 0 -- 10.242.42.172:7241/2527458 >> 10.242.42.168:7276/1638190 pipe(0x228da2c0 sd=289 :7241 s=2 pgs=10112 cs=103 l=0 c=0x1609b9c0).fault with nothing to send, going to standby
2014-11-04 15:26:49.399301 7f378f6ee700 0 -- 10.242.42.172:7241/2527458 >> 10.242.42.164:6821/2002429 pipe(0x17eb63c0 sd=118 :7241 s=2 pgs=17295 cs=103 l=0 c=0x19501080).fault with nothing to send, going to standby
2014-11-04 15:26:50.756670 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f374bbbe700' had timed out after 15
2014-11-04 15:26:50.756687 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f375100b700' had timed out after 15
2014-11-04 15:26:50.756694 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f3759792700' had timed out after 15
2014-11-04 15:26:50.756696 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f379e744700' had timed out after 15
2014-11-04 15:26:50.756697 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f379df43700' had timed out after 15
2014-11-04 15:26:54.063102 7f378372f700 0 -- 10.242.42.172:7241/2527458 >> 10.242.42.170:7160/3030701 pipe(0x246a0c00 sd=121 :34593 s=2 pgs=77727 cs=105 l=0 c=0x18e112e0).fault with nothing to send, going to standby
2014-11-04 15:26:55.756764 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f374bbbe700' had timed out after 15
2014-11-04 15:26:55.756780 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f374bbbe700' had suicide timed out after 150
2014-11-04 15:26:55.930586 7f37b4c9c700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f37b4c9c700 time 2014-11-04 15:26:55.756793
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb7c05b]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2a9) [0xabb509]
3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xabbd96]
4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xabc477]
5: (CephContextServiceThread::entry()+0x154) [0xb90514]
6: (()+0x8182) [0x7f37b8419182]
7: (clone()+0x6d) [0x7f37b67fefbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
After upgrading to giant, I will check again.
#5 Updated by Sahana Lokeshappa over 9 years ago
Yes it reproduced in giant too.