Project

General

Profile

Bug #9554

"FAILED assert(0 == "hit suicide timeout")" in upgrade:firefly-firefly-testing-basic-vps run

Added by Yuri Weinstein over 9 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-09-20_15:08:15-upgrade:firefly-firefly-testing-basic-vps/500976/

Assert in */remote/vpm094/log/ceph-osd.5.log.gz

2014-09-21 00:54:07.875038 7f8a22953700  0 -- 10.214.138.153:6807/10046 >> 10.214.138.153:6815/9759 pipe(0x2740280 sd=46 :54506 s=2 pgs=146 cs=253 l=0 c=0x2748160).fault, initiating reconnect
2014-09-21 00:54:08.125001 7f8a25376700 10 osd.5 69 OSD::ms_get_authorizer type=osd
2014-09-21 00:54:08.125740 7f8a3dbcc700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f8a3dbcc700 time 2014-09-21 00:54:07.864249
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.80.5-229-g2675b0d (2675b0d159ff0608bc6e607be3727ed23e7e2976)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x107) [0x9c3b77]
 2: (ceph::HeartbeatMap::is_healthy()+0xa7) [0x9c4527]
 3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x9c4ad3]
 4: (CephContextServiceThread::entry()+0x133) [0xa90a43]
 5: (()+0x7e9a) [0x7f8a40eb5e9a]
 6: (clone()+0x6d) [0x7f8a3f86731d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
archive_path: /var/lib/teuthworker/archive/teuthology-2014-09-20_15:08:15-upgrade:firefly-firefly-testing-basic-vps/500976
branch: firefly
description: upgrade:firefly/older/{0-cluster/start.yaml 1-install/v0.80.2.yaml 2-workload/blogbench.yaml
  3-upgrade-sequence/upgrade-mon-osd-mds.yaml 4-final/{monthrash.yaml osdthrash.yaml
  testrgw.yaml} distros/ubuntu_12.04.yaml}
email: ceph-qa@ceph.com
job_id: '500976'
kernel: &id001
  kdb: true
  sha1: 3ed0e301a4806e64958a50b28a34abc093b953ab
last_in_suite: false
machine_type: vps
name: teuthology-2014-09-20_15:08:15-upgrade:firefly-firefly-testing-basic-vps
nuke-on-error: true
os_type: ubuntu
os_version: '12.04'
overrides:
  admin_socket:
    branch: firefly
  ceph:
    conf:
      global:
        osd heartbeat grace: 100
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    fs: xfs
    log-whitelist:
    - slow request
    - scrub
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch
    sha1: 2675b0d159ff0608bc6e607be3727ed23e7e2976
  ceph-deploy:
    branch:
      dev: firefly
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 2675b0d159ff0608bc6e607be3727ed23e7e2976
  rgw:
    default_idle_timeout: 1200
  s3tests:
    branch: firefly
    idle_timeout: 1200
  workunit:
    sha1: 2675b0d159ff0608bc6e607be3727ed23e7e2976
owner: scheduled_teuthology@teuthology
priority: 1000
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mon.c
  - osd.3
  - osd.4
  - osd.5
  - client.0
  - client.1
suite: upgrade:firefly
suite_branch: firefly
suite_path: /var/lib/teuthworker/src/ceph-qa-suite_firefly
targets:
  ubuntu@vpm094.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDovlOblFaaeTfgD+3efJrUi4TZLYO2gyLf4WtnHh80GMiqDLb8d+t6qesiD+4ZtkbSXQITK1i2FcQpwGqFEWzzdVNZA7w0CVNGW76QxH5n2LOEGxksAdzfJdDbhdDAqYgdye7E+HvNJtg+rLt/QTfykIlB0QqDpFYkAbFlYexMf3qjNBAbnYaH3T7Hz0C/1hfw3PQVnQ62ElmsdI2XmYO74jQP8DX2OqnEjlL+jCwnqAXz7inxdvgJyJsDs8lISXEkV/JmX7VA7/pUyTfn7ELYLJZjYEu1bWpO7c8dtJjGw/nNDEcUJp7k6E+D33+J0Tea2q7PEwr2h0k5LbDSxWe5
  ubuntu@vpm143.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDA46JFVM8gN6P87utvoKa6ybXTBnaXuX2QKRw3WNiqFRARnHW1gY4naM4ohbJGEzODdtDORO0LXfHbOAv6s0E3VNqlaQgtLUt+Ft8joDdIF/lL7zd9c5VzRgg9XZ0221I7q4E6EG2zo2jSvufgniO1f7oXjz6TOgMJShxV0/hV1pXGDXPY9U9i7l5mRtRschUl1uvni1CxftMuWbqiBXOgk6qqU2cWJffnBO4k38W6DMInqikUJdS1QAwzWEX6/OTtsMeHlXnepW0GJfC0/vSwFf48+jJBxq38NwJB9F4dWIU0jLauK90a7X+6sye5lnRYdnR8O5QcASUbYiJFTW35
tasks:
- internal.lock_machines:
  - 2
  - vps
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.push_inventory: null
- internal.serialize_remote_roles: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    tag: v0.80.2
- ceph:
    log-whitelist:
    - scrub mismatch
    - ScrubResult
- parallel:
  - workload
  - upgrade-sequence
- sequential:
  - mon_thrash:
      revive_delay: 20
      thrash_delay: 1
  - ceph-fuse: null
  - workunit:
      clients:
        client.0:
        - suites/dbench.sh
- sequential:
  - thrashosds:
      chance_pgnum_grow: 1
      chance_pgpnum_fix: 1
      timeout: 1200
  - ceph-fuse:
    - client.0
  - workunit:
      clients:
        client.0:
        - suites/iogen.sh
- sequential:
  - rgw:
    - client.1
  - s3tests:
      client.1:
        rgw_server: client.1
teuthology_branch: master
tube: vps
upgrade-sequence:
  sequential:
  - install.upgrade:
      all:
        branch: firefly
  - ceph.restart:
    - mon.a
  - sleep:
      duration: 60
  - ceph.restart:
    - mon.b
  - sleep:
      duration: 60
  - ceph.restart:
    - mon.c
  - sleep:
      duration: 60
  - ceph.restart:
    - mds.a
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.0
  - sleep:
      duration: 30
  - ceph.restart:
    - osd.1
  - sleep:
      duration: 30
  - ceph.restart:
    - osd.2
  - sleep:
      duration: 30
  - ceph.restart:
    - osd.3
  - sleep:
      duration: 30
  - ceph.restart:
    - osd.4
  - sleep:
      duration: 30
  - ceph.restart:
    - osd.5
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.vps.3019
workload:
  workunit:
    clients:
      client.0:
      - suites/blogbench.sh
description: upgrade:firefly/older/{0-cluster/start.yaml 1-install/v0.80.2.yaml 2-workload/blogbench.yaml
  3-upgrade-sequence/upgrade-mon-osd-mds.yaml 4-final/{monthrash.yaml osdthrash.yaml
  testrgw.yaml} distros/ubuntu_12.04.yaml}
duration: 5435.529995203018
failure_reason: 'Command failed on vpm094 with status 1: ''sudo adjust-ulimits ceph-coverage
  /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-osd -f -i 5'''
flavor: basic
mon.a-kernel-sha1: 3ed0e301a4806e64958a50b28a34abc093b953ab
mon.b-kernel-sha1: 3ed0e301a4806e64958a50b28a34abc093b953ab
owner: scheduled_teuthology@teuthology
success: false

History

#1 Updated by Samuel Just over 9 years ago

  • Assignee set to David Zafman
  • Priority changed from Normal to Urgent

#2 Updated by David Zafman over 9 years ago

The crashed osd.5 was on a node that had a load average of 38. osd.1 didn't see ping responses although it saw osd.5 initiated pings during this time of high load.

osd.5 is the only one that reported a loadavg because of trying to scrub. One of 4 scrubs (pg 0.e) took a very long time to finish. This was during the period of high load average.

osd.5:
2014-09-21 00:45:52.501620 7f8a29c80700 10 osd.5 pg_epoch: 48 pg[0.a( v 24'779 (0'0,24'779] local-les=47 n=0 ec=1 les/c 47/48 22/42/22) [5,1] r=0 lpr=42 crt=24'776 lcod 24'778 mlcod 0'0 active+clean+scrubbing] starting a new chunky scrub
2014-09-21 00:46:00.228866 7f8a29c80700 10 osd.5 pg_epoch: 50 pg[0.1a( v 24'779 (0'0,24'779] local-les=48 n=0 ec=1 les/c 48/48 22/42/22) [5,1] r=0 lpr=42 crt=24'776 lcod 0'0 mlcod 0'0 active+clean+scrubbing] starting a new chunky scrub
2014-09-21 00:48:29.935088 7f8a29c80700 10 osd.5 pg_epoch: 66 pg[0.1e( v 24'507 (0'0,24'507] local-les=61 n=0 ec=1 les/c 61/64 54/54/22) [5,3] r=0 lpr=54 crt=24'503 lcod 0'0 mlcod 0'0 active+clean+scrubbing] starting a new chunky scrub
2014-09-21 00:49:04.491443 7f8a29c80700 10 osd.5 pg_epoch: 68 pg[0.e( v 68'515 (0'0,68'515] local-les=62 n=3 ec=1 les/c 62/65 54/54/22) [5,3] r=0 lpr=54 luod=65'513 crt=65'512 lcod 65'512 mlcod 65'512 active+clean+scrubbing] starting a new chunky scrub
2014-09-21 00:50:50.774347 7f8a3abc6700 20 osd.5 68 scrub_should_schedule loadavg 38 >= max 5 = no, load too high
2014-09-21 00:50:58.612390 7f8a3abc6700 0 log [WRN] : 9 slow requests, 1 included below; oldest blocked for > 120.206632 secs
2014-09-21 00:51:06.237568 7f8a3abc6700 0 log [WRN] : 9 slow requests, 1 included below; oldest blocked for > 127.831814 secs
2014-09-21 00:51:10.168962 7f8a3abc6700 0 log [WRN] : 10 slow requests, 5 included below; oldest blocked for > 131.763179 secs
2014-09-21 00:51:18.938596 7f8a3abc6700 0 log [WRN] : 10 slow requests, 1 included below; oldest blocked for > 140.532841 secs
2014-09-21 00:51:21.123826 7f8a3abc6700 0 log [WRN] : 11 slow requests, 1 included below; oldest blocked for > 142.718069 secs
2014-09-21 00:51:31.478292 7f8a3abc6700 0 log [WRN] : 11 slow requests, 1 included below; oldest blocked for > 153.072539 secs

osd.1:
2014-09-21 00:53:02.452857 7ff18acfa700 -1 osd.1 72 heartbeat_check: no reply from osd.5 since back 2014-09-21 00:51:21.826360 front 2014-09-21 00:51:21.826360 (cutoff 2014-09-21 00:51:22.452857)

osd.5:
2014-09-21 00:53:23.807286 7f8a2fc8c700 1 -- 10.214.138.153:0/10046 <== osd.1 10.214.139.170:6820/7323 666 ==== osd_ping(you_died e75 stamp 2014-09-21 00:53:23.767985) v2 ==== 47+0+0 (2473526350 0 0) 0x38758c0 con 0x2748c60
2014-09-21 00:53:57.287920 7f8a29c80700 10 osd.5 pg_epoch: 68 pg[0.e( v 68'515 (0'0,68'515] local-les=62 n=3 ec=1 les/c 62/65 54/54/22) [5,3] r=0 lpr=54 crt=65'512 lcod 68'514 mlcod 68'514 active+clean+scrubbing] _scrub (scrub) finish

Other OSDs also responded with you_died.

#3 Updated by David Zafman over 9 years ago

  • Status changed from New to Can't reproduce

Looks like just an overloaded node.

#4 Updated by Sahana Lokeshappa over 9 years ago

Its reproducible in ceph 0.84 (customized build)

2014-11-04 15:26:39.388499 7f377cac3700 0 -- 10.242.42.172:7241/2527458 >> 10.242.42.166:6838/2037987 pipe(0x17eb7180 sd=190 :7241 s=2 pgs=13061 cs=1 l=0 c=0xaa272e0).fault with nothing to send, going to standby
2014-11-04 15:26:40.756493 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f374bbbe700' had timed out after 15
2014-11-04 15:26:40.756509 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f375100b700' had timed out after 15
2014-11-04 15:26:40.756511 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f3759792700' had timed out after 15
2014-11-04 15:26:40.756513 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f379e744700' had timed out after 15
2014-11-04 15:26:40.756515 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f379df43700' had timed out after 15
2014-11-04 15:26:45.756582 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f374bbbe700' had timed out after 15
2014-11-04 15:26:45.756596 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f375100b700' had timed out after 15
2014-11-04 15:26:45.756598 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f3759792700' had timed out after 15
2014-11-04 15:26:45.756600 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f379e744700' had timed out after 15
2014-11-04 15:26:45.756601 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f379df43700' had timed out after 15
2014-11-04 15:26:49.083657 7f378c2ba700 0 -- 10.242.42.172:7241/2527458 >> 10.242.42.168:7276/1638190 pipe(0x228da2c0 sd=289 :7241 s=2 pgs=10112 cs=103 l=0 c=0x1609b9c0).fault with nothing to send, going to standby
2014-11-04 15:26:49.399301 7f378f6ee700 0 -- 10.242.42.172:7241/2527458 >> 10.242.42.164:6821/2002429 pipe(0x17eb63c0 sd=118 :7241 s=2 pgs=17295 cs=103 l=0 c=0x19501080).fault with nothing to send, going to standby
2014-11-04 15:26:50.756670 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f374bbbe700' had timed out after 15
2014-11-04 15:26:50.756687 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f375100b700' had timed out after 15
2014-11-04 15:26:50.756694 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f3759792700' had timed out after 15
2014-11-04 15:26:50.756696 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f379e744700' had timed out after 15
2014-11-04 15:26:50.756697 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f379df43700' had timed out after 15
2014-11-04 15:26:54.063102 7f378372f700 0 -- 10.242.42.172:7241/2527458 >> 10.242.42.170:7160/3030701 pipe(0x246a0c00 sd=121 :34593 s=2 pgs=77727 cs=105 l=0 c=0x18e112e0).fault with nothing to send, going to standby
2014-11-04 15:26:55.756764 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f374bbbe700' had timed out after 15
2014-11-04 15:26:55.756780 7f37b4c9c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f374bbbe700' had suicide timed out after 150
2014-11-04 15:26:55.930586 7f37b4c9c700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f37b4c9c700 time 2014-11-04 15:26:55.756793
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb7c05b]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2a9) [0xabb509]
3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xabbd96]
4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xabc477]
5: (CephContextServiceThread::entry()+0x154) [0xb90514]
6: (()+0x8182) [0x7f37b8419182]
7: (clone()+0x6d) [0x7f37b67fefbd]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

After upgrading to giant, I will check again.

#5 Updated by Sahana Lokeshappa over 9 years ago

Yes it reproduced in giant too.

Also available in: Atom PDF