Project

General

Profile

Actions

Bug #8320

closed

heartbeat timeouts too low for vps machines

Added by Yuri Weinstein almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

There are several of those in this suite/run
And valgrind does not seem to be enabled in the orig.config.yaml

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-05-08_19:33:19-upgrade:dumpling-x:parallel-firefly---basic-vps/244048/

archive_path: /var/lib/teuthworker/archive/teuthology-2014-05-08_19:33:19-upgrade:dumpling-x:parallel-firefly---basic-vps/244048
branch: firefly
description: upgrade/dumpling-x/parallel/{0-cluster/start.yaml 1-dumpling-install/cuttlefish-dumpling.yaml
  2-workload/rados_loadgenbig.yaml 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 4-final-upgrade/client.yaml
  5-final-workload/rbd_cls.yaml distros/rhel_6.4.yaml}
email: null
job_id: '244048'
last_in_suite: false
machine_type: vps
name: teuthology-2014-05-08_19:33:19-upgrade:dumpling-x:parallel-firefly---basic-vps
nuke-on-error: true
os_type: rhel
os_version: '6.4'
overrides:
  admin_socket:
    branch: firefly
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - scrub mismatch
    - ScrubResult
    sha1: db8873b69c73b40110bf1512c114e4a0395671ab
  ceph-deploy:
    branch:
      dev: firefly
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: db8873b69c73b40110bf1512c114e4a0395671ab
  s3tests:
    branch: master
  workunit:
    sha1: db8873b69c73b40110bf1512c114e4a0395671ab
owner: scheduled_teuthology@teuthology
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
- - mon.b
  - mon.c
  - osd.2
  - osd.3
- - client.0
  - client.1
suite: upgrade:dumpling-x:parallel
targets:
  ubuntu@vpm115.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAszawQx1jNRmtPq2Gj3fqH1SfmgAYOBuSowujH1lTj1aXGprZh+mVaxty2o6gutS5bAbK7PwNXiSwcvR7dB9OwbioUcTMYCjOd1t5+I68Q9iYMP0bAH5DPr94LkVSkbLyI5sJEEjGs/fS0YhgTP79w7IQW8YeGuhst+P/BiV4+jbFqAUEgxqakfGhE4PgyN+GpAweRubGkIp1deDyKfhQJHcuuoAVey9MDRe9/4WmCYKcU3DQjMCKgUoYYV8Czdulmo883MHKTfS7v1aN6KEjOg6As9rsBb79LxYYtZkjjB6pV8WquPayaeaBcQu5zk0WA9ask0vbgasVlbkAgt8zWQ==
  ubuntu@vpm116.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAmdfTdH1+YVKSWTzmjeCdoQaPsdWO57KaVPUp+bz7HsB5YZcxhp1TJ8sPRHfcOCUlHF4SOMtZPTWGmYAPiZHchI78utbaNQtyY6jY64QXRXUag+j+FCoGk7fYlHSX9grDe6gY71I49ueVF+691ii4k3uYE+cCLP6DuOaXlFwo94zM0anNag9eyNdxS6uzm6/e4vUIUchUUUojZRUPLdBQNIw4bQNpG2K4n+mCqkO1NGlVgchXzkGWrCImguc+DUnXHGwsVEjwdf564x5fqpSv73pRZe3GAVQIbV3HAL0BnHefG3Dzdx2iZZQ5UleHvW1PpQWiXZCpniVSQB/crSUOiw==
  ubuntu@vpm117.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAvG0Msn4lUl1VCBvILd9Kto1yRa98FSoMUh3wH9pBeQVEhVnWLJzfn03zcmI4n7BvyJRnabC5VvVlt30BPNrQlHBWVZGGSRjxYh6QYvLO+NVtj+8ooJJ4SckdZ+hyUlTKYYvhq7cy/p4K9KmYfX6drdghcH7vdO0GcizyAr6BwF6779tBs6dZ+JUo8efFRc+pmNJfp1OJxpyVouijlV2FqtLCmz79G7/poZZXgllOHSONgOJq2zzxMf2dxEbYNa/1cn1i3iSes83J6xgn/oH8vlKpvgDbsiX3hTL21QQamK3bsX4JITtE6qYyrereWMtwCL4L5+PP9qIk1HZLRIEVwQ==
tasks:
- internal.lock_machines:
  - 3
  - vps
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.serialize_remote_roles: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: cuttlefish
- print: '**** done cuttlefish install'
- ceph:
    fs: xfs
- print: '**** done ceph'
- install.upgrade:
    all:
      branch: dumpling
- ceph.restart: null
- parallel:
  - workload
  - upgrade-sequence
- print: '**** done parallel'
- install.upgrade:
    client.0: null
- print: '**** done install.upgrade'
- workunit:
    clients:
      client.1:
      - cls/test_cls_rbd.sh
teuthology_branch: firefly
upgrade-sequence:
  sequential:
  - install.upgrade:
      mon.a: null
      mon.b: null
  - ceph.restart:
      daemons:
      - mon.a
      wait-for-healthy: false
      wait-for-osds-up: true
  - sleep:
      duration: 60
  - ceph.restart:
      daemons:
      - mon.b
      wait-for-healthy: false
      wait-for-osds-up: true
  - sleep:
      duration: 60
  - ceph.restart:
    - mon.c
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.0
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.1
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.2
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.3
  - sleep:
      duration: 60
  - ceph.restart:
    - mds.a
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.vps.19330
workload:
  sequential:
  - workunit:
      branch: dumpling
      clients:
        client.0:
        - rados/load-gen-big.sh
description: upgrade/dumpling-x/parallel/{0-cluster/start.yaml 1-dumpling-install/cuttlefish-dumpling.yaml
  2-workload/rados_loadgenbig.yaml 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 4-final-upgrade/client.yaml
  5-final-workload/rbd_cls.yaml distros/rhel_6.4.yaml}
duration: 3378.6129529476166
failure_reason: '"2014-05-09 03:27:49.176216 osd.0 10.214.138.182:6808/6617 305 :
  [WRN] map e11 wrongly marked me down" in cluster log'
flavor: basic
owner: scheduled_teuthology@teuthology
success: false
Actions #1

Updated by Sage Weil almost 10 years ago

  • Status changed from New to 12
  • Source changed from other to Q/A

From the logs it looks like the OSD just stalls and does nothing. I'm chalking it up to limited ram on the VPS nodes and swapping.

Actions #2

Updated by Sage Weil almost 10 years ago

  • Subject changed from "[WRN] map e11 wrongly marked me down" in upgrade:dumpling-x:parallel-firefly---basic-vps suite to heartbeat timeouts too low for vps machines
  • Priority changed from Normal to Urgent

Make this increase teh timeouts when running on vps.

Actions #3

Updated by Sage Weil almost 10 years ago

  • Status changed from 12 to Resolved

added ~teuthology/vps.yaml and added it as an arg for all the vps scheduled suites. sets the heartbeat grace to 40s (from default of 20s)

Actions

Also available in: Atom PDF