Bug #8320: heartbeat timeouts too low for vps machines - Ceph - Ceph

Actions

Copy link

Bug #8320

closed

heartbeat timeouts too low for vps machines

Added by Yuri Weinstein almost 10 years ago. Updated almost 10 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

There are several of those in this suite/run
And valgrind does not seem to be enabled in the orig.config.yaml

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-05-08_19:33:19-upgrade:dumpling-x:parallel-firefly---basic-vps/244048/

archive_path: /var/lib/teuthworker/archive/teuthology-2014-05-08_19:33:19-upgrade:dumpling-x:parallel-firefly---basic-vps/244048
branch: firefly
description: upgrade/dumpling-x/parallel/{0-cluster/start.yaml 1-dumpling-install/cuttlefish-dumpling.yaml
  2-workload/rados_loadgenbig.yaml 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 4-final-upgrade/client.yaml
  5-final-workload/rbd_cls.yaml distros/rhel_6.4.yaml}
email: null
job_id: '244048'
last_in_suite: false
machine_type: vps
name: teuthology-2014-05-08_19:33:19-upgrade:dumpling-x:parallel-firefly---basic-vps
nuke-on-error: true
os_type: rhel
os_version: '6.4'
overrides:
  admin_socket:
    branch: firefly
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - scrub mismatch
    - ScrubResult
    sha1: db8873b69c73b40110bf1512c114e4a0395671ab
  ceph-deploy:
    branch:
      dev: firefly
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: db8873b69c73b40110bf1512c114e4a0395671ab
  s3tests:
    branch: master
  workunit:
    sha1: db8873b69c73b40110bf1512c114e4a0395671ab
owner: scheduled_teuthology@teuthology
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
- - mon.b
  - mon.c
  - osd.2
  - osd.3
- - client.0
  - client.1
suite: upgrade:dumpling-x:parallel
targets:
  ubuntu@vpm115.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAszawQx1jNRmtPq2Gj3fqH1SfmgAYOBuSowujH1lTj1aXGprZh+mVaxty2o6gutS5bAbK7PwNXiSwcvR7dB9OwbioUcTMYCjOd1t5+I68Q9iYMP0bAH5DPr94LkVSkbLyI5sJEEjGs/fS0YhgTP79w7IQW8YeGuhst+P/BiV4+jbFqAUEgxqakfGhE4PgyN+GpAweRubGkIp1deDyKfhQJHcuuoAVey9MDRe9/4WmCYKcU3DQjMCKgUoYYV8Czdulmo883MHKTfS7v1aN6KEjOg6As9rsBb79LxYYtZkjjB6pV8WquPayaeaBcQu5zk0WA9ask0vbgasVlbkAgt8zWQ==
  ubuntu@vpm116.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAmdfTdH1+YVKSWTzmjeCdoQaPsdWO57KaVPUp+bz7HsB5YZcxhp1TJ8sPRHfcOCUlHF4SOMtZPTWGmYAPiZHchI78utbaNQtyY6jY64QXRXUag+j+FCoGk7fYlHSX9grDe6gY71I49ueVF+691ii4k3uYE+cCLP6DuOaXlFwo94zM0anNag9eyNdxS6uzm6/e4vUIUchUUUojZRUPLdBQNIw4bQNpG2K4n+mCqkO1NGlVgchXzkGWrCImguc+DUnXHGwsVEjwdf564x5fqpSv73pRZe3GAVQIbV3HAL0BnHefG3Dzdx2iZZQ5UleHvW1PpQWiXZCpniVSQB/crSUOiw==
  ubuntu@vpm117.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAvG0Msn4lUl1VCBvILd9Kto1yRa98FSoMUh3wH9pBeQVEhVnWLJzfn03zcmI4n7BvyJRnabC5VvVlt30BPNrQlHBWVZGGSRjxYh6QYvLO+NVtj+8ooJJ4SckdZ+hyUlTKYYvhq7cy/p4K9KmYfX6drdghcH7vdO0GcizyAr6BwF6779tBs6dZ+JUo8efFRc+pmNJfp1OJxpyVouijlV2FqtLCmz79G7/poZZXgllOHSONgOJq2zzxMf2dxEbYNa/1cn1i3iSes83J6xgn/oH8vlKpvgDbsiX3hTL21QQamK3bsX4JITtE6qYyrereWMtwCL4L5+PP9qIk1HZLRIEVwQ==
tasks:
- internal.lock_machines:
  - 3
  - vps
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.serialize_remote_roles: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: cuttlefish
- print: '**** done cuttlefish install'
- ceph:
    fs: xfs
- print: '**** done ceph'
- install.upgrade:
    all:
      branch: dumpling
- ceph.restart: null
- parallel:
  - workload
  - upgrade-sequence
- print: '**** done parallel'
- install.upgrade:
    client.0: null
- print: '**** done install.upgrade'
- workunit:
    clients:
      client.1:
      - cls/test_cls_rbd.sh
teuthology_branch: firefly
upgrade-sequence:
  sequential:
  - install.upgrade:
      mon.a: null
      mon.b: null
  - ceph.restart:
      daemons:
      - mon.a
      wait-for-healthy: false
      wait-for-osds-up: true
  - sleep:
      duration: 60
  - ceph.restart:
      daemons:
      - mon.b
      wait-for-healthy: false
      wait-for-osds-up: true
  - sleep:
      duration: 60
  - ceph.restart:
    - mon.c
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.0
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.1
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.2
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.3
  - sleep:
      duration: 60
  - ceph.restart:
    - mds.a
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.vps.19330
workload:
  sequential:
  - workunit:
      branch: dumpling
      clients:
        client.0:
        - rados/load-gen-big.sh

description: upgrade/dumpling-x/parallel/{0-cluster/start.yaml 1-dumpling-install/cuttlefish-dumpling.yaml
  2-workload/rados_loadgenbig.yaml 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 4-final-upgrade/client.yaml
  5-final-workload/rbd_cls.yaml distros/rhel_6.4.yaml}
duration: 3378.6129529476166
failure_reason: '"2014-05-09 03:27:49.176216 osd.0 10.214.138.182:6808/6617 305 :
  [WRN] map e11 wrongly marked me down" in cluster log'
flavor: basic
owner: scheduled_teuthology@teuthology
success: false

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Status changed from New to 12
Source changed from other to Q/A

From the logs it looks like the OSD just stalls and does nothing. I'm chalking it up to limited ram on the VPS nodes and swapping.

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Subject changed from "[WRN] map e11 wrongly marked me down" in upgrade:dumpling-x:parallel-firefly---basic-vps suite to heartbeat timeouts too low for vps machines
Priority changed from Normal to Urgent

Make this increase teh timeouts when running on vps.

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Status changed from 12 to Resolved

added ~teuthology/vps.yaml and added it as an arg for all the vps scheduled suites. sets the heartbeat grace to 40s (from default of 20s)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #8320

heartbeat timeouts too low for vps machines

Updated by Sage Weil almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Sage Weil almost 10 years ago