Project

General

Profile

Actions

Bug #9258

closed

"Floating point exception" in upgrade:dumpling-firefly-x-master-distro-basic-vps suite

Added by Yuri Weinstein over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-08-27_14:00:01-upgrade:dumpling-firefly-x-master-distro-basic-vps/455607/

coredump in */455607/remote/vpm020/log/ceph-mds.a.log.gz:

ceph-mds.a.log.gz:2014-08-27 21:50:05.519812 7fa0e6063700 -1 *** Caught signal (Floating point exception) **
ceph-mds.a.log.gz: in thread 7fa0e6063700
ceph-mds.a.log.gz:
ceph-mds.a.log.gz: ceph version 0.80.5-202-g8e3120f (8e3120fcb379a00d370e4c04d34af35e596e2de9)
ceph-mds.a.log.gz: 1: ceph-mds() [0x8d45a2]
ceph-mds.a.log.gz: 2: (()+0xf030) [0x7fa0ea6e1030]
ceph-mds.a.log.gz: 3: (Locker::calc_new_client_ranges(CInode*, unsigned long, std::map<client_t, client_writeable_range_t, std::less<client_t>, std::allocator<std::pair<client_t const, client_writeable_range_t> > >&)+0x4e) [0x7d667e]
ceph-mds.a.log.gz: 4: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long, bool, unsigned long, utime_t)+0xb1) [0x7e0341]
ceph-mds.a.log.gz: 5: (MDCache::start_files_to_recover(std::vector<CInode*, std::allocator<CInode*> >&, std::vector<CInode*, std::allocator<CInode*> >&)+0x63) [0x70ec83]
ceph-mds.a.log.gz: 6: (MDCache::open_snap_parents()+0xc9f) [0x731eff]
ceph-mds.a.log.gz: 7: (MDCache::rejoin_gather_finish()+0x1a8) [0x771f68]
ceph-mds.a.log.gz: 8: (MDCache::rejoin_send_rejoins()+0x2a59) [0x77e1a9]
ceph-mds.a.log.gz: 9: (MDS::rejoin_joint_start()+0x6c) [0x65237c]
ceph-mds.a.log.gz: 10: (MDS::handle_mds_map(MMDSMap*)+0x22c2) [0x663062]
ceph-mds.a.log.gz: 11: (MDS::handle_core_message(Message*)+0xc5b) [0x667e3b]
ceph-mds.a.log.gz: 12: (MDS::_dispatch(Message*)+0x33) [0x667f73]
ceph-mds.a.log.gz: 13: (MDS::ms_dispatch(Message*)+0xc2) [0x669d42]
ceph-mds.a.log.gz: 14: (DispatchQueue::entry()+0x4eb) [0xa6d3fb]
ceph-mds.a.log.gz: 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x99053d]
ceph-mds.a.log.gz: 16: (()+0x6b50) [0x7fa0ea6d8b50]
ceph-mds.a.log.gz: 17: (clone()+0x6d) [0x7fa0e9500a7d]
2014-08-27T15:07:45.320 INFO:teuthology.orchestra.run.vpm020:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph health'
2014-08-27T15:07:47.487 DEBUG:teuthology.misc:Ceph health: HEALTH_WARN mds cluster is degraded; mds a is laggy
2014-08-27T15:07:48.487 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 53, in run_tasks
    manager.__enter__()
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/ceph.py", line 1097, in restart
    def stop(ctx, config):
  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/ceph.py", line 994, in healthy
    ctx,
  File "/home/teuthworker/src/teuthology_master/teuthology/misc.py", line 820, in wait_until_healthy
    while proceed():
  File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 127, in __call__
    raise MaxWhileTries(error_msg)
MaxWhileTries: 'wait_until_healthy'reached maximum tries (150) after waiting for 900 seconds
2014-08-27T15:07:48.488 DEBUG:teuthology.run_tasks:Unwinding manager install.upgrade
archive_path: /var/lib/teuthworker/archive/teuthology-2014-08-27_14:00:01-upgrade:dumpling-firefly-x-master-distro-basic-vps/455607
branch: master
description: upgrade:dumpling-firefly-x/parallel/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-workload/{rados_api.yaml rados_loadgenbig.yaml test_rbd_api.yaml test_rbd_python.yaml}
  3-firefly-upgrade/firefly.yaml 4-workload/{rados_api.yaml rados_loadgenbig.yaml
  test_rbd_api.yaml test_rbd_python.yaml} 5-upgrade-sequence/upgrade-by-type.yaml
  6-final-workload/{ec-readwrite.yaml rados-snaps-few-objects.yaml rados_loadgenmix.yaml
  rados_mon_thrash.yaml rbd_cls.yaml rbd_import_export.yaml rgw_s3tests.yaml rgw_swift.yaml}
  distros/debian_7.0.yaml}
email: ceph-qa@ceph.com
job_id: '455607'
kernel: &id001
  kdb: true
  sha1: distro
last_in_suite: false
machine_type: vps
name: teuthology-2014-08-27_14:00:01-upgrade:dumpling-firefly-x-master-distro-basic-vps
nuke-on-error: true
os_type: debian
os_version: '7.0'
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      global:
        osd heartbeat grace: 100
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - scrub mismatch
    - ScrubResult
    sha1: f25bca313629725b195bf6be43ae2236084064a3
  ceph-deploy:
    branch:
      dev: master
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: f25bca313629725b195bf6be43ae2236084064a3
  rgw:
    default_idle_timeout: 1200
  s3tests:
    branch: master
    idle_timeout: 1200
  workunit:
    sha1: f25bca313629725b195bf6be43ae2236084064a3
owner: scheduled_teuthology@teuthology
priority: 1000
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
- - mon.b
  - mon.c
  - osd.2
  - osd.3
- - client.0
  - client.1
suite: upgrade:dumpling-firefly-x
suite_branch: master
suite_path: /var/lib/teuthworker/src/ceph-qa-suite_master
targets:
  ubuntu@vpm020.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDP9/gmqf4z8WkyglFVyMZQev/K+q9RmyzNETpbvgyT0YTvTaSo/OPlwJIBqPlUlikaSJFmV/PJjbw8zHhxFgZyzoMKkCSypC3q9HqzCJa7vb9mvBwWKq8veAYTibAydfTkJHqLROZJNzCTfzwwOFTVUnVEEJrk8T0eaIla8nEP29CroJOOSIp+x5ITFGohu8ucDQKQLdf16n4rPDM+ZXLSeFn7Eb26sN6dp+n/iKQYnqR39vZSkosjd9vnuf5HhvrnRtyC1nBkTnoaeohA+F1biaXCwVANVPhx131PzL0xmdBjAXYtcZ6gMcfmKBeiVQsxledZoGqF3RHjD16r/NQD
  ubuntu@vpm053.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDmVdGKt+JC1/BSsczXWcd9v0cp0O4cvOUdArP+l+Hq8LlCGcfkoEaUIqm7ebFJgA7ouOP/N/oWPIzmp12Hk6tWcUPrRFgQuQGLpOeKQG9GHqYTK8vyAd1jg7Qq4wS9krliSwUfmKt5UgZLqzfyHVvL5EnYEvQ6zopOsKLgeBwknR2ZfN1mMQs30mYqRna/FsMlYZ2gDw6SwyGO7aLJxq4Ej3dfIPrSkOUFMq3mlTb9eck0UIQOjuSUpHch2upp8cMooxWiyoy0Frw1xm0X/WNWNEKPYm1Z46J0apLtu9xM6J7EahxDzXN0JNuO4lqR2sevnvMo4RWkZ+k8NUB6i1tP
  ubuntu@vpm087.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCnMLtjGS5BGN8NEpha11cIusObD9Mqysnh0eD6oYf+0otRSqDFGY8aVIRtSuPu4w5OOuLcTHYZWBol+HJkCjYV6JRdrBRnq1Ex1cFiCEL8e/nDii7DxtpHZElS0Bkf7cIf/hiLSQFP2QkTJipl0tbuuGEfQS3LemiYIkmZqZ0lXeL8ok47tW7H0YynNUirHrm1zk/l+vFpfNTl0FgU3RUXpUJKCk+xrLWsb0whU50rP3au+BxIVmUbI7VC7r9XHjNUDK0w+qfrgOa6Gok/rCHF12t4fzKTjys9rnMGQGi0+irsJ5bpjoTzk90+MXs2yFXXn7W5GdeyxMfS0im+PiAV
tasks:
- internal.lock_machines:
  - 3
  - vps
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.serialize_remote_roles: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: dumpling
- print: '**** done dumpling install'
- ceph:
    fs: xfs
- parallel:
  - workload
- print: '**** done parallel'
- install.upgrade:
    client.0:
      branch: firefly
    mon.a:
      branch: firefly
    mon.b:
      branch: firefly
- print: '**** done install.upgrade'
- ceph.restart: null
- print: '**** done restart'
- parallel:
  - workload2
  - upgrade-sequence
- print: '**** done parallel'
- install.upgrade:
    client.0: null
- print: '**** done install.upgrade client.0 to the version from teuthology-suite
    arg'
- rados:
    clients:
    - client.0
    ec_pool: true
    objects: 500
    op_weights:
      append: 45
      delete: 10
      read: 45
      write: 0
    ops: 4000
- rados:
    clients:
    - client.1
    objects: 50
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
- workunit:
    clients:
      client.1:
      - rados/load-gen-mix.sh
- sequential:
  - mon_thrash:
      revive_delay: 20
      thrash_delay: 1
  - workunit:
      clients:
        client.1:
        - rados/test.sh
  - print: '**** done rados/test.sh - 6-final-workload'
- workunit:
    clients:
      client.1:
      - cls/test_cls_rbd.sh
- workunit:
    clients:
      client.1:
      - rbd/import_export.sh
    env:
      RBD_CREATE_ARGS: --new-format
- rgw:
  - client.1
- s3tests:
    client.1:
      rgw_server: client.1
- swift:
    client.1:
      rgw_server: client.1
teuthology_branch: master
tube: vps
upgrade-sequence:
  sequential:
  - install.upgrade:
      mon.a: null
  - print: '**** done install.upgrade mon.a to the version from teuthology-suite arg'
  - install.upgrade:
      mon.b: null
  - print: '**** done install.upgrade mon.b to the version from teuthology-suite arg'
  - ceph.restart:
      daemons:
      - mon.a
      - mon.b
      - mon.c
      wait-for-healthy: true
  - sleep:
      duration: 60
  - ceph.restart:
      daemons:
      - osd.0
      - osd.1
      - osd.2
      - osd.3
      wait-for-healthy: true
  - sleep:
      duration: 60
  - ceph.restart:
    - mds.a
  - sleep:
      duration: 60
  - exec:
      mon.a:
      - ceph osd crush tunables firefly
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.vps.4650
workload:
  sequential:
  - workunit:
      branch: dumpling
      clients:
        client.0:
        - rados/test.sh
        - cls
  - print: '**** done rados/test.sh &  cls'
  - workunit:
      branch: dumpling
      clients:
        client.0:
        - rados/load-gen-big.sh
  - print: '**** done rados/load-gen-big.sh'
  - workunit:
      branch: dumpling
      clients:
        client.0:
        - rbd/test_librbd.sh
  - print: '**** done rbd/test_librbd.sh'
  - workunit:
      branch: dumpling
      clients:
        client.0:
        - rbd/test_librbd_python.sh
  - print: '**** done rbd/test_librbd_python.sh'
workload2:
  sequential:
  - workunit:
      branch: firefly
      clients:
        client.0:
        - rados/test.sh
        - cls
  - print: '**** done #rados/test.sh and cls 2'
  - workunit:
      branch: firefly
      clients:
        client.0:
        - rados/load-gen-big.sh
  - print: '**** done rados/load-gen-big.sh 2'
  - workunit:
      branch: firefly
      clients:
        client.0:
        - rbd/test_librbd.sh
  - print: '**** done rbd/test_librbd.sh 2'
  - workunit:
      branch: firefly
      clients:
        client.0:
        - rbd/test_librbd_python.sh
  - print: '**** done rbd/test_librbd_python.sh 2'
description: upgrade:dumpling-firefly-x/parallel/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-workload/{rados_api.yaml rados_loadgenbig.yaml test_rbd_api.yaml test_rbd_python.yaml}
  3-firefly-upgrade/firefly.yaml 4-workload/{rados_api.yaml rados_loadgenbig.yaml
  test_rbd_api.yaml test_rbd_python.yaml} 5-upgrade-sequence/upgrade-by-type.yaml
  6-final-workload/{ec-readwrite.yaml rados-snaps-few-objects.yaml rados_loadgenmix.yaml
  rados_mon_thrash.yaml rbd_cls.yaml rbd_import_export.yaml rgw_s3tests.yaml rgw_swift.yaml}
  distros/debian_7.0.yaml}
duration: 4069.9191370010376
failure_reason: '''wait_until_healthy''reached maximum tries (150) after waiting for
  900 seconds'
flavor: basic
owner: scheduled_teuthology@teuthology
success: false
Actions #1

Updated by Yuri Weinstein over 9 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Sage Weil over 9 years ago

  • Status changed from New to Resolved
Actions #3

Updated by Yuri Weinstein over 9 years ago

  • Status changed from Resolved to New

Still see the crash, but only in one tests now:

http://qa-proxy.ceph.com/teuthology/teuthology-2014-08-27_17:50:01-upgrade:dumpling-firefly-x-master-distro-basic-vps/456876/teuthology.log

014-08-27T20:33:19.416 INFO:tasks.ceph.mds.a.vpm055.stderr:*** Caught signal (Floating point exception) **
2014-08-27T20:33:19.416 INFO:tasks.ceph.mds.a.vpm055.stderr: in thread 7f340f600700
2014-08-27T20:33:19.418 INFO:tasks.ceph.mds.a.vpm055.stderr: ceph version 0.80.5-202-g8e3120f (8e3120fcb379a00d370e4c04d34af35e596e2de9)
2014-08-27T20:33:19.418 INFO:tasks.ceph.mds.a.vpm055.stderr: 1: ceph-mds() [0x82f2c1]
2014-08-27T20:33:19.419 INFO:tasks.ceph.mds.a.vpm055.stderr: 2: (()+0xf710) [0x7f3414103710]
2014-08-27T20:33:19.419 INFO:tasks.ceph.mds.a.vpm055.stderr: 3: (Locker::calc_new_client_ranges(CInode*, unsigned long, std::map<client_t, client_writeable_range_t, std::less<client_t>, std::allocator<std::pair<client_t const, client_writeable_range_t> > >&)+0x54) [0x6f3984]
2014-08-27T20:33:19.419 INFO:tasks.ceph.mds.a.vpm055.stderr: 4: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long, bool, unsigned long, utime_t)+0x113) [0x70aaf3]
2014-08-27T20:33:19.419 INFO:tasks.ceph.mds.a.vpm055.stderr: 5: (MDCache::start_files_to_recover(std::vector<CInode*, std::allocator<CInode*> >&, std::vector<CInode*, std::allocator<CInode*> >&)+0x52) [0x63d3f2]
2014-08-27T20:33:19.419 INFO:tasks.ceph.mds.a.vpm055.stderr: 6: (MDCache::open_snap_parents()+0xb47) [0x6a5a87]
2014-08-27T20:33:19.420 INFO:tasks.ceph.mds.a.vpm055.stderr: 7: (MDCache::rejoin_gather_finish()+0x146) [0x6a62a6]
2014-08-27T20:33:19.420 INFO:tasks.ceph.mds.a.vpm055.stderr: 8: (MDCache::rejoin_send_rejoins()+0x2f9a) [0x6a930a]
2014-08-27T20:33:19.420 INFO:tasks.ceph.mds.a.vpm055.stderr: 9: (MDS::rejoin_joint_start()+0x142) [0x575f32]
2014-08-27T20:33:19.420 INFO:tasks.ceph.mds.a.vpm055.stderr: 10: (MDS::handle_mds_map(MMDSMap*)+0x4683) [0x58fb33]
2014-08-27T20:33:19.420 INFO:tasks.ceph.mds.a.vpm055.stderr: 11: (MDS::handle_core_message(Message*)+0x9ab) [0x59069b]
2014-08-27T20:33:19.420 INFO:tasks.ceph.mds.a.vpm055.stderr: 12: (MDS::_dispatch(Message*)+0x2f) [0x59076f]
2014-08-27T20:33:19.421 INFO:tasks.ceph.mds.a.vpm055.stderr: 13: (MDS::ms_dispatch(Message*)+0x1b3) [0x592233]
2014-08-27T20:33:19.421 INFO:tasks.ceph.mds.a.vpm055.stderr: 14: (DispatchQueue::entry()+0x5a2) [0xa0bd02]
2014-08-27T20:33:19.421 INFO:tasks.ceph.mds.a.vpm055.stderr: 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x9d370d]
2014-08-27T20:33:19.421 INFO:tasks.ceph.mds.a.vpm055.stderr: 16: (()+0x79d1) [0x7f34140fb9d1]
2014-08-27T20:33:19.421 INFO:tasks.ceph.mds.a.vpm055.stderr: 17: (clone()+0x6d) [0x7f341328fb6d]
2014-08-27T20:33:19.421 INFO:tasks.ceph.mds.a.vpm055.stderr:2014-08-27 23:33:19.418082 7f340f600700 -1 *** Caught signal (Floating point exception) **
2014-08-27T20:33:19.421 INFO:tasks.ceph.mds.a.vpm055.stderr: in thread 7f340f600700
2014-08-27T20:33:19.422 INFO:tasks.ceph.mds.a.vpm055.stderr:
2014-08-27T20:33:19.422 INFO:tasks.ceph.mds.a.vpm055.stderr: ceph version 0.80.5-202-g8e3120f (8e3120fcb379a00d370e4c04d34af35e596e2de9)
2014-08-27T20:33:19.422 INFO:tasks.ceph.mds.a.vpm055.stderr: 1: ceph-mds() [0x82f2c1]
2014-08-27T20:33:19.422 INFO:tasks.ceph.mds.a.vpm055.stderr: 2: (()+0xf710) [0x7f3414103710]
2014-08-27T20:33:19.422 INFO:tasks.ceph.mds.a.vpm055.stderr: 3: (Locker::calc_new_client_ranges(CInode*, unsigned long, std::map<client_t, client_writeable_range_t, std::less<client_t>, std::allocator<std::pair<client_t const, client_writeable_range_t> > >&)+0x54) [0x6f3984]
2014-08-27T20:33:19.423 INFO:tasks.ceph.mds.a.vpm055.stderr: 4: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long, bool, unsigned long, utime_t)+0x113) [0x70aaf3]
2014-08-27T20:33:19.423 INFO:tasks.ceph.mds.a.vpm055.stderr: 5: (MDCache::start_files_to_recover(std::vector<CInode*, std::allocator<CInode*> >&, std::vector<CInode*, std::allocator<CInode*> >&)+0x52) [0x63d3f2]
2014-08-27T20:33:19.423 INFO:tasks.ceph.mds.a.vpm055.stderr: 6: (MDCache::open_snap_parents()+0xb47) [0x6a5a87]
2014-08-27T20:33:19.423 INFO:tasks.ceph.mds.a.vpm055.stderr: 7: (MDCache::rejoin_gather_finish()+0x146) [0x6a62a6]
2014-08-27T20:33:19.423 INFO:tasks.ceph.mds.a.vpm055.stderr: 8: (MDCache::rejoin_send_rejoins()+0x2f9a) [0x6a930a]
2014-08-27T20:33:19.423 INFO:tasks.ceph.mds.a.vpm055.stderr: 9: (MDS::rejoin_joint_start()+0x142) [0x575f32]
2014-08-27T20:33:19.424 INFO:tasks.ceph.mds.a.vpm055.stderr: 10: (MDS::handle_mds_map(MMDSMap*)+0x4683) [0x58fb33]
2014-08-27T20:33:19.424 INFO:tasks.ceph.mds.a.vpm055.stderr: 11: (MDS::handle_core_message(Message*)+0x9ab) [0x59069b]
2014-08-27T20:33:19.424 INFO:tasks.ceph.mds.a.vpm055.stderr: 12: (MDS::_dispatch(Message*)+0x2f) [0x59076f]
2014-08-27T20:33:19.424 INFO:tasks.ceph.mds.a.vpm055.stderr: 13: (MDS::ms_dispatch(Message*)+0x1b3) [0x592233]
2014-08-27T20:33:19.424 INFO:tasks.ceph.mds.a.vpm055.stderr: 14: (DispatchQueue::entry()+0x5a2) [0xa0bd02]
2014-08-27T20:33:19.424 INFO:tasks.ceph.mds.a.vpm055.stderr: 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x9d370d]
2014-08-27T20:33:19.425 INFO:tasks.ceph.mds.a.vpm055.stderr: 16: (()+0x79d1) [0x7f34140fb9d1]
2014-08-27T20:33:19.425 INFO:tasks.ceph.mds.a.vpm055.stderr: 17: (clone()+0x6d) [0x7f341328fb6d]
Actions #4

Updated by Sage Weil over 9 years ago

  • Status changed from New to Resolved

that crash is on the same old commit before the fix was applied. the latest firefly has the backported patch.

Actions

Also available in: Atom PDF