Bug #1976: osd: timeout getting clean - Ceph - Ceph

Actions

Copy link

Bug #1976

closed

osd: timeout getting clean

Added by Sage Weil over 12 years ago. Updated over 12 years ago.

Status:

Closed

Priority:

Urgent

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

actually, this may be a monitor thing. seen it twice now:

/var/lib/teuthworker/archive/nightly_coverage_2012-01-24-a/8880
/var/lib/teuthworker/archive/nightly_coverage_2012-01-23-b/8773

2012-01-24T02:33:35.530 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:05.318944 7f2b9f060700 mon.2@2(peon).pg v845 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:35.530 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:05.319365 7f1c65650700 mon.1@0(leader).pg v845 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:47.031 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:16.818886 7f1c65650700 mon.1@0(leader).pg v846 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:47.031 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:16.818944 7f1c65650700 mon.1@0(leader).pg v846 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:47.131 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:16.919551 7f2b9f060700 mon.2@2(peon).pg v846 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:47.136 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:16.923964 7f2b9f060700 mon.2@2(peon).pg v846 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:47.136 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:16.924135 7f2b9f060700 mon.2@2(peon).pg v846 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:48.363 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -s'
2012-01-24T02:33:54.824 INFO:teuthology.task.thrashosds.ceph_manager:2012-01-24 02:12:22.926816   mds e5: 1/1/1 up {0=a=up:active}
2012-01-24 02:12:22.927055   osd e123: 6 osds: 6 up, 6 in
2012-01-24 02:12:22.927175   mon e1: 3 mons at {0=10.3.14.183:6789/0,1=10.3.14.182:6789/0,2=10.3.14.182:6790/0}
2012-01-24 02:12:23.959943    pg v847: 108 pgs: 16 active, 91 active+clean, 1 active+clean+scrubbing; 10660 MB data, 212 GB used, 823 GB / 1055 GB avail; 2091/5374 degraded (38.910%)
2012-01-24 02:12:24.601922   log 2012-01-24 02:12:19.876793 mon.0 10.3.14.182:6789/0 86 : [INF] mon.1@0 won leader election with quorum 0,1,2

2012-01-24T02:33:57.831 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -- pg dump --format=json'
2012-01-24T02:33:58.693 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -s'
2012-01-24T02:34:04.321 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:34.107506 7f1c65650700 mon.1@0(leader).pg v850 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:04.345 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:34.131715 7f2b9f060700 mon.2@2(peon).pg v850 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:04.346 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:34.131760 7f2b9f060700 mon.2@2(peon).pg v850 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:04.476 INFO:teuthology.task.thrashosds.ceph_manager:2012-01-24 02:12:31.589185   mds e5: 1/1/1 up {0=a=up:active}
2012-01-24 02:12:31.589529   osd e123: 6 osds: 6 up, 6 in
2012-01-24 02:12:31.589656   log 2012-01-24 02:12:12.161518 mon.1 10.3.14.183:6789/0 23 : [INF] mon.0 calling new monitor election
2012-01-24 02:12:31.589739   mon e1: 3 mons at {0=10.3.14.183:6789/0,1=10.3.14.182:6789/0,2=10.3.14.182:6790/0}
2012-01-24 02:12:34.107018    pg v850: 108 pgs: 16 active, 91 active+clean, 1 active+clean+scrubbing; 10660 MB data, 212 GB used, 822 GB / 1055 GB avail; 2062/5374 degraded (38.370%)

2012-01-24T02:34:07.360 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:37.145958 7f1c65650700 mon.1@0(leader).pg v851 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:07.481 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -- pg dump --format=json'
2012-01-24T02:34:07.985 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -s'
2012-01-24T02:34:16.529 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:46.313484 7f1c65650700 mon.1@0(leader).pg v853 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:16.529 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:46.313583 7f1c65650700 mon.1@0(leader).pg v853 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:16.529 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:46.313380 7f2b9f060700 mon.2@2(peon).pg v853 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:17.294 INFO:teuthology.task.thrashosds.ceph_manager:2012-01-24 02:12:40.847219   mds e5: 1/1/1 up {0=a=up:active}
2012-01-24 02:12:40.847634   osd e123: 6 osds: 6 up, 6 in
2012-01-24 02:12:40.847752   log 2012-01-24 02:12:12.161518 mon.1 10.3.14.183:6789/0 23 : [INF] mon.0 calling new monitor election
2012-01-24 02:12:40.847944   mon e1: 3 mons at {0=10.3.14.183:6789/0,1=10.3.14.182:6789/0,2=10.3.14.182:6790/0}
2012-01-24 02:12:46.313095    pg v853: 108 pgs: 16 active, 91 active+clean, 1 active+clean+scrubbing; 10660 MB data, 212 GB used, 822 GB / 1055 GB avail; 2033/5374 degraded (37.830%)

2012-01-24T02:34:17.295 ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x2c30c10>
Traceback (most recent call last):
  File "/var/lib/teuthworker/teuthology/teuthology/run_tasks.py", line 45, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/var/lib/teuthworker/teuthology/teuthology/task/thrashosds.py", line 85, in task
    manager.wait_till_clean(config.get('timeout', 360))
  File "/var/lib/teuthworker/teuthology/teuthology/task/ceph_manager.py", line 238, in wait_till_clean
    'failed to become clean before timeout expired'
AssertionError: failed to become clean before timeout expired
2012-01-24T02:34:17.296 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x2c113d0>
2012-01-24T02:34:17.296 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/var/lib/teuthworker/teuthology/teuthology/contextutil.py", line 27, in nested
    yield vars
  File "/var/lib/teuthworker/teuthology/teuthology/task/ceph.py", line 972, in task
    yield
  File "/var/lib/teuthworker/teuthology/teuthology/run_tasks.py", line 45, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/var/lib/teuthworker/teuthology/teuthology/task/thrashosds.py", line 85, in task
    manager.wait_till_clean(config.get('timeout', 360))
  File "/var/lib/teuthworker/teuthology/teuthology/task/ceph_manager.py", line 238, in wait_till_clean
    'failed to become clean before timeout expired'
AssertionError: failed to become clean before timeout expired
2012-01-24T02:34:17.297 INFO:teuthology.task.ceph:Shutting down mds daemons...
2012-01-24T02:34:17.329 INFO:teuthology.task.ceph.mds.a.err:*** Caught signal (Terminated) **
2012-01-24T02:34:17.329 INFO:teuthology.task.ceph.mds.a.err: in thread 7f190ebbd780. Shutting down.
2012-01-24T02:34:19.038 INFO:teuthology.task.ceph.mds.a:mds.a: Stopped
2012-01-24T02:34:19.038 INFO:teuthology.task.ceph:Shutting down osd daemons...

in both cases (modulo sha1):

  kernel:
    sha1: 44f6e2f27c69080d03d66e1c82b0760e3053af24
  nuke-on-error: true
  overrides:
    ceph:
      btrfs: 1
      coverage: true
      log-whitelist:
      - clocks not synchronized
      sha1: 1e4210937fa3951881d39425a0469dcadaae61bf
  roles:
  - - mon.0
    - mds.a
    - osd.0
    - osd.1
    - osd.2
  - - mon.1
    - mon.2
    - client.0
    - osd.3
    - osd.4
    - osd.5
  tasks:
  - chef: null
  - ceph:
      log-whitelist:
      - wrongly marked me down or wrong addr
  - thrashosds: null
  - radosbench:
      clients:
      - client.0
      time: 1800

Actions

Copy link

Updated by Sage Weil over 12 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Sage Weil over 12 years ago

Priority changed from High to Urgent

Actions

Copy link

Updated by Samuel Just over 12 years ago

Status changed from New to Closed

In the case of 8880 at least, it looked like the osds were still making progress. I've changed wait_till_clean to timeout based on time since last change in active+clean.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #1976

osd: timeout getting clean

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Samuel Just over 12 years ago