Project

General

Profile

Actions

Bug #1976

closed

osd: timeout getting clean

Added by Sage Weil over 12 years ago. Updated over 12 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

actually, this may be a monitor thing. seen it twice now:

/var/lib/teuthworker/archive/nightly_coverage_2012-01-24-a/8880
/var/lib/teuthworker/archive/nightly_coverage_2012-01-23-b/8773

2012-01-24T02:33:35.530 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:05.318944 7f2b9f060700 mon.2@2(peon).pg v845 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:35.530 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:05.319365 7f1c65650700 mon.1@0(leader).pg v845 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:47.031 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:16.818886 7f1c65650700 mon.1@0(leader).pg v846 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:47.031 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:16.818944 7f1c65650700 mon.1@0(leader).pg v846 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:47.131 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:16.919551 7f2b9f060700 mon.2@2(peon).pg v846 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:47.136 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:16.923964 7f2b9f060700 mon.2@2(peon).pg v846 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:47.136 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:16.924135 7f2b9f060700 mon.2@2(peon).pg v846 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:33:48.363 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -s'
2012-01-24T02:33:54.824 INFO:teuthology.task.thrashosds.ceph_manager:2012-01-24 02:12:22.926816   mds e5: 1/1/1 up {0=a=up:active}
2012-01-24 02:12:22.927055   osd e123: 6 osds: 6 up, 6 in
2012-01-24 02:12:22.927175   mon e1: 3 mons at {0=10.3.14.183:6789/0,1=10.3.14.182:6789/0,2=10.3.14.182:6790/0}
2012-01-24 02:12:23.959943    pg v847: 108 pgs: 16 active, 91 active+clean, 1 active+clean+scrubbing; 10660 MB data, 212 GB used, 823 GB / 1055 GB avail; 2091/5374 degraded (38.910%)
2012-01-24 02:12:24.601922   log 2012-01-24 02:12:19.876793 mon.0 10.3.14.182:6789/0 86 : [INF] mon.1@0 won leader election with quorum 0,1,2

2012-01-24T02:33:57.831 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -- pg dump --format=json'
2012-01-24T02:33:58.693 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -s'
2012-01-24T02:34:04.321 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:34.107506 7f1c65650700 mon.1@0(leader).pg v850 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:04.345 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:34.131715 7f2b9f060700 mon.2@2(peon).pg v850 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:04.346 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:34.131760 7f2b9f060700 mon.2@2(peon).pg v850 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:04.476 INFO:teuthology.task.thrashosds.ceph_manager:2012-01-24 02:12:31.589185   mds e5: 1/1/1 up {0=a=up:active}
2012-01-24 02:12:31.589529   osd e123: 6 osds: 6 up, 6 in
2012-01-24 02:12:31.589656   log 2012-01-24 02:12:12.161518 mon.1 10.3.14.183:6789/0 23 : [INF] mon.0 calling new monitor election
2012-01-24 02:12:31.589739   mon e1: 3 mons at {0=10.3.14.183:6789/0,1=10.3.14.182:6789/0,2=10.3.14.182:6790/0}
2012-01-24 02:12:34.107018    pg v850: 108 pgs: 16 active, 91 active+clean, 1 active+clean+scrubbing; 10660 MB data, 212 GB used, 822 GB / 1055 GB avail; 2062/5374 degraded (38.370%)

2012-01-24T02:34:07.360 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:37.145958 7f1c65650700 mon.1@0(leader).pg v851 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:07.481 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -- pg dump --format=json'
2012-01-24T02:34:07.985 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -s'
2012-01-24T02:34:16.529 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:46.313484 7f1c65650700 mon.1@0(leader).pg v853 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:16.529 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:46.313583 7f1c65650700 mon.1@0(leader).pg v853 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:16.529 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:46.313380 7f2b9f060700 mon.2@2(peon).pg v853 PGMonitor::preprocess_pg_stats: no monitor session!
2012-01-24T02:34:17.294 INFO:teuthology.task.thrashosds.ceph_manager:2012-01-24 02:12:40.847219   mds e5: 1/1/1 up {0=a=up:active}
2012-01-24 02:12:40.847634   osd e123: 6 osds: 6 up, 6 in
2012-01-24 02:12:40.847752   log 2012-01-24 02:12:12.161518 mon.1 10.3.14.183:6789/0 23 : [INF] mon.0 calling new monitor election
2012-01-24 02:12:40.847944   mon e1: 3 mons at {0=10.3.14.183:6789/0,1=10.3.14.182:6789/0,2=10.3.14.182:6790/0}
2012-01-24 02:12:46.313095    pg v853: 108 pgs: 16 active, 91 active+clean, 1 active+clean+scrubbing; 10660 MB data, 212 GB used, 822 GB / 1055 GB avail; 2033/5374 degraded (37.830%)

2012-01-24T02:34:17.295 ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x2c30c10>
Traceback (most recent call last):
  File "/var/lib/teuthworker/teuthology/teuthology/run_tasks.py", line 45, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/var/lib/teuthworker/teuthology/teuthology/task/thrashosds.py", line 85, in task
    manager.wait_till_clean(config.get('timeout', 360))
  File "/var/lib/teuthworker/teuthology/teuthology/task/ceph_manager.py", line 238, in wait_till_clean
    'failed to become clean before timeout expired'
AssertionError: failed to become clean before timeout expired
2012-01-24T02:34:17.296 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x2c113d0>
2012-01-24T02:34:17.296 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/var/lib/teuthworker/teuthology/teuthology/contextutil.py", line 27, in nested
    yield vars
  File "/var/lib/teuthworker/teuthology/teuthology/task/ceph.py", line 972, in task
    yield
  File "/var/lib/teuthworker/teuthology/teuthology/run_tasks.py", line 45, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/var/lib/teuthworker/teuthology/teuthology/task/thrashosds.py", line 85, in task
    manager.wait_till_clean(config.get('timeout', 360))
  File "/var/lib/teuthworker/teuthology/teuthology/task/ceph_manager.py", line 238, in wait_till_clean
    'failed to become clean before timeout expired'
AssertionError: failed to become clean before timeout expired
2012-01-24T02:34:17.297 INFO:teuthology.task.ceph:Shutting down mds daemons...
2012-01-24T02:34:17.329 INFO:teuthology.task.ceph.mds.a.err:*** Caught signal (Terminated) **
2012-01-24T02:34:17.329 INFO:teuthology.task.ceph.mds.a.err: in thread 7f190ebbd780. Shutting down.
2012-01-24T02:34:19.038 INFO:teuthology.task.ceph.mds.a:mds.a: Stopped
2012-01-24T02:34:19.038 INFO:teuthology.task.ceph:Shutting down osd daemons...

in both cases (modulo sha1):

  kernel:
    sha1: 44f6e2f27c69080d03d66e1c82b0760e3053af24
  nuke-on-error: true
  overrides:
    ceph:
      btrfs: 1
      coverage: true
      log-whitelist:
      - clocks not synchronized
      sha1: 1e4210937fa3951881d39425a0469dcadaae61bf
  roles:
  - - mon.0
    - mds.a
    - osd.0
    - osd.1
    - osd.2
  - - mon.1
    - mon.2
    - client.0
    - osd.3
    - osd.4
    - osd.5
  tasks:
  - chef: null
  - ceph:
      log-whitelist:
      - wrongly marked me down or wrong addr
  - thrashosds: null
  - radosbench:
      clients:
      - client.0
      time: 1800

Actions #1

Updated by Sage Weil over 12 years ago

  • Priority changed from Normal to High
Actions #2

Updated by Sage Weil over 12 years ago

  • Priority changed from High to Urgent
Actions #3

Updated by Samuel Just over 12 years ago

  • Status changed from New to Closed

In the case of 8880 at least, it looked like the osds were still making progress. I've changed wait_till_clean to timeout based on time since last change in active+clean.

Actions

Also available in: Atom PDF