Bug #1976
osd: timeout getting clean
Status:
Closed
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
actually, this may be a monitor thing. seen it twice now:
/var/lib/teuthworker/archive/nightly_coverage_2012-01-24-a/8880
/var/lib/teuthworker/archive/nightly_coverage_2012-01-23-b/8773
2012-01-24T02:33:35.530 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:05.318944 7f2b9f060700 mon.2@2(peon).pg v845 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:33:35.530 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:05.319365 7f1c65650700 mon.1@0(leader).pg v845 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:33:47.031 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:16.818886 7f1c65650700 mon.1@0(leader).pg v846 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:33:47.031 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:16.818944 7f1c65650700 mon.1@0(leader).pg v846 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:33:47.131 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:16.919551 7f2b9f060700 mon.2@2(peon).pg v846 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:33:47.136 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:16.923964 7f2b9f060700 mon.2@2(peon).pg v846 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:33:47.136 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:16.924135 7f2b9f060700 mon.2@2(peon).pg v846 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:33:48.363 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -s' 2012-01-24T02:33:54.824 INFO:teuthology.task.thrashosds.ceph_manager:2012-01-24 02:12:22.926816 mds e5: 1/1/1 up {0=a=up:active} 2012-01-24 02:12:22.927055 osd e123: 6 osds: 6 up, 6 in 2012-01-24 02:12:22.927175 mon e1: 3 mons at {0=10.3.14.183:6789/0,1=10.3.14.182:6789/0,2=10.3.14.182:6790/0} 2012-01-24 02:12:23.959943 pg v847: 108 pgs: 16 active, 91 active+clean, 1 active+clean+scrubbing; 10660 MB data, 212 GB used, 823 GB / 1055 GB avail; 2091/5374 degraded (38.910%) 2012-01-24 02:12:24.601922 log 2012-01-24 02:12:19.876793 mon.0 10.3.14.182:6789/0 86 : [INF] mon.1@0 won leader election with quorum 0,1,2 2012-01-24T02:33:57.831 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -- pg dump --format=json' 2012-01-24T02:33:58.693 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -s' 2012-01-24T02:34:04.321 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:34.107506 7f1c65650700 mon.1@0(leader).pg v850 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:34:04.345 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:34.131715 7f2b9f060700 mon.2@2(peon).pg v850 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:34:04.346 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:34.131760 7f2b9f060700 mon.2@2(peon).pg v850 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:34:04.476 INFO:teuthology.task.thrashosds.ceph_manager:2012-01-24 02:12:31.589185 mds e5: 1/1/1 up {0=a=up:active} 2012-01-24 02:12:31.589529 osd e123: 6 osds: 6 up, 6 in 2012-01-24 02:12:31.589656 log 2012-01-24 02:12:12.161518 mon.1 10.3.14.183:6789/0 23 : [INF] mon.0 calling new monitor election 2012-01-24 02:12:31.589739 mon e1: 3 mons at {0=10.3.14.183:6789/0,1=10.3.14.182:6789/0,2=10.3.14.182:6790/0} 2012-01-24 02:12:34.107018 pg v850: 108 pgs: 16 active, 91 active+clean, 1 active+clean+scrubbing; 10660 MB data, 212 GB used, 822 GB / 1055 GB avail; 2062/5374 degraded (38.370%) 2012-01-24T02:34:07.360 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:37.145958 7f1c65650700 mon.1@0(leader).pg v851 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:34:07.481 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -- pg dump --format=json' 2012-01-24T02:34:07.985 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -s' 2012-01-24T02:34:16.529 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:46.313484 7f1c65650700 mon.1@0(leader).pg v853 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:34:16.529 INFO:teuthology.task.ceph.mon.1.err:2012-01-24 02:12:46.313583 7f1c65650700 mon.1@0(leader).pg v853 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:34:16.529 INFO:teuthology.task.ceph.mon.2.err:2012-01-24 02:12:46.313380 7f2b9f060700 mon.2@2(peon).pg v853 PGMonitor::preprocess_pg_stats: no monitor session! 2012-01-24T02:34:17.294 INFO:teuthology.task.thrashosds.ceph_manager:2012-01-24 02:12:40.847219 mds e5: 1/1/1 up {0=a=up:active} 2012-01-24 02:12:40.847634 osd e123: 6 osds: 6 up, 6 in 2012-01-24 02:12:40.847752 log 2012-01-24 02:12:12.161518 mon.1 10.3.14.183:6789/0 23 : [INF] mon.0 calling new monitor election 2012-01-24 02:12:40.847944 mon e1: 3 mons at {0=10.3.14.183:6789/0,1=10.3.14.182:6789/0,2=10.3.14.182:6790/0} 2012-01-24 02:12:46.313095 pg v853: 108 pgs: 16 active, 91 active+clean, 1 active+clean+scrubbing; 10660 MB data, 212 GB used, 822 GB / 1055 GB avail; 2033/5374 degraded (37.830%) 2012-01-24T02:34:17.295 ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x2c30c10> Traceback (most recent call last): File "/var/lib/teuthworker/teuthology/teuthology/run_tasks.py", line 45, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__ self.gen.next() File "/var/lib/teuthworker/teuthology/teuthology/task/thrashosds.py", line 85, in task manager.wait_till_clean(config.get('timeout', 360)) File "/var/lib/teuthworker/teuthology/teuthology/task/ceph_manager.py", line 238, in wait_till_clean 'failed to become clean before timeout expired' AssertionError: failed to become clean before timeout expired 2012-01-24T02:34:17.296 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x2c113d0> 2012-01-24T02:34:17.296 ERROR:teuthology.contextutil:Saw exception from nested tasks Traceback (most recent call last): File "/var/lib/teuthworker/teuthology/teuthology/contextutil.py", line 27, in nested yield vars File "/var/lib/teuthworker/teuthology/teuthology/task/ceph.py", line 972, in task yield File "/var/lib/teuthworker/teuthology/teuthology/run_tasks.py", line 45, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__ self.gen.next() File "/var/lib/teuthworker/teuthology/teuthology/task/thrashosds.py", line 85, in task manager.wait_till_clean(config.get('timeout', 360)) File "/var/lib/teuthworker/teuthology/teuthology/task/ceph_manager.py", line 238, in wait_till_clean 'failed to become clean before timeout expired' AssertionError: failed to become clean before timeout expired 2012-01-24T02:34:17.297 INFO:teuthology.task.ceph:Shutting down mds daemons... 2012-01-24T02:34:17.329 INFO:teuthology.task.ceph.mds.a.err:*** Caught signal (Terminated) ** 2012-01-24T02:34:17.329 INFO:teuthology.task.ceph.mds.a.err: in thread 7f190ebbd780. Shutting down. 2012-01-24T02:34:19.038 INFO:teuthology.task.ceph.mds.a:mds.a: Stopped 2012-01-24T02:34:19.038 INFO:teuthology.task.ceph:Shutting down osd daemons...
in both cases (modulo sha1):
kernel: sha1: 44f6e2f27c69080d03d66e1c82b0760e3053af24 nuke-on-error: true overrides: ceph: btrfs: 1 coverage: true log-whitelist: - clocks not synchronized sha1: 1e4210937fa3951881d39425a0469dcadaae61bf roles: - - mon.0 - mds.a - osd.0 - osd.1 - osd.2 - - mon.1 - mon.2 - client.0 - osd.3 - osd.4 - osd.5 tasks: - chef: null - ceph: log-whitelist: - wrongly marked me down or wrong addr - thrashosds: null - radosbench: clients: - client.0 time: 1800
History
#1 Updated by Sage Weil almost 12 years ago
- Priority changed from Normal to High
#2 Updated by Sage Weil almost 12 years ago
- Priority changed from High to Urgent
#3 Updated by Samuel Just almost 12 years ago
- Status changed from New to Closed
In the case of 8880 at least, it looked like the osds were still making progress. I've changed wait_till_clean to timeout based on time since last change in active+clean.