Bug #14684: test_scrub_checks fails - CephFS - Ceph

Actions

Copy link

Bug #14684

closed

test_scrub_checks fails

Added by Zheng Yan about 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

John Spray

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://qa-proxy.ceph.com/teuthology/teuthology-2016-02-03_14:03:10-fs-jewel---basic-smithi/5365/teuthology.log

2016-02-03T22:32:28.995 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2016-02-03T22:32:28.995 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/test_scrub_checks.py", line 39, in test_scrub_checks
2016-02-03T22:32:28.995 INFO:tasks.cephfs_test_runner:    self._checks(1)
2016-02-03T22:32:28.995 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/test_scrub_checks.py", line 123, in _checks
2016-02-03T22:32:28.995 INFO:tasks.cephfs_test_runner:    self.asok_command(mds_rank, command, success_validator)
2016-02-03T22:32:28.995 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/test_scrub_checks.py", line 149, in asok_command
2016-02-03T22:32:28.996 INFO:tasks.cephfs_test_runner:    mds_id = self.fs.get_active_names()[mds_rank]
2016-02-03T22:32:28.996 INFO:tasks.cephfs_test_runner:IndexError: list index out of range

Actions

Copy link

Updated by Greg Farnum about 8 years ago

Priority changed from Normal to High

http://pulpito.ceph.com/teuthology-2016-02-10_14:03:02-fs-jewel---basic-smithi/4160/

2016-02-10T21:04:11.934 INFO:tasks.cephfs_test_runner:======================================================================
2016-02-10T21:04:11.934 INFO:tasks.cephfs_test_runner:ERROR: test_scrub_checks (tasks.cephfs.test_scrub_checks.TestScrubChecks)
2016-02-10T21:04:11.935 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2016-02-10T21:04:11.935 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2016-02-10T21:04:11.935 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/test_scrub_checks.py", line 39, in test_scrub_checks
2016-02-10T21:04:11.935 INFO:tasks.cephfs_test_runner:    self._checks(1)
2016-02-10T21:04:11.935 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/test_scrub_checks.py", line 133, in _checks
2016-02-10T21:04:11.935 INFO:tasks.cephfs_test_runner:    lambda j, r: json_validator(j, r, "return_code", -errno.ENODATA))
2016-02-10T21:04:11.936 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/test_scrub_checks.py", line 156, in asok_command
2016-02-10T21:04:11.936 INFO:tasks.cephfs_test_runner:    jout = json.loads(sout)
2016-02-10T21:04:11.936 INFO:tasks.cephfs_test_runner:  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
2016-02-10T21:04:11.936 INFO:tasks.cephfs_test_runner:    return _default_decoder.decode(s)
2016-02-10T21:04:11.936 INFO:tasks.cephfs_test_runner:  File "/usr/lib/python2.7/json/decoder.py", line 369, in decode
2016-02-10T21:04:11.936 INFO:tasks.cephfs_test_runner:    raise ValueError(errmsg("Extra data", s, end, len(s)))
2016-02-10T21:04:11.936 INFO:tasks.cephfs_test_runner:ValueError: Extra data: line 2 column 1 - line 3 column 1 (char 17 - 24)

Actions

Copy link

Updated by Greg Farnum about 8 years ago

http://pulpito.ceph.com/teuthology-2016-02-12_14:03:01-fs-jewel---basic-smithi/8245/

2016-02-12T23:59:13.690 INFO:tasks.cephfs_test_runner:ValueError: Extra data: line 2 column 1 - line 3 column 1 (char 17 - 24)

We haven't seen it on master branch yet, strangely?

Actions

Copy link

Updated by Greg Farnum about 8 years ago

http://pulpito.ceph.com/teuthology-2016-02-05_14:03:02-fs-jewel---basic-smithi/8806/

Same as the original case

Actions

Copy link

Updated by Greg Farnum about 8 years ago

gregf-2016-02-15_13:33:12-fs-greg-fs-testing-215-sure---basic-smithi/10011/

original; and here we have it on master(+)

Actions

Copy link

Updated by Greg Farnum about 8 years ago

Priority changed from High to Urgent

And again: http://pulpito.ceph.com/gregf-2016-02-15_18:08:49-fs-greg-fs-testing-215-1---basic-smithi/10771/

It doesn't seem to be consistent, although all these examples are on smithi, so maybe we've got a lovely timing bug?

Actions

Copy link

Updated by Zheng Yan about 8 years ago

http://pulpito.ceph.com/teuthology-2016-02-19_14:03:01-fs-jewel---basic-smithi/17838/

2016-02-19T19:49:01.019 INFO:tasks.cephfs_test_runner:======================================================================
2016-02-19T19:49:01.019 INFO:tasks.cephfs_test_runner:ERROR: test_scrub_checks (tasks.cephfs.test_scrub_checks.TestScrubChecks)
2016-02-19T19:49:01.019 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2016-02-19T19:49:01.020 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2016-02-19T19:49:01.020 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/test_scrub_checks.py", line 39, in test_scrub_checks
2016-02-19T19:49:01.020 INFO:tasks.cephfs_test_runner:    self._checks(1)
2016-02-19T19:49:01.020 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/test_scrub_checks.py", line 123, in _checks
2016-02-19T19:49:01.020 INFO:tasks.cephfs_test_runner:    self.asok_command(mds_rank, command, success_validator)
2016-02-19T19:49:01.021 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/test_scrub_checks.py", line 149, in asok_command
2016-02-19T19:49:01.021 INFO:tasks.cephfs_test_runner:    mds_id = self.fs.get_active_names()[mds_rank]
2016-02-19T19:49:01.021 INFO:tasks.cephfs_test_runner:IndexError: list index out of range

Actions

Copy link

Updated by Zheng Yan about 8 years ago

Status changed from New to Fix Under Review
Assignee set to John Spray

2016-02-20 03:49:00.515417 mon.0 172.21.15.7:6789/0 145 : cluster [INF] osdmap e16: 4 osds: 4 up, 4 in
2016-02-20 03:49:00.565680 mon.0 172.21.15.7:6789/0 146 : cluster [INF] mdsmap e19: 1/1/1 up {0=a=up:replay}
2016-02-20 03:49:00.565745 mon.0 172.21.15.7:6789/0 147 : cluster [INF] pgmap v36: 32 pgs: 32 active+clean; 12393 kB data, 593 MB used, 368 GB / 372 GB avail; 0 B/s rd, 0 B/s wr, 323 op/s
2016-02-20 03:49:00.334717 mon.0 172.21.15.7:6789/0 142 : cluster [INF] pgmap v35: 32 pgs: 32 active+clean; 12393 kB data, 593 MB used, 368 GB / 372 GB avail; 354 B/s rd, 96511 B/s wr, 400 op/s
2016-02-20 03:49:00.334819 mon.0 172.21.15.7:6789/0 143 : cluster [INF] mds.0 172.21.15.7:6808/19849 up:active
2016-02-20 03:49:00.392920 mon.0 172.21.15.7:6789/0 144 : cluster [INF] mdsmap e18: 1/1/1 up {0=a-s=up:active}, 1 up:standby
2016-02-20 03:49:00.515417 mon.0 172.21.15.7:6789/0 145 : cluster [INF] osdmap e16: 4 osds: 4 up, 4 in
2016-02-20 03:49:00.565680 mon.0 172.21.15.7:6789/0 146 : cluster [INF] mdsmap e19: 1/1/1 up {0=a=up:replay}
2016-02-20 03:49:00.565745 mon.0 172.21.15.7:6789/0 147 : cluster [INF] pgmap v36: 32 pgs: 32 active+clean; 12393 kB data, 593 MB used, 368 GB / 372 GB avail; 0 B/s rd, 0 B/s wr, 323 op/s
2016-02-20 03:49:01.599201 mon.0 172.21.15.7:6789/0 150 : cluster [INF] mds.? 172.21.15.7:6809/19849 up:boot
2016-02-20 03:49:01.599342 mon.0 172.21.15.7:6789/0 151 : cluster [INF] mds.0 172.21.15.18:6808/29561 up:reconnect
2016-02-20 03:49:01.599467 mon.0 172.21.15.7:6789/0 152 : cluster [INF] mdsmap e20: 1/1/1 up {0=a=up:reconnect}, 1 up:standby
2016-02-20 03:49:01.599201 mon.0 172.21.15.7:6789/0 150 : cluster [INF] mds.? 172.21.15.7:6809/19849 up:boot
2016-02-20 03:49:01.599342 mon.0 172.21.15.7:6789/0 151 : cluster [INF] mds.0 172.21.15.18:6808/29561 up:reconnect
2016-02-20 03:49:01.599467 mon.0 172.21.15.7:6789/0 152 : cluster [INF] mdsmap e20: 1/1/1 up {0=a=up:reconnect}, 1 up:standby

there were mds takeovers even we didn't enable mds thrash.

2016-02-20 03:48:59.809408 7fd4c8036700  1 -- 172.21.15.7:6789/0 <== mds.0 172.21.15.7:6808/19849 20 ==== mdsbeacon(4179/a-s up:active seq 12 v17) v4 ==== 355+0+0 (3921659038 0 0) 0x7fd4db429c00 con 0x7fd4db7cb080
2016-02-20 03:48:59.809480 7fd4c8036700 20 mon.b@0(leader) e1 _ms_dispatch existing session 0x7fd4dba8cc40 for mds.? 172.21.15.7:6808/19849
2016-02-20 03:48:59.809489 7fd4c8036700 20 mon.b@0(leader) e1  caps allow *
2016-02-20 03:48:59.809500 7fd4c8036700 10 mon.b@0(leader).paxosservice(mdsmap 1..17) dispatch 0x7fd4db429c00 mdsbeacon(4179/a-s up:active seq 12 v17) v4 from mds.0 172.21.15.7:6808/19849 con 0x7fd4db7cb080
2016-02-20 03:48:59.809510 7fd4c8036700  5 mon.b@0(leader).paxos(paxos active c 1..124) is_readable = 1 - now=2016-02-20 03:48:59.809510 lease_expire=2016-02-20 03:49:04.074688 has v0 lc 124
2016-02-20 03:48:59.809526 7fd4c8036700 10 mon.b@0(leader).mds e17 preprocess_query mdsbeacon(4179/a-s up:active seq 12 v17) v4 from mds.0 172.21.15.7:6808/19849
2016-02-20 03:48:59.809541 7fd4c8036700 20 is_capable service=mds command= exec on cap allow *
2016-02-20 03:48:59.809545 7fd4c8036700 20  allow so far , doing grant allow *
2016-02-20 03:48:59.809546 7fd4c8036700 20  allow all
2016-02-20 03:48:59.809548 7fd4c8036700 12 mon.b@0(leader).mds e17 preprocess_beacon mdsbeacon(4179/a-s up:active seq 12 v17) v4 from mds.0 172.21.15.7:6808/19849 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}
2016-02-20 03:48:59.809562 7fd4c8036700 20 mon.b@0(leader).mds e17 preprocess_beacon health metrics for gid 4179 were updated

2016-02-20 03:49:00.334984 7fd4c9839700 10 mon.b@0(leader).mds e18 e18: 1/1/1 up {0=a-s=up:active}, 1 up:standby
2016-02-20 03:49:00.335006 7fd4c9839700 10 mon.b@0(leader).mds e18 no beacon from 4179 172.21.15.7:6808/19849 mds.0.2 up:active since 2016-02-20 03:48:43.808993
2016-02-20 03:49:00.335022 7fd4c9839700 10 mon.b@0(leader).mds e18  replacing 4179 172.21.15.7:6808/19849 mds.0.2 up:active with 4136/a 172.21.15.18:6808/29561

looks like monitor does note beacon if health metrics changes.

https://github.com/ceph/ceph/pull/7757

Actions

Copy link