Bug #44382
closedqa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull)
0%
Description
2020-02-29T13:47:06.460 INFO:teuthology.orchestra.run.smithi155:> sudo python3 -c ' 2020-02-29T13:47:06.460 INFO:teuthology.orchestra.run.smithi155:> import os 2020-02-29T13:47:06.461 INFO:teuthology.orchestra.run.smithi155:> 2020-02-29T13:47:06.461 INFO:teuthology.orchestra.run.smithi155:> print(open(os.path.join("/sys/kernel/debug/ceph/e11d0fe9-5e22-4d3c-8734-8685b960d41e.client5193", "osdmap")).read()) 2020-02-29T13:47:06.461 INFO:teuthology.orchestra.run.smithi155:> ' 2020-02-29T13:47:06.542 INFO:teuthology.orchestra.run.smithi155.stdout:epoch 35 barrier 35 flags 0x588000 2020-02-29T13:47:06.542 INFO:teuthology.orchestra.run.smithi155.stdout:pool 1 'device_health_metrics' type 1 size 2 min_size 1 pg_num 1 pg_num_mask 0 flags 0x1 lfor 0 read_tier -1 write_tier -1 2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:pool 2 'rbd' type 1 size 2 min_size 1 pg_num 8 pg_num_mask 7 flags 0x1 lfor 0 read_tier -1 write_tier -1 2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:pool 5 'cephfs_metadata' type 1 size 2 min_size 1 pg_num 8 pg_num_mask 7 flags 0x1 lfor 0 read_tier -1 write_tier -1 2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:pool 6 'cephfs_data' type 1 size 2 min_size 1 pg_num 8 pg_num_mask 7 flags 0x1 lfor 0 read_tier -1 write_tier -1 2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:osd0 (1)172.21.15.183:6801 100% (exists, up) 100% 2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:osd1 (1)172.21.15.183:6805 100% (exists, up) 100% 2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:osd2 (1)172.21.15.183:6815 100% (exists, up) 100% 2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd3 (1)172.21.15.183:6825 100% (exists, up) 100% 2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd4 (1)172.21.15.178:6816 100% (exists, up) 100% 2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd5 (1)172.21.15.178:6803 100% (exists, up) 100% 2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd6 (1)172.21.15.178:6810 100% (exists, up) 100% 2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd7 (1)172.21.15.178:6827 100% (exists, up) 100% 2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout: 2020-02-29T13:47:06.593 INFO:tasks.cephfs_test_runner:test_barrier (tasks.cephfs.test_full.TestClusterFull) ... FAIL ... 2020-02-29T13:47:10.985 INFO:tasks.cephfs_test_runner:====================================================================== 2020-02-29T13:47:10.985 INFO:tasks.cephfs_test_runner:FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) 2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_batrick_ceph_wip-pdonnell-testing-20200229.001503/qa/tasks/cephfs/test_full.py", line 105, in test_barrier 2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner: self.assertEqual(mount_a_epoch, mount_a_initial_epoch) 2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner:AssertionError: 35 != 33
From: /ceph/teuthology-archive/pdonnell-2020-02-29_02:56:38-kcephfs-wip-pdonnell-testing-20200229.001503-distro-basic-smithi/4810991/teuthology.log
See also:
3 jobs: ['4810991', '4811028', '4811065'] suites intersection: ['clusters/1-mds-4-client.yaml', 'conf/{client.yaml', 'k-testing.yaml}', 'kcephfs/recovery/{begin.yaml', 'kclient/{mount.yaml', 'log-config.yaml', 'mds.yaml', 'mon.yaml', 'ms-die-on-skipped.yaml}}', 'osd-asserts.yaml', 'osd.yaml}', 'overrides/{distro/testing/{flavor/centos_latest.yaml', 'overrides/{frag_enable.yaml', 'tasks/mds-full.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}'] suites union: ['clusters/1-mds-4-client.yaml', 'conf/{client.yaml', 'k-testing.yaml}', 'kcephfs/recovery/{begin.yaml', 'kclient/{mount.yaml', 'log-config.yaml', 'mds.yaml', 'mon.yaml', 'ms-die-on-skipped.yaml}}', 'objectstore-ec/bluestore-comp-ec-root.yaml', 'objectstore-ec/bluestore-comp.yaml', 'objectstore-ec/bluestore-ec-root.yaml', 'osd-asserts.yaml', 'osd.yaml}', 'overrides/{distro/testing/{flavor/centos_latest.yaml', 'overrides/{frag_enable.yaml', 'tasks/mds-full.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}']
This looks like another testing kernel regression.
Updated by Jeff Layton about 4 years ago
When was the last successful run before this? It'd be nice to know what kernel it was using to help ID the cause.
Updated by Patrick Donnelly about 4 years ago
Updated by Jeff Layton about 4 years ago
This test seems to be based on some very subtle assumptions:
def test_barrier(self): """ That when an OSD epoch barrier is set on an MDS, subsequently issued capabilities cause clients to update their OSD map to that epoch. """ # Sync up clients with initial MDS OSD map barrier self.mount_a.open_no_data("foo") self.mount_b.open_no_data("bar") # Grab mounts' initial OSD epochs: later we will check that # it hasn't advanced beyond this point. mount_a_initial_epoch = self.mount_a.get_osd_epoch()[0] mount_b_initial_epoch = self.mount_b.get_osd_epoch()[0] # Freshly mounted at start of test, should be up to date with OSD map self.assertGreaterEqual(mount_a_initial_epoch, self.initial_osd_epoch) self.assertGreaterEqual(mount_b_initial_epoch, self.initial_osd_epoch) # Set and unset a flag to cause OSD epoch to increment self.fs.mon_manager.raw_cluster_cmd("osd", "set", "pause") self.fs.mon_manager.raw_cluster_cmd("osd", "unset", "pause") out = self.fs.mon_manager.raw_cluster_cmd("osd", "dump", "--format=json").strip() new_epoch = json.loads(out)['epoch'] self.assertNotEqual(self.initial_osd_epoch, new_epoch) # Do a metadata operation on clients, witness that they end up with # the old OSD map from startup time (nothing has prompted client # to update its map) self.mount_a.open_no_data("alpha") self.mount_b.open_no_data("bravo1") # Sleep long enough that if the OSD map was propagating it would # have done so (this is arbitrary because we are 'waiting' for something # to *not* happen). time.sleep(30) mount_a_epoch, mount_a_barrier = self.mount_a.get_osd_epoch() self.assertEqual(mount_a_epoch, mount_a_initial_epoch) mount_b_epoch, mount_b_barrier = self.mount_b.get_osd_epoch() self.assertEqual(mount_b_epoch, mount_b_initial_epoch)
Ok, so basically we're assuming that even though the osd map was updated that we are not going to get that update because the client is idle? We are opening files on the mounts but don't seem to be otherwise doing any I/O there, so there's no reason we should have to fetch the map there, granted.
I'm not sure I'd consider that behavior necessarily broken though.
Updated by Jeff Layton about 4 years ago
Actually, it's the later assertion:
# ...and none of this should have affected the oblivious mount a, # because it wasn't doing any data or metadata IO mount_a_epoch, mount_a_barrier = self.mount_a.get_osd_epoch() self.assertEqual(mount_a_epoch, mount_a_initial_epoch)
Still, we presumably have 2 kcephfs mounts here. I'm not sure we should assume this behavior.
Updated by Zheng Yan about 4 years ago
- Status changed from New to Fix Under Review
- Backport set to octopus,nautilus,luminous
- Pull request ID set to 34164
Updated by Greg Farnum about 4 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler about 4 years ago
- Copied to Backport #45217: nautilus: qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) added
Updated by Nathan Cutler about 4 years ago
- Copied to Backport #45218: luminous: qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) added
Updated by Nathan Cutler about 4 years ago
- Copied to Backport #45219: octopus: qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) added
Updated by Nathan Cutler almost 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".