Project

General

Profile

Bug #44382

qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull)

Added by Patrick Donnelly 4 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
octopus,nautilus,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
kceph
Labels (FS):
Pull request ID:
Crash signature:

Description

2020-02-29T13:47:06.460 INFO:teuthology.orchestra.run.smithi155:> sudo python3 -c '
2020-02-29T13:47:06.460 INFO:teuthology.orchestra.run.smithi155:> import os
2020-02-29T13:47:06.461 INFO:teuthology.orchestra.run.smithi155:>
2020-02-29T13:47:06.461 INFO:teuthology.orchestra.run.smithi155:> print(open(os.path.join("/sys/kernel/debug/ceph/e11d0fe9-5e22-4d3c-8734-8685b960d41e.client5193", "osdmap")).read())
2020-02-29T13:47:06.461 INFO:teuthology.orchestra.run.smithi155:> '
2020-02-29T13:47:06.542 INFO:teuthology.orchestra.run.smithi155.stdout:epoch 35 barrier 35 flags 0x588000
2020-02-29T13:47:06.542 INFO:teuthology.orchestra.run.smithi155.stdout:pool 1 'device_health_metrics' type 1 size 2 min_size 1 pg_num 1 pg_num_mask 0 flags 0x1 lfor 0 read_tier -1 write_tier -1
2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:pool 2 'rbd' type 1 size 2 min_size 1 pg_num 8 pg_num_mask 7 flags 0x1 lfor 0 read_tier -1 write_tier -1
2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:pool 5 'cephfs_metadata' type 1 size 2 min_size 1 pg_num 8 pg_num_mask 7 flags 0x1 lfor 0 read_tier -1 write_tier -1
2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:pool 6 'cephfs_data' type 1 size 2 min_size 1 pg_num 8 pg_num_mask 7 flags 0x1 lfor 0 read_tier -1 write_tier -1
2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:osd0     (1)172.21.15.183:6801   100%    (exists, up)    100%
2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:osd1     (1)172.21.15.183:6805   100%    (exists, up)    100%
2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:osd2     (1)172.21.15.183:6815   100%    (exists, up)    100%
2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd3     (1)172.21.15.183:6825   100%    (exists, up)    100%
2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd4     (1)172.21.15.178:6816   100%    (exists, up)    100%
2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd5     (1)172.21.15.178:6803   100%    (exists, up)    100%
2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd6     (1)172.21.15.178:6810   100%    (exists, up)    100%
2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd7     (1)172.21.15.178:6827   100%    (exists, up)    100%
2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:
2020-02-29T13:47:06.593 INFO:tasks.cephfs_test_runner:test_barrier (tasks.cephfs.test_full.TestClusterFull) ... FAIL
...
2020-02-29T13:47:10.985 INFO:tasks.cephfs_test_runner:======================================================================
2020-02-29T13:47:10.985 INFO:tasks.cephfs_test_runner:FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull)
2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_batrick_ceph_wip-pdonnell-testing-20200229.001503/qa/tasks/cephfs/test_full.py", line 105, in test_barrier
2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner:    self.assertEqual(mount_a_epoch, mount_a_initial_epoch)
2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner:AssertionError: 35 != 33

From: /ceph/teuthology-archive/pdonnell-2020-02-29_02:56:38-kcephfs-wip-pdonnell-testing-20200229.001503-distro-basic-smithi/4810991/teuthology.log

See also:

3 jobs: ['4810991', '4811028', '4811065']
suites intersection: ['clusters/1-mds-4-client.yaml', 'conf/{client.yaml', 'k-testing.yaml}', 'kcephfs/recovery/{begin.yaml', 'kclient/{mount.yaml', 'log-config.yaml', 'mds.yaml', 'mon.yaml', 'ms-die-on-skipped.yaml}}', 'osd-asserts.yaml', 'osd.yaml}', 'overrides/{distro/testing/{flavor/centos_latest.yaml', 'overrides/{frag_enable.yaml', 'tasks/mds-full.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}']
suites union: ['clusters/1-mds-4-client.yaml', 'conf/{client.yaml', 'k-testing.yaml}', 'kcephfs/recovery/{begin.yaml', 'kclient/{mount.yaml', 'log-config.yaml', 'mds.yaml', 'mon.yaml', 'ms-die-on-skipped.yaml}}', 'objectstore-ec/bluestore-comp-ec-root.yaml', 'objectstore-ec/bluestore-comp.yaml', 'objectstore-ec/bluestore-ec-root.yaml', 'osd-asserts.yaml', 'osd.yaml}', 'overrides/{distro/testing/{flavor/centos_latest.yaml', 'overrides/{frag_enable.yaml', 'tasks/mds-full.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}']

This looks like another testing kernel regression.


Related issues

Copied to fs - Backport #45217: nautilus: qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) Resolved
Copied to fs - Backport #45218: luminous: qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) Rejected
Copied to fs - Backport #45219: octopus: qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) Resolved

History

#1 Updated by Jeff Layton 4 months ago

When was the last successful run before this? It'd be nice to know what kernel it was using to help ID the cause.

#3 Updated by Jeff Layton 4 months ago

This test seems to be based on some very subtle assumptions:

    def test_barrier(self):                                                     
        """                                                                     
        That when an OSD epoch barrier is set on an MDS, subsequently           
        issued capabilities cause clients to update their OSD map to that       
        epoch.                                                                  
        """                                                                     

        # Sync up clients with initial MDS OSD map barrier                      
        self.mount_a.open_no_data("foo")                                        
        self.mount_b.open_no_data("bar")                                        

        # Grab mounts' initial OSD epochs: later we will check that             
        # it hasn't advanced beyond this point.                                 
        mount_a_initial_epoch = self.mount_a.get_osd_epoch()[0]                 
        mount_b_initial_epoch = self.mount_b.get_osd_epoch()[0]                 

        # Freshly mounted at start of test, should be up to date with OSD map   
        self.assertGreaterEqual(mount_a_initial_epoch, self.initial_osd_epoch)  
        self.assertGreaterEqual(mount_b_initial_epoch, self.initial_osd_epoch)  

        # Set and unset a flag to cause OSD epoch to increment                  
        self.fs.mon_manager.raw_cluster_cmd("osd", "set", "pause")              
        self.fs.mon_manager.raw_cluster_cmd("osd", "unset", "pause")            

        out = self.fs.mon_manager.raw_cluster_cmd("osd", "dump", "--format=json").strip()
        new_epoch = json.loads(out)['epoch']                                    
        self.assertNotEqual(self.initial_osd_epoch, new_epoch)                  

        # Do a metadata operation on clients, witness that they end up with     
        # the old OSD map from startup time (nothing has prompted client        
        # to update its map)                                                    
        self.mount_a.open_no_data("alpha")                                      
        self.mount_b.open_no_data("bravo1")                                     

        # Sleep long enough that if the OSD map was propagating it would        
        # have done so (this is arbitrary because we are 'waiting' for something
        # to *not* happen).                                                     
        time.sleep(30)                                                          

        mount_a_epoch, mount_a_barrier = self.mount_a.get_osd_epoch()           
        self.assertEqual(mount_a_epoch, mount_a_initial_epoch)                  
        mount_b_epoch, mount_b_barrier = self.mount_b.get_osd_epoch()           
        self.assertEqual(mount_b_epoch, mount_b_initial_epoch)                  

Ok, so basically we're assuming that even though the osd map was updated that we are not going to get that update because the client is idle? We are opening files on the mounts but don't seem to be otherwise doing any I/O there, so there's no reason we should have to fetch the map there, granted.

I'm not sure I'd consider that behavior necessarily broken though.

#4 Updated by Jeff Layton 4 months ago

Actually, it's the later assertion:

        # ...and none of this should have affected the oblivious mount a,       
        # because it wasn't doing any data or metadata IO                       
        mount_a_epoch, mount_a_barrier = self.mount_a.get_osd_epoch()           
        self.assertEqual(mount_a_epoch, mount_a_initial_epoch)                  

Still, we presumably have 2 kcephfs mounts here. I'm not sure we should assume this behavior.

#5 Updated by Zheng Yan 4 months ago

  • Status changed from New to Fix Under Review
  • Backport set to octopus,nautilus,luminous
  • Pull request ID set to 34164

#6 Updated by Greg Farnum 3 months ago

  • Status changed from Fix Under Review to Pending Backport

#7 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #45217: nautilus: qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) added

#8 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #45218: luminous: qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) added

#9 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #45219: octopus: qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) added

#10 Updated by Nathan Cutler about 2 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF