Bug #44382: qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) - CephFS - Ceph

Actions

Copy link

Bug #44382

closed

qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull)

Added by Patrick Donnelly about 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Zheng Yan

Category:

Target version:

Ceph - v15.0.0

% Done:

Source:

Q/A

Tags:

Backport:

octopus,nautilus,luminous

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

kceph

Labels (FS):

Pull request ID:

34164

Crash signature (v1):

Crash signature (v2):

Description

2020-02-29T13:47:06.460 INFO:teuthology.orchestra.run.smithi155:> sudo python3 -c '
2020-02-29T13:47:06.460 INFO:teuthology.orchestra.run.smithi155:> import os
2020-02-29T13:47:06.461 INFO:teuthology.orchestra.run.smithi155:>
2020-02-29T13:47:06.461 INFO:teuthology.orchestra.run.smithi155:> print(open(os.path.join("/sys/kernel/debug/ceph/e11d0fe9-5e22-4d3c-8734-8685b960d41e.client5193", "osdmap")).read())
2020-02-29T13:47:06.461 INFO:teuthology.orchestra.run.smithi155:> '
2020-02-29T13:47:06.542 INFO:teuthology.orchestra.run.smithi155.stdout:epoch 35 barrier 35 flags 0x588000
2020-02-29T13:47:06.542 INFO:teuthology.orchestra.run.smithi155.stdout:pool 1 'device_health_metrics' type 1 size 2 min_size 1 pg_num 1 pg_num_mask 0 flags 0x1 lfor 0 read_tier -1 write_tier -1
2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:pool 2 'rbd' type 1 size 2 min_size 1 pg_num 8 pg_num_mask 7 flags 0x1 lfor 0 read_tier -1 write_tier -1
2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:pool 5 'cephfs_metadata' type 1 size 2 min_size 1 pg_num 8 pg_num_mask 7 flags 0x1 lfor 0 read_tier -1 write_tier -1
2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:pool 6 'cephfs_data' type 1 size 2 min_size 1 pg_num 8 pg_num_mask 7 flags 0x1 lfor 0 read_tier -1 write_tier -1
2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:osd0     (1)172.21.15.183:6801   100%    (exists, up)    100%
2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:osd1     (1)172.21.15.183:6805   100%    (exists, up)    100%
2020-02-29T13:47:06.543 INFO:teuthology.orchestra.run.smithi155.stdout:osd2     (1)172.21.15.183:6815   100%    (exists, up)    100%
2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd3     (1)172.21.15.183:6825   100%    (exists, up)    100%
2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd4     (1)172.21.15.178:6816   100%    (exists, up)    100%
2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd5     (1)172.21.15.178:6803   100%    (exists, up)    100%
2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd6     (1)172.21.15.178:6810   100%    (exists, up)    100%
2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:osd7     (1)172.21.15.178:6827   100%    (exists, up)    100%
2020-02-29T13:47:06.544 INFO:teuthology.orchestra.run.smithi155.stdout:
2020-02-29T13:47:06.593 INFO:tasks.cephfs_test_runner:test_barrier (tasks.cephfs.test_full.TestClusterFull) ... FAIL
...
2020-02-29T13:47:10.985 INFO:tasks.cephfs_test_runner:======================================================================
2020-02-29T13:47:10.985 INFO:tasks.cephfs_test_runner:FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull)
2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_batrick_ceph_wip-pdonnell-testing-20200229.001503/qa/tasks/cephfs/test_full.py", line 105, in test_barrier
2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner:    self.assertEqual(mount_a_epoch, mount_a_initial_epoch)
2020-02-29T13:47:10.986 INFO:tasks.cephfs_test_runner:AssertionError: 35 != 33

From: /ceph/teuthology-archive/pdonnell-2020-02-29_02:56:38-kcephfs-wip-pdonnell-testing-20200229.001503-distro-basic-smithi/4810991/teuthology.log

Updated by Jeff Layton about 4 years ago

When was the last successful run before this? It'd be nice to know what kernel it was using to help ID the cause.

Actions

Copy link

Updated by Patrick Donnelly about 4 years ago

Don't see it in http://pulpito.ceph.com/pdonnell-2020-02-16_17:35:17-kcephfs-wip-pdonnell-testing-20200215.033325-distro-basic-smithi/

Actions

Copy link

Updated by Jeff Layton about 4 years ago

This test seems to be based on some very subtle assumptions:

    def test_barrier(self):                                                     
        """                                                                     
        That when an OSD epoch barrier is set on an MDS, subsequently           
        issued capabilities cause clients to update their OSD map to that       
        epoch.                                                                  
        """                                                                     

        # Sync up clients with initial MDS OSD map barrier                      
        self.mount_a.open_no_data("foo")                                        
        self.mount_b.open_no_data("bar")                                        

        # Grab mounts' initial OSD epochs: later we will check that             
        # it hasn't advanced beyond this point.                                 
        mount_a_initial_epoch = self.mount_a.get_osd_epoch()[0]                 
        mount_b_initial_epoch = self.mount_b.get_osd_epoch()[0]                 

        # Freshly mounted at start of test, should be up to date with OSD map   
        self.assertGreaterEqual(mount_a_initial_epoch, self.initial_osd_epoch)  
        self.assertGreaterEqual(mount_b_initial_epoch, self.initial_osd_epoch)  

        # Set and unset a flag to cause OSD epoch to increment                  
        self.fs.mon_manager.raw_cluster_cmd("osd", "set", "pause")              
        self.fs.mon_manager.raw_cluster_cmd("osd", "unset", "pause")            

        out = self.fs.mon_manager.raw_cluster_cmd("osd", "dump", "--format=json").strip()
        new_epoch = json.loads(out)['epoch']                                    
        self.assertNotEqual(self.initial_osd_epoch, new_epoch)                  

        # Do a metadata operation on clients, witness that they end up with     
        # the old OSD map from startup time (nothing has prompted client        
        # to update its map)                                                    
        self.mount_a.open_no_data("alpha")                                      
        self.mount_b.open_no_data("bravo1")                                     

        # Sleep long enough that if the OSD map was propagating it would        
        # have done so (this is arbitrary because we are 'waiting' for something
        # to *not* happen).                                                     
        time.sleep(30)                                                          

        mount_a_epoch, mount_a_barrier = self.mount_a.get_osd_epoch()           
        self.assertEqual(mount_a_epoch, mount_a_initial_epoch)                  
        mount_b_epoch, mount_b_barrier = self.mount_b.get_osd_epoch()           
        self.assertEqual(mount_b_epoch, mount_b_initial_epoch)

Ok, so basically we're assuming that even though the osd map was updated that we are not going to get that update because the client is idle? We are opening files on the mounts but don't seem to be otherwise doing any I/O there, so there's no reason we should have to fetch the map there, granted.

I'm not sure I'd consider that behavior necessarily broken though.

Actions

Copy link

Updated by Jeff Layton about 4 years ago

Actually, it's the later assertion:

        # ...and none of this should have affected the oblivious mount a,       
        # because it wasn't doing any data or metadata IO                       
        mount_a_epoch, mount_a_barrier = self.mount_a.get_osd_epoch()           
        self.assertEqual(mount_a_epoch, mount_a_initial_epoch)

Still, we presumably have 2 kcephfs mounts here. I'm not sure we should assume this behavior.

Actions

Copy link

Updated by Zheng Yan about 4 years ago

Status changed from New to Fix Under Review
Backport set to octopus,nautilus,luminous
Pull request ID set to 34164

Actions

Copy link

Updated by Greg Farnum about 4 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Nathan Cutler about 4 years ago

Copied to Backport #45217: nautilus: qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) added

Actions

Copy link

Updated by Nathan Cutler about 4 years ago

Copied to Backport #45218: luminous: qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) added

Actions

Copy link

Updated by Nathan Cutler about 4 years ago

Copied to Backport #45219: octopus: qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull) added

Actions

Copy link

#10

Updated by Nathan Cutler almost 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #44382

qa: FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull)

Updated by Jeff Layton about 4 years ago

Updated by Patrick Donnelly about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by Zheng Yan about 4 years ago

Updated by Greg Farnum about 4 years ago

Updated by Nathan Cutler about 4 years ago

Updated by Nathan Cutler about 4 years ago

Updated by Nathan Cutler about 4 years ago

Updated by Nathan Cutler almost 4 years ago