Project

General

Profile

Bug #19550

re-enable ENOSPC tests in ceph-qa-suite for kclient

Added by Jeff Layton 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
04/07/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

Now that the patches for this bug are in the testing branch, we need to reenable the ENOSPC handling tests in for the kclient:

http://tracker.ceph.com/issues/17204

John apparently knows what to do here.

History

#1 Updated by Jeff Layton 8 months ago

One of the tests failed, I think due to this?

017-04-07T14:18:54.833 INFO:tasks.cephfs_test_runner:======================================================================
2017-04-07T14:18:54.834 INFO:tasks.cephfs_test_runner:ERROR: test_barrier (tasks.cephfs.test_full.TestClusterFull)
2017-04-07T14:18:54.834 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2017-04-07T14:18:54.834 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2017-04-07T14:18:54.834 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_jcsp_ceph_wip-19550/qa/tasks/cephfs/test_full.py", line 62, in test_barrier
2017-04-07T14:18:54.835 INFO:tasks.cephfs_test_runner:    mount_a_initial_epoch = self.mount_a.get_osd_epoch()[0]
2017-04-07T14:18:54.835 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_jcsp_ceph_wip-19550/qa/tasks/cephfs/kernel_mount.py", line 258, in get_osd_epoch
2017-04-07T14:18:54.835 INFO:tasks.cephfs_test_runner:    epoch_barrier = int(lines[2].split()[1].strip('"'))
2017-04-07T14:18:54.835 INFO:tasks.cephfs_test_runner:ValueError: invalid literal for int() with base 10: 'open'
2017-04-07T14:18:54.835 INFO:tasks.cephfs_test_runner:
2017-04-07T14:18:54.835 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2017-04-07T14:18:54.835 INFO:tasks.cephfs_test_runner:Ran 1 test in 41.765s
2017-04-07T14:18:54.835 INFO:tasks.cephfs_test_runner:
2017-04-07T14:18:54.835 INFO:tasks.cephfs_test_runner:FAILED (errors=1)
2017-04-07T14:18:54.835 INFO:tasks.cephfs_test_runner:
2017-04-07T14:18:54.836 INFO:tasks.cephfs_test_runner:======================================================================
2017-04-07T14:18:54.836 INFO:tasks.cephfs_test_runner:ERROR: test_barrier (tasks.cephfs.test_full.TestClusterFull)
2017-04-07T14:18:54.836 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2017-04-07T14:18:54.836 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2017-04-07T14:18:54.836 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_jcsp_ceph_wip-19550/qa/tasks/cephfs/test_full.py", line 62, in test_barrier
2017-04-07T14:18:54.836 INFO:tasks.cephfs_test_runner:    mount_a_initial_epoch = self.mount_a.get_osd_epoch()[0]
2017-04-07T14:18:54.836 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_jcsp_ceph_wip-19550/qa/tasks/cephfs/kernel_mount.py", line 258, in get_osd_epoch
2017-04-07T14:18:54.837 INFO:tasks.cephfs_test_runner:    epoch_barrier = int(lines[2].split()[1].strip('"'))
2017-04-07T14:18:54.837 INFO:tasks.cephfs_test_runner:ValueError: invalid literal for int() with base 10: 'open'
2017-04-07T14:18:54.837 INFO:tasks.cephfs_test_runner:
2017-04-07T14:18:54.837 ERROR:teuthology.run_tasks:Saw exception from tasks.

#2 Updated by Jeff Layton 7 months ago

Ah ok, I think I see. This is the current get_osd_epoch in kernel_mount.py:

    def get_osd_epoch(self):
        """ 
        Return 2-tuple of osd_epoch, osd_epoch_barrier
        """ 
        osd_map = self._read_debug_file("osdmap")
        lines = osd_map.split("\n")
        epoch = int(lines[0].split()[1])

        mds_sessions = self._read_debug_file("mds_sessions")
        lines = mds_sessions.split("\n") 
        epoch_barrier = int(lines[2].split()[1].strip('"'))

        return epoch, epoch_barrier

The debugfs file format is different. One way to fix this would be to open /sys/kernel/debug/ceph/client*/osdmap and scrape the first line, which now does this:

seq_printf(s, "epoch %u barrier %u flags 0x%x\n", map->epoch,           
                        osdc->epoch_barrier, map->flags);

Just snip out the uints for the epoch and barrier.

#3 Updated by Jeff Layton 7 months ago

  • Assignee changed from John Spray to Jeff Layton

Bah, good time for me to learn some python. I'll fix this.

#4 Updated by John Spray 7 months ago

  • Status changed from New to Need Review

Just linking the PR in for my convenience
https://github.com/ceph/ceph/pull/14396

#6 Updated by John Spray 7 months ago

2017-04-09T19:13:08.843 INFO:tasks.cephfs_test_runner:
2017-04-09T19:13:08.843 INFO:tasks.cephfs_test_runner:======================================================================
2017-04-09T19:13:08.843 INFO:tasks.cephfs_test_runner:ERROR: test_barrier (tasks.cephfs.test_full.TestClusterFull)
2017-04-09T19:13:08.843 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2017-04-09T19:13:08.843 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2017-04-09T19:13:08.843 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_jcsp_ceph_wip-19550/qa/tasks/cephfs/test_full.py", line 111, in test_barrier
2017-04-09T19:13:08.843 INFO:tasks.cephfs_test_runner:    lambda x: x[0] > new_epoch or x[1] > new_epoch)
2017-04-09T19:13:08.844 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_jcsp_ceph_wip-19550/qa/tasks/ceph_test_case.py", line 122, in wait_until_equal
2017-04-09T19:13:08.846 INFO:tasks.cephfs_test_runner:    elapsed, expect_val, val
2017-04-09T19:13:08.846 INFO:tasks.cephfs_test_runner:RuntimeError: Timed out after 30 seconds waiting for (21, 21) (currently (19, 21))
2017-04-09T19:13:08.846 INFO:tasks.cephfs_test_runner:
2017-04-09T19:13:08.846 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2017-04-09T19:13:08.846 INFO:tasks.cephfs_test_runner:Ran 1 test in 103.223s
2017-04-09T19:13:08.847 INFO:tasks.cephfs_test_runner:
2017-04-09T19:13:08.847 INFO:tasks.cephfs_test_runner:FAILED (errors=1)
2017-04-09T19:13:08.847 INFO:tasks.cephfs_test_runner:
2017-04-09T19:13:08.847 INFO:tasks.cephfs_test_runner:======================================================================
2017-04-09T19:13:08.847 INFO:tasks.cephfs_test_runner:ERROR: test_barrier (tasks.cephfs.test_full.TestClusterFull)
2017-04-09T19:13:08.847 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2017-04-09T19:13:08.847 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2017-04-09T19:13:08.847 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_jcsp_ceph_wip-19550/qa/tasks/cephfs/test_full.py", line 111, in test_barrier
2017-04-09T19:13:08.847 INFO:tasks.cephfs_test_runner:    lambda x: x[0] > new_epoch or x[1] > new_epoch)
2017-04-09T19:13:08.847 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_jcsp_ceph_wip-19550/qa/tasks/ceph_test_case.py", line 122, in wait_until_equal
2017-04-09T19:13:08.847 INFO:tasks.cephfs_test_runner:    elapsed, expect_val, val
2017-04-09T19:13:08.847 INFO:tasks.cephfs_test_runner:RuntimeError: Timed out after 30 seconds waiting for (21, 21) (currently (19, 21))
2017-04-09T19:13:08.848 INFO:tasks.cephfs_test_runner:
2017-04-09T19:13:08.848 ERROR:teuthology.run_tasks:Saw exception from tasks.

Different failure -- this part of the test is expecting that when the client sees an osdmap barrier higher than the map it currently has, it will request that higher versioned map from the mon -- seems like the kclient isn't doing it

#7 Updated by Jeff Layton 7 months ago

Yes, I think the problem is that we need ceph_osdc_update_epoch_barrier to call maybe_request_map if it updates the barrier. Let me spin up a patch for that and we can rerun the tests. Come to think of it, Ilya had mentioned that we needed to do that there, and I forgot to add it when going over his comments.

Now fixed in the ceph kernel testing branch, and test is running here:

http://pulpito.ceph.com/jlayton-2017-04-10_11:17:37-kcephfs:recovery-master-testing-basic-smithi/

#8 Updated by Jeff Layton 7 months ago

  • Assignee changed from Jeff Layton to John Spray

(Reassigning back to John as I need his input here)

The latest test that John kicked off died here:

2017-04-10T17:56:27.675 INFO:tasks.cephfs_test_runner:                          
2017-04-10T17:56:27.676 INFO:tasks.cephfs_test_runner:======================================================================
2017-04-10T17:56:27.676 INFO:tasks.cephfs_test_runner:FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull)
2017-04-10T17:56:27.676 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2017-04-10T17:56:27.676 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2017-04-10T17:56:27.676 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_jcsp_ceph_wip-19550/qa/tasks/cephfs/test_full.py", line 89, in test_barrier
2017-04-10T17:56:27.676 INFO:tasks.cephfs_test_runner:    self.assertEqual(mount_a_epoch, mount_a_initial_epoch)
2017-04-10T17:56:27.676 INFO:tasks.cephfs_test_runner:AssertionError: 22 != 20  
2017-04-10T17:56:27.676 INFO:tasks.cephfs_test_runner:                          
2017-04-10T17:56:27.676 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2017-04-10T17:56:27.676 INFO:tasks.cephfs_test_runner:Ran 1 test in 71.033s     
2017-04-10T17:56:27.676 INFO:tasks.cephfs_test_runner:                          
2017-04-10T17:56:27.677 INFO:tasks.cephfs_test_runner:FAILED (failures=1)       
2017-04-10T17:56:27.677 INFO:tasks.cephfs_test_runner:                          
2017-04-10T17:56:27.677 INFO:tasks.cephfs_test_runner:======================================================================
2017-04-10T17:56:27.677 INFO:tasks.cephfs_test_runner:FAIL: test_barrier (tasks.cephfs.test_full.TestClusterFull)
2017-04-10T17:56:27.677 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2017-04-10T17:56:27.677 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2017-04-10T17:56:27.677 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_jcsp_ceph_wip-19550/qa/tasks/cephfs/test_full.py", line 89, in test_barrier
2017-04-10T17:56:27.677 INFO:tasks.cephfs_test_runner:    self.assertEqual(mount_a_epoch, mount_a_initial_epoch)
2017-04-10T17:56:27.677 INFO:tasks.cephfs_test_runner:AssertionError: 22 != 20  
2017-04-10T17:56:27.677 INFO:tasks.cephfs_test_runner:                          

That's from:

        mount_a_epoch, mount_a_barrier = self.mount_a.get_osd_epoch()
        self.assertEqual(mount_a_epoch, mount_a_initial_epoch)

Basically the test samples the osd map epoch, does some operations and then samples it again and expects the epoch to be the same. It wasn't the same though. Between fetch mount_a_initial_epoch and mount_a_epoch, the code does this:

        self.fs.mon_manager.raw_cluster_cmd("osd", "set", "pause")
        self.fs.mon_manager.raw_cluster_cmd("osd", "unset", "pause")

Wouldn't that cause the epoch to be incremented twice (which is what we see happening here)? I'm not sure why we'd expect it to stay the same there.

#9 Updated by John Spray 7 months ago

The assertion is not saying that the epoch hasn't incremented (globally), it's meant to be that an idle mount isn't getting new osdmap epochs.

The idea of this was to detect the difference between a client that's correctly getting a new map when it needs to, vs. a client that is for some reason continuously subscribing to the map.

It's not wrong behaviour for a client to just continuously subscribe to the osdmap, but it's unexpected

#10 Updated by Jeff Layton 7 months ago

The kernel client does do a continuous subscription when the PAUSE* flags are set. From maybe_request_map:

        if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||                         
            ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||                      
            ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {                      
                dout("%s osdc %p continuous\n", __func__, osdc);                
                continuous = true;                                              
        } else {                                                                
                dout("%s osdc %p onetime\n", __func__, osdc);                   
        }                                                                       

What I'm not sure of is why the client would have noticed the pausing in the first place after pausing. I could see it subscribing continuously after noticing the flag the first time though.

I'll bring my personal test rig up a little later and see what's going on.

#11 Updated by Jeff Layton 7 months ago

Can't quite reproduce this locally. I have a vstart cluster set up. Mounted it, and got the osdmap epoch. Then I ran "ceph osd set pause" and "unset pause" on the vstart machine, and the epoch in the kclient debugfs file never changed from the original value of 6.

I'll try again in a bit. Could this be a result of a race somehow?

#12 Updated by Jeff Layton 7 months ago

Ok, I think I might sort of see. I turned the dout() calls in maybe_request_map into printks, and added a dump_stack() in there (so we could see how it was getting called).

[ 1906.336194] libceph: mon0 192.168.1.149:6789 session established
[ 1906.337751] libceph: client4133 fsid 58a37278-1ad0-48fe-8308-d0c9e41f3313
[ 2932.817579] maybe_request_map osdc ffff90566f305778 onetime
[ 2932.818928] CPU: 1 PID: 1277 Comm: kworker/1:1 Tainted: G           OE   4.11.0-rc3+ #25
[ 2932.821036] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
[ 2932.823295] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[ 2932.824483] Call Trace:
[ 2932.825074]  dump_stack+0x63/0x86
[ 2932.825853]  maybe_request_map+0x50/0xd0 [libceph]
[ 2932.826897]  ceph_osdc_update_epoch_barrier+0x5b/0x90 [libceph]
[ 2932.828242]  ceph_handle_caps+0x21c/0x1b10 [ceph]
[ 2932.829225]  ? release_sock+0x8c/0xa0
[ 2932.829999]  ? tcp_recvmsg+0x4c1/0xb40
[ 2932.830789]  ? inet_recvmsg+0x4b/0xc0
[ 2932.831570]  dispatch+0x1f7/0x1810 [ceph]
[ 2932.832451]  ceph_con_workfn+0x5fe/0x2f30 [libceph]
[ 2932.833610]  ? kvm_sched_clock_read+0x1e/0x30
[ 2932.834749]  ? __switch_to+0x227/0x410
[ 2932.835779]  process_one_work+0x197/0x450
[ 2932.836956]  worker_thread+0x4e/0x4a0
[ 2932.838024]  kthread+0x101/0x140
[ 2932.838885]  ? process_one_work+0x450/0x450
[ 2932.839801]  ? kthread_park+0x90/0x90
[ 2932.840702]  ret_from_fork+0x2c/0x40
[ 2995.960251] maybe_request_map osdc ffff90566f305778 continuous (flags=0x7800c)
[ 2995.961968] CPU: 1 PID: 1472 Comm: kworker/1:2 Tainted: G           OE   4.11.0-rc3+ #25
[ 2995.963501] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
[ 2995.965574] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[ 2995.966904] Call Trace:
[ 2995.967714]  dump_stack+0x63/0x86
[ 2995.968687]  maybe_request_map+0x50/0xd0 [libceph]
[ 2995.969840]  ceph_osdc_handle_map+0x33a/0x7a0 [libceph]
[ 2995.971056]  ? ceph_tcp_recvmsg+0x68/0x90 [libceph]
[ 2995.972212]  dispatch+0x31f/0x6e0 [libceph]
[ 2995.973405]  ? read_partial.isra.27+0x50/0x80 [libceph]
[ 2995.974619]  ceph_con_workfn+0x5fe/0x2f30 [libceph]
[ 2995.975846]  ? lock_timer_base+0x81/0xa0
[ 2995.976885]  ? update_curr+0x13f/0x1d0
[ 2995.977833]  ? account_entity_dequeue+0xab/0xe0
[ 2995.978909]  ? __switch_to+0x227/0x410
[ 2995.979859]  process_one_work+0x197/0x450
[ 2995.980807]  worker_thread+0x4e/0x4a0
[ 2995.981845]  kthread+0x101/0x140
[ 2995.982830]  ? process_one_work+0x450/0x450
[ 2995.983795]  ? kthread_park+0x90/0x90
[ 2995.984997]  ret_from_fork+0x2c/0x40
[ 2997.123260] maybe_request_map osdc ffff90566f305778 onetime
[ 2997.124835] CPU: 1 PID: 1472 Comm: kworker/1:2 Tainted: G           OE   4.11.0-rc3+ #25
[ 2997.126793] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
[ 2997.128551] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[ 2997.129780] Call Trace:
[ 2997.130471]  dump_stack+0x63/0x86
[ 2997.131349]  maybe_request_map+0x50/0xd0 [libceph]
[ 2997.132456]  ceph_osdc_handle_map+0x33a/0x7a0 [libceph]
[ 2997.133631]  ? ceph_tcp_recvmsg+0x68/0x90 [libceph]
[ 2997.134710]  dispatch+0x31f/0x6e0 [libceph]
[ 2997.135649]  ? read_partial.isra.27+0x50/0x80 [libceph]
[ 2997.136767]  ceph_con_workfn+0x5fe/0x2f30 [libceph]
[ 2997.137813]  ? lock_timer_base+0x81/0xa0
[ 2997.138699]  ? update_curr+0x13f/0x1d0
[ 2997.139569]  ? account_entity_dequeue+0xab/0xe0
[ 2997.140570]  ? __switch_to+0x227/0x410
[ 2997.141445]  process_one_work+0x197/0x450
[ 2997.142361]  worker_thread+0x4e/0x4a0
[ 2997.143247]  kthread+0x101/0x140
[ 2997.144041]  ? process_one_work+0x450/0x450
[ 2997.144948]  ? kthread_park+0x90/0x90
[ 2997.145773]  ret_from_fork+0x2c/0x40

So the first call occurs because when we open the file, we request some caps, and the cap message has an epoch barrier in it. We call maybe_request_map when we update the eb, and that always asks for the epoch that we currently have +1. Here's the current function (with my debugging code):

static void maybe_request_map(struct ceph_osd_client *osdc)
{
        bool continuous = false;

        verify_osdc_locked(osdc);
        WARN_ON(!osdc->osdmap->epoch);

        if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
            ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
            ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
                printk("%s osdc %p continuous (flags=0x%x)\n", __func__, osdc,
                        osdc->osdmap->flags);
                continuous = true;
        } else {
                printk("%s osdc %p onetime\n", __func__, osdc);
        }
        dump_stack();

        if (ceph_monc_want_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
                               osdc->osdmap->epoch + 1, continuous))
                ceph_monc_renew_subs(&osdc->client->monc);
}

So, we always request the next epoch, and so we get one as soon as the pause flags are set. Once we see a map with those flags set, we ask for a continuous sub. Then we see a map with them unset and do an oneshot request. This behavior has been there for a while and isn't changed by this patchset, fwiw.

So, I think the fundamental problem I have is that I'm not sure what the rules ought to be for requesting a map. Maybe we if we nail down how it should work, I can go in and clean up the maybe_request_map calls to DTRT.

#13 Updated by Jeff Layton 7 months ago

Simple patch that should fix this case, I think:

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index f187d36240b4..90f0c5a0478a 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -3337,7 +3337,9 @@ void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb)
                        dout("updating epoch_barrier from %u to %u\n",
                                        osdc->epoch_barrier, eb);
                        osdc->epoch_barrier = eb;
-                       maybe_request_map(osdc);
+                       /* Request map if we're not to the barrier yet */
+                       if (eb > osdc->osdmap->epoch)
+                               maybe_request_map(osdc);
                }
                up_write(&osdc->lock);
        } else {

I'll plan to push this to the testing branch and we can give it a go.

#14 Updated by Jeff Layton 7 months ago

Ok, with the latest kernel testing branch, and John's ceph wip-19550 branch for the testsuite, it gets a bit farther. The latest problem is here:

2017-04-13T16:14:19.145 INFO:teuthology.orchestra.run.smithi184.stderr:Traceback (most recent call last):
2017-04-13T16:14:19.145 INFO:teuthology.orchestra.run.smithi184.stderr:  File "<string>", line 50, in <module>
2017-04-13T16:14:19.145 INFO:teuthology.orchestra.run.smithi184.stderr:RuntimeError: close() failed to raise error

This test does some buffered writes after filling the filesystem and then a close, and expects to get an error back on the close() syscall.

kcephfs does not even have a "flush" file_operation, so it'll always return 0 on close. That could be added and it should fix this, but evidently some applications don't really like getting errors on close. It may be better to nerf this part of the test.

Thoughts?

#15 Updated by John Spray 7 months ago

What happens to dirty buffers when a file is closed in the kernel client? They just stick around in the background until they can be successfully flushed, and the client doesn't give up the buffered write caps?

Anyway, if flush errors don't bubble up to close() then we'll need to disable this test for the kernel client.

The next test (test_full_fsync) has similar expectations about errors flushing buffers getting exposed on calls to fsync, if the kernel client's behaviour there is different too, then might need to disable that one as well.

#16 Updated by Jeff Layton 7 months ago

John Spray wrote:

What happens to dirty buffers when a file is closed in the kernel client? They just stick around in the background until they can be successfully flushed, and the client doesn't give up the buffered write caps?

That is a good question. Let's see...

The cached inode data is not flushed on closing the file description since there is no ->flush op. Each dirty page seems to hold a WR and possibly BUFFER cap ref. So, I think that just means that the kernel would just hold the cached data until the caps are revoked...but I don't quite see where cap references are put after buffered writeback.

Ahh...nm I see it now. So yeah, we'll just hold the data in cache until we have to release it due to memory pressure, kupdated style flushing, or the caps are revoked.

Anyway, if flush errors don't bubble up to close() then we'll need to disable this test for the kernel client.

The next test (test_full_fsync) has similar expectations about errors flushing buffers getting exposed on calls to fsync, if the kernel client's behaviour there is different too, then might need to disable that one as well.

fsync should be fine here. We expect to get writeback errors on fsync and part of the work I'm currently doing on the kernel upstream is ensuring that you always get an error when this occurs.

POSIX is a bit ambiguous about errors on close, and the kernel is really inconsistent here too, which doesn't help. NFS (for instance) will return errors on close, and CIFS too, but most other filesystems don't. This is an area where we need more consistent behavior from filesystems, IMO, but that's a larger battle.

I'm fine with allowing kcephfs to return errors on close, but some applications will take those errors to mean that the close itself failed, rather than writeback failing. Given that NFS has done this for a long time though, I'm not too worried about it if we did want to enable it.

The only other concern is that in order to detect errors on close, you have to flush all of the inode data on any close. That can have performance implications.

#17 Updated by Jeff Layton 7 months ago

Ok, I added a flush operation to kcephfs which does a vfs_fsync when the file is open for write, and returns the error. The testsuite still failed, and I think this is the reason why:

            # Okay, now we're going to fill up the filesystem, and then keep    
            # writing until we see an error from fsync.  As long as we're doing 
            # buffered IO, the error should always only appear from fsync and not
            # from write                                                        
            full = False                                                        

            for n in range(0, {fill_mb} + 1):                                   
                try:                                                            
                    bytes += os.write(f, 'x' * 1024 * 1024)                     
                    print "wrote bytes via buffered write, moving on to fsync"  
                except OSError as e:                                            
                    print "Unexpected error %s from write() instead of fsync()" % e
                    raise                                                       

                try:                                                            
                    os.fsync(f)                                                 
                    print "fsync'ed successfully"                               
                except OSError as e:                                            
                    print "Reached fullness after %.2f MB" % (bytes / (1024.0 * 1024.0))
                    full = True                                                 
                    break                                                       
                else:                                                           
                    print "Not full yet after %.2f MB" % (bytes / (1024.0 * 1024.0))

                if n > {fill_mb} * 0.8:                                         
                    # Be cautious in the last region where we expect to hit     
                    # the full condition, so that we don't overshoot too dramatically
                    print "sleeping a bit as we've exceeded 80% of our expected full ratio" 
                    time.sleep({full_wait})                                     

            if not full:                                                        
                raise RuntimeError("Failed to reach fullness after writing %d bytes" % bytes)

            # The error sticks to the inode until we dispose of it              
            try:                                                                
                os.close(f)                                                     
            except OSError:                                                     
                print "Saw error from close() as expected"                      
            else:                                                               
                raise RuntimeError("Did not see expected error from close()")                                                     

The comment that says "the error sticks to the inode until we dispose of it" is entirely wrong. The current kernel code will clean out the error once it has been reported via fsync. I think we'll need to fix that piece of the testcase.

It may be best not to verify any sort of error reporting on close. We can always add that in later, and it's certainly not required for POSIX compliance.

#18 Updated by Jeff Layton 7 months ago

Ok, with this suite repo/branch and the code currently in testing, the failing test now passes:

https://github.com/jtlayton/ceph/tree/wip-19550

I think what we need to sort out at this point is whether we want close to return an error on writeback failures. Whatever we choose, I think we need to strive to make the kernel and userland clients work the same way. Perhaps we can discuss at the next standup.

#19 Updated by John Spray 7 months ago

As you say, exposing writeback errors in close() is not required. I like doing it from a pragmatic point of view, because it increases the chances that a userspace application will at least notice that something is wrong (the set of applications that check the retval of close() or call fsync() is hopefully bigger than the set of applications that call fsync() correctly).

In the kernel, are writeback errors being stored for exposing in fsync (and just by choice not being exposed in close()), or are they just unavailable without doing more work to remember them? I expect this is all somewhat tied up in the other changes you were making to avoid eating errors after they've been exposed to one fsync caller.

#20 Updated by Jeff Layton 7 months ago

John Spray wrote:

As you say, exposing writeback errors in close() is not required. I like doing it from a pragmatic point of view, because it increases the chances that a userspace application will at least notice that something is wrong (the set of applications that check the retval of close() or call fsync() is hopefully bigger than the set of applications that call fsync() correctly).

Agreed. I like returning it in close() too, but there was some pushback on this in the discussion on LWN:

https://lwn.net/Articles/718734/#Comments

...and Neil Brown brought up a good point that if you want to do this, then you need to basically do an fsync() on every close call. That sucks for something like a build, where you do open/write/close on a lot of small files.

For performance reasons it's good not to flush on every close since that allows you to batch together writes to low-level storage for multiple files (very helpful on a local block-based fs), and not have to synchronously wait on writeback before returning on the close call (which is a big deal on both local and network filesystems too).

In the kernel, are writeback errors being stored for exposing in fsync (and just by choice not being exposed in close()), or are they just unavailable without doing more work to remember them? I expect this is all somewhat tied up in the other changes you were making to avoid eating errors after they've been exposed to one fsync caller.

Yes, this is definitely tied up with the kernel changes I'm making. It's actually quite easy to return these errors at close. We just basically have the ->flush routine call vfs_fsync and return the error. Some filesystems already do this (NFS for instance), but it's a per-fs decision. For now, I've decided not to worry about changing any close() behavior if I can help it.

#21 Updated by John Spray 7 months ago

Interesting thread.

I think the "whether to sync on close" aspect is a bit of a red herring -- for local filesystems I can see how syncing on close is undesirable, but for network filesystems where we need to sync before another client can read, and where we are never combining IOs from multiple files, it doesn't make much difference to us. But mainly, we don't have to sync on close in order to expose errors in close(), if those errors have already happened.

If we don't sync on close, then whether someone sees the error out of close() is going to be probabilistic (i.e. whether a writeback happened soon enough), so I can see the purist argument for not returning the error at all rather than returning it sometimes. I just think that pragmatism should win out here -- it feels mean if we have a file we know is EIO, and we're returning 0 to the user on close(), which we know is probably the last time they will ever ask us about the state of that file.

#22 Updated by Jeff Layton 7 months ago

John Spray wrote:

Interesting thread.

I think the "whether to sync on close" aspect is a bit of a red herring -- for local filesystems I can see how syncing on close is undesirable, but for network filesystems where we need to sync before another client can read, and where we are never combining IOs from multiple files, it doesn't make much difference to us. But mainly, we don't have to sync on close in order to expose errors in close(), if those errors have already happened.

Well, you do still want to generally flush on close with a netfs. It's just that you don't necessarily need to wait for that flush to complete if you aren't worried about returning errors from it.

If we don't sync on close, then whether someone sees the error out of close() is going to be probabilistic (i.e. whether a writeback happened soon enough), so I can see the purist argument for not returning the error at all rather than returning it sometimes. I just think that pragmatism should win out here -- it feels mean if we have a file we know is EIO, and we're returning 0 to the user on close(), which we know is probably the last time they will ever ask us about the state of that file.

Well, under the principle that "more information is good" then sure...but there's another principle at play here too...inconsistent API behavior is bad. If you want consistent behavior from your filesystem here then you do need to flush.

If the errors already happened, then sure, you can return an error then, but that model's not exactly reliable as an indicator. Anyone who cares will need to fsync before close anyway.

#23 Updated by John Spray 7 months ago

Patch to update fuse behaviour to avoid the spurious repeated return of errors on fclose if they've already been exposed with fsync:
https://github.com/ceph/ceph/pull/14589 (http://tracker.ceph.com/issues/19640)

#24 Updated by John Spray 7 months ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF