Bug #3681
closedkclient fsx fails nightly
0%
Description
teuthology-2012-12-23_03:00:04-regression-master-testing-gcov/23184 FAIL scheduled_teuthology@teuthology collection:kernel-cephfs clusters:fixed-3.yaml fs:btrfs.yaml tasks:kclient_workunit_suites_fsx.yaml 2122s teuthology-2012-12-24_03:00:03-regression-master-testing-gcov/23997 FAIL scheduled_teuthology@teuthology collection:kernel-cephfs clusters:fixed-3.yaml fs:btrfs.yaml tasks:kclient_workunit_suites_fsx.yaml 3361s teuthology-2012-12-25_03:00:02-regression-master-testing-gcov/27378 pass scheduled_teuthology@teuthology collection:kernel-cephfs clusters:fixed-3.yaml fs:btrfs.yaml tasks:kclient_workunit_suites_fsx.yaml 4321s teuthology-2012-12-26_03:00:10-regression-master-testing-gcov/27759 pass scheduled_teuthology@teuthology collection:kernel-cephfs clusters:fixed-3.yaml fs:btrfs.yaml tasks:kclient_workunit_suites_fsx.yaml 3540s
Files
Updated by Sam Lang over 11 years ago
- File fsx-failure-syslog fsx-failure-syslog added
Its most likely all the same bug, but fsx fails in different ways each time (always because of a truncate down). The one I'm looking at does the following:
truncate down from 104576 -> 75711
Then it tries to verify the size with a getattr, but getattr returns 104576. The kernel log shows that:
- The setattr successfully sets the size:
Dec 31 10:09:58 plana23 kernel: [1218326.551021] ceph: setattr ffff88020ea01c60 size 104576 -> 75711
- Then on the trace reply:
Dec 31 10:10:07 plana23 kernel: [1218401.360063] ceph: size 104576 -> 75711
- But then during the getattr, the CAP_TRUNC message arrives:
Dec 31 10:10:12 plana23 kernel: [1218406.195224] ceph: handle_cap_trunc inode ffff88020ea01c60 mds0 seq 154 to 75711 seq 36
Dec 31 10:10:12 plana23 kernel: [1218406.254293] ceph: size 75711 -> 104576
Note that second line is setting the i_size back to 104576, because the CAP_TRUNC message has a truncate_size=75711, but size=104576.
So the problem seems like its coming from the mds. But on inspecting the mds, the Locker::issue_truncate() is called after pop_and_dirty_projected_inode (which should correctly set the inode size to the value sent in the setattr). So its still unclear how the mds is sending back size=104576...
Attached is the client log.
Updated by Sam Lang over 11 years ago
- Status changed from 12 to 7
The race here is between a truncate down, and completion of osd write ops triggering a cap flush. The exact order that triggers it is:
a) write increases size, updates inode->i_size to 104576
b) setattr to mds for truncate down size 75711
c) write to osds complete, cap flushed, cap update sends size 104576 to mds
d) setattr response, inode->i_size set to 75711
e) cap truncate message received with size 104576, sets inode->i_size to 104576
Pushed a proposed fix to wip-3681, which is currently being tested. This resolves the issue that the setattr is sending a request to the mds if it has the exclusive cap, and ensures that i_size has a correct value, but the same race still exists if the client doesn't have the exclusive cap. Opening a separate bug/issue for the multi-client case.
Updated by Sam Lang over 11 years ago
Proposed fix to set i_size before the setattr request:
This will resolve the above issue, because the cap flush on write finish will send the truncated size to the mds. It may not work for multi-client scenarios, consider the following:
a) clientA write increases size, updates inode->i_size to 104576
b) clientB setattr to mds for truncate down size 75711 [mds size=75711]
c) clientA write to osds complete, cap flushed, cap update sends size 104576 to mds [mds size=104576]
d) clientB setattr response, inode->i_size set to 75711
e) clientB cap truncate message received with size 104576, sets inode->i_size to 104576
Updated by Ian Colle about 11 years ago
- Status changed from 7 to 12
- Assignee deleted (
Sam Lang)
Updated by Ian Colle about 11 years ago
- Assignee set to Sage Weil
Should review entire kernel locking around truncate.
Updated by Zheng Yan almost 11 years ago
I think this has already been fixed (a cap revoke bug in the MDS code). When handling truncate request, current MDS revokes write caps from clients.
Updated by Sage Weil over 10 years ago
- Status changed from 12 to 7
added fsx back into the kcephfs test suite. reportedly fsx now passes, but we should verify before closing this bug.