Bug #54067
closedfs/maxentries.sh test fails with "2022-01-21T12:47:05.490 DEBUG:teuthology.orchestra.run:got remote process result: 124"
0%
Description
https://pulpito.ceph.com/vshankar-2022-01-21_07:36:21-fs-wip-vshankar-fscrypt-20220121-095846-testing-default-smithi/6631519/
https://pulpito.ceph.com/vshankar-2022-01-27_12:23:50-fs-wip-vshankar-fscrypt-20220121-095846-testing-default-smithi/6643330/
2022-01-21T12:47:05.490 DEBUG:teuthology.orchestra.run:got remote process result: 124 2022-01-21T12:47:05.492 INFO:tasks.workunit:Stopping ['fs/maxentries'] on client.0... 2022-01-21T12:47:05.493 DEBUG:teuthology.orchestra.run.smithi146:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0 2022-01-21T12:47:06.536 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/run_tasks.py", line 91, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/run_tasks.py", line 70, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_deb627556d87622270d2432ff54074989085bcec/qa/tasks/workunit.py", line 135, in task coverage_and_limits=not config.get('no_coverage_and_limits', None)) File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/parallel.py", line 84, in __exit__ for result in self: File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/parallel.py", line 98, in __next__ resurrect_traceback(result) File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/parallel.py", line 30, in resurrect_traceback raise exc.exc_info[1] File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/parallel.py", line 23, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_deb627556d87622270d2432ff54074989085bcec/qa/tasks/workunit.py", line 427, in _run_tests label="workunit test {workunit}".format(workunit=workunit) File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/orchestra/remote.py", line 509, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/orchestra/run.py", line 455, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/orchestra/run.py", line 161, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/orchestra/run.py", line 183, in _raise_for_status node=self.hostname, label=self.label
Updated by Venky Shankar over 2 years ago
- Status changed from New to Triaged
- Assignee set to Rishabh Dave
Updated by Jeff Layton over 2 years ago
I think status 124 that means that it timed out, and the logs seem to jive with that as there was a long time of apparent inactivity before the test failed. Maybe there is something in one of the MDS logs from that time?
Updated by Venky Shankar over 2 years ago
Jeff Layton wrote:
I think status 124 that means that it timed out, and the logs seem to jive with that as there was a long time of apparent inactivity before the test failed. Maybe there is something in one of the MDS logs from that time?
Right - the test timed out and this happens consistently with fscrypt changes: https://shaman.ceph.com/builds/ceph/wip-vshankar-fscrypt-20220121-095846/
(FYI - the build has just the fscrypt changes)
Updated by Jeff Layton about 2 years ago
- Assignee changed from Rishabh Dave to Jeff Layton
This line tells us the workunit:
2022-01-21T12:47:05.492 INFO:tasks.workunit:Stopping ['fs/maxentries'] on client.0...
...which AFAICT is just this shell script:
https://github.com/ceph/ceph/blob/master/qa/workunits/fs/maxentries/maxentries.sh
It looks like it hung during make_files():
2022-01-27T12:52:16.287 INFO:tasks.workunit.client.0.smithi065.stderr:+ ok_dir_max_entries=100 2022-01-27T12:52:16.288 INFO:tasks.workunit.client.0.smithi065.stderr:+ fail_dir_max_entries=101 2022-01-27T12:52:16.288 INFO:tasks.workunit.client.0.smithi065.stderr:+ make_files 100
So it creates a 100 files and then expects the next 100 creates to fail. I wonder if this has more to do with async creates than anything in the fscrypt PR? I'll see if I can reproduce locally.
Updated by Jeff Layton about 2 years ago
- Project changed from CephFS to Linux kernel client
- Category deleted (
Administration/Usability)
I think the problem is in the kclient. We're not waking up the tasks waiting on an async create properly when the MDS returns an error. I'm testing a patch now.
As to why you're only seeing this with the fscrypt PR, I'm not sure. It may be another case where earlier runs ran this test with -o wsync.
Updated by Jeff Layton about 2 years ago
Ok, I have a kernel patch that I'll post soon that should fix this, but I think there is another issue (probably in the MDS). The maxentries test sets the max entries in a directory to 100 and then creates 101 files in it. With the last file, it expects to see an error when it fsyncs, but sometimes the MDS just returns success.
I have a pr_warn message that the kernel pops when the MDS returns an error, and it doesn't happen in this case. That leads me to believe that the MDS is not enforcing this limit correctly (the client is not really aware of it).
Updated by Jeff Layton about 2 years ago
Updated by Venky Shankar about 2 years ago
Jeff Layton wrote:
I think the problem is in the kclient. We're not waking up the tasks waiting on an async create properly when the MDS returns an error. I'm testing a patch now.
As to why you're only seeing this with the fscrypt PR, I'm not sure. It may be another case where earlier runs ran this test with -o wsync.
It was a rerun of this yaml matrix with master.
It's funny BTW, now I cannot rerun this yaml matrix with master :/
Updated by Venky Shankar about 2 years ago
Used a test branch that does not have fscrypt changes with testing kernel. Run link - https://pulpito.ceph.com/vshankar-2022-02-09_12:43:34-fs-wip-vshankar-testing-20220201-113815-testing-default-smithi/6671431/
Test passes.
yaml matrix:
fs/functional/{begin clusters/1a3s-mds-4c-client conf/{client mds mon osd} distro/{rhel_8} mount/kclient/{mount-syntax/{v2} mount overrides/{distro/testing/{flavor/ubuntu_latest k-testing} ms-die-on-skipped}} objectstore/bluestore-bitmap overrides/{no_client_pidfile whitelist_health whitelist_wrongly_marked_down} tasks/workunit/dir-max-entries}
Updated by Jeff Layton about 2 years ago
Strange. Venky, can you rerun the test with the current testing kernel and the fscrypt changes? I'd like to confirm that it also passes now.
Updated by Venky Shankar about 2 years ago
Jeff was able to hit this without fscrypt changes.
I reran fs:functional with a freshly built fscrypt branch - https://pulpito.ceph.com/vshankar-2022-02-11_14:08:00-fs:functional-wip-vshankar-fscrypt-20220211-150029-testing-default-smithi/
Updated by Jeff Layton about 2 years ago
- Status changed from Triaged to Fix Under Review
Updated by Jeff Layton almost 2 years ago
- Status changed from Fix Under Review to Resolved
I haven't seen this with the latest kernels so I think this is now resolved. I had at least one bug in the fscrypt series that may have been a candidate for this, and there were a number of other fixes that may also have been candidates.