Project

General

Profile

Actions

Bug #54067

closed

fs/maxentries.sh test fails with "2022-01-21T12:47:05.490 DEBUG:teuthology.orchestra.run:got remote process result: 124"

Added by Venky Shankar about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
quincy, pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

https://pulpito.ceph.com/vshankar-2022-01-21_07:36:21-fs-wip-vshankar-fscrypt-20220121-095846-testing-default-smithi/6631519/
https://pulpito.ceph.com/vshankar-2022-01-27_12:23:50-fs-wip-vshankar-fscrypt-20220121-095846-testing-default-smithi/6643330/

2022-01-21T12:47:05.490 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-01-21T12:47:05.492 INFO:tasks.workunit:Stopping ['fs/maxentries'] on client.0...
2022-01-21T12:47:05.493 DEBUG:teuthology.orchestra.run.smithi146:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-01-21T12:47:06.536 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/run_tasks.py", line 91, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/run_tasks.py", line 70, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_deb627556d87622270d2432ff54074989085bcec/qa/tasks/workunit.py", line 135, in task
    coverage_and_limits=not config.get('no_coverage_and_limits', None))
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_deb627556d87622270d2432ff54074989085bcec/qa/tasks/workunit.py", line 427, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/orchestra/remote.py", line 509, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
Actions #1

Updated by Venky Shankar about 2 years ago

  • Status changed from New to Triaged
  • Assignee set to Rishabh Dave
Actions #2

Updated by Jeff Layton about 2 years ago

I think status 124 that means that it timed out, and the logs seem to jive with that as there was a long time of apparent inactivity before the test failed. Maybe there is something in one of the MDS logs from that time?

Actions #3

Updated by Venky Shankar about 2 years ago

Jeff Layton wrote:

I think status 124 that means that it timed out, and the logs seem to jive with that as there was a long time of apparent inactivity before the test failed. Maybe there is something in one of the MDS logs from that time?

Right - the test timed out and this happens consistently with fscrypt changes: https://shaman.ceph.com/builds/ceph/wip-vshankar-fscrypt-20220121-095846/

(FYI - the build has just the fscrypt changes)

Actions #4

Updated by Jeff Layton about 2 years ago

  • Assignee changed from Rishabh Dave to Jeff Layton

This line tells us the workunit:

2022-01-21T12:47:05.492 INFO:tasks.workunit:Stopping ['fs/maxentries'] on client.0...

...which AFAICT is just this shell script:

https://github.com/ceph/ceph/blob/master/qa/workunits/fs/maxentries/maxentries.sh

It looks like it hung during make_files():

2022-01-27T12:52:16.287 INFO:tasks.workunit.client.0.smithi065.stderr:+ ok_dir_max_entries=100      
2022-01-27T12:52:16.288 INFO:tasks.workunit.client.0.smithi065.stderr:+ fail_dir_max_entries=101    
2022-01-27T12:52:16.288 INFO:tasks.workunit.client.0.smithi065.stderr:+ make_files 100                     

So it creates a 100 files and then expects the next 100 creates to fail. I wonder if this has more to do with async creates than anything in the fscrypt PR? I'll see if I can reproduce locally.

Actions #5

Updated by Jeff Layton about 2 years ago

  • Project changed from CephFS to Linux kernel client
  • Category deleted (Administration/Usability)

I think the problem is in the kclient. We're not waking up the tasks waiting on an async create properly when the MDS returns an error. I'm testing a patch now.

As to why you're only seeing this with the fscrypt PR, I'm not sure. It may be another case where earlier runs ran this test with -o wsync.

Actions #6

Updated by Jeff Layton about 2 years ago

Ok, I have a kernel patch that I'll post soon that should fix this, but I think there is another issue (probably in the MDS). The maxentries test sets the max entries in a directory to 100 and then creates 101 files in it. With the last file, it expects to see an error when it fsyncs, but sometimes the MDS just returns success.

I have a pr_warn message that the kernel pops when the MDS returns an error, and it doesn't happen in this case. That leads me to believe that the MDS is not enforcing this limit correctly (the client is not really aware of it).

Actions #8

Updated by Venky Shankar about 2 years ago

Jeff Layton wrote:

I think the problem is in the kclient. We're not waking up the tasks waiting on an async create properly when the MDS returns an error. I'm testing a patch now.

As to why you're only seeing this with the fscrypt PR, I'm not sure. It may be another case where earlier runs ran this test with -o wsync.

It was a rerun of this yaml matrix with master.

It's funny BTW, now I cannot rerun this yaml matrix with master :/

Actions #9

Updated by Venky Shankar about 2 years ago

Used a test branch that does not have fscrypt changes with testing kernel. Run link - https://pulpito.ceph.com/vshankar-2022-02-09_12:43:34-fs-wip-vshankar-testing-20220201-113815-testing-default-smithi/6671431/

Test passes.

yaml matrix:

fs/functional/{begin clusters/1a3s-mds-4c-client conf/{client mds mon osd} distro/{rhel_8} mount/kclient/{mount-syntax/{v2} mount overrides/{distro/testing/{flavor/ubuntu_latest k-testing} ms-die-on-skipped}} objectstore/bluestore-bitmap overrides/{no_client_pidfile whitelist_health whitelist_wrongly_marked_down} tasks/workunit/dir-max-entries}

Actions #10

Updated by Jeff Layton about 2 years ago

Strange. Venky, can you rerun the test with the current testing kernel and the fscrypt changes? I'd like to confirm that it also passes now.

Actions #11

Updated by Venky Shankar about 2 years ago

Jeff was able to hit this without fscrypt changes.

I reran fs:functional with a freshly built fscrypt branch - https://pulpito.ceph.com/vshankar-2022-02-11_14:08:00-fs:functional-wip-vshankar-fscrypt-20220211-150029-testing-default-smithi/

Actions #12

Updated by Jeff Layton about 2 years ago

  • Status changed from Triaged to Fix Under Review
Actions #13

Updated by Jeff Layton almost 2 years ago

  • Status changed from Fix Under Review to Resolved

I haven't seen this with the latest kernels so I think this is now resolved. I had at least one bug in the fscrypt series that may have been a candidate for this, and there were a number of other fixes that may also have been candidates.

Actions

Also available in: Atom PDF