Bug #54067: fs/maxentries.sh test fails with "2022-01-21T12:47:05.490 DEBUG:teuthology.orchestra.run:got remote process result: 124" - Linux kernel client - Ceph

Actions

Copy link

Bug #54067

closed

fs/maxentries.sh test fails with "2022-01-21T12:47:05.490 DEBUG:teuthology.orchestra.run:got remote process result: 124"

Added by Venky Shankar about 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jeff Layton

Category:

Target version:

Ceph - v18.0.0

% Done:

Source:

Community (dev)

Tags:

Backport:

quincy, pacific

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

CephFS - v0.57b

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

https://pulpito.ceph.com/vshankar-2022-01-21_07:36:21-fs-wip-vshankar-fscrypt-20220121-095846-testing-default-smithi/6631519/
https://pulpito.ceph.com/vshankar-2022-01-27_12:23:50-fs-wip-vshankar-fscrypt-20220121-095846-testing-default-smithi/6643330/

2022-01-21T12:47:05.490 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-01-21T12:47:05.492 INFO:tasks.workunit:Stopping ['fs/maxentries'] on client.0...
2022-01-21T12:47:05.493 DEBUG:teuthology.orchestra.run.smithi146:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-01-21T12:47:06.536 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/run_tasks.py", line 91, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/run_tasks.py", line 70, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_deb627556d87622270d2432ff54074989085bcec/qa/tasks/workunit.py", line 135, in task
    coverage_and_limits=not config.get('no_coverage_and_limits', None))
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_deb627556d87622270d2432ff54074989085bcec/qa/tasks/workunit.py", line 427, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/orchestra/remote.py", line 509, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_75f1bcd237a8b6022305cd53ba87e106d45aa665/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label

Actions

Copy link

Updated by Venky Shankar about 2 years ago

Status changed from New to Triaged
Assignee set to Rishabh Dave

Actions

Copy link

Updated by Jeff Layton about 2 years ago

I think status 124 that means that it timed out, and the logs seem to jive with that as there was a long time of apparent inactivity before the test failed. Maybe there is something in one of the MDS logs from that time?

Actions

Copy link

Updated by Venky Shankar about 2 years ago

Jeff Layton wrote:

I think status 124 that means that it timed out, and the logs seem to jive with that as there was a long time of apparent inactivity before the test failed. Maybe there is something in one of the MDS logs from that time?

Right - the test timed out and this happens consistently with fscrypt changes: https://shaman.ceph.com/builds/ceph/wip-vshankar-fscrypt-20220121-095846/

(FYI - the build has just the fscrypt changes)

Actions

Copy link

Updated by Jeff Layton about 2 years ago

Assignee changed from Rishabh Dave to Jeff Layton

This line tells us the workunit:

2022-01-21T12:47:05.492 INFO:tasks.workunit:Stopping ['fs/maxentries'] on client.0...

...which AFAICT is just this shell script:

https://github.com/ceph/ceph/blob/master/qa/workunits/fs/maxentries/maxentries.sh

It looks like it hung during make_files():

2022-01-27T12:52:16.287 INFO:tasks.workunit.client.0.smithi065.stderr:+ ok_dir_max_entries=100      
2022-01-27T12:52:16.288 INFO:tasks.workunit.client.0.smithi065.stderr:+ fail_dir_max_entries=101    
2022-01-27T12:52:16.288 INFO:tasks.workunit.client.0.smithi065.stderr:+ make_files 100

So it creates a 100 files and then expects the next 100 creates to fail. I wonder if this has more to do with async creates than anything in the fscrypt PR? I'll see if I can reproduce locally.

Actions

Copy link

Updated by Jeff Layton about 2 years ago

Project changed from CephFS to Linux kernel client
Category deleted (~~Administration/Usability~~)

I think the problem is in the kclient. We're not waking up the tasks waiting on an async create properly when the MDS returns an error. I'm testing a patch now.

As to why you're only seeing this with the fscrypt PR, I'm not sure. It may be another case where earlier runs ran this test with -o wsync.

Actions

Copy link

Updated by Jeff Layton about 2 years ago

Ok, I have a kernel patch that I'll post soon that should fix this, but I think there is another issue (probably in the MDS). The maxentries test sets the max entries in a directory to 100 and then creates 101 files in it. With the last file, it expects to see an error when it fsyncs, but sometimes the MDS just returns success.

I have a pr_warn message that the kernel pops when the MDS returns an error, and it doesn't happen in this case. That leads me to believe that the MDS is not enforcing this limit correctly (the client is not really aware of it).

Actions

Copy link

Updated by Jeff Layton about 2 years ago

Patch posted here:

https://lore.kernel.org/ceph-devel/20220208194648.190652-1-jlayton@kernel.org/T/#u

Actions

Copy link

Updated by Venky Shankar about 2 years ago

Jeff Layton wrote:

I think the problem is in the kclient. We're not waking up the tasks waiting on an async create properly when the MDS returns an error. I'm testing a patch now.

As to why you're only seeing this with the fscrypt PR, I'm not sure. It may be another case where earlier runs ran this test with -o wsync.

It was a rerun of this yaml matrix with master.

It's funny BTW, now I cannot rerun this yaml matrix with master :/

Actions

Copy link

Updated by Venky Shankar about 2 years ago

Used a test branch that does not have fscrypt changes with testing kernel. Run link - https://pulpito.ceph.com/vshankar-2022-02-09_12:43:34-fs-wip-vshankar-testing-20220201-113815-testing-default-smithi/6671431/

Test passes.

yaml matrix:

fs/functional/{begin clusters/1a3s-mds-4c-client conf/{client mds mon osd} distro/{rhel_8} mount/kclient/{mount-syntax/{v2} mount overrides/{distro/testing/{flavor/ubuntu_latest k-testing} ms-die-on-skipped}} objectstore/bluestore-bitmap overrides/{no_client_pidfile whitelist_health whitelist_wrongly_marked_down} tasks/workunit/dir-max-entries}

Actions

Copy link

#10

Updated by Jeff Layton about 2 years ago

Strange. Venky, can you rerun the test with the current testing kernel and the fscrypt changes? I'd like to confirm that it also passes now.

Actions

Copy link

#11

Updated by Venky Shankar about 2 years ago

Jeff was able to hit this without fscrypt changes.

I reran fs:functional with a freshly built fscrypt branch - https://pulpito.ceph.com/vshankar-2022-02-11_14:08:00-fs:functional-wip-vshankar-fscrypt-20220211-150029-testing-default-smithi/

Actions

Copy link

#12

Updated by Jeff Layton about 2 years ago

Status changed from Triaged to Fix Under Review

Actions

Copy link

#13

Updated by Jeff Layton almost 2 years ago

Status changed from Fix Under Review to Resolved

I haven't seen this with the latest kernels so I think this is now resolved. I had at least one bug in the fscrypt series that may have been a candidate for this, and there were a number of other fixes that may also have been candidates.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #54067

fs/maxentries.sh test fails with "2022-01-21T12:47:05.490 DEBUG:teuthology.orchestra.run:got remote process result: 124"

Updated by Venky Shankar about 2 years ago

Updated by Jeff Layton about 2 years ago

Updated by Venky Shankar about 2 years ago

Updated by Jeff Layton about 2 years ago

Updated by Jeff Layton about 2 years ago

Updated by Jeff Layton about 2 years ago

Updated by Jeff Layton about 2 years ago

Updated by Venky Shankar about 2 years ago

Updated by Venky Shankar about 2 years ago

Updated by Jeff Layton about 2 years ago

Updated by Venky Shankar about 2 years ago

Updated by Jeff Layton about 2 years ago

Updated by Jeff Layton almost 2 years ago