Project

General

Profile

Bug #46046

Test failure: test_create_multiple_exports (tasks.cephfs.test_nfs.TestNFS)

Added by Kefu Chai 23 days ago. Updated 16 days ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature:

Description

rados/cephadm/workunits/{distro/ubuntu_18.04_podman.yaml task/test_orch_cli.yaml}

2020-06-17T10:57:38.033 INFO:tasks.cephfs_test_runner:Starting test: test_create_multiple_exports (tasks.cephfs.test_nfs.TestNFS)
...
2020-06-17T10:58:01.463 INFO:teuthology.orchestra.run.smithi125.stderr:Error EINVAL: Failed to delete exports: {err} and {ret}
2020-06-17T10:58:01.467 DEBUG:teuthology.orchestra.run:got remote process result: 22
2020-06-17T10:58:01.468 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log 'Ended test tasks.cephfs.test_nfs.TestNFS.test_create_multiple_exports'
2020-06-17T10:58:01.791 INFO:ceph.mgr.a.smithi125.stdout:Jun 17 10:58:01 smithi125 bash[33685]: debug 2020-06-17T10:58:01.464+0000 7f39d9d80700 -1 mgr.server reply reply (22) Invalid argument Failed to delete exports: {err} and {ret}
2020-06-17T10:58:01.930 INFO:tasks.cephfs_test_runner:test_create_multiple_exports (tasks.cephfs.test_nfs.TestNFS) ... ERROR
2020-06-17T10:58:01.931 INFO:tasks.cephfs_test_runner:
2020-06-17T10:58:01.931 INFO:tasks.cephfs_test_runner:======================================================================
2020-06-17T10:58:01.931 INFO:tasks.cephfs_test_runner:ERROR: test_create_multiple_exports (tasks.cephfs.test_nfs.TestNFS)
2020-06-17T10:58:01.932 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-06-17T10:58:01.932 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2020-06-17T10:58:01.932 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/cephfs/test_nfs.py", line 119, in test_create_multiple_exports
2020-06-17T10:58:01.933 INFO:tasks.cephfs_test_runner:    self._test_delete_cluster()
2020-06-17T10:58:01.934 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/cephfs/test_nfs.py", line 66, in _test_delete_cluster
2020-06-17T10:58:01.934 INFO:tasks.cephfs_test_runner:    self._nfs_cmd('cluster', 'delete', self.cluster_id)
2020-06-17T10:58:01.934 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/cephfs/test_nfs.py", line 19, in _nfs_cmd
2020-06-17T10:58:01.935 INFO:tasks.cephfs_test_runner:    return self._cmd("nfs", *args)
2020-06-17T10:58:01.935 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/cephfs/test_nfs.py", line 16, in _cmd
2020-06-17T10:58:01.935 INFO:tasks.cephfs_test_runner:    return self.mgr_cluster.mon_manager.raw_cluster_cmd(*args)
2020-06-17T10:58:01.936 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 1363, in raw_cluster_cmd
2020-06-17T10:58:01.936 INFO:tasks.cephfs_test_runner:    stdout=BytesIO(),
2020-06-17T10:58:01.936 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 206, in run
2020-06-17T10:58:01.936 INFO:tasks.cephfs_test_runner:    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
2020-06-17T10:58:01.937 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 475, in run
2020-06-17T10:58:01.937 INFO:tasks.cephfs_test_runner:    r.wait()
2020-06-17T10:58:01.937 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait
2020-06-17T10:58:01.938 INFO:tasks.cephfs_test_runner:    self._raise_for_status()
2020-06-17T10:58:01.938 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status
2020-06-17T10:58:01.938 INFO:tasks.cephfs_test_runner:    node=self.hostname, label=self.label
2020-06-17T10:58:01.939 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed on smithi125 with status 22: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early nfs cluster delete test'
/a//kchai-2020-06-17_10:32:37-rados-master-distro-basic-smithi/5157608

Related issues

Copied to fs - Backport #46155: octopus: Test failure: test_create_multiple_exports (tasks.cephfs.test_nfs.TestNFS) Resolved

History

#1 Updated by Kefu Chai 23 days ago

  • Assignee set to Varsha Rao

Varsha, could you take a look?

#2 Updated by Kefu Chai 23 days ago

2020-06-17T09:00:43.188 INFO:teuthology.orchestra.run.smithi167.stderr:Error EINVAL: Failed to delete exports: [errno 110] RADOS timed out (Failed to notify b'conf-nfs.ganesha-test') and -22
2020-06-17T09:00:43.191 DEBUG:teuthology.orchestra.run:got remote process result: 22
2020-06-17T09:00:43.192 INFO:teuthology.orchestra.run.smithi167:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log 'Ended test tasks.cephfs.test_nfs.TestNFS.test_create_multiple_exports'
2020-06-17T09:00:43.532 INFO:ceph.mgr.a.smithi167.stdout:Jun 17 09:00:43 smithi167 bash[33738]: debug 2020-06-17T09:00:43.182+0000 7fc6fb8fa700 -1 mgr.server reply reply (22) Invalid argument Failed to delete exports: [errno 110] RADOS timed out (Failed to notify b'conf-nfs.ganesha-test') and -22
2020-06-17T09:00:43.929 INFO:tasks.cephfs_test_runner:test_create_multiple_exports (tasks.cephfs.test_nfs.TestNFS) ... ERROR
2020-06-17T09:00:43.930 INFO:tasks.cephfs_test_runner:
2020-06-17T09:00:43.930 INFO:tasks.cephfs_test_runner:======================================================================
2020-06-17T09:00:43.931 INFO:tasks.cephfs_test_runner:ERROR: test_create_multiple_exports (tasks.cephfs.test_nfs.TestNFS)
2020-06-17T09:00:43.931 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-06-17T09:00:43.931 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2020-06-17T09:00:43.931 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-kefu-testing-2020-06-17-1349/qa/tasks/cephfs/test_nfs.py", line 119, in test_create_multiple_exports
2020-06-17T09:00:43.932 INFO:tasks.cephfs_test_runner:    self._test_delete_cluster()
2020-06-17T09:00:43.932 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-kefu-testing-2020-06-17-1349/qa/tasks/cephfs/test_nfs.py", line 66, in _test_delete_cluster
2020-06-17T09:00:43.932 INFO:tasks.cephfs_test_runner:    self._nfs_cmd('cluster', 'delete', self.cluster_id)
2020-06-17T09:00:43.933 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-kefu-testing-2020-06-17-1349/qa/tasks/cephfs/test_nfs.py", line 19, in _nfs_cmd
2020-06-17T09:00:43.933 INFO:tasks.cephfs_test_runner:    return self._cmd("nfs", *args)
2020-06-17T09:00:43.933 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-kefu-testing-2020-06-17-1349/qa/tasks/cephfs/test_nfs.py", line 16, in _cmd
2020-06-17T09:00:43.933 INFO:tasks.cephfs_test_runner:    return self.mgr_cluster.mon_manager.raw_cluster_cmd(*args)
2020-06-17T09:00:43.934 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-kefu-testing-2020-06-17-1349/qa/tasks/ceph_manager.py", line 1363, in raw_cluster_cmd
2020-06-17T09:00:43.934 INFO:tasks.cephfs_test_runner:    stdout=BytesIO(),
2020-06-17T09:00:43.934 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 206, in run
2020-06-17T09:00:43.934 INFO:tasks.cephfs_test_runner:    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
2020-06-17T09:00:43.935 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 475, in run
2020-06-17T09:00:43.935 INFO:tasks.cephfs_test_runner:    r.wait()
2020-06-17T09:00:43.936 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait
2020-06-17T09:00:43.936 INFO:tasks.cephfs_test_runner:    self._raise_for_status()
2020-06-17T09:00:43.936 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status
2020-06-17T09:00:43.936 INFO:tasks.cephfs_test_runner:    node=self.hostname, label=self.label
2020-06-17T09:00:43.937 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed on smithi167 with status 22: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early nfs cluster delete test'

/a/kchai-2020-06-17_08:41:50-rados-wip-kefu-testing-2020-06-17-1349-distro-basic-smithi/5157584

#4 Updated by Varsha Rao 22 days ago

mgr log

2020-06-17T09:00:43.046+0000 7fc7177fa700  1 -- 172.21.15.167:0/244649789 --> [v2:172.21.15.167:3300/0,v1:172.21.15.167:6789/0] -- log(1 entries from seq 76 at 2020-06-17T09:00:42.333096+0000) v1 -- 0x559d5f604e00 con 0x559d5eb70000
2020-06-17T09:00:43.182+0000 7fc71d806700  1 --2- 172.21.15.167:0/47090163 >> [v2:172.21.15.167:6810/539135740,v1:172.21.15.167:6811/539135740] conn(0x559d5ede4400 0x559d5f123200 secure :-1 s=THROTTLE_DONE pgs=10 cs=0 l=1 rx=0x559d5f1533b
0 tx=0x559d5ec795e0).handle_read_frame_epilogue_main read frame epilogue bytes=32
2020-06-17T09:00:43.182+0000 7fc71d806700  1 -- 172.21.15.167:0/47090163 <== osd.1 v2:172.21.15.167:6810/539135740 33 ==== watch-notify(notify_complete (2) cookie 94134390019584 notify 133143986182 ret -110) v3 ==== 42+0+24 (secure 0 0 0)
 0x559d5f810000 con 0x559d5ede4400
2020-06-17T09:00:43.182+0000 7fc6fb8fa700  0 [volumes WARNING volumes.fs.nfs] Failed to delete exports
2020-06-17T09:00:43.182+0000 7fc6fb8fa700  0 [volumes WARNING volumes.fs.nfs] Failed to delete NFS Cluster
2020-06-17T09:00:43.182+0000 7fc6fb8fa700 20 mgr ~Gil Destroying new thread state 0x559d5f512000
2020-06-17T09:00:43.182+0000 7fc6fb8fa700 -1 mgr.server reply reply (22) Invalid argument Failed to delete exports: [errno 110] RADOS timed out (Failed to notify b'conf-nfs.ganesha-test') and -22
2020-06-17T09:00:43.182+0000 7fc6fb8fa700  1 -- [v2:172.21.15.167:6800/1072681736,v1:172.21.15.167:6801/1072681736] --> 172.21.15.167:0/2463696467 -- mgr_command_reply(tid 0: -22 Failed to delete exports: [errno 110] RADOS timed out (Failed to notify b'conf-nfs.ganesha-test') and -22) v1 -- 0x559d5ed9fa20 con 0x559d5ef75000

#5 Updated by Patrick Donnelly 22 days ago

Varsha, I recommend you change task/test_orch_cli.yaml to add rados client debugging ("debug rados = 10") so it's easier to see the problem. See qa/cephfs/conf/mds.yaml for how to add that override.

#6 Updated by Patrick Donnelly 22 days ago

I dug into this more. Here's where the mgr notifies:

2020-06-17T10:57:56.460+0000 7f39d9d80700  1 -- 172.21.15.125:0/3575526903 --> [v2:172.21.15.125:6810/4251849281,v1:172.21.15.125:6811/4251849281] -- osd_op(unknown.0.0:27 2.1b 2:df026d98:test::conf-nfs.ganesha-test:head [notify cookie 94801177838720 in=12b] snapc 0=[] ondisk+read+known_if_redirected e31) v8 -- 0x56389e167800 con 0x56389e022400

and then gets ETIMEDOUT 5 seconds later:

2020-06-17T10:58:01.460+0000 7f39f5c88700  1 -- 172.21.15.125:0/3575526903 <== osd.1 v2:172.21.15.125:6810/4251849281 33 ==== watch-notify(notify_complete (2) cookie 94801177838720 notify 133143986182 ret -110) v3 ==== 42+0+24 (secure 0 0 0) 0x56389dde7d40 con 0x56389e022400

So, perhaps the client (ganesha) died?

From the OSD log:

2020-06-17T10:57:56.448+0000 7f5aa4a36700 10 osd.1 pg_epoch: 31 pg[2.1b( v 31'12 (0'0,31'12] local-lis/les=24/25 n=1 ec=24/24 lis/c=24/24 les/c/f=25/25/0 sis=24) [1,0,2] r=0 lpr=24 crt=31'12 lcod 31'11 mlcod 31'11 active+clean]  -- Watch(93925274022624,client.14288) start_notify 133143986181
2020-06-17T10:57:56.448+0000 7f5aa4a36700 10 Notify(94801177838720,133143986181  watchers=0) start_watcher
2020-06-17T10:57:56.448+0000 7f5aa4a36700 10 osd.1 pg_epoch: 31 pg[2.1b( v 31'12 (0'0,31'12] local-lis/les=24/25 n=1 ec=24/24 lis/c=24/24 les/c/f=25/25/0 sis=24) [1,0,2] r=0 lpr=24 crt=31'12 lcod 31'11 mlcod 31'11 active+clean]  -- Watch(93925274022624,client.14288) send_notify
2020-06-17T10:57:56.448+0000 7f5aa4a36700  1 -- [v2:172.21.15.125:6810/4251849281,v1:172.21.15.125:6811/4251849281] --> 172.21.15.125:0/3667345523 -- watch-notify(notify (1) cookie 93925274022624 notify 133143986181 ret 0) v3 -- 0x55eda281b040 con 0x55ed95df0800

from: /ceph/teuthology-archive/kchai-2020-06-17_10:32:/log/cad82e04-b088-11ea-a06d-001a4aab830c/ceph-osd.1.log.gz

msgr debug indicates that client went away:

2020-06-17T10:57:56.480+0000 7f5ac6932700 10 --2- [v2:172.21.15.125:6810/4251849281,v1:172.21.15.125:6811/4251849281] >> 172.21.15.125:0/3667345523 conn(0x55ed95df0800 0x55ed9617c880 unknown :-1 s=READY pgs=3 cs=0 l=1 rx=0 tx=0).discard_out_queue started 
2020-06-17T10:57:56.480+0000 7f5ac6932700 20 EpollDriver.del_event del event fd=74 cur_mask=1 delmask=3 to 5
2020-06-17T10:57:56.480+0000 7f5ab5257700  2 osd.1 31 ms_handle_reset con 0x55ed95df0800 session 0x55eda2431a40

and then the notify is eventually cancelled:

2020-06-17T10:58:01.460+0000 7f5aac245700 10 Notify(94801177838720,133143986182  watchers=1) timeout
2020-06-17T10:58:01.460+0000 7f5aac245700 10 Notify(94801177838720,133143986182  watchers=1) maybe_complete_notify -- 1 in progress watchers
2020-06-17T10:58:01.460+0000 7f5aac245700  1 -- [v2:172.21.15.125:6810/4251849281,v1:172.21.15.125:6811/4251849281] --> 172.21.15.125:0/3575526903 -- watch-notify(notify_complete (2) cookie 94801177838720 notify 133143986182 ret -110) v3 -- 0x55eda281b520 con 0x55eda231e400
2020-06-17T10:58:01.460+0000 7f5aac245700  5 --2- [v2:172.21.15.125:6810/4251849281,v1:172.21.15.125:6811/4251849281] >> 172.21.15.125:0/3575526903 conn(0x55eda231e400 0x55ed9823b180 secure :-1 s=READY pgs=3 cs=0 l=1 rx=0x55eda2394780 tx=0x55ed961932c0).send_message enqueueing message m=0x55eda281b520 type=44 watch-notify(notify_complete (2) cookie 94801177838720 notify 133143986182 ret -110) v3
2020-06-17T10:58:01.460+0000 7f5aac245700 15 --2- [v2:172.21.15.125:6810/4251849281,v1:172.21.15.125:6811/4251849281] >> 172.21.15.125:0/3575526903 conn(0x55eda231e400 0x55ed9823b180 secure :-1 s=READY pgs=3 cs=0 l=1 rx=0x55eda2394780 tx=0x55ed961932c0).send_message inline write is denied, reschedule m=0x55eda281b520
2020-06-17T10:58:01.460+0000 7f5aac245700 20 Event(0x55ed94ebb040 nevent=5000 time_id=15).wakeup
2020-06-17T10:58:01.460+0000 7f5aac245700 10 osd.1 pg_epoch: 31 pg[2.1b( v 31'13 (0'0,31'13] local-lis/les=24/25 n=1 ec=24/24 lis/c=24/24 les/c/f=25/25/0 sis=24) [1,0,2] r=0 lpr=24 crt=31'13 lcod 31'12 mlcod 31'12 active+clean]  -- Watch(93925274022624,client.14288) cancel_notify 133143986182
2020-06-17T10:58:01.460+0000 7f5ac6131700 10 -- [v2:172.21.15.125:6810/4251849281,v1:172.21.15.125:6811/4251849281] >> 172.21.15.125:0/3575526903 conn(0x55eda231e400 msgr2=0x55ed9823b180 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).handle_write
2020-06-17T10:58:01.460+0000 7f5ac6131700 10 --2- [v2:172.21.15.125:6810/4251849281,v1:172.21.15.125:6811/4251849281] >> 172.21.15.125:0/3575526903 conn(0x55eda231e400 0x55ed9823b180 secure :-1 s=READY pgs=3 cs=0 l=1 rx=0x55eda2394780 tx=0x55ed961932c0).write_event
2020-06-17T10:58:01.460+0000 7f5ac6131700 20 --2- [v2:172.21.15.125:6810/4251849281,v1:172.21.15.125:6811/4251849281] >> 172.21.15.125:0/3575526903 conn(0x55eda231e400 0x55ed9823b180 secure :-1 s=READY pgs=3 cs=0 l=1 rx=0x55eda2394780 tx=0x55ed961932c0).prepare_send_message m=watch-notify(notify_complete (2) cookie 94801177838720 notify 133143986182 ret -110) v3
2020-06-17T10:58:01.460+0000 7f5ac6131700 20 --2- [v2:172.21.15.125:6810/4251849281,v1:172.21.15.125:6811/4251849281] >> 172.21.15.125:0/3575526903 conn(0x55eda231e400 0x55ed9823b180 secure :-1 s=READY pgs=3 cs=0 l=1 rx=0x55eda2394780 tx=0x55ed961932c0).prepare_send_message encoding features 4540138292837744639 0x55eda281b520 watch-notify(notify_complete (2) cookie 94801177838720 notify 133143986182 ret -110) v3

That client (172.21.15.125:0/3667345523) which didn't respond (which we think is Ganesha) can be confirmed to be Ganesha by looking at the mon logs:

2020-06-17T10:57:43.740+0000 7f2cf9843700  1 -- [v2:172.21.15.125:3300/0,v1:172.21.15.125:6789/0] <== client.? 172.21.15.125:0/3667345523 1 ==== mon_getmap magic: 0 v1 ==== 0+0+0 (secure 0 0 0) 0x55df2b9af7c0 con 0x55df2bd32400
2020-06-17T10:57:43.740+0000 7f2cf9843700 20 mon.a@0(leader) e1 _ms_dispatch existing session 0x55df2d9b6000 for client.?
2020-06-17T10:57:43.740+0000 7f2cf9843700 20 mon.a@0(leader) e1  entity client.nfs.ganesha-test.smithi125 caps allow r

From: /ceph/teuthology-archive/kchai-2020-06-17_10:32:/log/cad82e04-b088-11ea-a06d-001a4aab830c/ceph-mon.a.log.gz

(Lesson: look at the messages being sent/received.)

So, this looks like a simple matter of Ganesha going away for whatever reason. We should tolerate that by ignoring ETIMEDOUT on notify.

#8 Updated by Varsha Rao 21 days ago

  • Status changed from New to Fix Under Review
  • Target version set to v16.0.0
  • Source set to Development
  • Backport set to octopus
  • Pull request ID set to 35646

#9 Updated by Patrick Donnelly 16 days ago

  • Status changed from Fix Under Review to Pending Backport

#10 Updated by Varsha Rao 16 days ago

  • Copied to Backport #46155: octopus: Test failure: test_create_multiple_exports (tasks.cephfs.test_nfs.TestNFS) added

#11 Updated by Nathan Cutler 16 days ago

  • Project changed from Orchestrator to fs

Also available in: Atom PDF