Project

General

Profile

Actions

Bug #61732

closed

pacific: test_cluster_info fails from "No daemons reported"

Added by Laura Flores 11 months ago. Updated 26 days ago.

Status:
Resolved
Priority:
High
Category:
Correctness/Safety
Target version:
% Done:

100%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
NFS-cluster
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2023-06-15_19:41:47-rados-wip-yuri6-testing-2023-06-14-0754-pacific-distro-default-smithi/7305740

2023-06-15T22:03:58.417 INFO:teuthology.orchestra.run.smithi171.stdout:No daemons reported
2023-06-15T22:03:58.430 WARNING:teuthology.contextutil:reached maximum tries (11) after waiting for 60 seconds
2023-06-15T22:03:58.431 WARNING:tasks.cephfs.test_nfs:NFS Ganesha cluster deployment failed, retrying
2023-06-15T22:03:58.431 DEBUG:teuthology.orchestra.run.smithi171:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph log 'Ended test tasks.cephfs.test_nfs.TestNFS.test_cluster_info'
2023-06-15T22:03:58.991 INFO:tasks.cephfs_test_runner:test_cluster_info (tasks.cephfs.test_nfs.TestNFS) ... ERROR
2023-06-15T22:03:58.991 INFO:tasks.cephfs_test_runner:
2023-06-15T22:03:58.992 INFO:tasks.cephfs_test_runner:======================================================================

2023-06-15T22:03:58.992 INFO:tasks.cephfs_test_runner:ERROR: test_cluster_info (tasks.cephfs.test_nfs.TestNFS)
2023-06-15T22:03:58.992 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2023-06-15T22:03:58.992 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2023-06-15T22:03:58.993 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_a2a4ed2b4fbd1366687a5db6ac3695c86d95455f/qa/tasks/cephfs/test_nfs.py", line 574, in test_cluster_info
2023-06-15T22:03:58.993 INFO:tasks.cephfs_test_runner:    self._test_create_cluster()
2023-06-15T22:03:58.993 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_a2a4ed2b4fbd1366687a5db6ac3695c86d95455f/qa/tasks/cephfs/test_nfs.py", line 149, in _test_create_cluster
2023-06-15T22:03:58.993 INFO:tasks.cephfs_test_runner:    while proceed():
2023-06-15T22:03:58.994 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_961f4fb51318c373681de5844aadbc6dc0e58abc/teuthology/contextutil.py", line 134, in __call__
2023-06-15T22:03:58.994 INFO:tasks.cephfs_test_runner:    raise MaxWhileTries(error_msg)
2023-06-15T22:03:58.994 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: reached maximum tries (11) after waiting for 40 seconds

Actions #1

Updated by Laura Flores 11 months ago

  • Backport set to pacific
Actions #2

Updated by Venky Shankar 10 months ago

  • Category set to Correctness/Safety
  • Assignee set to Dhairya Parmar
  • Target version set to v19.0.0
  • Labels (FS) NFS-cluster added
Actions #3

Updated by Venky Shankar 10 months ago

  • Status changed from New to Triaged
Actions #4

Updated by Laura Flores 10 months ago

/a/yuriw-2023-06-23_20:51:14-rados-wip-yuri8-testing-2023-06-22-1309-pacific-distro-default-smithi/7314160

Actions #5

Updated by Laura Flores 10 months ago

  • Priority changed from Normal to High
Actions #6

Updated by Dhairya Parmar 10 months ago

Laura Flores wrote:

Occurs quite a bit. Perhaps from a recent regression?

See http://pulpito.front.sepia.ceph.com/lflores-2023-07-05_17:10:12-rados-wip-yuri8-testing-2023-06-22-1309-pacific-distro-default-smithi/ for example.

I'm going through logs now, will update soon.

Actions #7

Updated by Dhairya Parmar 10 months ago

@laura this isn't seen in quincy or reef, is it?

Actions #8

Updated by Laura Flores 10 months ago

Dhairya Parmar wrote:

@laura this isn't seen in quincy or reef, is it?

Right. But since it occurs in pacific, it might be in quincy and reef too, so I will continue to update the tracker if I see anything.

Actions #9

Updated by Dhairya Parmar 10 months ago

Laura Flores wrote:

Dhairya Parmar wrote:

@laura this isn't seen in quincy or reef, is it?

Right. But since it occurs in pacific, it might be in quincy and reef too, so I will continue to update the tracker if I see anything.

Can you share the list of PRs included in the run? also is it the same set of PRs as yuri's run you pasted above?

Actions #11

Updated by Dhairya Parmar 10 months ago

Laura Flores wrote:

See https://trello.com/c/qQnRTrLO/1792-wip-yuri8-testing-2023-06-22-1309-pacific-old-wip-yuri8-testing-2023-06-22-1004-pacific-old-wip-yuri8-testing-2023-06-22-0834-pa.

This is a private workspace; need access to it; already requested, who do I need to get in touch with for the permission? Yuri W?

Actions #12

Updated by Dhairya Parmar 10 months ago

Log is full of line complaining it could not find the nfs cluster daemon

INFO:teuthology.orchestra.run.smithi164.stdout:No daemons reported

the issue lies over here:

2023-07-05T17:43:26.341 DEBUG:teuthology.orchestra.run.smithi164:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph n f s ' ' c l u s t e r ' ' c r e a t e ' ' t e s t

i.e. the weird spaces and the single quotes makes the command uninterpretable and throws EINVAL:

2023-07-05T17:43:26.907 INFO:teuthology.orchestra.run.smithi164.stderr:no valid command found; 10 closest matches:
2023-07-05T17:43:26.907 INFO:teuthology.orchestra.run.smithi164.stderr:pg stat
2023-07-05T17:43:26.907 INFO:teuthology.orchestra.run.smithi164.stderr:pg getmap
2023-07-05T17:43:26.907 INFO:teuthology.orchestra.run.smithi164.stderr:pg dump [all|summary|sum|delta|pools|osds|pgs|pgs_brief...]
2023-07-05T17:43:26.907 INFO:teuthology.orchestra.run.smithi164.stderr:pg dump_json [all|summary|sum|pools|osds|pgs...]
2023-07-05T17:43:26.907 INFO:teuthology.orchestra.run.smithi164.stderr:pg dump_pools_json
2023-07-05T17:43:26.908 INFO:teuthology.orchestra.run.smithi164.stderr:pg ls-by-pool <poolstr> [<states>...]
2023-07-05T17:43:26.908 INFO:teuthology.orchestra.run.smithi164.stderr:pg ls-by-primary <id|osd.id> [<pool:int>] [<states>...]
2023-07-05T17:43:26.908 INFO:teuthology.orchestra.run.smithi164.stderr:pg ls-by-osd <id|osd.id> [<pool:int>] [<states>...]
2023-07-05T17:43:26.908 INFO:teuthology.orchestra.run.smithi164.stderr:pg ls [<pool:int>] [<states>...]
2023-07-05T17:43:26.908 INFO:teuthology.orchestra.run.smithi164.stderr:pg dump_stuck [inactive|unclean|stale|undersized|degraded...] [<threshold:int>]
2023-07-05T17:43:26.908 INFO:teuthology.orchestra.run.smithi164.stderr:Error EINVAL: invalid command
2023-07-05T17:43:26.909 DEBUG:teuthology.orchestra.run:got remote process result: 22

This goes in a loop and at the ends fails.

Actions #13

Updated by Dhairya Parmar 10 months ago

_test_create_cluster() in test_nfs demanded strerr to be looked at; therefore I had created a new helper _nfs_complete_cmd()

    def _nfs_complete_cmd(self, cmd):
        return self.mgr_cluster.mon_manager.run_cluster_cmd(args=f"nfs {cmd}",
                                                            stdout=StringIO(),
                                                            stderr=StringIO(),
                                                            check_status=False)

Which is being used here [0].

There is a difference in the way I've called this helper, instead of sending a tuple I sent a string because it is more readable and the underlying code in main branch does allow it [1] but this code is missing in pacific branch [2]; and this clearly explains why we see those weird spaces and unintended singles quotes when the cmd `nfs cluster create test` is interpreted by the pacific's run_cluster_cmd().

The commits that allowed usage of both string and tuple while passing cli cmds are [3] and [4] and obviously were never backported to pacific. So either I make changes to [0] and pass a tuple or we backport [3] and [4]. Either way is good but I'd recommend backporting because this issue may arise in future where someone again would pass a cmd as string only to find some unearthly command in pacific teuthology logs :P

[0] https://github.com/ceph/ceph/pull/50809/files#diff-61b87b23c38fe121bbe5f110686a0cd1e5e338811b5fa1a9456c4548bd206055R153-R154
[1] https://github.com/ceph/ceph/blob/main/qa/tasks/ceph_manager.py#L1562-L1565
[2] https://github.com/ceph/ceph/blob/pacific/qa/tasks/ceph_manager.py#L1560-L1593
[3] https://github.com/ceph/ceph/commit/93677576c1fd6d0e4e2991a9ba6be6d222ea98ea
[4] https://github.com/ceph/ceph/commit/a1dc6b6c1964423158dcd7c930db5e3063ff210e

Actions #14

Updated by Laura Flores 9 months ago

/a/yuriw-2023-07-19_14:33:14-rados-wip-yuri11-testing-2023-07-18-0927-pacific-distro-default-smithi/7343428

Actions #15

Updated by Sridhar Seshasayee 9 months ago

/a/yuriw-2023-07-26_15:54:22-rados-wip-yuri6-testing-2023-07-24-0819-pacific-distro-default-smithi/7353337
/a/yuriw-2023-07-26_15:54:22-rados-wip-yuri6-testing-2023-07-24-0819-pacific-distro-default-smithi/7353548
/a/yuriw-2023-07-26_15:54:22-rados-wip-yuri6-testing-2023-07-24-0819-pacific-distro-default-smithi/7353740
/a/yuriw-2023-07-26_15:54:22-rados-wip-yuri6-testing-2023-07-24-0819-pacific-distro-default-smithi/7353948

Actions #16

Updated by Venky Shankar 9 months ago

Dhairya Parmar wrote:

_test_create_cluster() in test_nfs demanded strerr to be looked at; therefore I had created a new helper _nfs_complete_cmd()

[...]

Which is being used here [0].

There is a difference in the way I've called this helper, instead of sending a tuple I sent a string because it is more readable and the underlying code in main branch does allow it [1] but this code is missing in pacific branch [2]; and this clearly explains why we see those weird spaces and unintended singles quotes when the cmd `nfs cluster create test` is interpreted by the pacific's run_cluster_cmd().

The commits that allowed usage of both string and tuple while passing cli cmds are [3] and [4] and obviously were never backported to pacific. So either I make changes to [0] and pass a tuple or we backport [3] and [4]. Either way is good but I'd recommend backporting because this issue may arise in future where someone again would pass a cmd as string only to find some unearthly command in pacific teuthology logs :P

[0] https://github.com/ceph/ceph/pull/50809/files#diff-61b87b23c38fe121bbe5f110686a0cd1e5e338811b5fa1a9456c4548bd206055R153-R154
[1] https://github.com/ceph/ceph/blob/main/qa/tasks/ceph_manager.py#L1562-L1565
[2] https://github.com/ceph/ceph/blob/pacific/qa/tasks/ceph_manager.py#L1560-L1593
[3] https://github.com/ceph/ceph/commit/93677576c1fd6d0e4e2991a9ba6be6d222ea98ea
[4] https://github.com/ceph/ceph/commit/a1dc6b6c1964423158dcd7c930db5e3063ff210e

Dhairya, could you try backporting the dependent PRs?

Actions #17

Updated by Dhairya Parmar 9 months ago

Venky Shankar wrote:

Dhairya Parmar wrote:

_test_create_cluster() in test_nfs demanded strerr to be looked at; therefore I had created a new helper _nfs_complete_cmd()

[...]

Which is being used here [0].

There is a difference in the way I've called this helper, instead of sending a tuple I sent a string because it is more readable and the underlying code in main branch does allow it [1] but this code is missing in pacific branch [2]; and this clearly explains why we see those weird spaces and unintended singles quotes when the cmd `nfs cluster create test` is interpreted by the pacific's run_cluster_cmd().

The commits that allowed usage of both string and tuple while passing cli cmds are [3] and [4] and obviously were never backported to pacific. So either I make changes to [0] and pass a tuple or we backport [3] and [4]. Either way is good but I'd recommend backporting because this issue may arise in future where someone again would pass a cmd as string only to find some unearthly command in pacific teuthology logs :P

[0] https://github.com/ceph/ceph/pull/50809/files#diff-61b87b23c38fe121bbe5f110686a0cd1e5e338811b5fa1a9456c4548bd206055R153-R154
[1] https://github.com/ceph/ceph/blob/main/qa/tasks/ceph_manager.py#L1562-L1565
[2] https://github.com/ceph/ceph/blob/pacific/qa/tasks/ceph_manager.py#L1560-L1593
[3] https://github.com/ceph/ceph/commit/93677576c1fd6d0e4e2991a9ba6be6d222ea98ea
[4] https://github.com/ceph/ceph/commit/a1dc6b6c1964423158dcd7c930db5e3063ff210e

Dhairya, could you try backporting the dependent PRs?

okay

Actions #18

Updated by Laura Flores 9 months ago

/a/yuriw-2023-08-02_20:21:03-rados-wip-yuri3-testing-2023-08-01-0825-pacific-distro-default-smithi/7358531

Actions #19

Updated by Venky Shankar 9 months ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 52763
Actions #20

Updated by Venky Shankar 9 months ago

  • Subject changed from test_cluster_info fails from "No daemons reported" to pacific: test_cluster_info fails from "No daemons reported"
Actions #21

Updated by Laura Flores 9 months ago

/a/yuriw-2023-08-08_14:45:33-rados-wip-yuri6-testing-2023-08-03-0807-pacific-distro-default-smithi/7362839

Actions #22

Updated by Laura Flores 9 months ago

/a/yuriw-2023-08-10_20:19:11-rados-wip-yuri2-testing-2023-08-08-0755-pacific-distro-default-smithi/7366072

Actions #23

Updated by Aishwarya Mathuria 9 months ago

/a/yuriw-2023-08-16_22:40:18-rados-wip-yuri2-testing-2023-08-16-1142-pacific-distro-default-smithi/7370706/

Actions #24

Updated by Laura Flores 8 months ago

/a/yuriw-2023-08-21_23:10:07-rados-pacific-release-distro-default-smithi/7375005

Actions #25

Updated by Laura Flores 8 months ago

/a/yuriw-2023-09-01_19:14:47-rados-wip-batrick-testing-20230831.124848-pacific-distro-default-smithi/7386551

Actions #26

Updated by Laura Flores 6 months ago

/a/lflores-2023-11-01_18:38:59-rados-wip-yuri5-testing-2023-10-24-0737-pacific-distro-default-smithi/7443306

Actions #27

Updated by Yuri Weinstein 5 months ago

  • Target version changed from v19.0.0 to v16.2.15

merged

Actions #28

Updated by Konstantin Shalygin 26 days ago

  • Status changed from Fix Under Review to Resolved
  • % Done changed from 0 to 100
  • Backport deleted (pacific)
Actions

Also available in: Atom PDF