Project

General

Profile

Bug #46815

mgr/dashboard: FAIL: test_pool_create (tasks.mgr.dashboard.test_pool.PoolTest)

Added by Tatjana Dehler 6 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Urgent
Category:
API Tests
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

2020-08-03 03:43:22,156.156 INFO:__main__:----------------------------------------------------------------------
2020-08-03 03:43:22,156.156 INFO:__main__:Traceback (most recent call last):
2020-08-03 03:43:22,156.156 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api-nightly-nautilus-backend/qa/tasks/mgr/dashboard/test_pool.py", line 267, in test_pool_create
2020-08-03 03:43:22,156.156 INFO:__main__:    self.assertIn(conf, new_pool['configuration'])
2020-08-03 03:43:22,156.156 INFO:__main__:AssertionError: {'name': 'rbd_qos_bps_limit', 'source': 1, 'value': '2048'} not found in []
2020-08-03 03:43:22,156.156 INFO:__main__:
2020-08-03 03:43:22,157.157 INFO:__main__:----------------------------------------------------------------------
2020-08-03 03:43:22,157.157 INFO:__main__:Ran 82 tests in 1096.900s

- https://jenkins.ceph.com/view/mgr-dashboard/job/ceph-api-nightly-nautilus-backend/248/
- https://jenkins.ceph.com/view/mgr-dashboard/job/ceph-api-nightly-nautilus-backend/247/
- https://jenkins.ceph.com/view/mgr-dashboard/job/ceph-api-nightly-nautilus-backend/256/
- https://pulpito.ceph.com/laura-2020-08-20_12:55:29-rados:dashboard-wip-laura-testing-nautilus-batch-36607-36613-36692-distro-basic-smithi/
- https://jenkins.ceph.com/view/mgr-dashboard/job/ceph-api-nightly-nautilus-backend/273/
- https://jenkins.ceph.com/view/mgr-dashboard/job/ceph-api-nightly-nautilus-backend/289/
- https://jenkins.ceph.com/job/ceph-api/4061/


Related issues

Related to mgr - Feature #36191: mgr/dashboard: Add support for managing RBD QoS Resolved 01/30/2019
Related to mgr - Bug #44224: mgr/dashboard: Timeouts for rbd.py calls New

History

#1 Updated by Tatjana Dehler 6 months ago

  • Assignee deleted (Laura Paduano)

#2 Updated by Laura Paduano 6 months ago

  • Description updated (diff)

#3 Updated by Laura Paduano 5 months ago

  • Description updated (diff)
  • Priority changed from Normal to High
  • Affected Versions v14.2.9 added
  • Affected Versions deleted (v14.2.11)

#4 Updated by Tatjana Dehler 5 months ago

I tried to reproduce the issue on my local environment but unfortunately it didn't happen there.

#5 Updated by Laura Paduano 5 months ago

2020-08-20T14:42:26.381 INFO:tasks.mgr.dashboard.helper:command result:
2020-08-20T14:42:26.382 INFO:tasks.cephfs_test_runner:
2020-08-20T14:42:26.382 INFO:tasks.cephfs_test_runner:======================================================================
2020-08-20T14:42:26.383 INFO:tasks.cephfs_test_runner:FAIL: test_pool_create (tasks.mgr.dashboard.test_pool.PoolTest)
2020-08-20T14:42:26.383 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-08-20T14:42:26.383 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2020-08-20T14:42:26.384 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-laura-testing-nautilus-batch-36607-36613-36692/qa/tasks/mgr/dashboard/test_pool.py", line 267, in test_pool_create
2020-08-20T14:42:26.384 INFO:tasks.cephfs_test_runner:    self.assertIn(conf, new_pool['configuration'])
2020-08-20T14:42:26.384 INFO:tasks.cephfs_test_runner:AssertionError: {'name': 'rbd_qos_bps_limit', 'source': 1, 'value': '2048'} not found in []
2020-08-20T14:42:26.385 INFO:tasks.cephfs_test_runner:
2020-08-20T14:42:26.385 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-08-20T14:42:26.385 INFO:tasks.cephfs_test_runner:Ran 123 tests in 4324.539s
2020-08-20T14:42:26.386 INFO:tasks.cephfs_test_runner:
2020-08-20T14:42:26.386 INFO:tasks.cephfs_test_runner:FAILED (failures=1)
2020-08-20T14:42:26.386 INFO:tasks.cephfs_test_runner:
2020-08-20T14:42:26.387 INFO:tasks.cephfs_test_runner:======================================================================
2020-08-20T14:42:26.388 INFO:tasks.cephfs_test_runner:FAIL: test_pool_create (tasks.mgr.dashboard.test_pool.PoolTest)
2020-08-20T14:42:26.388 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-08-20T14:42:26.388 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2020-08-20T14:42:26.388 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-laura-testing-nautilus-batch-36607-36613-36692/qa/tasks/mgr/dashboard/test_pool.py", line 267, in test_pool_create
2020-08-20T14:42:26.389 INFO:tasks.cephfs_test_runner:    self.assertIn(conf, new_pool['configuration'])
2020-08-20T14:42:26.389 INFO:tasks.cephfs_test_runner:AssertionError: {'name': 'rbd_qos_bps_limit', 'source': 1, 'value': '2048'} not found in []
2020-08-20T14:42:26.390 INFO:tasks.cephfs_test_runner:
2020-08-20T14:42:26.390 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 93, in run_tasks
    manager.__enter__()
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-laura-testing-nautilus-batch-36607-36613-36692/qa/tasks/cephfs_test_runner.py", line 207, in task
    raise RuntimeError("Test failure: {0}".format(", ".join(bad_tests)))
RuntimeError: Test failure: test_pool_create (tasks.mgr.dashboard.test_pool.PoolTest)
2020-08-20T14:42:26.416 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/sepia/teuthology/?q=c72affb1f1d94dc593c31fc319416645
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 93, in run_tasks
    manager.__enter__()
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-laura-testing-nautilus-batch-36607-36613-36692/qa/tasks/cephfs_test_runner.py", line 207, in task
    raise RuntimeError("Test failure: {0}".format(", ".join(bad_tests)))
RuntimeError: Test failure: test_pool_create (tasks.mgr.dashboard.test_pool.PoolTest)

- http://pulpito.ceph.com/laura-2020-08-20_12:55:29-rados:dashboard-wip-laura-testing-nautilus-batch-36607-36613-36692-distro-basic-smithi/
- http://pulpito.ceph.com/laura-2020-08-21_08:58:47-rados:dashboard-wip-laura-testing-nautilus-batch-36607-36613-36692-distro-basic-smithi/
- http://pulpito.ceph.com/laura-2020-08-25_07:37:15-rados:dashboard-wip-laura-testing-nautilus-batch-36607-36613-36692-distro-basic-smithi/

#6 Updated by Tatjana Dehler 5 months ago

I ran the tests a few time now. It happens (not in every run) on my local environment as well.

#7 Updated by Laura Paduano 5 months ago

  • Description updated (diff)

#9 Updated by Laura Paduano 5 months ago

  • Description updated (diff)

#10 Updated by Deepika Upadhyay 5 months ago

/a/yuriw-2020-09-10_22:09:32-rados-wip-yuri2-testing-2020-09-10-0000-nautilus-distro-basic-smithi/5422088/teuthology.log
/a/yuriw-2020-09-10_22:09:32-rados-wip-yuri2-testing-2020-09-10-0000-nautilus-distro-basic-smithi/5422178/teuthology.log

#13 Updated by Lenz Grimmer 4 months ago

According to git blame qa/tasks/mgr/dashboard/test_pool.py in the nautilus branch, that test has not been modified for quite some time:

22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 250)             'configuration': {
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 251)                 'rbd_qos_bps_limit': 2048,
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 252)                 'rbd_qos_iops_limit': None,
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 253)             },
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 254)         }
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 255)         expected_configuration = [{
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 256)             'name': 'rbd_qos_bps_limit',
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 257)             'source': 1,
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 258)             'value': '2048',
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 259)         }, {
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 260)             'name': 'rbd_qos_iops_limit',
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 261)             'source': 0,
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 262)             'value': '0',
e6c1056ad5ad qa/tasks/mgr/dashboard/test_pool.py    (Sebastian Wagner 2018-03-21 17:15:29 +0100 263)         }]
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 264)         self._task_post('/api/pool/', pool)
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 265)         self.assertStatus(201)
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 266)         new_pool = self._get_pool(pool['pool'])
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 267)         self._validate_pool_properties(pool, new_pool)
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 268)         for conf in expected_configuration:
22dce7c3b51d qa/tasks/mgr/dashboard/test_pool.py    (Patrick Nawracay 2019-02-19 11:14:33 +0100 269)             self.assertIn(conf, new_pool['configuration'])
cbee3719580c qa/tasks/mgr/dashboard/test_pool.py    (Sebastian Wagner 2018-03-22 12:08:18 +0100 270) 

Did some recent backport into Nautilus make changes to the RBD QoS BPS settings? If I'm not mistaken, the test is expecting rbd_qos_bps_limit to be set to 2048, but the Ceph pool does not contain this value.

#14 Updated by Lenz Grimmer 4 months ago

  • Regression changed from No to Yes

#15 Updated by Lenz Grimmer 4 months ago

  • Related to Feature #36191: mgr/dashboard: Add support for managing RBD QoS added

#16 Updated by Lenz Grimmer 4 months ago

I've asked Mykola to take a look (as he's the original author of the RBD QoS feature), and he made the following observation:

I wonder couldn't it be related to https://github.com/ceph/ceph/pull/35367/

It has a change to src/pybind/mgr/dashboard/services/rbd.py, RbdConfiguration.list() function to return an empty configuration if the pool in a "bad" state. Can we check this somehow?

#17 Updated by Stephan Müller 4 months ago

  • Status changed from New to In Progress
  • Assignee set to Stephan Müller

#18 Updated by Stephan Müller 4 months ago

  • Status changed from In Progress to Can't reproduce

I tried to reproduce the failure with two different environments and in neither of them the failure could be reproduced.

At this point I only can guess what the cause could be:
  • A change in the Jekinsjob.
  • To slow test systems?

To the second point. First a pool will be created, than special options will be set like compression and after that the qos settings will be applied. As the set qos settings are expected and the pool creation is an asynchronous call, it could be possible if a system is slow enough that the qos settings are not set during some milliseconds even if it's on the same hardware (vstart cluster).

On the other hand the same test exists in master and octopus too and those tests are not falling on them and do not need a sleep.

Was there any significant change in the jobs that resulted in longer test runs?

#20 Updated by Laura Paduano 4 months ago

Stephan Müller wrote:

I tried to reproduce the failure with two different environments and in neither of them the failure could be reproduced.

Did you try to run all tests or just the pool suite?
Tatjana mentioned that she was able to reproduce it when running the pool suite, but it did not happen all the time.

#21 Updated by Ernesto Puerta 4 months ago

  • Status changed from Can't reproduce to New

Reopening. Happening in Nautilus (missing fix?):

https://jenkins.ceph.com/job/ceph-api/3702/

#24 Updated by Laura Paduano 4 months ago

  • Description updated (diff)

#25 Updated by Nathan Cutler 4 months ago

This failure is reproducing in about 60-80% of "jenkins test api" runs in nautilus PRs at the moment.

#26 Updated by Nathan Cutler 4 months ago

  • Severity changed from 3 - minor to 2 - major

#27 Updated by Ernesto Puerta 4 months ago

  • Assignee changed from Stephan Müller to Ernesto Puerta
  • Priority changed from High to Urgent

I'll have a look at this.

#28 Updated by Stephan Müller 4 months ago

Laura Paduano wrote:

Stephan Müller wrote:

I tried to reproduce the failure with two different environments and in neither of them the failure could be reproduced.

Did you try to run all tests or just the pool suite?
Tatjana mentioned that she was able to reproduce it when running the pool suite, but it did not happen all the time.

I ran the pool suite tests multiple times and used the API directly with different combinations via calling it directly using Insomnia (where I could replay the direct API calls) and through the dashboard.

As said it's not happening in my environments.

#29 Updated by Mykola Golub 4 months ago

Just to confirm my assumption that the root cause is "len(pg_status) 1 and 'unknown' in pg_status[0]" condition check in RbdConfiguration.list function, I added some diagnostic there and created a test PR [1]. When it fails I see now [2]:

2020-10-01 04:19:26,758.758 INFO:__main__:Traceback (most recent call last):
2020-10-01 04:19:26,758.758 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/test_pool.py", line 271, in test_pool_create
2020-10-01 04:19:26,758.758 INFO:__main__:    self.assertIn(conf, new_pool['configuration'])
2020-10-01 04:19:26,758.758 INFO:__main__:AssertionError: {'name': 'rbd_qos_bps_limit', 'source': 1, 'value': '2048'} not found in [{'source': 1, 'name': 'debug', 'value': 'pg_status is unknown'}]

I.e. it really fails when all pgs for the pool are in unknown state.

The 'unknown' state means that pg peering is not complete yet, so it looks like the test just not wait long enough for pgs to peer after the pool is created.

Note, in the master we have a different condition here [3]: "len(pg_status) 1 and 'incomplete' in pg_status0". I.e. we check for all pg in 'incomplete' state. A pg enters incomplete state if the peering failed. So it is more correct check, still I don't like this too. It looks just like an attempt to make CI happy, not to fix a potential problem that may happen in reality. Usually you will not observe all pgs in incomplete state, but only some of them. And it is a matter of luck if the rbd metadata object belongs to incomplete pg or not (in the first case the request would get stuck, in the second it would succeed).

If we really want something like this to make sure it does not get stuck, we should check that all pgs are in 'active+...' state. And I think it is a bad idea to just return an empty list. This is misleading both for the users and for troubleshooting issues like this one. The error should be propagated and the dashboard should inform the user that currently the pool configuration is not accessible because the pool is not in healthy state.

[1] https://github.com/ceph/ceph/pull/37491
[2] https://jenkins.ceph.com/job/ceph-api/4296/
[3] https://github.com/ceph/ceph/commit/e174b91d6b7670ed575577ddff18edc354be69fb

#30 Updated by Ernesto Puerta 4 months ago

  • Related to Bug #44224: mgr/dashboard: Timeouts for rbd.py calls added

#31 Updated by Ernesto Puerta 4 months ago

Thanks a lot, Mykola, for the analysis. So I'll backport that suboptimal fix and raise the priority of https://tracker.ceph.com/issues/43765

If we really want something like this to make sure it does not get stuck, we should check that all pgs are in 'active+...' state. And I think it is a bad idea to just return an empty list. This is misleading both for the users and for troubleshooting issues like this one. The error should be propagated and the dashboard should inform the user that currently the pool configuration is not accessible because the pool is not in healthy state.

Fully agree. Ignoring errors is the worst way of dealing with them.

#32 Updated by Laura Paduano 4 months ago

Thanks a lot for investigating and looking into the issue, Mykola!
And thanks for looking into fixing the issue, Ernesto!

#33 Updated by Ernesto Puerta 4 months ago

Ok it seems that during the backporting from master to nautilus, a line with 'incomplete' became 'unknown'.

I'll simply backport that.

#34 Updated by Ernesto Puerta 4 months ago

  • Pull request ID set to 37505

#35 Updated by Ernesto Puerta 4 months ago

Alternative fix (proposed by Stephan): https://github.com/ceph/ceph/pull/37510

#36 Updated by Nathan Cutler 4 months ago

  • Status changed from New to Resolved
  • Pull request ID changed from 37505 to 37510

Also available in: Atom PDF