Project

General

Profile

Actions

Bug #53680

closed

ERROR:tasks.rook:'waiting for service removal' reached maximum tries (90) after waiting for 900 seconds

Added by Laura Flores over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2021-12-17_22:45:37-rados-wip-yuri10-testing-2021-12-17-1119-distro-default-smithi/6569344/

2021-12-18T01:35:01.041 INFO:teuthology.orchestra.run.smithi185.stdout:[{"placement": {"host_pattern": "*"}, "service_name": "crash", "service_type": "crash", "status": {"container_image_id": "1b23219043771c7aeaf383e73d414c625d03fae66d69ca7172f26ede96eefd1d", "container_image_name": "quay.ceph.io/ceph-ci/ceph:91fdab49fed87aa0a3dbbceccc27e84ab4f80130", "created": "2021-12-18T01:11:19.000000Z", "last_refresh": "2021-12-18T01:35:00.947402Z", "running": 1, "size": 1}}, {"placement": {"count": 1}, "service_name": "mgr", "service_type": "mgr", "status": {"container_image_id": "1b23219043771c7aeaf383e73d414c625d03fae66d69ca7172f26ede96eefd1d", "container_image_name": "quay.ceph.io/ceph-ci/ceph:91fdab49fed87aa0a3dbbceccc27e84ab4f80130", "created": "2021-12-18T01:02:05.000000Z", "last_refresh": "2021-12-18T01:35:00.947402Z", "running": 1, "size": 1}}, {"placement": {"count": 1}, "service_name": "mon", "service_type": "mon", "status": {"container_image_id": "1b23219043771c7aeaf383e73d414c625d03fae66d69ca7172f26ede96eefd1d", "container_image_name": "quay.ceph.io/ceph-ci/ceph:91fdab49fed87aa0a3dbbceccc27e84ab4f80130", "created": "2021-12-18T01:01:38.000000Z", "last_refresh": "2021-12-18T01:35:00.947402Z", "running": 1, "size": 1}}, {"service_name": "osd", "service_type": "osd", "spec": {"filter_logic": "AND", "objectstore": "bluestore"}, "status": {"container_image_id": "1b23219043771c7aeaf383e73d414c625d03fae66d69ca7172f26ede96eefd1d", "container_image_name": "quay.ceph.io/ceph-ci/ceph:91fdab49fed87aa0a3dbbceccc27e84ab4f80130", "created": "2021-12-18T01:03:41.000000Z", "last_refresh": "2021-12-18T01:35:00.947402Z", "running": 8, "size": 4}, "unmanaged": true}, {"placement": {"host_pattern": "*"}, "service_id": "all-available-devices", "service_name": "osd.all-available-devices", "service_type": "osd", "spec": {"data_devices": {"all": true}, "filter_logic": "AND", "objectstore": "bluestore"}, "status": {"last_refresh": "2021-12-18T01:35:00.947402Z", "running": 0, "size": 0}}, {"placement": {"count": 1}, "service_id": "foo", "service_name": "rgw.foo", "service_type": "rgw", "spec": {"rgw_frontend_port": 80}, "status": {"container_image_id": "1b23219043771c7aeaf383e73d414c625d03fae66d69ca7172f26ede96eefd1d", "container_image_name": "quay.ceph.io/ceph-ci/ceph:91fdab49fed87aa0a3dbbceccc27e84ab4f80130", "created": "2021-12-18T01:13:57.000000Z", "last_refresh": "2021-12-18T01:35:00.947402Z", "running": 1, "size": 1}}]
2021-12-18T01:35:01.064 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/github.com_ceph_ceph-c_91fdab49fed87aa0a3dbbceccc27e84ab4f80130/qa/tasks/rook.py", line 669, in task
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'waiting for service removal' reached maximum tries (90) after waiting for 900 seconds
2021-12-18T01:35:01.065 ERROR:tasks.rook:'waiting for service removal' reached maximum tries (90) after waiting for 900 seconds
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_91fdab49fed87aa0a3dbbceccc27e84ab4f80130/qa/tasks/rook.py", line 530, in ceph_config_keyring
    yield
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/github.com_ceph_ceph-c_91fdab49fed87aa0a3dbbceccc27e84ab4f80130/qa/tasks/rook.py", line 669, in task
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'waiting for service removal' reached maximum tries (90) after waiting for 900 seconds
2021-12-18T01:35:01.065 INFO:tasks.rook:Cleaning up config and client.admin keyring
2021-12-18T01:35:01.066 DEBUG:teuthology.orchestra.run.smithi185:> sudo rm -f /etc/ceph/ceph.conf /etc/ceph/ceph.client.admin.keyring
2021-12-18T01:35:01.079 ERROR:tasks.rook:'waiting for service removal' reached maximum tries (90) after waiting for 900 seconds
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_91fdab49fed87aa0a3dbbceccc27e84ab4f80130/qa/tasks/rook.py", line 478, in rook_post_config
    yield
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/github.com_ceph_ceph-c_91fdab49fed87aa0a3dbbceccc27e84ab4f80130/qa/tasks/rook.py", line 669, in task
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'waiting for service removal' reached maximum tries (90) after waiting for 900 seconds
2021-12-18T01:35:01.080 ERROR:tasks.rook:'waiting for service removal' reached maximum tries (90) after waiting for 900 seconds
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_91fdab49fed87aa0a3dbbceccc27e84ab4f80130/qa/tasks/rook.py", line 442, in rook_toolbox
    yield
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/github.com_ceph_ceph-c_91fdab49fed87aa0a3dbbceccc27e84ab4f80130/qa/tasks/rook.py", line 669, in task
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'waiting for service removal' reached maximum tries (90) after waiting for 900 seconds
2021-12-18T01:35:01.141 DEBUG:teuthology.orchestra.remote:smithi185:rook/cluster/examples/kubernetes/ceph/operator.yaml is 22KB
2021-12-18T01:35:01.191 DEBUG:teuthology.orchestra.run.smithi185:> kubectl delete -f rook/cluster/examples/kubernetes/ceph/toolbox.yaml
2021-12-18T01:35:01.255 INFO:teuthology.orchestra.run.smithi185.stdout:deployment.apps "rook-ceph-tools" deleted
2021-12-18T01:35:01.287 ERROR:tasks.rook:'waiting for service removal' reached maximum tries (90) after waiting for 900 seconds
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_91fdab49fed87aa0a3dbbceccc27e84ab4f80130/qa/tasks/rook.py", line 379, in rook_cluster
    yield
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/github.com_ceph_ceph-c_91fdab49fed87aa0a3dbbceccc27e84ab4f80130/qa/tasks/rook.py", line 669, in task
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'waiting for service removal' reached maximum tries (90) after waiting for 900 seconds
2021-12-18T01:35:01.288 DEBUG:teuthology.orchestra.run.smithi185:> kubectl delete -f cluster.yaml
2021-12-18T01:35:01.380 INFO:teuthology.orchestra.run.smithi185.stdout:cephcluster.ceph.rook.io "rook-ceph" deleted
2021-12-18T01:35:01.383 INFO:tasks.rook.operator.smithi185.stdout:2021-12-18 01:35:01.383570 I | ceph-cluster-controller: CR "rook-ceph" is going be deleted, cancelling any ongoing orchestration
2021-12-18T01:35:01.685 INFO:tasks.rook.operator.smithi185.stdout:2021-12-18 01:35:01.685102 I | ceph-cluster-controller: CephCluster "rook-ceph/rook-ceph" will not be deleted until all dependents are removed: CephObjectStores: [foo]
2021-12-18T01:35:01.696 INFO:tasks.rook.operator.smithi185.stdout:2021-12-18 01:35:01.696075 E | ceph-cluster-controller: failed to reconcile CephCluster "rook-ceph/rook-ceph". CephCluster "rook-ceph/rook-ceph" will not be deleted until all dependents are removed: CephObjectStores: [foo]
2021-12-18T01:35:01.696 INFO:tasks.rook.operator.smithi185.stdout:2021-12-18 01:35:01.696111 I | op-k8sutil: Reporting Event rook-ceph:rook-ceph Warning:ReconcileFailed:CephCluster "rook-ceph/rook-ceph" will not be deleted until all dependents are removed: CephObjectStores: [foo]
Actions #1

Updated by Kamoltat (Junior) Sirivadhna over 2 years ago

/a/yuriw-2021-12-21_18:01:07-rados-wip-yuri3-testing-2021-12-21-0749-distro-default-smithi/6576218/

Actions #2

Updated by Sebastian Wagner over 2 years ago

  • Assignee set to Joseph Sawaya
Actions #3

Updated by Laura Flores over 2 years ago

/a/yuriw-2022-01-04_21:52:15-rados-wip-yuri7-testing-2022-01-04-1159-distro-default-smithi/6595518

Actions #4

Updated by Joseph Sawaya over 2 years ago

The first two logs are due to this ListBuckets call failing in the RGW pod: https://github.com/rook/rook/blob/0d8fd9d8a47799fbb2607fded7bab757fee2fd6a/pkg/operator/ceph/object/dependents.go#L97. This is the error we get when that happens:

failed to reconcile CephObjectStore "rook-ceph/foo". failed to get dependents of CephObjectStore "rook-ceph/foo": failed to list buckets in CephObjectStore "rook-ceph/foo": Get "http://rook-ceph-rgw-foo.rook-ceph.svc:80/admin/bucket?": dial tcp 10.98.24.136:80: connect: connection refused

We also get another error earlier in the log that says that we can't create multisite for the CephObjectStore but that goes away and we end up getting the error above when the CephObjectStore tries to reconcile:

failed to reconcile CephObjectStore "rook-ceph/foo". failed to create object store deployments: failed to configure multisite for object store: failed create ceph multisite for object-store ["foo"]: failed to update period%!(EXTRA []string=[]): exit status 2

In the third log we're getting a different error related to the creation of the "rgw-admin-ops-user" and it looks like it signifies and error with the radosgw-admin command. This is the error we get when that happens:

failed to reconcile CephObjectStore "rook-ceph/foo". failed to check for object buckets. failed to get admin ops API context: failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "foo": failed to create s3 user. 2022-01-05T03:22:34.737+0000 7f4cb28aa340 0 failed reading zonegroup info: ret -2 (2) No such file or directory

To me it seems that the issue isn't caused by the orchestrator or Rook, since there were no changes to the orchestrator or Rook 1.7.2 and the error changed between runs. If the same error persists, then this warrants another look.

Actions #5

Updated by Laura Flores over 2 years ago

Thanks Joseph. I frequently review teuthology runs, so I'll update this tracker if the problem persists. Hopefully if there are more occurrences, we can narrow down the root cause.

Actions #6

Updated by Laura Flores over 2 years ago

Happened again here: /a/yuriw-2022-01-13_14:57:55-rados-wip-yuri5-testing-2022-01-12-1534-distro-default-smithi/6612758

Actions #7

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-02-21_15:40:41-rados-wip-yuri4-testing-2022-02-18-0800-distro-default-smithi/6698305
/a/yuriw-2022-02-21_15:40:41-rados-wip-yuri4-testing-2022-02-18-0800-distro-default-smithi/6698462
/a/yuriw-2022-02-21_15:40:41-rados-wip-yuri4-testing-2022-02-18-0800-distro-default-smithi/6698542
/a/yuriw-2022-02-22_16:14:07-rados-wip-yuri4-testing-2022-02-18-0800-distro-default-smithi/6700744
/a/yuriw-2022-02-22_16:14:07-rados-wip-yuri4-testing-2022-02-18-0800-distro-default-smithi/6700746
/a/yuriw-2022-02-22_16:14:07-rados-wip-yuri4-testing-2022-02-18-0800-distro-default-smithi/6700752
/a/yuriw-2022-02-22_16:14:07-rados-wip-yuri4-testing-2022-02-18-0800-distro-default-smithi/6700754

Actions #8

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-03-01_22:42:19-rados-wip-yuri4-testing-2022-03-01-1206-distro-default-smithi/6715405

Actions #9

Updated by Laura Flores about 2 years ago

  • Priority changed from Normal to High

Upping the priority of this because it is failing a lot in the rados suite.

Actions #10

Updated by Laura Flores about 2 years ago

/a/dgalloway-2022-03-09_02:34:58-rados-wip-45272-distro-basic-smithi/6727547

Actions #11

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-03-10_01:04:51-rados-wip-yuri5-testing-2022-03-07-0958-distro-default-smithi/6728619

Actions #12

Updated by Aishwarya Mathuria about 2 years ago

/a/yuriw-2022-03-14_18:47:44-rados-wip-yuri3-testing-2022-03-14-0946-distro-default-smithi/6736585
/a/yuriw-2022-03-14_18:47:44-rados-wip-yuri3-testing-2022-03-14-0946-distro-default-smithi/6736425

Actions #13

Updated by Sridhar Seshasayee about 2 years ago

/a/yuriw-2022-03-16_20:38:07-rados-wip-yuri3-testing-2022-03-16-1030-distro-default-smithi/6739268

Actions #14

Updated by Joseph Sawaya about 2 years ago

Neha brought it to my attention that this issue is still coming up, so I'm going to attempt to fix it by simply removing the orchestrator commands from the test suite so that the rook suite just creates a Rook cluster and runs radosbench. The Rook orchestrator is not being maintained at the moment so I think it's safe to remove testing for it. The orchestrator being broken could easily be the cause of this issue so it makes sense to just remove that from the test suite for now, and think of more appropriate solution for running radosbench on Rook in teuthology later.

Actions #15

Updated by Neha Ojha about 2 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 45749
Actions #16

Updated by Aishwarya Mathuria about 2 years ago

/a/yuriw-2022-04-06_16:35:43-rados-wip-yuri5-testing-2022-04-05-1720-distro-default-smithi/6779888

Actions #17

Updated by Laura Flores about 2 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF