Bug #49888
rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds
0%
Description
2021-03-18T15:45:54.926 ERROR:teuthology.run_tasks:Manager failed: radosbench Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/run_tasks.py", line 176, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "/home/teuthworker/src/github.com_ceph_ceph-c_3ae49e85764e073e154cc3617a04d1d62c9897ac/qa/tasks/radosbench.py", line 140, in task run.wait(radosbench.values(), timeout=timeout) File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/orchestra/run.py", line 473, in wait check_time() File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/contextutil.py", line 133, in __call__ raise MaxWhileTries(error_msg) teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds 2021-03-18T15:45:54.939 DEBUG:teuthology.run_tasks:Unwinding manager thrashosds
rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-stupid rados supported-random-distro$/{centos_8}}
/a/yuriw-2021-03-17_23:47:24-rados-wip-yuri6-testing-3.17.21-distro-basic-smithi/5975231
Related issues
History
#1 Updated by Sridhar Seshasayee over 2 years ago
Observed the same failure here:
/a/sseshasa-2021-04-23_18:11:53-rados-wip-sseshasa-testing-2021-04-23-2212-distro-basic-smithi/6069205/
#2 Updated by Neha Ojha about 2 years ago
rados/singleton/{all/thrash-eio mon_election/classic msgr-failures/many msgr/async-v1only objectstore/bluestore-bitmap rados supported-random-distro$/{rhel_8}}
/a/benhanokh-2021-07-29_20:44:57-rados-wip_gbenhano_ncb_fix3-distro-basic-smithi/6301914
#3 Updated by Neha Ojha about 2 years ago
- Subject changed from qa/tasks/radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds to rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds
rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-stupid rados supported-random-distro$/{centos_8}}
/a/yuriw-2021-08-06_16:31:19-rados-wip-yuri-master-8.6.21-distro-basic-smithi/6324604
This only happens with rados/singleton, and every time the job goes dead.
#4 Updated by Aishwarya Mathuria about 2 years ago
/a/yuriw-2021-08-27_21:20:08-rados-wip-yuri2-testing-2021-08-27-1207-distro-basic-smithi/6363835
#5 Updated by Laura Flores over 1 year ago
/a/yuriw-2022-03-19_14:37:23-rados-wip-yuri6-testing-2022-03-18-1104-distro-default-smithi/6746760
#6 Updated by Radoslaw Zarzynski over 1 year ago
A note from the bug scrub: although this is a client-side symptom (which might be a result of backend failure or an issue in the test itself), there is pattern between these reoccurrences: rados/singleton/{all/thrash-eio
.
#7 Updated by Laura Flores over 1 year ago
/a/yuriw-2022-06-02_00:50:42-rados-wip-yuri4-testing-2022-06-01-1350-pacific-distro-default-smithi/6859734
Description: rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{ubuntu_latest}}
#8 Updated by Radoslaw Zarzynski over 1 year ago
No high priority. Possibly a test issue.
#9 Updated by Matan Breizman about 1 year ago
/a/yuriw-2022-08-15_17:54:08-rados-wip-yuri2-testing-2022-08-15-0848-quincy-distro-default-smithi/6973760
#10 Updated by Kamoltat (Junior) Sirivadhna about 1 year ago
/a/yuriw-2022-09-27_23:37:28-rados-wip-yuri2-testing-2022-09-27-1455-distro-default-smithi/7046230/
#11 Updated by Laura Flores 5 months ago
- Tags set to test-failure
/a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250551
Something interesting at the end of the mgr log:
/a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250551/remote/smithi121/log/ceph-mgr.x.log.gz
2023-04-27T23:51:41.761+0000 7f5d406fa640 10 mgr.server operator() 24 pgs: 3 active+recovery_unfound+undersized+degraded+remapped, 21 active+clean; 4.0 GiB data, 8.8 GiB used, 441 GiB / 450 GiB avail; 24/8218 objects degraded (0.292%); 12/4109 objects unfound (0.292%)
2023-04-27T23:51:41.761+0000 7f5d406fa640 10 mgr.server operator() 4 health checks
2023-04-27T23:51:41.761+0000 7f5d406fa640 20 mgr.server operator() health checks:
{
"OBJECT_UNFOUND": {
"severity": "HEALTH_WARN",
"summary": {
"message": "12/4109 objects unfound (0.292%)",
"count": 12
},
"detail": [
{
"message": "pg 2.9 has 3 unfound objects"
},
{
"message": "pg 2.2 has 4 unfound objects"
},
{
"message": "pg 2.6 has 5 unfound objects"
}
]
},
"OSD_TOO_MANY_REPAIRS": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Too many repaired reads on 1 OSDs",
"count": 1
},
"detail": [
{
"message": "osd.0 had 16 reads repaired"
}
]
},
"PG_DAMAGED": {
"severity": "HEALTH_ERR",
"summary": {
"message": "Possible data damage: 3 pgs recovery_unfound",
"count": 3
},
"detail": [
{
"message": "pg 2.2 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 4 unfound"
},
{
"message": "pg 2.6 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 5 unfound"
},
{
"message": "pg 2.9 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 3 unfound"
}
]
},
"PG_DEGRADED": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Degraded data redundancy: 24/8218 objects degraded (0.292%), 3 pgs degraded, 3 pgs undersized",
"count": 6
},
"detail": [
{
"message": "pg 2.2 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]"
},
{
"message": "pg 2.6 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]"
},
{
"message": "pg 2.9 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]"
}
]
}
}
#12 Updated by Laura Flores 5 months ago
/a/yuriw-2023-04-26_01:16:19-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7253869
/a/yuriw-2023-05-01_16:31:04-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7259839
#13 Updated by Nitzan Mordechai 5 months ago
- Assignee set to Nitzan Mordechai
#14 Updated by Nitzan Mordechai 5 months ago
Looks like all the failure related to thrash-eio, i checked all the archived that we have (that still out there) and all of them with the same symptoms, recovery still waiting and won't complete since we injected eio on object, but the object is lost during the random write on primary and secondary also set it as missing.
Each one of the tests have size 2 replicas, and it doesn't look enough with the probability of eio injection (0.33) we should increase the replica size
in thrash-eio.yaml we have:
override: ceph: conf: mon: osd default pool size: 3
I tried to find any other reference to "osd default pool size" but couldn't find it in the code, i found it in other 3 yaml files.
it seems to be a typo or a misconfiguration of "osd pool default size", does it make seans?
#15 Updated by Nitzan Mordechai 5 months ago
- Status changed from New to In Progress
If my above comment is correct, teuthology also have that incorrect configuration as default, in placeholder.py we thought that we are setting it correctly:
'ceph-deploy': {
'conf': {
'client': {
'log file': '/var/log/ceph/ceph-$name.$pid.log'
},
'mon': {
'osd default pool size': 2
}
}
#16 Updated by Nitzan Mordechai 5 months ago
- Pull request ID set to 51527
#17 Updated by Nitzan Mordechai 5 months ago
- Status changed from In Progress to Fix Under Review
#19 Updated by Laura Flores 5 months ago
/a/yuriw-2023-05-16_23:44:06-rados-wip-yuri10-testing-2023-05-16-1243-distro-default-smithi/7276255
Hey Nitzan, yes, your analysis makes sense. Thanks for taking a look!
#20 Updated by Laura Flores 4 months ago
/a/yuriw-2023-05-23_22:39:17-rados-wip-yuri6-testing-2023-05-23-0757-reef-distro-default-smithi/7284941
#21 Updated by Radoslaw Zarzynski 4 months ago
- Status changed from Fix Under Review to Pending Backport
#22 Updated by Backport Bot 4 months ago
- Copied to Backport #61447: reef: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added
#23 Updated by Backport Bot 4 months ago
- Copied to Backport #61448: pacific: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added
#24 Updated by Backport Bot 4 months ago
- Copied to Backport #61449: quincy: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added
#25 Updated by Backport Bot 4 months ago
- Tags set to backport_processed