Bug #49888
closedrados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds
100%
Description
2021-03-18T15:45:54.926 ERROR:teuthology.run_tasks:Manager failed: radosbench Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/run_tasks.py", line 176, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "/home/teuthworker/src/github.com_ceph_ceph-c_3ae49e85764e073e154cc3617a04d1d62c9897ac/qa/tasks/radosbench.py", line 140, in task run.wait(radosbench.values(), timeout=timeout) File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/orchestra/run.py", line 473, in wait check_time() File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/contextutil.py", line 133, in __call__ raise MaxWhileTries(error_msg) teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds 2021-03-18T15:45:54.939 DEBUG:teuthology.run_tasks:Unwinding manager thrashosds
rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-stupid rados supported-random-distro$/{centos_8}}
/a/yuriw-2021-03-17_23:47:24-rados-wip-yuri6-testing-3.17.21-distro-basic-smithi/5975231
Updated by Sridhar Seshasayee about 3 years ago
Observed the same failure here:
/a/sseshasa-2021-04-23_18:11:53-rados-wip-sseshasa-testing-2021-04-23-2212-distro-basic-smithi/6069205/
Updated by Neha Ojha almost 3 years ago
rados/singleton/{all/thrash-eio mon_election/classic msgr-failures/many msgr/async-v1only objectstore/bluestore-bitmap rados supported-random-distro$/{rhel_8}}
/a/benhanokh-2021-07-29_20:44:57-rados-wip_gbenhano_ncb_fix3-distro-basic-smithi/6301914
Updated by Neha Ojha over 2 years ago
- Subject changed from qa/tasks/radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds to rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds
rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-stupid rados supported-random-distro$/{centos_8}}
/a/yuriw-2021-08-06_16:31:19-rados-wip-yuri-master-8.6.21-distro-basic-smithi/6324604
This only happens with rados/singleton, and every time the job goes dead.
Updated by Aishwarya Mathuria over 2 years ago
/a/yuriw-2021-08-27_21:20:08-rados-wip-yuri2-testing-2021-08-27-1207-distro-basic-smithi/6363835
Updated by Laura Flores about 2 years ago
/a/yuriw-2022-03-19_14:37:23-rados-wip-yuri6-testing-2022-03-18-1104-distro-default-smithi/6746760
Updated by Radoslaw Zarzynski about 2 years ago
A note from the bug scrub: although this is a client-side symptom (which might be a result of backend failure or an issue in the test itself), there is pattern between these reoccurrences: rados/singleton/{all/thrash-eio
.
Updated by Laura Flores almost 2 years ago
/a/yuriw-2022-06-02_00:50:42-rados-wip-yuri4-testing-2022-06-01-1350-pacific-distro-default-smithi/6859734
Description: rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{ubuntu_latest}}
Updated by Radoslaw Zarzynski almost 2 years ago
No high priority. Possibly a test issue.
Updated by Matan Breizman over 1 year ago
/a/yuriw-2022-08-15_17:54:08-rados-wip-yuri2-testing-2022-08-15-0848-quincy-distro-default-smithi/6973760
Updated by Kamoltat (Junior) Sirivadhna over 1 year ago
/a/yuriw-2022-09-27_23:37:28-rados-wip-yuri2-testing-2022-09-27-1455-distro-default-smithi/7046230/
Updated by Laura Flores 12 months ago
- Translation missing: en.field_tag_list set to test-failure
/a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250551
Something interesting at the end of the mgr log:
/a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250551/remote/smithi121/log/ceph-mgr.x.log.gz
2023-04-27T23:51:41.761+0000 7f5d406fa640 10 mgr.server operator() 24 pgs: 3 active+recovery_unfound+undersized+degraded+remapped, 21 active+clean; 4.0 GiB data, 8.8 GiB used, 441 GiB / 450 GiB avail; 24/8218 objects degraded (0.292%); 12/4109 objects unfound (0.292%)
2023-04-27T23:51:41.761+0000 7f5d406fa640 10 mgr.server operator() 4 health checks
2023-04-27T23:51:41.761+0000 7f5d406fa640 20 mgr.server operator() health checks:
{
"OBJECT_UNFOUND": {
"severity": "HEALTH_WARN",
"summary": {
"message": "12/4109 objects unfound (0.292%)",
"count": 12
},
"detail": [
{
"message": "pg 2.9 has 3 unfound objects"
},
{
"message": "pg 2.2 has 4 unfound objects"
},
{
"message": "pg 2.6 has 5 unfound objects"
}
]
},
"OSD_TOO_MANY_REPAIRS": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Too many repaired reads on 1 OSDs",
"count": 1
},
"detail": [
{
"message": "osd.0 had 16 reads repaired"
}
]
},
"PG_DAMAGED": {
"severity": "HEALTH_ERR",
"summary": {
"message": "Possible data damage: 3 pgs recovery_unfound",
"count": 3
},
"detail": [
{
"message": "pg 2.2 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 4 unfound"
},
{
"message": "pg 2.6 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 5 unfound"
},
{
"message": "pg 2.9 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 3 unfound"
}
]
},
"PG_DEGRADED": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Degraded data redundancy: 24/8218 objects degraded (0.292%), 3 pgs degraded, 3 pgs undersized",
"count": 6
},
"detail": [
{
"message": "pg 2.2 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]"
},
{
"message": "pg 2.6 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]"
},
{
"message": "pg 2.9 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]"
}
]
}
}
Updated by Laura Flores 12 months ago
/a/yuriw-2023-04-26_01:16:19-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7253869
/a/yuriw-2023-05-01_16:31:04-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7259839
Updated by Nitzan Mordechai 12 months ago
Looks like all the failure related to thrash-eio, i checked all the archived that we have (that still out there) and all of them with the same symptoms, recovery still waiting and won't complete since we injected eio on object, but the object is lost during the random write on primary and secondary also set it as missing.
Each one of the tests have size 2 replicas, and it doesn't look enough with the probability of eio injection (0.33) we should increase the replica size
in thrash-eio.yaml we have:
override: ceph: conf: mon: osd default pool size: 3
I tried to find any other reference to "osd default pool size" but couldn't find it in the code, i found it in other 3 yaml files.
it seems to be a typo or a misconfiguration of "osd pool default size", does it make seans?
Updated by Nitzan Mordechai 12 months ago
- Status changed from New to In Progress
If my above comment is correct, teuthology also have that incorrect configuration as default, in placeholder.py we thought that we are setting it correctly:
'ceph-deploy': {
'conf': {
'client': {
'log file': '/var/log/ceph/ceph-$name.$pid.log'
},
'mon': {
'osd default pool size': 2
}
}
Updated by Nitzan Mordechai 12 months ago
- Status changed from In Progress to Fix Under Review
Updated by Laura Flores 12 months ago
/a/yuriw-2023-05-16_23:44:06-rados-wip-yuri10-testing-2023-05-16-1243-distro-default-smithi/7276255
Hey Nitzan, yes, your analysis makes sense. Thanks for taking a look!
Updated by Laura Flores 12 months ago
/a/yuriw-2023-05-23_22:39:17-rados-wip-yuri6-testing-2023-05-23-0757-reef-distro-default-smithi/7284941
Updated by Radoslaw Zarzynski 11 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 11 months ago
- Copied to Backport #61447: reef: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added
Updated by Backport Bot 11 months ago
- Copied to Backport #61448: pacific: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added
Updated by Backport Bot 11 months ago
- Copied to Backport #61449: quincy: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added
Updated by Konstantin Shalygin 6 months ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100