Project

General

Profile

Bug #49888

rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds

Added by Neha Ojha over 2 years ago. Updated 4 months ago.

Status:
Pending Backport
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific,quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-03-18T15:45:54.926 ERROR:teuthology.run_tasks:Manager failed: radosbench
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/run_tasks.py", line 176, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_3ae49e85764e073e154cc3617a04d1d62c9897ac/qa/tasks/radosbench.py", line 140, in task
    run.wait(radosbench.values(), timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/orchestra/run.py", line 473, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds
2021-03-18T15:45:54.939 DEBUG:teuthology.run_tasks:Unwinding manager thrashosds

rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-stupid rados supported-random-distro$/{centos_8}}

/a/yuriw-2021-03-17_23:47:24-rados-wip-yuri6-testing-3.17.21-distro-basic-smithi/5975231


Related issues

Copied to RADOS - Backport #61447: reef: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds Resolved
Copied to RADOS - Backport #61448: pacific: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds In Progress
Copied to RADOS - Backport #61449: quincy: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds In Progress

History

#1 Updated by Sridhar Seshasayee over 2 years ago

Observed the same failure here:
/a/sseshasa-2021-04-23_18:11:53-rados-wip-sseshasa-testing-2021-04-23-2212-distro-basic-smithi/6069205/

#2 Updated by Neha Ojha about 2 years ago

rados/singleton/{all/thrash-eio mon_election/classic msgr-failures/many msgr/async-v1only objectstore/bluestore-bitmap rados supported-random-distro$/{rhel_8}}

/a/benhanokh-2021-07-29_20:44:57-rados-wip_gbenhano_ncb_fix3-distro-basic-smithi/6301914

#3 Updated by Neha Ojha about 2 years ago

  • Subject changed from qa/tasks/radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds to rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds

rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-stupid rados supported-random-distro$/{centos_8}}

/a/yuriw-2021-08-06_16:31:19-rados-wip-yuri-master-8.6.21-distro-basic-smithi/6324604

This only happens with rados/singleton, and every time the job goes dead.

#4 Updated by Aishwarya Mathuria about 2 years ago

/a/yuriw-2021-08-27_21:20:08-rados-wip-yuri2-testing-2021-08-27-1207-distro-basic-smithi/6363835

#5 Updated by Laura Flores over 1 year ago

/a/yuriw-2022-03-19_14:37:23-rados-wip-yuri6-testing-2022-03-18-1104-distro-default-smithi/6746760

#6 Updated by Radoslaw Zarzynski over 1 year ago

A note from the bug scrub: although this is a client-side symptom (which might be a result of backend failure or an issue in the test itself), there is pattern between these reoccurrences: rados/singleton/{all/thrash-eio.

#7 Updated by Laura Flores over 1 year ago

/a/yuriw-2022-06-02_00:50:42-rados-wip-yuri4-testing-2022-06-01-1350-pacific-distro-default-smithi/6859734

Description: rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{ubuntu_latest}}

#8 Updated by Radoslaw Zarzynski over 1 year ago

No high priority. Possibly a test issue.

#9 Updated by Matan Breizman about 1 year ago

/a/yuriw-2022-08-15_17:54:08-rados-wip-yuri2-testing-2022-08-15-0848-quincy-distro-default-smithi/6973760

#10 Updated by Kamoltat (Junior) Sirivadhna about 1 year ago

/a/yuriw-2022-09-27_23:37:28-rados-wip-yuri2-testing-2022-09-27-1455-distro-default-smithi/7046230/

#11 Updated by Laura Flores 5 months ago

  • Tags set to test-failure

/a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250551

Something interesting at the end of the mgr log:
/a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250551/remote/smithi121/log/ceph-mgr.x.log.gz

2023-04-27T23:51:41.761+0000 7f5d406fa640 10 mgr.server operator() 24 pgs: 3 active+recovery_unfound+undersized+degraded+remapped, 21 active+clean; 4.0 GiB data, 8.8 GiB used, 441 GiB / 450 GiB avail; 24/8218 objects degraded (0.292%); 12/4109 objects unfound (0.292%)
2023-04-27T23:51:41.761+0000 7f5d406fa640 10 mgr.server operator() 4 health checks
2023-04-27T23:51:41.761+0000 7f5d406fa640 20 mgr.server operator() health checks:
{
    "OBJECT_UNFOUND": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "12/4109 objects unfound (0.292%)",
            "count": 12
        },
        "detail": [
            {
                "message": "pg 2.9 has 3 unfound objects" 
            },
            {
                "message": "pg 2.2 has 4 unfound objects" 
            },
            {
                "message": "pg 2.6 has 5 unfound objects" 
            }
        ]
    },
    "OSD_TOO_MANY_REPAIRS": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Too many repaired reads on 1 OSDs",
            "count": 1
        },
        "detail": [
            {
                "message": "osd.0 had 16 reads repaired" 
            }
        ]
    },
    "PG_DAMAGED": {
        "severity": "HEALTH_ERR",
        "summary": {
            "message": "Possible data damage: 3 pgs recovery_unfound",
            "count": 3
        },
        "detail": [
            {
                "message": "pg 2.2 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 4 unfound" 
            },
            {
                "message": "pg 2.6 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 5 unfound" 
            },
            {
                "message": "pg 2.9 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 3 unfound" 
            }
        ]
    },
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 24/8218 objects degraded (0.292%), 3 pgs degraded, 3 pgs undersized",
            "count": 6
        },
        "detail": [
            {
                "message": "pg 2.2 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]" 
            },
            {
                "message": "pg 2.6 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]" 
            },
            {
                "message": "pg 2.9 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]" 
            }
        ]
    }
}

#12 Updated by Laura Flores 5 months ago

/a/yuriw-2023-04-26_01:16:19-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7253869
/a/yuriw-2023-05-01_16:31:04-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7259839

#13 Updated by Nitzan Mordechai 5 months ago

  • Assignee set to Nitzan Mordechai

#14 Updated by Nitzan Mordechai 5 months ago

Looks like all the failure related to thrash-eio, i checked all the archived that we have (that still out there) and all of them with the same symptoms, recovery still waiting and won't complete since we injected eio on object, but the object is lost during the random write on primary and secondary also set it as missing.
Each one of the tests have size 2 replicas, and it doesn't look enough with the probability of eio injection (0.33) we should increase the replica size

in thrash-eio.yaml we have:

override:
  ceph:
    conf:
      mon:
        osd default pool size: 3

I tried to find any other reference to "osd default pool size" but couldn't find it in the code, i found it in other 3 yaml files.
it seems to be a typo or a misconfiguration of "osd pool default size", does it make seans?

#15 Updated by Nitzan Mordechai 5 months ago

  • Status changed from New to In Progress

If my above comment is correct, teuthology also have that incorrect configuration as default, in placeholder.py we thought that we are setting it correctly:

       'ceph-deploy': {
            'conf': {
                'client': {
                    'log file': '/var/log/ceph/ceph-$name.$pid.log'
                },
                'mon': {
                    'osd default pool size': 2
                }
            }

#16 Updated by Nitzan Mordechai 5 months ago

  • Pull request ID set to 51527

#17 Updated by Nitzan Mordechai 5 months ago

  • Status changed from In Progress to Fix Under Review

#18 Updated by Neha Ojha 5 months ago

  • Backport set to pacific,quincy,reef

#19 Updated by Laura Flores 5 months ago

/a/yuriw-2023-05-16_23:44:06-rados-wip-yuri10-testing-2023-05-16-1243-distro-default-smithi/7276255

Hey Nitzan, yes, your analysis makes sense. Thanks for taking a look!

#20 Updated by Laura Flores 4 months ago

/a/yuriw-2023-05-23_22:39:17-rados-wip-yuri6-testing-2023-05-23-0757-reef-distro-default-smithi/7284941

#21 Updated by Radoslaw Zarzynski 4 months ago

  • Status changed from Fix Under Review to Pending Backport

#22 Updated by Backport Bot 4 months ago

  • Copied to Backport #61447: reef: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added

#23 Updated by Backport Bot 4 months ago

  • Copied to Backport #61448: pacific: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added

#24 Updated by Backport Bot 4 months ago

  • Copied to Backport #61449: quincy: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added

#25 Updated by Backport Bot 4 months ago

  • Tags set to backport_processed

Also available in: Atom PDF