Project

General

Profile

Actions

Bug #49888

closed

rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds

Added by Neha Ojha about 3 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
backport_processed
Backport:
pacific,quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-03-18T15:45:54.926 ERROR:teuthology.run_tasks:Manager failed: radosbench
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/run_tasks.py", line 176, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_3ae49e85764e073e154cc3617a04d1d62c9897ac/qa/tasks/radosbench.py", line 140, in task
    run.wait(radosbench.values(), timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/orchestra/run.py", line 473, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds
2021-03-18T15:45:54.939 DEBUG:teuthology.run_tasks:Unwinding manager thrashosds

rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-stupid rados supported-random-distro$/{centos_8}}

/a/yuriw-2021-03-17_23:47:24-rados-wip-yuri6-testing-3.17.21-distro-basic-smithi/5975231


Related issues 3 (0 open3 closed)

Copied to RADOS - Backport #61447: reef: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 secondsResolvedNitzan MordechaiActions
Copied to RADOS - Backport #61448: pacific: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 secondsResolvedNitzan MordechaiActions
Copied to RADOS - Backport #61449: quincy: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 secondsResolvedNitzan MordechaiActions
Actions #1

Updated by Sridhar Seshasayee almost 3 years ago

Observed the same failure here:
/a/sseshasa-2021-04-23_18:11:53-rados-wip-sseshasa-testing-2021-04-23-2212-distro-basic-smithi/6069205/

Actions #2

Updated by Neha Ojha over 2 years ago

rados/singleton/{all/thrash-eio mon_election/classic msgr-failures/many msgr/async-v1only objectstore/bluestore-bitmap rados supported-random-distro$/{rhel_8}}

/a/benhanokh-2021-07-29_20:44:57-rados-wip_gbenhano_ncb_fix3-distro-basic-smithi/6301914

Actions #3

Updated by Neha Ojha over 2 years ago

  • Subject changed from qa/tasks/radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds to rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds

rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-stupid rados supported-random-distro$/{centos_8}}

/a/yuriw-2021-08-06_16:31:19-rados-wip-yuri-master-8.6.21-distro-basic-smithi/6324604

This only happens with rados/singleton, and every time the job goes dead.

Actions #4

Updated by Aishwarya Mathuria over 2 years ago

/a/yuriw-2021-08-27_21:20:08-rados-wip-yuri2-testing-2021-08-27-1207-distro-basic-smithi/6363835

Actions #5

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-03-19_14:37:23-rados-wip-yuri6-testing-2022-03-18-1104-distro-default-smithi/6746760

Actions #6

Updated by Radoslaw Zarzynski about 2 years ago

A note from the bug scrub: although this is a client-side symptom (which might be a result of backend failure or an issue in the test itself), there is pattern between these reoccurrences: rados/singleton/{all/thrash-eio.

Actions #7

Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-06-02_00:50:42-rados-wip-yuri4-testing-2022-06-01-1350-pacific-distro-default-smithi/6859734

Description: rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{ubuntu_latest}}

Actions #8

Updated by Radoslaw Zarzynski almost 2 years ago

No high priority. Possibly a test issue.

Actions #9

Updated by Matan Breizman over 1 year ago

/a/yuriw-2022-08-15_17:54:08-rados-wip-yuri2-testing-2022-08-15-0848-quincy-distro-default-smithi/6973760

Actions #10

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/yuriw-2022-09-27_23:37:28-rados-wip-yuri2-testing-2022-09-27-1455-distro-default-smithi/7046230/

Actions #11

Updated by Laura Flores 12 months ago

  • Translation missing: en.field_tag_list set to test-failure

/a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250551

Something interesting at the end of the mgr log:
/a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250551/remote/smithi121/log/ceph-mgr.x.log.gz

2023-04-27T23:51:41.761+0000 7f5d406fa640 10 mgr.server operator() 24 pgs: 3 active+recovery_unfound+undersized+degraded+remapped, 21 active+clean; 4.0 GiB data, 8.8 GiB used, 441 GiB / 450 GiB avail; 24/8218 objects degraded (0.292%); 12/4109 objects unfound (0.292%)
2023-04-27T23:51:41.761+0000 7f5d406fa640 10 mgr.server operator() 4 health checks
2023-04-27T23:51:41.761+0000 7f5d406fa640 20 mgr.server operator() health checks:
{
    "OBJECT_UNFOUND": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "12/4109 objects unfound (0.292%)",
            "count": 12
        },
        "detail": [
            {
                "message": "pg 2.9 has 3 unfound objects" 
            },
            {
                "message": "pg 2.2 has 4 unfound objects" 
            },
            {
                "message": "pg 2.6 has 5 unfound objects" 
            }
        ]
    },
    "OSD_TOO_MANY_REPAIRS": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Too many repaired reads on 1 OSDs",
            "count": 1
        },
        "detail": [
            {
                "message": "osd.0 had 16 reads repaired" 
            }
        ]
    },
    "PG_DAMAGED": {
        "severity": "HEALTH_ERR",
        "summary": {
            "message": "Possible data damage: 3 pgs recovery_unfound",
            "count": 3
        },
        "detail": [
            {
                "message": "pg 2.2 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 4 unfound" 
            },
            {
                "message": "pg 2.6 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 5 unfound" 
            },
            {
                "message": "pg 2.9 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 3 unfound" 
            }
        ]
    },
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 24/8218 objects degraded (0.292%), 3 pgs degraded, 3 pgs undersized",
            "count": 6
        },
        "detail": [
            {
                "message": "pg 2.2 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]" 
            },
            {
                "message": "pg 2.6 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]" 
            },
            {
                "message": "pg 2.9 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]" 
            }
        ]
    }
}

Actions #12

Updated by Laura Flores 12 months ago

/a/yuriw-2023-04-26_01:16:19-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7253869
/a/yuriw-2023-05-01_16:31:04-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7259839

Actions #13

Updated by Nitzan Mordechai 11 months ago

  • Assignee set to Nitzan Mordechai
Actions #14

Updated by Nitzan Mordechai 11 months ago

Looks like all the failure related to thrash-eio, i checked all the archived that we have (that still out there) and all of them with the same symptoms, recovery still waiting and won't complete since we injected eio on object, but the object is lost during the random write on primary and secondary also set it as missing.
Each one of the tests have size 2 replicas, and it doesn't look enough with the probability of eio injection (0.33) we should increase the replica size

in thrash-eio.yaml we have:

override:
  ceph:
    conf:
      mon:
        osd default pool size: 3

I tried to find any other reference to "osd default pool size" but couldn't find it in the code, i found it in other 3 yaml files.
it seems to be a typo or a misconfiguration of "osd pool default size", does it make seans?

Actions #15

Updated by Nitzan Mordechai 11 months ago

  • Status changed from New to In Progress

If my above comment is correct, teuthology also have that incorrect configuration as default, in placeholder.py we thought that we are setting it correctly:

       'ceph-deploy': {
            'conf': {
                'client': {
                    'log file': '/var/log/ceph/ceph-$name.$pid.log'
                },
                'mon': {
                    'osd default pool size': 2
                }
            }

Actions #16

Updated by Nitzan Mordechai 11 months ago

  • Pull request ID set to 51527
Actions #17

Updated by Nitzan Mordechai 11 months ago

  • Status changed from In Progress to Fix Under Review
Actions #18

Updated by Neha Ojha 11 months ago

  • Backport set to pacific,quincy,reef
Actions #19

Updated by Laura Flores 11 months ago

/a/yuriw-2023-05-16_23:44:06-rados-wip-yuri10-testing-2023-05-16-1243-distro-default-smithi/7276255

Hey Nitzan, yes, your analysis makes sense. Thanks for taking a look!

Actions #20

Updated by Laura Flores 11 months ago

/a/yuriw-2023-05-23_22:39:17-rados-wip-yuri6-testing-2023-05-23-0757-reef-distro-default-smithi/7284941

Actions #21

Updated by Radoslaw Zarzynski 11 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #22

Updated by Backport Bot 11 months ago

  • Copied to Backport #61447: reef: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added
Actions #23

Updated by Backport Bot 11 months ago

  • Copied to Backport #61448: pacific: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added
Actions #24

Updated by Backport Bot 11 months ago

  • Copied to Backport #61449: quincy: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added
Actions #25

Updated by Backport Bot 11 months ago

  • Tags set to backport_processed
Actions #26

Updated by Konstantin Shalygin 5 months ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF