Bug #49888: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds - RADOS - Ceph

Actions

Copy link

Bug #49888

closed

rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds

Added by Neha Ojha about 3 years ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Nitzan Mordechai

Category:

Target version:

% Done:

100%

Source:

Tags:

backport_processed

Backport:

pacific,quincy,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

51527

Crash signature (v1):

Crash signature (v2):

Description

2021-03-18T15:45:54.926 ERROR:teuthology.run_tasks:Manager failed: radosbench
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/run_tasks.py", line 176, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_3ae49e85764e073e154cc3617a04d1d62c9897ac/qa/tasks/radosbench.py", line 140, in task
    run.wait(radosbench.values(), timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/orchestra/run.py", line 473, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds
2021-03-18T15:45:54.939 DEBUG:teuthology.run_tasks:Unwinding manager thrashosds

rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-stupid rados supported-random-distro$/{centos_8}}

/a/yuriw-2021-03-17_23:47:24-rados-wip-yuri6-testing-3.17.21-distro-basic-smithi/5975231

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Sridhar Seshasayee almost 3 years ago

Observed the same failure here:
/a/sseshasa-2021-04-23_18:11:53-rados-wip-sseshasa-testing-2021-04-23-2212-distro-basic-smithi/6069205/

Actions

Copy link

Updated by Neha Ojha over 2 years ago

rados/singleton/{all/thrash-eio mon_election/classic msgr-failures/many msgr/async-v1only objectstore/bluestore-bitmap rados supported-random-distro$/{rhel_8}}

/a/benhanokh-2021-07-29_20:44:57-rados-wip_gbenhano_ncb_fix3-distro-basic-smithi/6301914

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Subject changed from qa/tasks/radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds to rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds

rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-stupid rados supported-random-distro$/{centos_8}}

/a/yuriw-2021-08-06_16:31:19-rados-wip-yuri-master-8.6.21-distro-basic-smithi/6324604

This only happens with rados/singleton, and every time the job goes dead.

Actions

Copy link

Updated by Aishwarya Mathuria over 2 years ago

/a/yuriw-2021-08-27_21:20:08-rados-wip-yuri2-testing-2021-08-27-1207-distro-basic-smithi/6363835

Actions

Copy link

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-03-19_14:37:23-rados-wip-yuri6-testing-2022-03-18-1104-distro-default-smithi/6746760

Actions

Copy link

Updated by Radoslaw Zarzynski about 2 years ago

A note from the bug scrub: although this is a client-side symptom (which might be a result of backend failure or an issue in the test itself), there is pattern between these reoccurrences: rados/singleton/{all/thrash-eio.

Actions

Copy link

Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-06-02_00:50:42-rados-wip-yuri4-testing-2022-06-01-1350-pacific-distro-default-smithi/6859734

Description: rados/singleton/{all/thrash-eio mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{ubuntu_latest}}

Actions

Copy link

Updated by Radoslaw Zarzynski almost 2 years ago

No high priority. Possibly a test issue.

Actions

Copy link

Updated by Matan Breizman over 1 year ago

/a/yuriw-2022-08-15_17:54:08-rados-wip-yuri2-testing-2022-08-15-0848-quincy-distro-default-smithi/6973760

Actions

Copy link

#10

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/yuriw-2022-09-27_23:37:28-rados-wip-yuri2-testing-2022-09-27-1455-distro-default-smithi/7046230/

Actions

Copy link

#11

Updated by Laura Flores 12 months ago

Translation missing: en.field_tag_list set to test-failure

/a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250551

Something interesting at the end of the mgr log:
/a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250551/remote/smithi121/log/ceph-mgr.x.log.gz

2023-04-27T23:51:41.761+0000 7f5d406fa640 10 mgr.server operator() 24 pgs: 3 active+recovery_unfound+undersized+degraded+remapped, 21 active+clean; 4.0 GiB data, 8.8 GiB used, 441 GiB / 450 GiB avail; 24/8218 objects degraded (0.292%); 12/4109 objects unfound (0.292%)
2023-04-27T23:51:41.761+0000 7f5d406fa640 10 mgr.server operator() 4 health checks
2023-04-27T23:51:41.761+0000 7f5d406fa640 20 mgr.server operator() health checks:
{
    "OBJECT_UNFOUND": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "12/4109 objects unfound (0.292%)",
            "count": 12
        },
        "detail": [
            {
                "message": "pg 2.9 has 3 unfound objects" 
            },
            {
                "message": "pg 2.2 has 4 unfound objects" 
            },
            {
                "message": "pg 2.6 has 5 unfound objects" 
            }
        ]
    },
    "OSD_TOO_MANY_REPAIRS": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Too many repaired reads on 1 OSDs",
            "count": 1
        },
        "detail": [
            {
                "message": "osd.0 had 16 reads repaired" 
            }
        ]
    },
    "PG_DAMAGED": {
        "severity": "HEALTH_ERR",
        "summary": {
            "message": "Possible data damage: 3 pgs recovery_unfound",
            "count": 3
        },
        "detail": [
            {
                "message": "pg 2.2 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 4 unfound" 
            },
            {
                "message": "pg 2.6 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 5 unfound" 
            },
            {
                "message": "pg 2.9 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 3 unfound" 
            }
        ]
    },
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 24/8218 objects degraded (0.292%), 3 pgs degraded, 3 pgs undersized",
            "count": 6
        },
        "detail": [
            {
                "message": "pg 2.2 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]" 
            },
            {
                "message": "pg 2.6 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]" 
            },
            {
                "message": "pg 2.9 is stuck undersized for 11h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [0]" 
            }
        ]
    }
}

Actions

Copy link

#12

Updated by Laura Flores 12 months ago

/a/yuriw-2023-04-26_01:16:19-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7253869
/a/yuriw-2023-05-01_16:31:04-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7259839

Actions

Copy link

#13

Updated by Nitzan Mordechai 11 months ago

Assignee set to Nitzan Mordechai

Actions

Copy link

#14

Updated by Nitzan Mordechai 11 months ago

Looks like all the failure related to thrash-eio, i checked all the archived that we have (that still out there) and all of them with the same symptoms, recovery still waiting and won't complete since we injected eio on object, but the object is lost during the random write on primary and secondary also set it as missing.
Each one of the tests have size 2 replicas, and it doesn't look enough with the probability of eio injection (0.33) we should increase the replica size

in thrash-eio.yaml we have:

override:
  ceph:
    conf:
      mon:
        osd default pool size: 3

I tried to find any other reference to "osd default pool size" but couldn't find it in the code, i found it in other 3 yaml files.
it seems to be a typo or a misconfiguration of "osd pool default size", does it make seans?

Actions

Copy link

#15

Updated by Nitzan Mordechai 11 months ago

Status changed from New to In Progress

If my above comment is correct, teuthology also have that incorrect configuration as default, in placeholder.py we thought that we are setting it correctly:

       'ceph-deploy': {
            'conf': {
                'client': {
                    'log file': '/var/log/ceph/ceph-$name.$pid.log'
                },
                'mon': {
                    'osd default pool size': 2
                }
            }

Actions

Copy link

#16

Updated by Nitzan Mordechai 11 months ago

Pull request ID set to 51527

Actions

Copy link

#17

Updated by Nitzan Mordechai 11 months ago

Status changed from In Progress to Fix Under Review

Actions

Copy link

#18

Updated by Neha Ojha 11 months ago

Backport set to pacific,quincy,reef

Actions

Copy link

#19

Updated by Laura Flores 11 months ago

/a/yuriw-2023-05-16_23:44:06-rados-wip-yuri10-testing-2023-05-16-1243-distro-default-smithi/7276255

Hey Nitzan, yes, your analysis makes sense. Thanks for taking a look!

Actions

Copy link

#20

Updated by Laura Flores 11 months ago

/a/yuriw-2023-05-23_22:39:17-rados-wip-yuri6-testing-2023-05-23-0757-reef-distro-default-smithi/7284941

Actions

Copy link

#21

Updated by Radoslaw Zarzynski 11 months ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#22

Updated by Backport Bot 11 months ago

Copied to Backport #61447: reef: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added

Actions

Copy link

#23

Updated by Backport Bot 11 months ago

Copied to Backport #61448: pacific: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added

Actions

Copy link

#24

Updated by Backport Bot 11 months ago

Copied to Backport #61449: quincy: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds added

Actions

Copy link

#25

Updated by Backport Bot 11 months ago

Tags set to backport_processed

Actions

Copy link

#26

Updated by Konstantin Shalygin 5 months ago

Status changed from Pending Backport to Resolved
% Done changed from 0 to 100

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #49888

rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds

Updated by Sridhar Seshasayee almost 3 years ago

Updated by Neha Ojha over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by Aishwarya Mathuria over 2 years ago

Updated by Laura Flores about 2 years ago

Updated by Radoslaw Zarzynski about 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Matan Breizman over 1 year ago

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

Updated by Laura Flores 12 months ago

Updated by Laura Flores 12 months ago

Updated by Nitzan Mordechai 11 months ago

Updated by Nitzan Mordechai 11 months ago

Updated by Nitzan Mordechai 11 months ago

Updated by Nitzan Mordechai 11 months ago

Updated by Nitzan Mordechai 11 months ago

Updated by Neha Ojha 11 months ago

Updated by Laura Flores 11 months ago

Updated by Laura Flores 11 months ago

Updated by Radoslaw Zarzynski 11 months ago

Updated by Backport Bot 11 months ago

Updated by Backport Bot 11 months ago

Updated by Backport Bot 11 months ago

Updated by Backport Bot 11 months ago

Updated by Konstantin Shalygin 5 months ago