Project

General

Profile

Bug #52993

upgrade:octopus-x Test: Upgrade test failed due to timeout of the "ceph pg dump" command

Added by Sridhar Seshasayee over 2 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/teuthology-2021-10-12_13:10:21-upgrade:octopus-x-pacific-distro-basic-smithi/6433896
/a/teuthology-2021-10-12_13:10:21-upgrade:octopus-x-pacific-distro-basic-smithi/6433913

The observations below are from JobID: 6433896

Tasks that succeeded:


2021-10-12T18:55:59.285 INFO:teuthology.task.print:**** done install octopus
2021-10-12T18:57:27.424 INFO:teuthology.task.print:**** done ceph
2021-10-12T18:57:47.679 INFO:teuthology.task.print:**** done => started rgw client.0
2021-10-12T19:00:34.731 INFO:teuthology.task.print:**** done install.upgrade of first 3 nodes  -> mon.a & mon.b are restarted

Mgr logs after upgrade to pacific:

2021-10-12T19:00:40.064+0000 7f918f8b9380 -1 WARNING: all dangerous and experimental features are enabled.
2021-10-12T19:00:40.064+0000 7f918f8b9380  0 ceph version 16.2.6-210-gb56eac81 (b56eac81c3c91b10bbed736a3b40663b238eeb82) pacific (stable), process ceph-mgr, pid 19719

The manager waits for around 20 secs (mgr_stats_period * 4) but gives up since the OSDs
did not report PG status. This is likely the root of the problem and the reason why the
"ceph pg dump" failed in the test. Haven't looked into why this happened.

Start of the loop to wait for OSDs to report PG status:

2021-10-12T19:00:53.148+0000 7f9163124700 10 mgr.server tick
2021-10-12T19:00:53.148+0000 7f9163124700  1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs
2021-10-12T19:00:53.148+0000 7f9163124700 20 mgr.server adjust_pgs
2021-10-12T19:00:53.148+0000 7f9163124700 10 mgr.server operator() creating_or_unknown 265 max_creating 1024 left 759

Mgr gives up waiting for OSDs to report PG states:

2021-10-12T19:01:11.148+0000 7f9163124700 10 mgr.server tick
2021-10-12T19:01:11.148+0000 7f9163124700  1 mgr.server send_report Giving up on OSDs that haven't reported yet, sending potentially incomplete PG state to mon
2021-10-12T19:01:11.148+0000 7f9163124700 10 mgr update_delta_stats  v7
2021-10-12T19:01:11.148+0000 7f9163124700  4 OSD 0 is up, but has no stats
2021-10-12T19:01:11.148+0000 7f9163124700  4 OSD 1 is up, but has no stats
2021-10-12T19:01:11.148+0000 7f9163124700  4 OSD 2 is up, but has no stats
2021-10-12T19:01:11.148+0000 7f9163124700  4 OSD 3 is up, but has no stats
2021-10-12T19:01:11.148+0000 7f9163124700  4 OSD 4 is up, but has no stats
2021-10-12T19:01:11.148+0000 7f9163124700  4 OSD 5 is up, but has no stats
2021-10-12T19:01:11.148+0000 7f9163124700  4 OSD 6 is up, but has no stats
2021-10-12T19:01:11.148+0000 7f9163124700  4 OSD 7 is up, but has no stats
2021-10-12T19:01:11.148+0000 7f9163124700  4 OSD 8 is up, but has no stats
2021-10-12T19:01:11.148+0000 7f9163124700  4 OSD 9 is up, but has no stats
2021-10-12T19:01:11.148+0000 7f9163124700  4 OSD 10 is up, but has no stats
2021-10-12T19:01:11.148+0000 7f9163124700  4 OSD 11 is up, but has no stats
2021-10-12T19:01:11.148+0000 7f9163124700 10 mgr.server operator() 265 pgs: 265 unknown; 0 B data, 0 B used, 0 B / 0 B avail
2021-10-12T19:01:11.148+0000 7f9163124700 10 mgr.server operator() 0 health checks
2021-10-12T19:01:11.148+0000 7f9163124700 20 mgr.server operator() health checks:

A while later Mgr reports many PGs inactive:

2021-10-12T19:01:51.160+0000 7f9163124700 10 mgr.server operator() 265 pgs: 265 unknown; 0 B data, 0 B used, 0 B / 0 B avail
2021-10-12T19:01:51.160+0000 7f9163124700 10 mgr.server operator() 1 health checks
2021-10-12T19:01:51.160+0000 7f9163124700 20 mgr.server operator() health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 265 pgs inactive",
            "count": 265
        },
        "detail": [
            {
                "message": "pg 3.26 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.27 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.28 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.29 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.2a is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.2b is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.2c is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.2d is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.2e is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.2f is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.30 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.31 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.32 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.33 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.34 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.35 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.36 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.37 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.38 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.39 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.3a is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.3b is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.3c is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.3d is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.3e is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 3.3f is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.20 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.28 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.29 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.2a is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.2b is stuck inactive for 60s, current state unknown, last acting []" 
            },            
            {
                "message": "pg 4.2c is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.2d is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.2e is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.2f is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.30 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.31 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.32 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.33 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.34 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.35 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.36 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.37 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.38 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.39 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.3a is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.3b is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.3c is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.3d is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.3e is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 4.3f is stuck inactive for 60s, current state unknown, last acting []" 
            }
        ]
    }
}

Teuthology logs showing the timeout of the "pg dump" command during this time (after mgr upgrade to pacific):

Note that only for osd.8 a mismatch was reported between flush pg stats and osd last-stat-seq. The test however proceeded by logging the error but continuing ing with the test as shown below.

2021-10-12T19:01:25.031 DEBUG:teuthology.orchestra.run.smithi137:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd last-stat-seq osd.8
2021-10-12T19:01:25.157 INFO:teuthology.orchestra.run.smithi137.stderr:2021-10-12T19:01:25.148+0000 7fa31800c700 -1 WARNING: all dangerous and experimental features are enabled.
2021-10-12T19:01:25.161 INFO:teuthology.orchestra.run.smithi137.stderr:2021-10-12T19:01:25.164+0000 7fa316daa700 -1 WARNING: all dangerous and experimental features are enabled.
2021-10-12T19:01:25.317 INFO:teuthology.orchestra.run.smithi137.stdout:0
2021-10-12T19:01:25.331 INFO:tasks.ceph.ceph_manager.ceph:need seq 45 got 0 for osd.8

...

2021-10-12T19:09:00.593 INFO:tasks.ceph.ceph_manager.ceph:need seq 45 got 0 for osd.8
2021-10-12T19:09:01.594 INFO:tasks.ceph:ignoring flush pg stats error, probably testing upgrade: timed out waiting for mon to be updated with osd.8: 0 < 45
2021-10-12T19:09:01.969 INFO:tasks.ceph.ceph_manager.ceph:waiting for clean
2021-10-12T19:09:01.969 DEBUG:teuthology.orchestra.run.smithi137:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph pg dump --format=json

...

2021-10-12T19:11:01.992 DEBUG:teuthology.orchestra.run:got remote process result: 124
2021-10-12T19:11:01.994 ERROR:teuthology.run_tasks:Saw exception from tasks.

Last mgrreport sent from all osds

All the OSDs (including the set that was upgraded) did not send any mgrreport after the upgrade.
This indicates a communication problem between the OSDs and the Mgr after the upgrade.

One such example from osd.0 is shown below that shows the last mgrreport sent at around 19:00:32
which was just before the upgrade at around 19:00:40.

2021-10-12T19:00:32.408+0000 7f650a05d700 20 osd.0 40 reports for 0 queries
2021-10-12T19:00:32.408+0000 7f650a05d700  1 -- [v2:172.21.15.137:6826/16508,v1:172.21.15.137:6827/16508] --> [v2:172.21.15.137:6800/16326,v1:172.21.15.137:6801/16326] -- mgrreport(unknown.0 +0-0 packed 1222 daemon_metrics=2) v9 -- 0x55fc13799300 con 0x55fc10d98000

Related issues

Related to mgr - Bug #51815: ceph pg dump times out due to no active mgr Pending Backport

History

#1 Updated by Laura Flores almost 2 years ago

Similar problem happened on a rados/singleton test for Octopus:

/a/yuriw-2022-04-26_20:58:55-rados-wip-yuri2-testing-2022-04-26-1132-octopus-distro-default-smithi/6807516
Description: rados/singleton/{all/rebuild-mondb msgr-failures/none msgr/async-v2only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{centos_8}}

teuthology.log: `ceph pg dump` is issued, but the mon reports a bad auth method.

2022-04-28T02:35:56.581 DEBUG:teuthology.orchestra.run.smithi122:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph pg dump --format=json
2022-04-28T02:35:57.616 INFO:tasks.ceph.mon.c.smithi122.stderr:2022-04-28T02:35:57.615+0000 7fe8a90d0700 -1 mon.c@2(peon) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:36:11.551 INFO:tasks.ceph.mon.a.smithi122.stderr:2022-04-28T02:36:11.550+0000 7f8993693700 -1 mon.a@0(leader) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:36:12.625 INFO:tasks.ceph.mon.c.smithi122.stderr:2022-04-28T02:36:12.624+0000 7fe8a90d0700 -1 mon.c@2(peon) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:36:26.552 INFO:tasks.ceph.mon.a.smithi122.stderr:2022-04-28T02:36:26.551+0000 7f8993693700 -1 mon.a@0(leader) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:36:27.635 INFO:tasks.ceph.mon.c.smithi122.stderr:2022-04-28T02:36:27.635+0000 7fe8a90d0700 -1 mon.c@2(peon) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:36:41.553 INFO:tasks.ceph.mon.a.smithi122.stderr:2022-04-28T02:36:41.552+0000 7f8993693700 -1 mon.a@0(leader) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:36:42.651 INFO:tasks.ceph.mon.c.smithi122.stderr:2022-04-28T02:36:42.650+0000 7fe8a90d0700 -1 mon.c@2(peon) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:36:56.553 INFO:tasks.ceph.mon.a.smithi122.stderr:2022-04-28T02:36:56.552+0000 7f8993693700 -1 mon.a@0(leader) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:36:57.651 INFO:tasks.ceph.mon.c.smithi122.stderr:2022-04-28T02:36:57.650+0000 7fe8a90d0700 -1 mon.c@2(peon) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:37:11.554 INFO:tasks.ceph.mon.a.smithi122.stderr:2022-04-28T02:37:11.553+0000 7f8993693700 -1 mon.a@0(leader) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:37:12.659 INFO:tasks.ceph.mon.c.smithi122.stderr:2022-04-28T02:37:12.658+0000 7fe8a90d0700 -1 mon.c@2(peon) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:37:26.555 INFO:tasks.ceph.mon.a.smithi122.stderr:2022-04-28T02:37:26.554+0000 7f8993693700 -1 mon.a@0(leader) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:37:27.663 INFO:tasks.ceph.mon.c.smithi122.stderr:2022-04-28T02:37:27.663+0000 7fe8a90d0700 -1 mon.c@2(peon) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:37:41.555 INFO:tasks.ceph.mon.a.smithi122.stderr:2022-04-28T02:37:41.554+0000 7f8993693700 -1 mon.a@0(leader) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:37:42.679 INFO:tasks.ceph.mon.c.smithi122.stderr:2022-04-28T02:37:42.678+0000 7fe8a90d0700 -1 mon.c@2(peon) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:37:56.555 INFO:tasks.ceph.mon.a.smithi122.stderr:2022-04-28T02:37:56.555+0000 7f8993693700 -1 mon.a@0(leader) e1 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2022-04-28T02:37:56.613 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-04-28T02:37:56.614 ERROR:teuthology.contextutil:Saw exception from nested tasks

mgr.x.log.gz: mgr waits for OSDs to report PG status

2022-04-28T02:34:33.102+0000 7ff1b1b9c700 10 mgr.server tick 
2022-04-28T02:34:33.102+0000 7ff1b1b9c700  1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs
2022-04-28T02:34:33.102+0000 7ff1b1b9c700 20 mgr.server adjust_pgs 
2022-04-28T02:34:33.102+0000 7ff1b1b9c700 10 mgr.server operator() creating_or_unknown 25 max_creating 1024 left 999

mgr.x.log.gz: After looping for about a minute, the mgr gives up and reports unknown PGs.

2022-04-28T02:35:31.120+0000 7ff1b1b9c700 10 mgr.server tick 
2022-04-28T02:35:31.120+0000 7ff1b1b9c700 15 mgr get_health_checks getting health checks for balancer
2022-04-28T02:35:31.120+0000 7ff1b1b9c700 15 mgr get_health_checks getting health checks for crash
2022-04-28T02:35:31.120+0000 7ff1b1b9c700 15 mgr get_health_checks getting health checks for devicehealth
2022-04-28T02:35:31.120+0000 7ff1b1b9c700 15 mgr get_health_checks getting health checks for iostat
2022-04-28T02:35:31.120+0000 7ff1b1b9c700 15 mgr get_health_checks getting health checks for orchestrator
2022-04-28T02:35:31.120+0000 7ff1b1b9c700 15 mgr get_health_checks getting health checks for pg_autoscaler
2022-04-28T02:35:31.120+0000 7ff1b1b9c700 15 mgr get_health_checks getting health checks for progress
2022-04-28T02:35:31.120+0000 7ff1b1b9c700 10 mgr update_delta_stats  v44
2022-04-28T02:35:31.120+0000 7ff1b1b9c700  4 OSD 0 is up, but has no stats
2022-04-28T02:35:31.120+0000 7ff1b1b9c700  4 OSD 1 is up, but has no stats
2022-04-28T02:35:31.120+0000 7ff1b1b9c700  4 OSD 2 is up, but has no stats
2022-04-28T02:35:31.120+0000 7ff1b1b9c700 10 mgr.server operator() 25 pgs: 25 unknown; 0 B data, 0 B used, 0 B / 0 B avail
2022-04-28T02:35:31.120+0000 7ff1b1b9c700 10 mgr.server operator() 1 health checks
2022-04-28T02:35:31.120+0000 7ff1b1b9c700 20 mgr.server operator() health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 8 pgs inactive",
            "count": 8
        },
        "detail": [
            {
                "message": "pg 1.0 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 1.1 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 1.2 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 1.3 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 1.4 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 1.5 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 1.6 is stuck inactive for 60s, current state unknown, last acting []" 
            },
            {
                "message": "pg 1.7 is stuck inactive for 60s, current state unknown, last acting []" 
            }
        ]
    }
}

The same test ended up passing on a rerun, so this does not happen every time: yuriw-2022-05-09_21:42:51-rados-wip-yuri2-testing-2022-04-26-1132-octopus-distro-default-smithi/6829092.

#2 Updated by Neha Ojha almost 2 years ago

  • Related to Bug #51815: ceph pg dump times out due to no active mgr added

#3 Updated by Radoslaw Zarzynski almost 2 years ago

We haven't backported the fix for https://tracker.ceph.com/issues/51815 to Octopus (per Neha's explanation).

Also available in: Atom PDF