Project

General

Profile

Actions

Bug #51815

open

ceph pg dump times out due to no active mgr

Added by Neha Ojha over 2 years ago. Updated over 1 year ago.

Status:
Pending Backport
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific, octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-07-22T03:33:49.333 DEBUG:teuthology.orchestra.run.smithi050:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mgr -f --cluster ceph -i x
2021-07-22T03:33:49.335 INFO:tasks.ceph.mgr.x:Started
...
2021-07-22T03:35:27.068 INFO:tasks.ceph.mon.a.smithi050.stderr:2021-07-22T03:35:27.067+0000 7fd05e2be700 -1 log_channel(cluster) log [ERR] : Health check failed: no active mgr (MGR_DOWN)
...
2021-07-22T03:35:38.826 DEBUG:teuthology.orchestra.run.smithi050:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph pg dump --format=json
2021-07-22T03:37:38.839 DEBUG:teuthology.orchestra.run:got remote process result: 124
2021-07-22T03:33:49.395+0000 7efe70d3d380  0 ceph version 16.2.5-107-g25279b58 (25279b580b12bb6c4029912fd2e89174d631ce62) pacific (stable), process ceph-mgr, pid 23364
...
2021-07-22T03:33:53.063+0000 7efe6bcf1700  1 -- 172.21.15.50:0/23364 <== mon.2 v1:172.21.15.50:6791/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4080774440 0 0) 0x55e9b0762ea0 con 0x55e9aee70800
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: 0x55e9aed87ee0 handle_response ret = -13
2021-07-22T03:33:53.063+0000 7efe6bcf1700  1 -- 172.21.15.50:0/23364 >> v1:172.21.15.50:6791/0 conn(0x55e9aee70800 legacy=0x55e9af9e0800 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2021-07-22T03:33:53.063+0000 7efe6bcf1700  1 -- 172.21.15.50:0/23364 <== mon.0 v1:172.21.15.50:6789/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4080774440 0 0) 0x55e9b076dd40 con 0x55e9aee70c00
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: 0x55e9aed87e10 handle_response ret = -13
2021-07-22T03:33:53.063+0000 7efe6bcf1700  1 -- 172.21.15.50:0/23364 >> v1:172.21.15.50:6789/0 conn(0x55e9aee70c00 legacy=0x55e9af9e1000 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2021-07-22T03:33:53.063+0000 7efe6bcf1700  1 -- 172.21.15.50:0/23364 <== mon.1 v1:172.21.15.50:6790/0 1 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (unknown 623661766 0 0) 0x55e9aed816c0 con 0x55e9b32ee000
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 monclient(hunting): _init_auth method 2
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 monclient(hunting): _init_auth already have auth, reseting
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: reset
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: 0x55e9aed871e0 handle_response ret = 0
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client:  got initial server challenge 215dfa99bc67a265
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: validate_tickets: want=55 need=0 have=55
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx: validate_tickets want 55 have 23 need 32
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: want=55 need=32 have=23
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: build_request
2021-07-22T03:33:53.063+0000 7efe6bcf1700 20 cephx client: old ticket len=96
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: get auth session key: client_challenge 90515daa5a75b25c
2021-07-22T03:33:53.063+0000 7efe6bcf1700  1 -- 172.21.15.50:0/23364 --> v1:172.21.15.50:6790/0 -- auth(proto 2 132 bytes epoch 0) v1 -- 0x55e9b32f8780 con 0x55e9b32ee000
2021-07-22T03:33:53.063+0000 7efe6bcf1700  1 -- 172.21.15.50:0/23364 <== mon.1 v1:172.21.15.50:6790/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4080774440 0 0) 0x55e9b0762d00 con 0x55e9b32ee000
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: 0x55e9aed871e0 handle_response ret = -13
2021-07-22T03:33:53.063+0000 7efe6bcf1700  1 -- 172.21.15.50:0/23364 >> v1:172.21.15.50:6790/0 conn(0x55e9b32ee000 legacy=0x55e9b32f4000 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 monclient: _finish_hunting -13
2021-07-22T03:33:53.063+0000 7efe6bcf1700  1 monclient: no mon sessions established
2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 monclient: _finish_auth -13

rados/singleton/{all/rebuild-mondb mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{ubuntu_latest}}

/a/yuriw-2021-07-21_19:04:13-rados-wip-yuri4-testing-2021-07-21-0859-pacific-distro-basic-smithi/6285312


Related issues 4 (1 open3 closed)

Related to RADOS - Bug #52621: cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decryptCan't reproduce

Actions
Related to RADOS - Bug #52993: upgrade:octopus-x Test: Upgrade test failed due to timeout of the "ceph pg dump" commandNew

Actions
Copied to mgr - Backport #52844: pacific: ceph pg dump times out due to no active mgr ResolvedCory SnyderActions
Copied to mgr - Backport #55634: octopus: ceph pg dump times out due to no active mgr RejectedPonnuvel PActions
Actions #1

Updated by Neha Ojha over 2 years ago

  • Priority changed from Normal to High

rados/singleton/{all/rebuild-mondb mon_election/classic msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-zlib rados supported-random-distro$/{centos_8.stream}}

/a/benhanokh-2021-08-04_06:12:22-rados-wip_gbenhano_ncbz-distro-basic-smithi/6310647

Actions #2

Updated by Neha Ojha over 2 years ago

  • Description updated (diff)
Actions #3

Updated by Neha Ojha over 2 years ago

/a/yuriw-2021-08-11_14:27:15-rados-wip-yuri8-testing-2021-08-09-0844-pacific-distro-basic-smithi/6334927

Actions #4

Updated by Neha Ojha over 2 years ago

rados/singleton/{all/rebuild-mondb mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{centos_8}}

/a/yuriw-2021-09-16_18:23:18-rados-wip-yuri2-testing-2021-09-16-0923-distro-basic-smithi/6393541

Actions #5

Updated by Neha Ojha over 2 years ago

  • Assignee set to Neha Ojha
Actions #7

Updated by Neha Ojha over 2 years ago

  • Related to Bug #52621: cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decrypt added
Actions #8

Updated by Neha Ojha over 2 years ago

Seems to happen when all the three mons return -13

/a/nojha-2021-09-22_21:52:06-rados:singleton-wip-yuri2-testing-2021-09-16-0923-distro-basic-smithi/6402985/

2021-09-23T01:18:45.968+0000 7fcfe67f8700  1 -- 172.21.15.142:0/38616 <== mon.0 v1:172.21.15.142:6789/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4080774440 0 0) 0x55d8655ad860 con 0x55d85f587400
2021-09-23T01:18:45.968+0000 7fcfe67f8700  1 -- 172.21.15.142:0/38616 <== mon.2 v1:172.21.15.142:6791/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4080774440 0 0) 0x55d8655b3d40 con 0x55d85f587000
2021-09-23T01:18:45.968+0000 7fcfe67f8700  1 -- 172.21.15.142:0/38616 <== mon.1 v1:172.21.15.142:6790/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4080774440 0 0) 0x55d85e853380 con 0x55d865650000
...
2021-09-23T01:18:45.968+0000 7fcfe67f8700 10 monclient: _finish_hunting -13
2021-09-23T01:18:45.968+0000 7fcfe67f8700  1 monclient: no mon sessions established
2021-09-23T01:18:45.968+0000 7fcfe67f8700 10 monclient: _finish_auth -13

Comparing with a passed test /a/nojha-2021-09-22_21:52:06-rados:singleton-wip-yuri2-testing-2021-09-16-0923-distro-basic-smithi/6402979/, even if there is one monitor not returning an error, things work fine.

2021-09-23T01:18:46.282+0000 7f56b6ac4700  1 -- 172.21.15.101:0/1620652081 <== mon.2 v1:172.21.15.101:6791/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4003534095 0 0) 0x56082f74b860 con 0x56082f69cc00
2021-09-23T01:18:46.282+0000 7f56b6ac4700 10 cephx client: 0x56082f77f450 handle_response ret = -13
2021-09-23T01:18:46.282+0000 7f56b6ac4700  1 -- 172.21.15.101:0/1620652081 <== mon.1 v1:172.21.15.101:6790/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4003534095 0 0) 0x56082f607ba0 con 0x56082f54fc00
2021-09-23T01:18:46.282+0000 7f56b6ac4700  1 -- 172.21.15.101:0/1620652081 <== mon.0 v1:172.21.15.101:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 233+0+0 (unknown 2285516212 0 0) 0x56082f87e680 con 0x56082f57c800
...
2021-09-23T01:18:46.282+0000 7f56b6ac4700 10 monclient: _finish_hunting 0
2021-09-23T01:18:46.282+0000 7f56b6ac4700  1 monclient: found mon.a
Mon.0

2021-09-23T01:18:46.282+0000 7f11cfbbd700  1 -- v1:172.21.15.101:6789/0 <== client.10026 v1:172.21.15.101:0/1620652081 2 ==== auth(proto 2 132 bytes epoch 0) v1 ==== 162+0+0 (unknown 491090841 0 0) 0x56478f0e9200 con 0x56478e93a000
2021-09-23T01:18:46.282+0000 7f11cfbbd700 20 mon.a@0(leader) e1 _ms_dispatch existing session 0x56478e956000 for client.10026
2021-09-23T01:18:46.282+0000 7f11cfbbd700 20 mon.a@0(leader) e1  entity_name mgr.x global_id 10026 (reclaim_pending) caps
2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 mon.a@0(leader).paxosservice(auth 1..11) dispatch 0x56478f0e9200 auth(proto 2 132 bytes epoch 0) v1 from client.10026 v1:172.21.15.101:0/1620652081 con 0x56478e93a000
2021-09-23T01:18:46.282+0000 7f11cfbbd700  5 mon.a@0(leader).paxos(paxos active c 66..79) is_readable = 1 - now=2021-09-23T01:18:46.282821+0000 lease_expire=2021-09-23T01:18:51.247354+0000 has v0 lc 79
2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 mon.a@0(leader).auth v11 preprocess_query auth(proto 2 132 bytes epoch 0) v1 from client.10026 v1:172.21.15.101:0/1620652081
2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 mon.a@0(leader).auth v11 prep_auth() blob_size=132
2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 cephx server mgr.x: handle_request get_auth_session_key for mgr.x
2021-09-23T01:18:46.282+0000 7f11cfbbd700 20 cephx server mgr.x:  checking key: req.key=279919d7d2b0a3f5 expected_key=279919d7d2b0a3f5
2021-09-23T01:18:46.282+0000 7f11cfbbd700 20 cephx server mgr.x:  checking old_ticket: secret_id=2 len=96, old_ticket_may_be_omitted=0
2021-09-23T01:18:46.282+0000 7f11cfbbd700 20 cephx server mgr.x:  decoded old_ticket: global_id=10026
2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 cephx server mgr.x:  allowing reclaim of global_id 10026 (valid ticket presented, will encrypt new ticket)
2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 cephx: build_service_ticket_reply encoding 1 tickets with secret AQBi1UthG+ahNBAAcpjN4OgfIxzmTUSWjZuo4g==
2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 cephx: build_service_ticket service auth secret_id 2 ticket_info.ticket.name=mgr.x ticket.global_id 10026
2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 cephx keyserverdata: get_caps: name=mgr.x
2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 cephx keyserverdata: get_secret: num of caps=3
2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 mon.a@0(leader) e1 ms_handle_authentication session 0x56478e956000 con 0x56478e93a000 addr v1:172.21.15.101:0/1620652081 MonSession(client.10026 v1:172.21.15.101:0/1620652081 is open , features 0x3f01cfbb7ffdffff (luminous))
2021-09-23T01:18:46.282+0000 7f11cfbbd700  2 mon.a@0(leader) e1 send_reply 0x56478e9725a0 0x56478f154ea0 auth_reply(proto 2 0 (0) Success) v1

Mon.1

2021-09-23T01:18:46.282+0000 7f1ee27cc700  1 -- v1:172.21.15.101:6790/0 <== client.10026 v1:172.21.15.101:0/1620652081 2 ==== auth(proto 2 132 bytes epoch 0) v1 ==== 162+0+0 (unknown 247189302 0 0) 0x559522e53080 con 0x5595229aa000
2021-09-23T01:18:46.282+0000 7f1ee27cc700 20 mon.b@1(peon) e1 _ms_dispatch existing session 0x559522df1680 for client.10026
2021-09-23T01:18:46.282+0000 7f1ee27cc700 20 mon.b@1(peon) e1  entity_name mgr.x global_id 10026 (reclaim_pending) caps
2021-09-23T01:18:46.282+0000 7f1ee27cc700 10 mon.b@1(peon).paxosservice(auth 1..11) dispatch 0x559522e53080 auth(proto 2 132 bytes epoch 0) v1 from client.10026 v1:172.21.15.101:0/1620652081 con 0x5595229aa000
2021-09-23T01:18:46.282+0000 7f1ee27cc700  5 mon.b@1(peon).paxos(paxos active c 66..79) is_readable = 1 - now=2021-09-23T01:18:46.282865+0000 lease_expire=2021-09-23T01:18:51.247354+0000 has v0 lc 79
2021-09-23T01:18:46.282+0000 7f1ee27cc700 10 mon.b@1(peon).auth v11 preprocess_query auth(proto 2 132 bytes epoch 0) v1 from client.10026 v1:172.21.15.101:0/1620652081
2021-09-23T01:18:46.282+0000 7f1ee27cc700 10 mon.b@1(peon).auth v11 prep_auth() blob_size=132
2021-09-23T01:18:46.282+0000 7f1ee27cc700 10 cephx server mgr.x: handle_request get_auth_session_key for mgr.x
2021-09-23T01:18:46.282+0000 7f1ee27cc700 20 cephx server mgr.x:  checking key: req.key=50ea35923a58c457 expected_key=50ea35923a58c457
2021-09-23T01:18:46.282+0000 7f1ee27cc700 20 cephx server mgr.x:  checking old_ticket: secret_id=2 len=96, old_ticket_may_be_omitted=0
2021-09-23T01:18:46.282+0000 7f1ee27cc700  0 cephx: ceph_decode_ticket could not get service secret for service_id=auth secret_id=2
2021-09-23T01:18:46.282+0000 7f1ee27cc700  0 cephx server mgr.x:  attempt to reclaim global_id 10026 using bad ticket
2021-09-23T01:18:46.282+0000 7f1ee27cc700  0 cephx server mgr.x:  could not verify old ticket
2021-09-23T01:18:46.282+0000 7f1ee27cc700  2 mon.b@1(peon) e1 send_reply 0x5595229ad680 0x559522f0ad00 auth_reply(proto 2 -13 (13) Permission denied) v1

Mon.2

2021-09-23T01:18:46.282+0000 7fc255234700  1 -- v1:172.21.15.101:6791/0 <== client.10026 v1:172.21.15.101:0/1620652081 2 ==== auth(proto 2 132 bytes epoch 0) v1 ==== 162+0+0 (unknown 1546966740 0 0) 0x559589d96180 con 0x559589dab800
2021-09-23T01:18:46.282+0000 7fc255234700 20 mon.c@2(peon) e1 _ms_dispatch existing session 0x559589ccb440 for client.10026
2021-09-23T01:18:46.282+0000 7fc255234700 20 mon.c@2(peon) e1  entity_name mgr.x global_id 10026 (reclaim_pending) caps 
2021-09-23T01:18:46.282+0000 7fc255234700 10 mon.c@2(peon).paxosservice(auth 1..11) dispatch 0x559589d96180 auth(proto 2 132 bytes epoch 0) v1 from client.10026 v1:172.21.15.101:0/1620652081 con 0x559589dab800
2021-09-23T01:18:46.282+0000 7fc255234700  5 mon.c@2(peon).paxos(paxos active c 66..79) is_readable = 1 - now=2021-09-23T01:18:46.282871+0000 lease_expire=2021-09-23T01:18:51.247354+0000 has v0 lc 79
2021-09-23T01:18:46.282+0000 7fc255234700 10 mon.c@2(peon).auth v11 preprocess_query auth(proto 2 132 bytes epoch 0) v1 from client.10026 v1:172.21.15.101:0/1620652081
2021-09-23T01:18:46.282+0000 7fc255234700 10 mon.c@2(peon).auth v11 prep_auth() blob_size=132
2021-09-23T01:18:46.282+0000 7fc255234700 10 cephx server mgr.x: handle_request get_auth_session_key for mgr.x
2021-09-23T01:18:46.282+0000 7fc255234700 20 cephx server mgr.x:  checking key: req.key=85f72a27737c1f4b expected_key=85f72a27737c1f4b
2021-09-23T01:18:46.282+0000 7fc255234700 20 cephx server mgr.x:  checking old_ticket: secret_id=2 len=96, old_ticket_may_be_omitted=0
2021-09-23T01:18:46.282+0000 7fc255234700  0 cephx: ceph_decode_ticket could not get service secret for service_id=auth secret_id=2
2021-09-23T01:18:46.282+0000 7fc255234700  0 cephx server mgr.x:  attempt to reclaim global_id 10026 using bad ticket
2021-09-23T01:18:46.282+0000 7fc255234700  0 cephx server mgr.x:  could not verify old ticket
2021-09-23T01:18:46.282+0000 7fc255234700  2 mon.c@2(peon) e1 send_reply 0x559589cfe4b0 0x559589df8680 auth_reply(proto 2 -13 (13) Permission denied) v1
Actions #9

Updated by Sage Weil over 2 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 43335
Actions #10

Updated by Neha Ojha over 2 years ago

  • Assignee changed from Neha Ojha to Sage Weil
Actions #11

Updated by Kefu Chai over 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #12

Updated by Backport Bot over 2 years ago

  • Copied to Backport #52844: pacific: ceph pg dump times out due to no active mgr added
Actions #13

Updated by Loïc Dachary over 2 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions #14

Updated by Neha Ojha almost 2 years ago

  • Related to Bug #52993: upgrade:octopus-x Test: Upgrade test failed due to timeout of the "ceph pg dump" command added
Actions #15

Updated by Laura Flores almost 2 years ago

  • Status changed from Resolved to Pending Backport
  • Backport changed from pacific to pacific, octopus

Regarding https://tracker.ceph.com/issues/52993#note-3, this fix should also be backported to Octopus.

Actions #16

Updated by Backport Bot almost 2 years ago

  • Copied to Backport #55634: octopus: ceph pg dump times out due to no active mgr added
Actions #17

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF