Bug #51815
openceph pg dump times out due to no active mgr
0%
Description
2021-07-22T03:33:49.333 DEBUG:teuthology.orchestra.run.smithi050:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mgr -f --cluster ceph -i x 2021-07-22T03:33:49.335 INFO:tasks.ceph.mgr.x:Started ... 2021-07-22T03:35:27.068 INFO:tasks.ceph.mon.a.smithi050.stderr:2021-07-22T03:35:27.067+0000 7fd05e2be700 -1 log_channel(cluster) log [ERR] : Health check failed: no active mgr (MGR_DOWN) ... 2021-07-22T03:35:38.826 DEBUG:teuthology.orchestra.run.smithi050:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph pg dump --format=json 2021-07-22T03:37:38.839 DEBUG:teuthology.orchestra.run:got remote process result: 124
2021-07-22T03:33:49.395+0000 7efe70d3d380 0 ceph version 16.2.5-107-g25279b58 (25279b580b12bb6c4029912fd2e89174d631ce62) pacific (stable), process ceph-mgr, pid 23364 ... 2021-07-22T03:33:53.063+0000 7efe6bcf1700 1 -- 172.21.15.50:0/23364 <== mon.2 v1:172.21.15.50:6791/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4080774440 0 0) 0x55e9b0762ea0 con 0x55e9aee70800 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: 0x55e9aed87ee0 handle_response ret = -13 2021-07-22T03:33:53.063+0000 7efe6bcf1700 1 -- 172.21.15.50:0/23364 >> v1:172.21.15.50:6791/0 conn(0x55e9aee70800 legacy=0x55e9af9e0800 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down 2021-07-22T03:33:53.063+0000 7efe6bcf1700 1 -- 172.21.15.50:0/23364 <== mon.0 v1:172.21.15.50:6789/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4080774440 0 0) 0x55e9b076dd40 con 0x55e9aee70c00 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: 0x55e9aed87e10 handle_response ret = -13 2021-07-22T03:33:53.063+0000 7efe6bcf1700 1 -- 172.21.15.50:0/23364 >> v1:172.21.15.50:6789/0 conn(0x55e9aee70c00 legacy=0x55e9af9e1000 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down 2021-07-22T03:33:53.063+0000 7efe6bcf1700 1 -- 172.21.15.50:0/23364 <== mon.1 v1:172.21.15.50:6790/0 1 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (unknown 623661766 0 0) 0x55e9aed816c0 con 0x55e9b32ee000 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 monclient(hunting): _init_auth method 2 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 monclient(hunting): _init_auth already have auth, reseting 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: reset 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: 0x55e9aed871e0 handle_response ret = 0 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: got initial server challenge 215dfa99bc67a265 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: validate_tickets: want=55 need=0 have=55 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx: validate_tickets want 55 have 23 need 32 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: want=55 need=32 have=23 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: build_request 2021-07-22T03:33:53.063+0000 7efe6bcf1700 20 cephx client: old ticket len=96 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: get auth session key: client_challenge 90515daa5a75b25c 2021-07-22T03:33:53.063+0000 7efe6bcf1700 1 -- 172.21.15.50:0/23364 --> v1:172.21.15.50:6790/0 -- auth(proto 2 132 bytes epoch 0) v1 -- 0x55e9b32f8780 con 0x55e9b32ee000 2021-07-22T03:33:53.063+0000 7efe6bcf1700 1 -- 172.21.15.50:0/23364 <== mon.1 v1:172.21.15.50:6790/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4080774440 0 0) 0x55e9b0762d00 con 0x55e9b32ee000 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 cephx client: 0x55e9aed871e0 handle_response ret = -13 2021-07-22T03:33:53.063+0000 7efe6bcf1700 1 -- 172.21.15.50:0/23364 >> v1:172.21.15.50:6790/0 conn(0x55e9b32ee000 legacy=0x55e9b32f4000 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 monclient: _finish_hunting -13 2021-07-22T03:33:53.063+0000 7efe6bcf1700 1 monclient: no mon sessions established 2021-07-22T03:33:53.063+0000 7efe6bcf1700 10 monclient: _finish_auth -13
rados/singleton/{all/rebuild-mondb mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{ubuntu_latest}}
/a/yuriw-2021-07-21_19:04:13-rados-wip-yuri4-testing-2021-07-21-0859-pacific-distro-basic-smithi/6285312
Updated by Neha Ojha over 2 years ago
- Priority changed from Normal to High
rados/singleton/{all/rebuild-mondb mon_election/classic msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-zlib rados supported-random-distro$/{centos_8.stream}}
/a/benhanokh-2021-08-04_06:12:22-rados-wip_gbenhano_ncbz-distro-basic-smithi/6310647
Updated by Neha Ojha over 2 years ago
/a/yuriw-2021-08-11_14:27:15-rados-wip-yuri8-testing-2021-08-09-0844-pacific-distro-basic-smithi/6334927
Updated by Neha Ojha over 2 years ago
rados/singleton/{all/rebuild-mondb mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{centos_8}}
/a/yuriw-2021-09-16_18:23:18-rados-wip-yuri2-testing-2021-09-16-0923-distro-basic-smithi/6393541
Updated by Neha Ojha over 2 years ago
Updated by Neha Ojha over 2 years ago
- Related to Bug #52621: cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decrypt added
Updated by Neha Ojha over 2 years ago
Seems to happen when all the three mons return -13
/a/nojha-2021-09-22_21:52:06-rados:singleton-wip-yuri2-testing-2021-09-16-0923-distro-basic-smithi/6402985/
2021-09-23T01:18:45.968+0000 7fcfe67f8700 1 -- 172.21.15.142:0/38616 <== mon.0 v1:172.21.15.142:6789/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4080774440 0 0) 0x55d8655ad860 con 0x55d85f587400 2021-09-23T01:18:45.968+0000 7fcfe67f8700 1 -- 172.21.15.142:0/38616 <== mon.2 v1:172.21.15.142:6791/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4080774440 0 0) 0x55d8655b3d40 con 0x55d85f587000 2021-09-23T01:18:45.968+0000 7fcfe67f8700 1 -- 172.21.15.142:0/38616 <== mon.1 v1:172.21.15.142:6790/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4080774440 0 0) 0x55d85e853380 con 0x55d865650000 ... 2021-09-23T01:18:45.968+0000 7fcfe67f8700 10 monclient: _finish_hunting -13 2021-09-23T01:18:45.968+0000 7fcfe67f8700 1 monclient: no mon sessions established 2021-09-23T01:18:45.968+0000 7fcfe67f8700 10 monclient: _finish_auth -13
Comparing with a passed test /a/nojha-2021-09-22_21:52:06-rados:singleton-wip-yuri2-testing-2021-09-16-0923-distro-basic-smithi/6402979/, even if there is one monitor not returning an error, things work fine.
2021-09-23T01:18:46.282+0000 7f56b6ac4700 1 -- 172.21.15.101:0/1620652081 <== mon.2 v1:172.21.15.101:6791/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4003534095 0 0) 0x56082f74b860 con 0x56082f69cc00 2021-09-23T01:18:46.282+0000 7f56b6ac4700 10 cephx client: 0x56082f77f450 handle_response ret = -13 2021-09-23T01:18:46.282+0000 7f56b6ac4700 1 -- 172.21.15.101:0/1620652081 <== mon.1 v1:172.21.15.101:6790/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (unknown 4003534095 0 0) 0x56082f607ba0 con 0x56082f54fc00 2021-09-23T01:18:46.282+0000 7f56b6ac4700 1 -- 172.21.15.101:0/1620652081 <== mon.0 v1:172.21.15.101:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 233+0+0 (unknown 2285516212 0 0) 0x56082f87e680 con 0x56082f57c800 ... 2021-09-23T01:18:46.282+0000 7f56b6ac4700 10 monclient: _finish_hunting 0 2021-09-23T01:18:46.282+0000 7f56b6ac4700 1 monclient: found mon.a
Mon.0 2021-09-23T01:18:46.282+0000 7f11cfbbd700 1 -- v1:172.21.15.101:6789/0 <== client.10026 v1:172.21.15.101:0/1620652081 2 ==== auth(proto 2 132 bytes epoch 0) v1 ==== 162+0+0 (unknown 491090841 0 0) 0x56478f0e9200 con 0x56478e93a000 2021-09-23T01:18:46.282+0000 7f11cfbbd700 20 mon.a@0(leader) e1 _ms_dispatch existing session 0x56478e956000 for client.10026 2021-09-23T01:18:46.282+0000 7f11cfbbd700 20 mon.a@0(leader) e1 entity_name mgr.x global_id 10026 (reclaim_pending) caps 2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 mon.a@0(leader).paxosservice(auth 1..11) dispatch 0x56478f0e9200 auth(proto 2 132 bytes epoch 0) v1 from client.10026 v1:172.21.15.101:0/1620652081 con 0x56478e93a000 2021-09-23T01:18:46.282+0000 7f11cfbbd700 5 mon.a@0(leader).paxos(paxos active c 66..79) is_readable = 1 - now=2021-09-23T01:18:46.282821+0000 lease_expire=2021-09-23T01:18:51.247354+0000 has v0 lc 79 2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 mon.a@0(leader).auth v11 preprocess_query auth(proto 2 132 bytes epoch 0) v1 from client.10026 v1:172.21.15.101:0/1620652081 2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 mon.a@0(leader).auth v11 prep_auth() blob_size=132 2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 cephx server mgr.x: handle_request get_auth_session_key for mgr.x 2021-09-23T01:18:46.282+0000 7f11cfbbd700 20 cephx server mgr.x: checking key: req.key=279919d7d2b0a3f5 expected_key=279919d7d2b0a3f5 2021-09-23T01:18:46.282+0000 7f11cfbbd700 20 cephx server mgr.x: checking old_ticket: secret_id=2 len=96, old_ticket_may_be_omitted=0 2021-09-23T01:18:46.282+0000 7f11cfbbd700 20 cephx server mgr.x: decoded old_ticket: global_id=10026 2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 cephx server mgr.x: allowing reclaim of global_id 10026 (valid ticket presented, will encrypt new ticket) 2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 cephx: build_service_ticket_reply encoding 1 tickets with secret AQBi1UthG+ahNBAAcpjN4OgfIxzmTUSWjZuo4g== 2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 cephx: build_service_ticket service auth secret_id 2 ticket_info.ticket.name=mgr.x ticket.global_id 10026 2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 cephx keyserverdata: get_caps: name=mgr.x 2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 cephx keyserverdata: get_secret: num of caps=3 2021-09-23T01:18:46.282+0000 7f11cfbbd700 10 mon.a@0(leader) e1 ms_handle_authentication session 0x56478e956000 con 0x56478e93a000 addr v1:172.21.15.101:0/1620652081 MonSession(client.10026 v1:172.21.15.101:0/1620652081 is open , features 0x3f01cfbb7ffdffff (luminous)) 2021-09-23T01:18:46.282+0000 7f11cfbbd700 2 mon.a@0(leader) e1 send_reply 0x56478e9725a0 0x56478f154ea0 auth_reply(proto 2 0 (0) Success) v1 Mon.1 2021-09-23T01:18:46.282+0000 7f1ee27cc700 1 -- v1:172.21.15.101:6790/0 <== client.10026 v1:172.21.15.101:0/1620652081 2 ==== auth(proto 2 132 bytes epoch 0) v1 ==== 162+0+0 (unknown 247189302 0 0) 0x559522e53080 con 0x5595229aa000 2021-09-23T01:18:46.282+0000 7f1ee27cc700 20 mon.b@1(peon) e1 _ms_dispatch existing session 0x559522df1680 for client.10026 2021-09-23T01:18:46.282+0000 7f1ee27cc700 20 mon.b@1(peon) e1 entity_name mgr.x global_id 10026 (reclaim_pending) caps 2021-09-23T01:18:46.282+0000 7f1ee27cc700 10 mon.b@1(peon).paxosservice(auth 1..11) dispatch 0x559522e53080 auth(proto 2 132 bytes epoch 0) v1 from client.10026 v1:172.21.15.101:0/1620652081 con 0x5595229aa000 2021-09-23T01:18:46.282+0000 7f1ee27cc700 5 mon.b@1(peon).paxos(paxos active c 66..79) is_readable = 1 - now=2021-09-23T01:18:46.282865+0000 lease_expire=2021-09-23T01:18:51.247354+0000 has v0 lc 79 2021-09-23T01:18:46.282+0000 7f1ee27cc700 10 mon.b@1(peon).auth v11 preprocess_query auth(proto 2 132 bytes epoch 0) v1 from client.10026 v1:172.21.15.101:0/1620652081 2021-09-23T01:18:46.282+0000 7f1ee27cc700 10 mon.b@1(peon).auth v11 prep_auth() blob_size=132 2021-09-23T01:18:46.282+0000 7f1ee27cc700 10 cephx server mgr.x: handle_request get_auth_session_key for mgr.x 2021-09-23T01:18:46.282+0000 7f1ee27cc700 20 cephx server mgr.x: checking key: req.key=50ea35923a58c457 expected_key=50ea35923a58c457 2021-09-23T01:18:46.282+0000 7f1ee27cc700 20 cephx server mgr.x: checking old_ticket: secret_id=2 len=96, old_ticket_may_be_omitted=0 2021-09-23T01:18:46.282+0000 7f1ee27cc700 0 cephx: ceph_decode_ticket could not get service secret for service_id=auth secret_id=2 2021-09-23T01:18:46.282+0000 7f1ee27cc700 0 cephx server mgr.x: attempt to reclaim global_id 10026 using bad ticket 2021-09-23T01:18:46.282+0000 7f1ee27cc700 0 cephx server mgr.x: could not verify old ticket 2021-09-23T01:18:46.282+0000 7f1ee27cc700 2 mon.b@1(peon) e1 send_reply 0x5595229ad680 0x559522f0ad00 auth_reply(proto 2 -13 (13) Permission denied) v1 Mon.2 2021-09-23T01:18:46.282+0000 7fc255234700 1 -- v1:172.21.15.101:6791/0 <== client.10026 v1:172.21.15.101:0/1620652081 2 ==== auth(proto 2 132 bytes epoch 0) v1 ==== 162+0+0 (unknown 1546966740 0 0) 0x559589d96180 con 0x559589dab800 2021-09-23T01:18:46.282+0000 7fc255234700 20 mon.c@2(peon) e1 _ms_dispatch existing session 0x559589ccb440 for client.10026 2021-09-23T01:18:46.282+0000 7fc255234700 20 mon.c@2(peon) e1 entity_name mgr.x global_id 10026 (reclaim_pending) caps 2021-09-23T01:18:46.282+0000 7fc255234700 10 mon.c@2(peon).paxosservice(auth 1..11) dispatch 0x559589d96180 auth(proto 2 132 bytes epoch 0) v1 from client.10026 v1:172.21.15.101:0/1620652081 con 0x559589dab800 2021-09-23T01:18:46.282+0000 7fc255234700 5 mon.c@2(peon).paxos(paxos active c 66..79) is_readable = 1 - now=2021-09-23T01:18:46.282871+0000 lease_expire=2021-09-23T01:18:51.247354+0000 has v0 lc 79 2021-09-23T01:18:46.282+0000 7fc255234700 10 mon.c@2(peon).auth v11 preprocess_query auth(proto 2 132 bytes epoch 0) v1 from client.10026 v1:172.21.15.101:0/1620652081 2021-09-23T01:18:46.282+0000 7fc255234700 10 mon.c@2(peon).auth v11 prep_auth() blob_size=132 2021-09-23T01:18:46.282+0000 7fc255234700 10 cephx server mgr.x: handle_request get_auth_session_key for mgr.x 2021-09-23T01:18:46.282+0000 7fc255234700 20 cephx server mgr.x: checking key: req.key=85f72a27737c1f4b expected_key=85f72a27737c1f4b 2021-09-23T01:18:46.282+0000 7fc255234700 20 cephx server mgr.x: checking old_ticket: secret_id=2 len=96, old_ticket_may_be_omitted=0 2021-09-23T01:18:46.282+0000 7fc255234700 0 cephx: ceph_decode_ticket could not get service secret for service_id=auth secret_id=2 2021-09-23T01:18:46.282+0000 7fc255234700 0 cephx server mgr.x: attempt to reclaim global_id 10026 using bad ticket 2021-09-23T01:18:46.282+0000 7fc255234700 0 cephx server mgr.x: could not verify old ticket 2021-09-23T01:18:46.282+0000 7fc255234700 2 mon.c@2(peon) e1 send_reply 0x559589cfe4b0 0x559589df8680 auth_reply(proto 2 -13 (13) Permission denied) v1
Updated by Sage Weil over 2 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 43335
Updated by Neha Ojha over 2 years ago
- Assignee changed from Neha Ojha to Sage Weil
Updated by Kefu Chai over 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot over 2 years ago
- Copied to Backport #52844: pacific: ceph pg dump times out due to no active mgr added
Updated by Loïc Dachary over 2 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".
Updated by Neha Ojha almost 2 years ago
- Related to Bug #52993: upgrade:octopus-x Test: Upgrade test failed due to timeout of the "ceph pg dump" command added
Updated by Laura Flores almost 2 years ago
- Status changed from Resolved to Pending Backport
- Backport changed from pacific to pacific, octopus
Regarding https://tracker.ceph.com/issues/52993#note-3, this fix should also be backported to Octopus.
Updated by Backport Bot almost 2 years ago
- Copied to Backport #55634: octopus: ceph pg dump times out due to no active mgr added