Actions
Bug #20230
closedmon: fail to form large quorum; msg/async busy loop
% Done:
0%
Source:
Tags:
Backport:
jewel,kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2017-06-08 22:45:16.293178 7f1284fa1700 10 cephx: verify_authorizer ok nonce 640b4d61622a698c reply_bl.length()=36 2017-06-08 22:45:16.293183 7f1284fa1700 1 -- 172.21.15.47:6790/0 >> 172.21.15.47:6791/0 conn(0x7f12999ba800 :6790 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_CLOSED 2017-06-08 22:45:16.293309 7f1284fa1700 10 mon.e@3(probing) e1 ms_verify_authorizer 172.21.15.47:6791/0 mon protocol 2 2017-06-08 22:45:16.293317 7f1284fa1700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615 2017-06-08 22:45:16.293337 7f1284fa1700 10 cephx: verify_authorizer global_id=0 2017-06-08 22:45:16.293347 7f1284fa1700 10 cephx: verify_authorizer ok nonce 10cefe7a2bf928bd reply_bl.length()=36 2017-06-08 22:45:16.293351 7f1284fa1700 1 -- 172.21.15.47:6790/0 >> 172.21.15.47:6791/0 conn(0x7f12999ba800 :6790 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_CLOSED 2017-06-08 22:45:16.293482 7f1284fa1700 10 mon.e@3(probing) e1 ms_verify_authorizer 172.21.15.47:6791/0 mon protocol 2 2017-06-08 22:45:16.293490 7f1284fa1700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615 2017-06-08 22:45:16.293510 7f1284fa1700 10 cephx: verify_authorizer global_id=0 2017-06-08 22:45:16.293520 7f1284fa1700 10 cephx: verify_authorizer ok nonce 44d294235565df39 reply_bl.length()=36 2017-06-08 22:45:16.293527 7f1284fa1700 1 -- 172.21.15.47:6790/0 >> 172.21.15.47:6791/0 conn(0x7f12999ba800 :6790 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_CLOSED 2017-06-08 22:45:16.293657 7f1284fa1700 10 mon.e@3(probing) e1 ms_verify_authorizer 172.21.15.47:6791/0 mon protocol 2 2017-06-08 22:45:16.293665 7f1284fa1700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615 2017-06-08 22:45:16.293685 7f1284fa1700 10 cephx: verify_authorizer global_id=0 2017-06-08 22:45:16.293695 7f1284fa1700 10 cephx: verify_authorizer ok nonce 6b0a40433a7bc7ab reply_bl.length()=36 2017-06-08 22:45:16.293700 7f1284fa1700 1 -- 172.21.15.47:6790/0 >> 172.21.15.47:6791/0 conn(0x7f12999ba800 :6790 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_CLOSED 2017-06-08 22:45:16.293828 7f1284fa1700 10 mon.e@3(probing) e1 ms_verify_authorizer 172.21.15.47:6791/0 mon protocol 2 2017-06-08 22:45:16.293836 7f1284fa1700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615 2017-06-08 22:45:16.293857 7f1284fa1700 10 cephx: verify_authorizer global_id=0 2017-06-08 22:45:16.293867 7f1284fa1700 10 cephx: verify_authorizer ok nonce 53dc56b13c8d1a82 reply_bl.length()=36 2017-06-08 22:45:16.293871 7f1284fa1700 1 -- 172.21.15.47:6790/0 >> 172.21.15.47:6791/0 conn(0x7f12999ba800 :6790 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_CLOSED 2017-06-08 22:45:16.293999 7f1284fa1700 10 mon.e@3(probing) e1 ms_verify_authorizer 172.21.15.47:6791/0 mon protocol 2 2017-06-08 22:45:16.294007 7f1284fa1700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615 2017-06-08 22:45:16.294027 7f1284fa1700 10 cephx: verify_authorizer global_id=0 2017-06-08 22:45:16.294037 7f1284fa1700 10 cephx: verify_authorizer ok nonce 308807b74e5faadc reply_bl.length()=36 2017-06-08 22:45:16.294043 7f1284fa1700 1 -- 172.21.15.47:6790/0 >> 172.21.15.47:6791/0 conn(0x7f12999ba800 :6790 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_CLOSED 2017-06-08 22:45:16.294165 7f1284fa1700 10 mon.e@3(probing) e1 ms_verify_authorizer 172.21.15.47:6791/0 mon protocol 2 2017-06-08 22:45:16.294173 7f1284fa1700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615 2017-06-08 22:45:16.294193 7f1284fa1700 10 cephx: verify_authorizer global_id=0 2017-06-08 22:45:16.294203 7f1284fa1700 10 cephx: verify_authorizer ok nonce 22c316b82b07fb92 reply_bl.length()=36 2017-06-08 22:45:16.294210 7f1284fa1700 1 -- 172.21.15.47:6790/0 >> 172.21.15.47:6791/0 conn(0x7f12999ba800 :6790 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_CLOSED 2017-06-08 22:45:16.294333 7f1284fa1700 10 mon.e@3(probing) e1 ms_verify_authorizer 172.21.15.47:6791/0 mon protocol 2 2017-06-08 22:45:16.294341 7f1284fa1700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615 2017-06-08 22:45:16.294361 7f1284fa1700 10 cephx: verify_authorizer global_id=0 2017-06-08 22:45:16.294371 7f1284fa1700 10 cephx: verify_authorizer ok nonce 178bc6577fca5c47 reply_bl.length()=36 2017-06-08 22:45:16.294377 7f1284fa1700 1 -- 172.21.15.47:6790/0 >> 172.21.15.47:6791/0 conn(0x7f12999ba800 :6790 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_CLOSED 2017-06-08 22:45:16.294497 7f1284fa1700 10 mon.e@3(probing) e1 ms_verify_authorizer 172.21.15.47:6791/0 mon protocol 2 2017-06-08 22:45:16.294506 7f1284fa1700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615 2017-06-08 22:45:16.294527 7f1284fa1700 10 cephx: verify_authorizer global_id=0
and as a result
2017-06-08 22:46:19.294935 7fed72117700 10 mon.b@0(probing) e1 outside_quorum now b,h,j,k,m,p,s, need 11
mon.e fails to join the quroum, test fails.
/a/sage-2017-06-08_20:27:41-rados-wip-sage-testing2-distro-basic-smithi/1271816
rados/multimon/{clusters/21.yaml mon_kv_backend/leveldb.yaml msgr-failures/few.yaml msgr/async.yaml objectstore/filestore-btrfs.yaml rados.yaml tasks/mon_recovery.yaml}
Updated by Sage Weil almost 7 years ago
if (existing->replacing || existing->state == STATE_CLOSED) { ldout(async_msgr->cct, 1) << __func__ << " existing racing replace or mark_down happened while replacing." << " existing_state=" << get_state_name(existing->state) << dendl; reply.global_seq = existing->peer_global_seq; r = _reply_accept(CEPH_MSGR_TAG_RETRY_GLOBAL, connect, reply, authorizer_reply); existing->lock.unlock(); if (r < 0) goto fail; return 0; }
I don't understand what this code is trying to do. It looks to me like the CLOSED connection has replacing=true so we end up in a busy loop because nothing is cleaning it up. But why not ignore or tear down that other connection here instead of looping?
Updated by Sage Weil almost 7 years ago
- Subject changed from mon: fail to form large quorum; busy auth msgr loop to mon: fail to form large quorum; msg/async busy loop
if (existing->replacing || existing->state == STATE_CLOSED) { ldout(async_msgr->cct, 1) << __func__ << " existing racing replace or mark_down happened while replacing." << " existing_state=" << get_state_name(existing->state) << dendl; reply.global_seq = existing->peer_global_seq; r = _reply_accept(CEPH_MSGR_TAG_RETRY_GLOBAL, connect, reply, authorizer_reply); existing->lock.unlock(); if (r < 0) goto fail; return 0; }
I don't understand what this code is trying to do. It looks to me like the CLOSED connection has replacing=true so we end up in a busy loop because nothing is cleaning it up. But why not ignore or tear down that other connection here instead of looping?
Updated by Alexey Sheplyakov almost 7 years ago
- Backport set to jewel,kraken,luminous
AsyncConnection::handle_connect_msg() has the same bug in Jewel [0], Kraken [1], and Luminous [2]
[0] https://github.com/ceph/ceph/blob/jewel/src/msg/async/AsyncConnection.cc#L1713-L1722
[1] https://github.com/ceph/ceph/blob/kraken/src/msg/async/AsyncConnection.cc#L1513-L1522
[2] https://github.com/ceph/ceph/blob/luminous/src/msg/async/AsyncConnection.cc#L1538-L1547
Updated by Sage Weil almost 7 years ago
- Status changed from 12 to Fix Under Review
Updated by Yuri Weinstein almost 7 years ago
Updated by Sage Weil almost 7 years ago
- Status changed from Fix Under Review to Pending Backport
- Priority changed from Immediate to Urgent
Updated by Nathan Cutler almost 7 years ago
- Backport changed from jewel,kraken,luminous to jewel,kraken
Updated by Nathan Cutler almost 7 years ago
- Copied to Backport #20314: jewel: mon: fail to form large quorum; msg/async busy loop added
Updated by Nathan Cutler almost 7 years ago
- Copied to Backport #20315: kraken: mon: fail to form large quorum; msg/async busy loop added
Updated by Nathan Cutler over 6 years ago
- Status changed from Pending Backport to Resolved
Actions