Project

General

Profile

Actions

Bug #22462

closed

mon: unknown message type 1537 in luminous->mimic upgrade tests

Added by Josh Durgin over 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://pulpito.ceph.com/teuthology-2017-12-14_22:26:40-upgrade:luminous-x:point-to-point-x-master-distro-basic-ovh/

mon.a was mimic, mon.b and c were still luminous. mon.a was out of quorum, with logs including:

2017-12-15 17:05:26.772 7fb0e30ec700  1 -- 172.21.2.165:6789/0 --> 172.21.2.121:6789/0 -- forward(log(1 entries from seq 10 at 2017-12-15 17:05:26.775932) v1 caps allow * tid 14 con_features 2305244844817580027) v3 -- 0x5558549d5600 con 
0
2017-12-15 17:05:26.772 7fb0df0e4700 10 _calc_signature seq 67 front_crc_ = 3288517548 middle_crc = 0 data_crc = 0 sig = 4805415437538457764
2017-12-15 17:05:26.772 7fb0df0e4700 20 Putting signature in client message(seq # 67): sig = 4805415437538457764
2017-12-15 17:05:26.860 7fb0df0e4700 10 _calc_signature seq 204841642 front_crc_ = 2129739439 middle_crc = 0 data_crc = 0 sig = 7961375364363924504
2017-12-15 17:05:26.860 7fb0df0e4700  0 can't decode unknown message type 1537 MSG_AUTH=17
2017-12-15 17:05:26.860 7fb0df0e4700  1 -- 172.21.2.165:6789/0 >> 172.21.2.121:6789/0 conn(0x555854e1dc00 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=578 cs=1 l=0).process decode message failed 
2017-12-15 17:05:26.860 7fb0e30ec700  1 -- 172.21.2.165:6789/0 <== mon.0 172.21.2.121:6789/0 204841642 ==== paxos(begin lc 2093 fc 0 pn 1100 opn 0) v4 ==== 11398+0+0 (2129739439 0 0) 0x5558549d5600 con 0x555854e1dc00
2017-12-15 17:05:26.860 7fb0e30ec700 20 mon.a@1(peon) e1 _ms_dispatch existing session 0x555854b2e4c0 for mon.0 172.21.2.121:6789/0
2017-12-15 17:05:26.860 7fb0e30ec700 20 mon.a@1(peon) e1  caps allow *
2017-12-15 17:05:26.860 7fb0e30ec700 20 is_capable service=mon command= read on cap allow *
2017-12-15 17:05:26.860 7fb0e30ec700 20  allow so far , doing grant allow *
2017-12-15 17:05:26.860 7fb0e30ec700 20  allow all
2017-12-15 17:05:26.860 7fb0e30ec700 20 is_capable service=mon command= exec on cap allow *
2017-12-15 17:05:26.860 7fb0e30ec700 20  allow so far , doing grant allow *
2017-12-15 17:05:26.860 7fb0e30ec700 20  allow all
2017-12-15 17:05:26.860 7fb0e30ec700 10 mon.a@1(peon).paxos(paxos active c 1507..2093) handle_begin paxos(begin lc 2093 fc 0 pn 1100 opn 0) v4
2017-12-15 17:05:26.860 7fb0e30ec700 10 mon.a@1(peon).paxos(paxos updating c 1507..2093) accepting value for 2094 pn 1100
2017-12-15 17:05:26.860 7fb0df0e4700  1 -- 172.21.2.165:6789/0 >> - conn(0x555854e6e000 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=25 -
2017-12-15 17:05:26.860 7fb0df0e4700 10 mon.a@1(peon) e1 ms_verify_authorizer 172.21.2.121:6789/0 mon protocol 2
2017-12-15 17:05:26.860 7fb0df0e4700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615
2017-12-15 17:05:26.860 7fb0df0e4700 10 cephx: verify_authorizer global_id=0
2017-12-15 17:05:26.860 7fb0df0e4700 10 cephx: verify_authorizer ok nonce 2a487cb01d4ed43b reply_bl.length()=36
2017-12-15 17:05:26.860 7fb0df0e4700  0 -- 172.21.2.165:6789/0 >> 172.21.2.121:6789/0 conn(0x555854e6e000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_STANDBY
2017-12-15 17:05:26.860 7fb0df0e4700  1 -- 172.21.2.165:6789/0 >> 172.21.2.121:6789/0 conn(0x555854e6e000 :6789 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
2017-12-15 17:05:26.864 7fb0df0e4700 10 mon.a@1(peon) e1 ms_verify_authorizer 172.21.2.121:6789/0 mon protocol 2
2017-12-15 17:05:26.864 7fb0df0e4700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615
2017-12-15 17:05:26.864 7fb0df0e4700 10 cephx: verify_authorizer global_id=0
2017-12-15 17:05:26.864 7fb0df0e4700 10 cephx: verify_authorizer ok nonce 725a06fb2cd89a32 reply_bl.length()=36
2017-12-15 17:05:26.864 7fb0df0e4700 10 In get_auth_session_handler for protocol 2
2017-12-15 17:05:26.864 7fb0e30ec700  1 -- 172.21.2.165:6789/0 --> 172.21.2.121:6789/0 -- paxos(accept lc 2093 fc 0 pn 1100 opn 0) v4 -- 0x5558550ae600 con 0
2017-12-15 17:05:26.864 7fb0e30ec700 10 mon.a@1(peon) e1 ms_handle_reset 0x555854e6e000 172.21.2.121:6789/0
2017-12-15 17:05:26.864 7fb0df0e4700 10 _calc_signature seq 68 front_crc_ = 1210378228 middle_crc = 0 data_crc = 0 sig = 8133008694137864262
2017-12-15 17:05:26.864 7fb0df0e4700 20 Putting signature in client message(seq # 68): sig = 8133008694137864262
2017-12-15 17:05:26.864 7fb0df0e4700  0 can't decode unknown message type 1537 MSG_AUTH=17

and continuing:

2017-12-15 23:50:36.726 7fb0e00e6700 10 mon.a@1(probing) e1 ms_verify_authorizer 172.21.2.121:6790/0 mon protocol 2
2017-12-15 23:50:36.726 7fb0e00e6700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615
2017-12-15 23:50:36.726 7fb0e00e6700 10 cephx: verify_authorizer global_id=0
2017-12-15 23:50:36.726 7fb0e00e6700 10 cephx: verify_authorizer ok nonce 2c69fc1306aa9050 reply_bl.length()=36
2017-12-15 23:50:36.726 7fb0e00e6700 10 In get_auth_session_handler for protocol 2
2017-12-15 23:50:36.726 7fb0df8e5700 10 mon.a@1(probing) e1 ms_verify_authorizer 172.21.2.121:6789/0 mon protocol 2
2017-12-15 23:50:36.726 7fb0df8e5700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615
2017-12-15 23:50:36.726 7fb0df8e5700 10 cephx: verify_authorizer global_id=0
2017-12-15 23:50:36.726 7fb0df8e5700 10 cephx: verify_authorizer ok nonce 51d54ab542324674 reply_bl.length()=36
2017-12-15 23:50:36.726 7fb0df8e5700 10 In get_auth_session_handler for protocol 2
2017-12-15 23:50:36.726 7fb0df8e5700  0 can't decode unknown message type 1537 MSG_AUTH=17
2017-12-15 23:50:36.726 7fb0df8e5700  1 -- 172.21.2.165:6789/0 >> 172.21.2.121:6789/0 conn(0x555854e1dc00 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=11470279 cs=11469271 l=0).process decode message failed 
2017-12-15 23:50:36.730 7fb0e00e6700  0 can't decode unknown message type 1537 MSG_AUTH=17
2017-12-15 23:50:36.730 7fb0e00e6700  1 -- 172.21.2.165:6789/0 >> 172.21.2.121:6790/0 conn(0x555854e1e300 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=9317698 cs=9323132 l=0).process decode message failed 
2017-12-15 23:50:36.730 7fb0df8e5700  1 -- 172.21.2.165:6789/0 >> - conn(0x555857416a00 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=11 -
2017-12-15 23:50:36.730 7fb0df8e5700 10 mon.a@1(probing) e1 ms_verify_authorizer 172.21.2.121:6789/0 mon protocol 2
2017-12-15 23:50:36.730 7fb0df8e5700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615
2017-12-15 23:50:36.730 7fb0df8e5700 10 cephx: verify_authorizer global_id=0
2017-12-15 23:50:36.730 7fb0df8e5700 10 cephx: verify_authorizer ok nonce 288fc77d655f47a8 reply_bl.length()=36
2017-12-15 23:50:36.730 7fb0df8e5700  0 -- 172.21.2.165:6789/0 >> 172.21.2.121:6789/0 conn(0x555857416a00 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 11469272 vs existing csq=11469271 existing_state=STATE_STANDBY
2017-12-15 23:50:36.730 7fb0df0e4700  1 -- 172.21.2.165:6789/0 reap_dead start
2017-12-15 23:50:36.730 7fb0e30ec700 10 mon.a@1(probing) e1 ms_handle_reset 0x555857416a00 172.21.2.121:6789/0
2017-12-15 23:50:36.730 7fb0df8e5700  1 -- 172.21.2.165:6789/0 >> 172.21.2.121:6789/0 conn(0x555857416a00 :6789 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
2017-12-15 23:50:36.730 7fb0df8e5700  1 -- 172.21.2.165:6789/0 >> - conn(0x55585790ce00 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=25 -
2017-12-15 23:50:36.730 7fb0df8e5700 10 mon.a@1(probing) e1 ms_verify_authorizer 172.21.2.121:6789/0 mon protocol 2
2017-12-15 23:50:36.730 7fb0df8e5700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615
2017-12-15 23:50:36.730 7fb0df8e5700 10 cephx: verify_authorizer global_id=0
2017-12-15 23:50:36.730 7fb0df8e5700 10 cephx: verify_authorizer ok nonce 2a382f157676ea4 reply_bl.length()=36
2017-12-15 23:50:36.730 7fb0df8e5700 10 In get_auth_session_handler for protocol 2
2017-12-15 23:50:36.730 7fb0df8e5700 10 mon.a@1(probing) e1 ms_verify_authorizer 172.21.2.121:6790/0 mon protocol 2
2017-12-15 23:50:36.730 7fb0df8e5700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615
2017-12-15 23:50:36.730 7fb0df8e5700 10 cephx: verify_authorizer global_id=0
2017-12-15 23:50:36.730 7fb0df8e5700 10 cephx: verify_authorizer ok nonce 33d1f4010539dc16 reply_bl.length()=36
2017-12-15 23:50:36.730 7fb0df8e5700  0 -- 172.21.2.165:6789/0 >> 172.21.2.121:6790/0 conn(0x55585790ce00 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 9323133 vs existing csq=9323132 existing_state=STATE_STANDBY
2017-12-15 23:50:36.730 7fb0e30ec700 10 mon.a@1(probing) e1 ms_handle_reset 0x55585790ce00 172.21.2.121:6790/0
2017-12-15 23:50:36.730 7fb0df8e5700  1 -- 172.21.2.165:6789/0 >> 172.21.2.121:6790/0 conn(0x55585790ce00 :6789 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
2017-12-15 23:50:36.730 7fb0df8e5700  0 can't decode unknown message type 1537 MSG_AUTH=17
2017-12-15 23:50:36.730 7fb0df8e5700  1 -- 172.21.2.165:6789/0 >> 172.21.2.121:6789/0 conn(0x555854e1dc00 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=11470281 cs=11469273 l=0).process decode message failed 

versions were: 12.2.2-37-g5bef6fa and 13.0.0-4056-g5043e4b

Actions #1

Updated by Yuri Weinstein about 6 years ago

  • Assignee set to Kefu Chai

@Kefu Chai could pls take a look?

Actions #2

Updated by Kefu Chai about 6 years ago

MMonHealth (MSG_MON_HEALTH=0x601 (1537)) was removed in https://github.com/ceph/ceph/commit/7b4a741fbda4dc817a003c694e96c8df7c1d2092 .

$ git tag --contains 7b4a741fbda4dc817a003c694e96c8df7c1d2092
v13.0.1
$ git merge-base --is-ancestor 7b4a741fbda4dc817a003c694e96c8df7c1d2092 5043e4b && echo 1
1

so at that moment, the mimic mon.a was reading the MMonHealth sent from luminous mon.{b,c} . and it did not understand MMonHealth.

Actions #3

Updated by Kefu Chai about 6 years ago

  • Status changed from New to Fix Under Review
  • Component(RADOS) Monitor added
Actions #4

Updated by Kefu Chai about 6 years ago

  • Status changed from Fix Under Review to Resolved
Actions #5

Updated by Aleksandr Rudenko almost 6 years ago

Maybe i have the same issue during upgrade Jewel->Luminous http://tracker.ceph.com/issues/24481?next_issue_id=24480&prev_issue_id=13999 ?

Actions

Also available in: Atom PDF