Project

General

Profile

Bug #24481

[mon] can't decode unknown message type 1544 MSG_AUTH=17

Added by Aleksandr Rudenko almost 6 years ago. Updated about 5 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi, friends.

I updated cluster from Hammer(0.94.10) to Jewel(10.2.10) recently an all works fine.

Now, i can't update cluster from Jewel to Luminous(12.2.5).

I have 5 MONs.

I updated packages, restarted MON service but MON didn't joined to cluster.

In logs of new MOn i see:

2018-06-09 18:31:58.401195 7f53c4b94e40  0 mon.c@-1(probing).osd e97076 crush map has features 1107558400, adjusting msgr requires
2018-06-09 18:31:58.401199 7f53c4b94e40  0 mon.c@-1(probing).osd e97076 crush map has features 144115189183414272, adjusting msgr requires
2018-06-09 18:31:58.401201 7f53c4b94e40  0 mon.c@-1(probing).osd e97076 crush map has features 1107558400, adjusting msgr requires
2018-06-09 18:31:58.401640 7f53c4b94e40  1 mon.c@-1(probing).paxosservice(auth 27501..27729) refresh upgraded, format 0 -> 1
2018-06-09 18:31:58.405124 7f53c4b94e40  0 mon.c@-1(probing) e14  my rank is now 4 (was -1)
2018-06-09 18:31:58.407637 7f53bcf98700  0 -- 192.168.144.96:6789/0 >> 192.168.144.94:6789/0 conn(0x7f53cf49d800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=5862845 cs=1 l=0).process missed message?  skipped from seq 0 to 324975786
2018-06-09 18:31:58.407698 7f53bc797700  0 -- 192.168.144.96:6789/0 >> 192.168.144.95:6789/0 conn(0x7f53cf49f000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=5570472 cs=1 l=0).process missed message?  skipped from seq 0 to 581869062
2018-06-09 18:31:58.407885 7f53b833d700  1 mon.c@4(synchronizing) e14 sync_obtain_latest_monmap
2018-06-09 18:31:58.408133 7f53bd799700  0 -- 192.168.144.96:6789/0 >> 192.168.144.72:6789/0 conn(0x7f53cf49c000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=6280990 cs=1 l=0).process missed message?  skipped from seq 0 to 638861135
2018-06-09 18:31:58.408206 7f53b833d700  1 mon.c@4(synchronizing) e14 sync_obtain_latest_monmap obtained monmap e14
2018-06-09 18:31:58.408212 7f53bc797700  0 -- 192.168.144.96:6789/0 >> 192.168.144.6:6789/0 conn(0x7f53cf48e800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=7342571 cs=1 l=0).process missed message?  skipped from seq 0 to 1859752335
2018-06-09 18:31:59.480169 7f53b0c4b700  1 leveldb: Level-0 table #502: started
2018-06-09 18:31:59.611803 7f53b0c4b700  1 leveldb: Level-0 table #502: 33053398 bytes OK
2018-06-09 18:31:59.618460 7f53b0c4b700  1 leveldb: Delete type=0 #500

2018-06-09 18:31:59.866280 7f53b0c4b700  1 leveldb: Level-0 table #504: started
2018-06-09 18:31:59.991979 7f53b0c4b700  1 leveldb: Level-0 table #504: 33792335 bytes OK
2018-06-09 18:31:59.997304 7f53b0c4b700  1 leveldb: Delete type=0 #501

2018-06-09 18:32:04.346362 7f53b833d700  1 mon.c@4(synchronizing).mds e1 Unable to load 'last_metadata'
2018-06-09 18:32:04.400960 7f53b833d700  0 log_channel(cluster) log [INF] : mon.c calling monitor election
2018-06-09 18:32:04.401069 7f53b833d700  1 mon.c@4(electing).elector(3893) init, last seen epoch 3893, mid-election, bumping
2018-06-09 18:32:08.405774 7f53bc797700  0 -- 192.168.144.96:6789/0 >> 192.168.144.6:6789/0 conn(0x7f53cf48e800 :-1 s=STATE_OPEN pgs=7342571 cs=1 l=0).fault initiating reconnect
2018-06-09 18:32:08.407720 7f53bc797700  0 -- 192.168.144.96:6789/0 >> 192.168.144.6:6789/0 conn(0x7f53cf48e800 :-1 s=STATE_OPEN pgs=7342573 cs=3 l=0).fault initiating reconnect
2018-06-09 18:32:08.409374 7f53bc797700  0 -- 192.168.144.96:6789/0 >> 192.168.144.6:6789/0 conn(0x7f53cf48e800 :-1 s=STATE_OPEN pgs=7342574 cs=5 l=0).fault initiating reconnect
2018-06-09 18:32:08.411206 7f53bc797700  0 -- 192.168.144.96:6789/0 >> 192.168.144.6:6789/0 conn(0x7f53cf48e800 :-1 s=STATE_OPEN pgs=7342575 cs=7 l=0).fault initiating reconnect
2018-06-09 18:32:08.413154 7f53bc797700  0 -- 192.168.144.96:6789/0 >> 192.168.144.6:6789/0 conn(0x7f53cf48e800 :-1 s=STATE_OPEN pgs=7342576 cs=9 l=0).fault initiating reconnect

in logs of current active MON i see:

can't decode unknown message type 1544 MSG_AUTH=17

and some time this:

2018-06-08 23:42:02.910601 7fc6ddb8b700  0 mon.e@0(leader) e8 ms_verify_authorizer cephx enabled, but no authorizer (required for mon)
2018-06-08 23:42:02.910613 7fc6ddb8b700  0 -- 192.168.144.6:6789/0 >> 192.168.144.94:6789/0 pipe(0x7fc6ff23e000 sd=12 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fc6fb237800).accept: got bad authorizer

I naven't reproduce this in test environment. I tested update Hammer->Jewel-Luminous many times in different setups and always all was fine.

Any recommendation, please.

History

#1 Updated by Aleksandr Rudenko almost 6 years ago

debug mon = 20 on active master MON:

2018-06-09 19:55:17.079952 7f5414fe7700 10 mon.e@0(electing) e14 ms_verify_authorizer 192.168.144.96:6789/0 mon protocol 2
2018-06-09 19:55:17.080041 7f5414fe7700  0 -- 192.168.144.6:6789/0 >> 192.168.144.96:6789/0 pipe(0x7f542e2d3400 sd=11 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f542dd3b080).accept connect_seq 26358 vs existing 26357 stat
e standby
2018-06-09 19:55:17.080156 7f54178f3700 10 mon.e@0(electing) e14 ms_handle_reset 0x7f542dd3b080 192.168.144.96:6789/0
2018-06-09 19:55:17.080493 7f5414fe7700  0 can't decode unknown message type 1544 MSG_AUTH=17
2018-06-09 19:55:17.080521 7f5414fe7700  0 -- 192.168.144.6:6789/0 >> 192.168.144.96:6789/0 pipe(0x7f542e2d3400 sd=11 :6789 s=2 pgs=6780368 cs=26359 l=0 c=0x7f542ca94a00).fault with nothing to send, going to standby
2018-06-09 19:55:17.081677 7f541a8a5700 10 mon.e@0(electing) e14 ms_verify_authorizer 192.168.144.96:6789/0 mon protocol 0
2018-06-09 19:55:17.081707 7f541a8a5700  0 mon.e@0(electing) e14 ms_verify_authorizer cephx enabled, but no authorizer (required for mon)
2018-06-09 19:55:17.081712 7f541a8a5700  0 -- 192.168.144.6:6789/0 >> 192.168.144.96:6789/0 pipe(0x7f542e2d2000 sd=8 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f542dd3b200).accept: got bad authorizer
2018-06-09 19:55:17.082063 7f541a8a5700 10 mon.e@0(electing) e14 ms_verify_authorizer 192.168.144.96:6789/0 mon protocol 0
2018-06-09 19:55:17.082083 7f541a8a5700  0 mon.e@0(electing) e14 ms_verify_authorizer cephx enabled, but no authorizer (required for mon)
2018-06-09 19:55:17.082087 7f541a8a5700  0 -- 192.168.144.6:6789/0 >> 192.168.144.96:6789/0 pipe(0x7f542e2d2000 sd=8 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f542dd3b200).accept: got bad authorizer
2018-06-09 19:55:17.082598 7f54178f3700 10 mon.e@0(electing) e14 ms_handle_reset 0x7f542dd3b200 192.168.144.96:6789/0

#2 Updated by Josh Durgin almost 6 years ago

Can you restart all the monitors with luminous code? That should get you past this issue.

#3 Updated by Sage Weil almost 6 years ago

  • Status changed from New to Fix Under Review

#5 Updated by Greg Farnum about 5 years ago

  • Project changed from RADOS to Messengers

Also available in: Atom PDF