Project

General

Profile

Actions

Bug #21986

closed

ceph the second mon can not join the quorum

Added by linghucong linghucong over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-disk
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

the second mon always can not join the quorum.

root@node-1151:/home/pkg/tmp# ceph -v
ceph version 13.0.0-2613-gce6ba63 (ce6ba63e143b194dc6f42f0f9620df8673161da7) mimic (dev)
root@node-1151:/home/pkg/tmp# ps -ef|grep ceph-mon
root 22837 1 62 19:20 ? 00:03:11 ceph-mon -i node-1151
root 22975 20949 0 19:25 pts/0 00:00:00 grep --color=auto ceph-mon

root@node-1152:/home/pkg/tmp# ceph -s
cluster:
id: 3edc30f3-2157-4251-b94c-2a81db839bc8
health: HEALTH_WARN
too many PGs per OSD (320 > max 300)
1/3 mons down, quorum node-1150,node-1152

services:
mon: 3 daemons, quorum node-1150,node-1152, out of quorum: node-1151
mgr: node-1150(active), standbys: node-1151, node-1152
osd: 3 osds: 3 up, 3 in
data:
pools: 5 pools, 320 pgs
objects: 16674 objects, 95901 MB
usage: 287 GB used, 2509 GB / 2797 GB avail
pgs: 320 active+clean

mon log:

2017-10-31 19:27:41.866 7fb72a771700 0 -- 10.11.1.151:6789/0 >> 10.11.1.150:6789/0 conn(0x559a7c1ab000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 610232 vs existing csq=610231 existing_state=STATE_STANDBY
2017-10-31 19:27:41.866 7fb72a771700 0 can't decode unknown message type 1537 MSG_AUTH=17
2017-10-31 19:27:41.866 7fb72a771700 0 -- 10.11.1.151:6789/0 >> 10.11.1.150:6789/0 conn(0x559a7c1a9800 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 610234 vs existing csq=610233 existing_state=STATE_STANDBY
2017-10-31 19:27:41.866 7fb72a771700 0 can't decode unknown message type 1537 MSG_AUTH=17
2017-10-31 19:27:41.866 7fb72a771700 0 -- 10.11.1.151:6789/0 >> 10.11.1.150:6789/0 conn(0x559a7c1a2800 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 610236 vs existing csq=610235 existing_state=STATE_STANDBY
2017-10-31 19:27:41.866 7fb72a771700 0 can't decode unknown message type 1537 MSG_AUTH=17
2017-10-31 19:27:41.870 7fb72a771700 0 -- 10.11.1.151:6789/0 >> 10.11.1.150:6789/0 conn(0x559a7c19e000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 610238 vs existing csq=610237 existing_state=STATE_STANDBY
2017-10-31 19:27:41.870 7fb72a771700 0 can't decode unknown message type 1537 MSG_AUTH=17
2017-10-31 19:27:41.870 7fb72a771700 0 -- 10.11.1.151:6789/0 >> 10.11.1.150:6789/0 conn(0x559a7c192800 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 610240 vs existing csq=610239 existing_state=STATE_STANDBY
2017-10-31 19:27:41.870 7fb72a771700 0 can't decode unknown message type 1537 MSG_AUTH=17
2017-10-31 19:27:41.870 7fb72a771700 0 -- 10.11.1.151:6789/0 >> 10.11.1.150:6789/0 conn(0x559a7c1ab000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 610242 vs existing csq=610241 existing_state=STATE_STANDBY
2017-10-31 19:27:41.870 7fb72a771700 0 can't decode unknown message type 1537 MSG_AUTH=17
2017-10-31 19:27:41.870 7fb72a771700 0 -- 10.11.1.151:6789/0 >> 10.11.1.150:6789/0 conn(0x559a7c1a9800 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 610244 vs existing csq=610243 existing_state=STATE_STANDBY
2017-10-31 19:27:41.874 7fb72a771700 0 can't decode unknown message type 1537 MSG_AUTH=17


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #21770: ceph mon core dump when use ceph osd perf cmd.ResolvedJoao Eduardo Luis10/12/2017

Actions
Actions #1

Updated by Joao Eduardo Luis over 6 years ago

Are all the monitors running on master?

Actions #2

Updated by linghucong linghucong over 6 years ago

yes they are all same.

root@node-1152:~# dpkg-query -l|grep ceph-mon
ii ceph-mon 12.1.2-1 amd64 monitor server for the ceph storage system

root@node-1151:/home/pkg/tmp# dpkg-query -l|grep ceph-mon
ii ceph-mon 12.1.2-1 amd64 monitor server for the ceph storage system

root@node-1150:~# dpkg-query -l|grep ceph-mon
ii ceph-mon 12.1.2-1 amd64 monitor server for the ceph storage system

Actions #3

Updated by linghucong linghucong over 6 years ago

some log:

2017-11-01 10:58:04.483 7f6a4e783180 0 starting mon.node-1151 rank 1 at 10.11.1.151:6789/0 mon_data /var/lib/ceph/mon/ceph-node-1151 fsid 3edc30f3-2157-4251-b94c-2a81db839bc8
2017-11-01 10:58:04.483 7f6a4e783180 1 mon.node-1151@-1(probing) e13 preinit fsid 3edc30f3-2157-4251-b94c-2a81db839bc8
2017-11-01 10:58:04.483 7f6a4e783180 1 mon.node-1151@-1(probing).mds e0 Unable to load 'last_metadata'
2017-11-01 10:58:04.483 7f6a4e783180 0 mon.node-1151@-1(probing).mds e1 print_map
e1
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
legacy client fscid: -1

No filesystems configured

2017-11-01 10:58:04.487 7f6a4e783180 0 mon.node-1151@-1(probing).osd e149 crush map has features 288514051259236352, adjusting msgr requires
2017-11-01 10:58:04.487 7f6a4e783180 0 mon.node-1151@-1(probing).osd e149 crush map has features 288514051259236352, adjusting msgr requires
2017-11-01 10:58:04.487 7f6a4e783180 0 mon.node-1151@-1(probing).osd e149 crush map has features 1009089991638532096, adjusting msgr requires
2017-11-01 10:58:04.487 7f6a4e783180 0 mon.node-1151@-1(probing).osd e149 crush map has features 288514051259236352, adjusting msgr requires
2017-11-01 10:58:04.487 7f6a4e783180 1 mon.node-1151@-1(probing).paxosservice(auth 501..730) refresh upgraded, format 0 > 2
2017-11-01 10:58:04.491 7f6a41dd7700 0 -
10.11.1.151:0/28543 >> 10.11.1.150:6802/103206 conn(0x558cd57b2000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got BADAUTHORIZER
2017-11-01 10:58:04.491 7f6a4e783180 0 mon.node-1151@-1(probing) e13 my rank is now 1 (was 1)
2017-11-01 10:58:04.491 7f6a41dd7700 0 -
10.11.1.151:0/28543 >> 10.11.1.150:6802/103206 conn(0x558cd57b2000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got BADAUTHORIZER
2017-11-01 10:58:04.551 7f6a415d6700 0 -- 10.11.1.151:6789/0 >> 10.11.1.152:6789/0 conn(0x558cd57b6800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=69366333 cs=1 l=0).process missed message? skipped from seq 0 to 319737780
2017-11-01 10:58:04.551 7f6a44ddd700 1 mon.node-1151@1(synchronizing) e13 sync_obtain_latest_monmap
2017-11-01 10:58:04.551 7f6a44ddd700 1 mon.node-1151@1(synchronizing) e13 sync_obtain_latest_monmap obtained monmap e13
2017-11-01 10:58:04.607 7f6a40dd5700 0 -- 10.11.1.151:6789/0 >> 10.11.1.150:6789/0 conn(0x558cd57b5000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=88757478 cs=1 l=0).process missed message? skipped from seq 0 to 954877718

Actions #4

Updated by linghucong linghucong over 6 years ago

leader mon log

2017-11-01 06:25:02.829796 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_get_authorizer for mon
2017-11-01 06:25:02.829896 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_verify_authorizer 10.11.1.151:6789/0 mon protocol 2
2017-11-01 06:25:02.829957 7f33fa53e700 0 -- 10.11.1.150:6789/0 >> 10.11.1.151:6789/0 conn(0x562885f6f000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 62359140 vs existing csq=62359140 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY
2017-11-01 06:25:02.830092 7f33fe546700 10 mon.node-1150@0(leader) e13 ms_handle_reset 0x562885f6f000 10.11.1.151:6789/0
2017-11-01 06:25:02.830202 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_get_authorizer for mon
2017-11-01 06:25:02.830728 7f33fa53e700 0 -- 10.11.1.150:6789/0 >> 10.11.1.151:6789/0 conn(0x562874d35000 :-1 s=STATE_OPEN pgs=82831859 cs=62359141 l=0).fault initiating reconnect
2017-11-01 06:25:02.830972 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_get_authorizer for mon
2017-11-01 06:25:02.831055 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_verify_authorizer 10.11.1.151:6789/0 mon protocol 2
2017-11-01 06:25:02.831100 7f33fa53e700 0 -- 10.11.1.150:6789/0 >> 10.11.1.151:6789/0 conn(0x562885f6f000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 62359142 vs existing csq=62359142 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY
2017-11-01 06:25:02.831271 7f33fe546700 10 mon.node-1150@0(leader) e13 ms_handle_reset 0x562885f6f000 10.11.1.151:6789/0
2017-11-01 06:25:02.831288 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_get_authorizer for mon
2017-11-01 06:25:02.831744 7f33fa53e700 0 -- 10.11.1.150:6789/0 >> 10.11.1.151:6789/0 conn(0x562874d35000 :-1 s=STATE_OPEN pgs=82831862 cs=62359143 l=0).fault initiating reconnect
2017-11-01 06:25:02.831981 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_get_authorizer for mon
2017-11-01 06:25:02.832157 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_verify_authorizer 10.11.1.151:6789/0 mon protocol 2
2017-11-01 06:25:02.832198 7f33fa53e700 0 -- 10.11.1.150:6789/0 >> 10.11.1.151:6789/0 conn(0x5628a6c88800 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 62359144 vs existing csq=62359144 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY
2017-11-01 06:25:02.832270 7f33fe546700 10 mon.node-1150@0(leader) e13 ms_handle_reset 0x5628a6c88800 10.11.1.151:6789/0
2017-11-01 06:25:02.832288 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_get_authorizer for mon
2017-11-01 06:25:02.832729 7f33fa53e700 0 -- 10.11.1.150:6789/0 >> 10.11.1.151:6789/0 conn(0x562874d35000 :-1 s=STATE_OPEN pgs=82831865 cs=62359145 l=0).fault initiating reconnect
2017-11-01 06:25:02.832955 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_verify_authorizer 10.11.1.151:6789/0 mon protocol 2
2017-11-01 06:25:02.832998 7f33fa53e700 0 -- 10.11.1.150:6789/0 >> 10.11.1.151:6789/0 conn(0x5628a6c87000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 62359146 vs existing csq=62359146 existing_state=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY
2017-11-01 06:25:02.833031 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_get_authorizer for mon
2017-11-01 06:25:02.833136 7f33fe546700 10 mon.node-1150@0(leader) e13 ms_handle_reset 0x5628a6c87000 10.11.1.151:6789/0
2017-11-01 06:25:02.833241 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_get_authorizer for mon
2017-11-01 06:25:02.833911 7f33fa53e700 0 -- 10.11.1.150:6789/0 >> 10.11.1.151:6789/0 conn(0x562874d35000 :-1 s=STATE_OPEN pgs=82831868 cs=62359147 l=0).fault initiating reconnect
2017-11-01 06:25:02.834106 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_get_authorizer for mon
2017-11-01 06:25:02.834247 7f33fa53e700 10 mon.node-1150@0(leader) e13 ms_verify_authorizer 10.11.1.151:6789/0 mon protocol 2
2017-11-01 06:25:02.834307 7f33fa53e700 0 -- 10.11.1.150:6789/0 >> 10.11.1.151:6789/0 conn(0x5628a6c85800 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 62359148 vs existing csq=62359148 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY
2017-11-01 06:25:02.834402 7f33fe546700 10 mon.node-1150@0(leader) e13 ms_handle_reset 0x5628a6c85800 10.11.1.151:6789/0

Actions #5

Updated by linghucong linghucong over 6 years ago

after I reboot all the mon. it has all ok. I update the mon from 12.1.1.

root@node-1151:/var/log/ceph# ceph -s
cluster:
id: 3edc30f3-2157-4251-b94c-2a81db839bc8
health: HEALTH_WARN
too many PGs per OSD (320 > max 300)

services:
mon: 3 daemons, quorum node-1150,node-1151,node-1152
mgr: node-1150(active), standbys: node-1151, node-1152
osd: 3 osds: 3 up, 3 in
data:
pools: 5 pools, 320 pgs
objects: 16.3K objects, 93.7G
usage: 288G used, 2.45T / 2.73T avail
pgs: 320 active+clean
io:
client: 5116 B/s rd, 1023 B/s wr, 5 op/s rd, 0 op/s wr
Actions #6

Updated by Shinobu Kinjo over 6 years ago

  • Status changed from New to Closed
Actions #7

Updated by Kefu Chai over 6 years ago

@linghucong

couple questions:

1. is all the monitors master? in the "Description", the mon(s) on node-1151 is master, but we are not sure if the others are, or not. but in #21986-2, you claimed that they are all luminous 12.1.2

2. what are the steps to reproduce this issue?

Actions #8

Updated by Kefu Chai over 6 years ago

  • Related to Bug #21770: ceph mon core dump when use ceph osd perf cmd. added
Actions #9

Updated by linghucong linghucong over 6 years ago

Yes,1 that are the master code. I update the cluster to the master.

2 can not reproduce again. after I reboot all the mons.

Actions #10

Updated by Paul Emmerich over 6 years ago

#21770 disappears if you restart all mons and mgrs, so what you are seeing here is very likely the osd perf bug.

Actions

Also available in: Atom PDF