Feature #21198
openMonitors don't handle incomplete network splits
0%
Description
the network between monitors(the minimum rank and the maximum rank) disconnect, the node of the maximum rank always keep electing and the mon can't work,fox example the command of "ceph -s" can't work.
the network topology as below:
1(175)
/ \
(176)2 0(174)
the log of the 176 node as below, the other logs in the attachment
2017-08-22 18:07:27.800813 7fcdae7b4700 5 mon.node176@2(electing).elector(11791) election timer expired
2017-08-22 18:07:27.800841 7fcdae7b4700 5 mon.node176@2(electing).elector(11791) start -- can i be leader?
2017-08-22 18:07:27.800922 7fcdae7b4700 1 mon.node176@2(electing).elector(11791) init, last seen epoch 11791
2017-08-22 18:07:27.805078 7fcdadfb3700 5 mon.node176@2(electing).elector(11791) handle_propose from mon.1
2017-08-22 18:07:27.805081 7fcdadfb3700 10 mon.node176@2(electing).elector(11791) handle_propose required features 9025616074506240, peer features 576460752032874495
2017-08-22 18:07:27.805083 7fcdadfb3700 10 mon.node176@2(electing).elector(11791) bump_epoch 11791 to 11793
2017-08-22 18:07:27.806013 7fcdadfb3700 10 mon.node176@2(electing) e5 join_election
2017-08-22 18:07:27.806028 7fcdadfb3700 10 mon.node176@2(electing) e5 _reset
2017-08-22 18:07:27.806034 7fcdadfb3700 10 mon.node176@2(electing) e5 cancel_probe_timeout (none scheduled)
2017-08-22 18:07:27.806038 7fcdadfb3700 10 mon.node176@2(electing) e5 timecheck_finish
2017-08-22 18:07:27.806041 7fcdadfb3700 10 mon.node176@2(electing) e5 scrub_event_cancel
2017-08-22 18:07:27.806044 7fcdadfb3700 10 mon.node176@2(electing) e5 scrub_reset
2017-08-22 18:07:27.806053 7fcdadfb3700 5 mon.node176@2(electing).elector(11793) defer to 1
2017-08-22 18:07:33.806178 7fcdae7b4700 5 mon.node176@2(electing).elector(11793) election timer expired
2017-08-22 18:07:33.806202 7fcdae7b4700 5 mon.node176@2(electing).elector(11793) start -- can i be leader?
2017-08-22 18:07:33.806247 7fcdae7b4700 1 mon.node176@2(electing).elector(11793) init, last seen epoch 11793
2017-08-22 18:07:33.810660 7fcdadfb3700 5 mon.node176@2(electing).elector(11793) handle_propose from mon.1
2017-08-22 18:07:33.810662 7fcdadfb3700 10 mon.node176@2(electing).elector(11793) handle_propose required features 9025616074506240, peer features 576460752032874495
2017-08-22 18:07:33.810665 7fcdadfb3700 10 mon.node176@2(electing).elector(11793) bump_epoch 11793 to 11795
2017-08-22 18:07:33.811652 7fcdadfb3700 10 mon.node176@2(electing) e5 join_election
2017-08-22 18:07:33.811666 7fcdadfb3700 10 mon.node176@2(electing) e5 _reset
2017-08-22 18:07:33.811668 7fcdadfb3700 10 mon.node176@2(electing) e5 cancel_probe_timeout (none scheduled)
2017-08-22 18:07:33.811670 7fcdadfb3700 10 mon.node176@2(electing) e5 timecheck_finish
2017-08-22 18:07:33.811673 7fcdadfb3700 10 mon.node176@2(electing) e5 scrub_event_cancel
2017-08-22 18:07:33.811674 7fcdadfb3700 10 mon.node176@2(electing) e5 scrub_reset
2017-08-22 18:07:33.811681 7fcdadfb3700 5 mon.node176@2(electing).elector(11795) defer to 1
2017-08-22 18:07:39.811807 7fcdae7b4700 5 mon.node176@2(electing).elector(11795) election timer expired
2017-08-22 18:07:39.811841 7fcdae7b4700 5 mon.node176@2(electing).elector(11795) start -- can i be leader?
2017-08-22 18:07:39.811925 7fcdae7b4700 1 mon.node176@2(electing).elector(11795) init, last seen epoch 11795
2017-08-22 18:07:39.816982 7fcdadfb3700 5 mon.node176@2(electing).elector(11795) handle_propose from mon.1
Files
Updated by zhiang li over 6 years ago
Updated by Greg Farnum over 6 years ago
- Tracker changed from Bug to Feature
- Subject changed from the network between monitors(the minimum rank and the maximum rank) disconnect, the node of the maximum rank always keep electing and the mon can't work to Monitors don't handle incomplete network splits
- Category set to Administration/Usability
Yep. If you have a network partition that can be crossed by some monitors but not others, we're screwed.
Resolving it would be a big project to try and detect these situations and work around them by declaring some monitors dead.