Bug #18166
closedmonitor cannot start because of "FAILED assert(info.state == MDSMap::STATE_STANDBY)"
0%
Description
ceph version: v10.2.3
operation system: ubuntu 14.04
linux kernel version: 3.13.0
Description:
I test for cephfs, and I have two mds, when I start by 'start ceph-mds-all', the mds cluster is OK, two mds is OK, one is active,
and another is standby. But when I test the option of hot-standby in command, problem has arisen.
I have three nodes:
ceph1: monitor + mds + osd
ceph2: monitor + mds + osd
ceph3: monitor + osd
Procedure:
1. start two mds
2. stop the mds in standby state
3. perform command: “ceph-mds --cluster=ceph -i ceph2 --setuser ceph --setgroup ceph --hot-standby 0”
at this step, mds cannot start, so I do next:
4. restart monitor service
at this time, I found that a monitor(ceph1) is down. When I restart the down monitor, the monitor is up, but, the other monitor(ceph2) is down.
monitor log:
-132> 2016-12-07 08:59:38.803492 7f9a7a2f3480 10 mon.ceph3@-1(probing).paxosservice(mdsmap 1..438) refresh
-131> 2016-12-07 08:59:38.803506 7f9a7a2f3480 10 mon.ceph3@-1(probing).mds e0 update_from_paxos version 438, my e 0
-130> 2016-12-07 08:59:38.803642 7f9a7a2f3480 10 mon.ceph3@-1(probing).mds e0 update_from_paxos got 438
-129> 2016-12-07 08:59:38.803686 7f9a7a2f3480 4 mon.ceph3@-1(probing).mds e438 new map
-128> 2016-12-07 08:59:38.803723 7f9a7a2f3480 0 mon.ceph3@-1(probing).mds e438 print_map
e438
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
Filesystem 'first_cephfs' (1)
fs_name first_cephfs
epoch 436
flags 8
created 2016-11-16 06:06:34.376616
modified 2016-11-23 11:37:26.768424
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
last_failure 0
last_failure_osd_epoch 68
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
max_mds 1
in 0
up {0=58523}
failed
damaged
stopped
data_pools 1
metadata_pool 2
inline_data disabled
58523: 10.10.38.40:6806/13750 'ceph1' mds.0.433 up:active seq 70
Standby daemons:
39073: 10.10.38.41:6805/12990 'ceph2' mds.-1.0 up:standby seq 1 (standby for rank 0)
-8> 2016-12-07 08:59:38.825925 7f9a7a2f3480 10 mon.ceph3@0(leader).paxosservice(mdsmap 1..438) election_finished
-7> 2016-12-07 08:59:38.825926 7f9a7a2f3480 10 mon.ceph3@0(leader).paxosservice(mdsmap 1..438) _active
-6> 2016-12-07 08:59:38.825928 7f9a7a2f3480 7 mon.ceph3@0(leader).paxosservice(mdsmap 1..438) _active creating new pending
-5> 2016-12-07 08:59:38.825939 7f9a7a2f3480 10 mon.ceph3@0(leader).mds e438 create_pending e439
-4> 2016-12-07 08:59:38.825943 7f9a7a2f3480 10 mon.ceph3@0(leader).mds e438 e438: 1/1/1 up {0=ceph1=up:active}, 1 up:standby
-3> 2016-12-07 08:59:38.825960 7f9a7a2f3480 20 mon.ceph3@0(leader).mds e438 gid 39073 is standby and following nobody
-2> 2016-12-07 08:59:38.825966 7f9a7a2f3480 10 mon.ceph3@0(leader).mds e438 setting to follow mds rank 0
-1> 2016-12-07 08:59:38.826000 7f9a6ffff700 5 asok(0x7f9a839eb480) AdminSocket: request 'get_command_descriptions' '' to 0x7f9a83a44270 returned 2165 bytes
0> 2016-12-07 08:59:38.828473 7f9a7a2f3480 -1 mon/MDSMonitor.cc: In function 'bool MDSMonitor::maybe_promote_standby(std::shared_ptr<Filesystem>)' thread 7f9a7a2f3480 time 2016-12-07 08:59:38.825971
mon/MDSMonitor.cc: 2797: FAILED assert(info.state == MDSMap::STATE_STANDBY)
ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f9a79e329fb]
2: (MDSMonitor::maybe_promote_standby(std::shared_ptr<Filesystem>)+0x976) [0x7f9a79b73016]
3: (MDSMonitor::tick()+0x397) [0x7f9a79b76fa7]
4: (MDSMonitor::on_active()+0x15) [0x7f9a79b6d0a5]
5: (PaxosService::_active()+0x1df) [0x7f9a79af92ef]
6: (PaxosService::election_finished()+0x67) [0x7f9a79af9a47]
7: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*)+0x236) [0x7f9a79abb336]
8: (Monitor::win_standalone_election()+0x158) [0x7f9a79abb738]
9: (Monitor::bootstrap()+0xa03) [0x7f9a79abc283]
10: (Monitor::init()+0x190) [0x7f9a79abc590]
11: (main()+0x24ca) [0x7f9a79a32a4a]
12: (__libc_start_main()+0xf5) [0x7f9a76ff6ec5]
13: (()+0x2609ca) [0x7f9a79a849ca]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Files
Updated by Greg Farnum over 7 years ago
- Project changed from Ceph to CephFS
- Component(FS) MDSMonitor added
So this cluster is freshly-created with version 10.2.3?
Can you upload the monitor log with ceph-post-file? (Preferably one with "debug mon = 20" set.)
Maybe the assert is just bad because the MDS was in standby but got marked as failed, but I'd like to see the log to make sure.
Updated by guotao Yao over 7 years ago
- File ceph-mon.ceph3.log ceph-mon.ceph3.log added
The attachment is the log of crash monitor.
Thanks!
Updated by John Spray over 7 years ago
- Status changed from New to 12
It looks like MDSMonitor::maybe_promote_standby is iterating over pending_fsmap.standby_daemons, but inside the loop calling try_standby_replay, which modifies standby_daemons (via assign_standby_replay).
Updated by John Spray over 7 years ago
- Status changed from 12 to Fix Under Review
- Assignee set to John Spray
- Backport set to jewel
Updated by John Spray over 7 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport changed from jewel to jewel kraken
Updated by Nathan Cutler over 7 years ago
- Copied to Backport #18282: jewel: monitor cannot start because of "FAILED assert(info.state == MDSMap::STATE_STANDBY)" added
Updated by Nathan Cutler over 7 years ago
- Copied to Backport #18283: kraken: monitor cannot start because of "FAILED assert(info.state == MDSMap::STATE_STANDBY)" added
Updated by Nathan Cutler over 7 years ago
- Backport changed from jewel kraken to jewel, kraken
Updated by Nathan Cutler about 7 years ago
kraken backport is unnecessary (fix already in v11.2.0)
Updated by Nathan Cutler about 7 years ago
- Status changed from Pending Backport to Resolved