Support #40103: ceph monitor cannot start - RADOS - Ceph

Actions

Copy link

Support #40103

open

ceph monitor cannot start

Added by JIANYU LI almost 5 years ago. Updated almost 5 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Ceph - v12.2.12

Component(RADOS):

Pull request ID:

Description

I have a ceph cluster running over 2 years and the monitor began crash since yesterday. I had some flapping OSDs up and down occasionally, sometimes I need to rebuild the OSD. I found 3 OSDs are down yesterday, they may cause this issue or may not.

Ceph Version: 12.2.12, ( upgraded from 12.2.8 not fix the issue)
I have 5 mon nodes, when I start mon service on the first 2 nodes, they are good. Once I start the service on the third node, All 3 nodes begin keeping up/down(flapping) due to Aborted in OSDMonitor::build_incremental. I also tried to recover monitor from 1 node(remove other 4 nodes) by injecting monmap, the node keep crash as well.

See below crash log from mon
May 31 02:26:09 ctlr101 systemd¹: Started Ceph cluster monitor daemon.
May 31 02:26:09 ctlr101 ceph-mon^2632098: 2019-05-31 02:26:09.345533 7fe250321080 -1 compacting monitor store ...
May 31 02:26:11 ctlr101 ceph-mon^2632098: 2019-05-31 02:26:11.320926 7fe250321080 -1 done compacting
May 31 02:26:16 ctlr101 ceph-mon^2632098: 2019-05-31 02:26:16.497933 7fe242925700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR 13 osds down; 1 host (6 osds) down; 74266/2566020 objects misplace
May 31 02:26:16 ctlr101 ceph-mon^2632098: * Caught signal (Aborted) *
May 31 02:26:16 ctlr101 ceph-mon^2632098: in thread 7fe24692d700 thread_name:ms_dispatch
May 31 02:26:16 ctlr101 ceph-mon^2632098: ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
May 31 02:26:16 ctlr101 ceph-mon^2632098: 1: (()+0x9e6334) [0x558c5f2fb334]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 2: (()+0x11390) [0x7fe24f6ce390]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 3: (gsignal()+0x38) [0x7fe24dc14428]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 4: (abort()+0x16a) [0x7fe24dc1602a]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 5: (OSDMonitor::build_incremental(unsigned int, unsigned int, unsigned long)+0x9c5) [0x558c5ee80455]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 6: (OSDMonitor::send_incremental(unsigned int, MonSession, bool, boost::intrusive_ptr<MonOpRequest>)+0xcf) [0x558c5ee80b3f]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 7: (OSDMonitor::check_osdmap_sub(Subscription*)+0x22d) [0x558c5ee8622d]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 8: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1082) [0x558c5ecdb0b2]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x9f4) [0x558c5ed05114]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 10: (Monitor::_ms_dispatch(Message*)+0x6db) [0x558c5ed061ab]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 11: (Monitor::ms_dispatch(Message*)+0x23) [0x558c5ed372c3]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 12: (DispatchQueue::entry()+0xf4a) [0x558c5f2a205a]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x558c5f035dcd]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 14: (()+0x76ba) [0x7fe24f6c46ba]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 15: (clone()+0x6d) [0x7fe24dce641d]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 2019-05-31 02:26:16.578932 7fe24692d700 -1 Caught signal (Aborted)
May 31 02:26:16 ctlr101 ceph-mon^2632098: in thread 7fe24692d700 thread_name:ms_dispatch
May 31 02:26:16 ctlr101 ceph-mon^2632098: ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
May 31 02:26:16 ctlr101 ceph-mon^2632098: 1: (()+0x9e6334) [0x558c5f2fb334]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 2: (()+0x11390) [0x7fe24f6ce390]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 3: (gsignal()+0x38) [0x7fe24dc14428]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 4: (abort()+0x16a) [0x7fe24dc1602a]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 5: (OSDMonitor::build_incremental(unsigned int, unsigned int, unsigned long)+0x9c5) [0x558c5ee80455]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 6: (OSDMonitor::send_incremental(unsigned int, MonSession, bool, boost::intrusive_ptr<MonOpRequest>)+0xcf) [0x558c5ee80b3f]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 7: (OSDMonitor::check_osdmap_sub(Subscription*)+0x22d) [0x558c5ee8622d]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 8: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1082) [0x558c5ecdb0b2]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x9f4) [0x558c5ed05114]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 10: (Monitor::_ms_dispatch(Message*)+0x6db) [0x558c5ed061ab]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 11: (Monitor::ms_dispatch(Message*)+0x23) [0x558c5ed372c3]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 12: (DispatchQueue::entry()+0xf4a) [0x558c5f2a205a]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x558c5f035dcd]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 14: (()+0x76ba) [0x7fe24f6c46ba]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 15: (clone()+0x6d) [0x7fe24dce641d]
May 31 02:26:16 ctlr101 ceph-mon^2632098: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
May 31 02:26:16 ctlr101 ceph-mon^2632098: -1501> 2019-05-31 02:26:09.345533 7fe250321080 -1 compacting monitor store ...
May 31 02:26:16 ctlr101 ceph-mon^2632098: -1475> 2019-05-31 02:26:11.320926 7fe250321080 -1 done compacting
May 31 02:26:16 ctlr101 ceph-mon^2632098: -946> 2019-05-31 02:26:16.497933 7fe242925700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR 13 osds down; 1 host (6 osds) down; 74266/2566020 objects
May 31 02:26:16 ctlr101 ceph-mon^2632098: 0> 2019-05-31 02:26:16.578932 7fe24692d700 -1 Caught signal (Aborted) *
May 31 02:26:16 ctlr101 ceph-mon^2632098: in thread 7fe24692d700 thread_name:ms_dispatch
May 31 02:26:16 ctlr101 ceph-mon^2632098: ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
May 31 02:26:16 ctlr101 ceph-mon^2632098: 1: (()+0x9e6334) [0x558c5f2fb334]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 2: (()+0x11390) [0x7fe24f6ce390]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 3: (gsignal()+0x38) [0x7fe24dc14428]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 4: (abort()+0x16a) [0x7fe24dc1602a]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 5: (OSDMonitor::build_incremental(unsigned int, unsigned int, unsigned long)+0x9c5) [0x558c5ee80455]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 6: (OSDMonitor::send_incremental(unsigned int, MonSession*, bool, boost::intrusive_ptr<MonOpRequest>)+0xcf) [0x558c5ee80b3f]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 7: (OSDMonitor::check_osdmap_sub(Subscription*)+0x22d) [0x558c5ee8622d]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 8: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1082) [0x558c5ecdb0b2]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x9f4) [0x558c5ed05114]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 10: (Monitor::_ms_dispatch(Message*)+0x6db) [0x558c5ed061ab]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 11: (Monitor::ms_dispatch(Message*)+0x23) [0x558c5ed372c3]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 12: (DispatchQueue::entry()+0xf4a) [0x558c5f2a205a]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x558c5f035dcd]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 14: (()+0x76ba) [0x7fe24f6c46ba]
May 31 02:26:16 ctlr101 ceph-mon^2632098: 15: (clone()+0x6d) [0x7fe24dce641d]
May 31 02:26:16 ctlr101 ceph-mon^2632098: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
May 31 02:26:16 ctlr101 systemd¹: ceph-mon@ctlr101.service: Main process exited, code=killed, status=6/ABRT
May 31 02:26:16 ctlr101 systemd¹: ceph-mon@ctlr101.service: Unit entered failed state.
May 31 02:26:16 ctlr101 systemd¹: ceph-mon@ctlr101.service: Failed with result 'signal'.
May 31 02:26:26 ctlr101 systemd¹: ceph-mon@ctlr101.service: Service hold-off time over, scheduling restart.
May 31 02:26:26 ctlr101 systemd¹: Stopped Ceph cluster monitor daemon.
May 31 02:26:26 ctlr101 systemd¹: Started Ceph cluster monitor daemon.

For command ceph -s, most of time, it's timeout. Sometimes when I have 3+ mon services are up, I can get result, but mon service become down very quickly.

root@ctlr101:~# ceph -s
cluster:
id: 53264466-680b-42e6-899d-d042c3a8334a
health: HEALTH_ERR
6 osds down
1 host (6 osds) down
74266/2566020 objects misplaced (2.894%)
Reduced data availability: 446 pgs inactive, 440 pgs peering
Degraded data redundancy: 108173/2566020 objects degraded (4.216%), 142 pgs degraded, 330 pgs undersized
18600 slow requests are blocked > 32 sec. Implicated osds 8,21,27,29,32,41,63,91,96,98,100
27371 stuck requests are blocked > 4096 sec. Implicated osds 14,25,26,34,37,46,48,50,51,58,59,60,61,66,67,69,73,74,75,90,95,99
2/5 mons down, quorum ctlr101,ctlr201,ctlr301

services:
    mon: 5 daemons, quorum ctlr101,ctlr201,ctlr301, out of quorum: ceph101, ceph201
    mgr: ceph101(active), standbys: ceph301, ctlr201, ctlr301, ceph201, ctlr101
    mds: cephfs-1/1/1 up  {0=ceph101=up:active}, 2 up:standby
    osd: 52 osds: 46 up, 52 in; 22 remapped pgs
    rgw: 3 daemons active

data:
    pools:   20 pools, 2528 pgs
    objects: 855.34k objects, 3.69TiB
    usage:   11.4TiB used, 28.3TiB / 39.7TiB avail
    pgs:     0.237% pgs unknown
             17.445% pgs not active
             108173/2566020 objects degraded (4.216%)
             74266/2566020 objects misplaced (2.894%)
             1667 active+clean
             413  peering
             198  active+undersized
             141  active+undersized+degraded
             60   active+remapped+backfill_wait
             27   remapped+peering
             12   active+clean+remapped
             6    unknown
             2    active+undersized+remapped
             1    active+undersized+degraded+remapped+backfilling
             1    remapped

io:
    client:   5.65MiB/s rd, 81.1KiB/s wr, 143op/s rd, 43op/s wr

Note, the about io data is stale, the value hasn't been changed for 1 day.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Support #40103

ceph monitor cannot start

Updated by Greg Farnum almost 5 years ago