Project

General

Profile

Bug #43318

monitor mark all services(osd mgr) down

Added by simon gao over 4 years ago. Updated over 4 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Suddenly, all mgrs and osds in my cluster began to be set to down by the monitor.
the log of monitor like this
```
3956 2019-12-02 11:18:48.437216 7f3dc73bd700 0 mon.xxxx-c3478@0(leader).data_health(50) update_stats avail 71% total 47.1GiB, used 11.0GiB, avail 33.7GiB
3957 2019-12-02 11:19:09.206883 7f3dc73bd700 0 log_channel(cluster) log [INF] : Manager daemon xxxx-c3478 is unresponsive. No standby daemons available.
3958 2019-12-02 11:19:09.206978 7f3dc73bd700 0 log_channel(cluster) log [WRN] : Health check failed: no active mgr (MGR_DOWN)
3959 2019-12-02 11:19:12.303696 7f3dc0bb0700 0 log_channel(cluster) log [DBG] : mgrmap e6415: no daemons active
3960 2019-12-02 11:19:49.588504 7f3dc73bd700 0 mon.xxxx-c3478@0(leader).data_health(50) update_stats avail 71% total 47.1GiB, used 11.0GiB, avail 33.7GiB
3961 2019-12-02 11:20:07.900015 7f3dc4bb8700 0 log_channel(cluster) log [INF] : Activating manager daemon xxxx-c3542
3962 2019-12-02 11:20:09.340391 7f3dc73bd700 0 log_channel(cluster) log [INF] : Health check cleared: MGR_DOWN (was: no active mgr)
3963 2019-12-02 11:20:12.035889 7f3dc0bb0700 0 log_channel(cluster) log [DBG] : mgrmap e6416: xxxx-c3542(active, starting)
3964 2019-12-02 11:20:24.117495 7f3dc4bb8700 0 log_channel(cluster) log [DBG] : Standby manager daemon xxxx-c3510 started
3965 2019-12-02 11:20:29.412746 7f3dc0bb0700 0 log_channel(cluster) log [DBG] : mgrmap e6417: xxxx-c3542(active, starting), standbys: xxxx-c3510
3966 2019-12-02 11:20:39.734093 7f3dc73bd700 0 log_channel(cluster) log [INF] : Manager daemon xxxx-c3542 is unresponsive, replacing it with standby daemon xxxx-c 3510
3967 2019-12-02 11:20:42.309267 7f3dc0bb0700 0 log_channel(cluster) log [DBG] : Standby manager daemon xxxx-c3478 started
3968 2019-12-02 11:20:42.309295 7f3dc0bb0700 0 log_channel(cluster) log [DBG] : mgrmap e6418: xxxx-c3510(active, starting)
3969 2019-12-02 11:20:47.259694 7f3dc0bb0700 0 log_channel(cluster) log [DBG] : mgrmap e6419: xxxx-c3510(active, starting), standbys: xxxx-c3478
3970 2019-12-02 11:20:49.703433 7f3dc73bd700 0 mon.xxxx-c3478@0(leader).data_health(50) update_stats avail 71% total 47.1GiB, used 11.0GiB, avail 33.7GiB
3971 2019-12-02 11:21:16.758797 7f3dc0bb0700 0 log_channel(cluster) log [DBG] : mgrmap e6420: xxxx-c3510(active, starting)
3972 2019-12-02 11:21:25.622953 7f3dc73bd700 0 log_channel(cluster) log [INF] : Manager daemon xxxx-c3510 is unresponsive. No standby daemons available.
3973 2019-12-02 11:21:25.623031 7f3dc73bd700 0 log_channel(cluster) log [WRN] : Health check failed: no active mgr (MGR_DOWN)
3974 2019-12-02 11:21:27.823344 7f3dc0bb0700 0 log_channel(cluster) log [DBG] : mgrmap e6421: no daemons active
3975 2019-12-02 11:21:35.222219 7f3dc4bb8700 0 log_channel(cluster) log [INF] : Activating manager daemon xxxx-c3542
3976 2019-12-02 11:21:36.321142 7f3dc73bd700 0 log_channel(cluster) log [INF] : Health check cleared: MGR_DOWN (was: no active mgr)
3977 2019-12-02 11:21:41.425900 7f3dc0bb0700 0 log_channel(cluster) log [DBG] : mgrmap e6422: xxxx-c3542(active, starting)
3978 2019-12-02 11:21:50.369749 7f3dc73bd700 0 mon.xxxx-c3478@0(leader).data_health(50) update_stats avail 71% total 47.1GiB, used 11.0GiB, avail 33.7GiB
3979 2019-12-02 11:21:58.984426 7f3dc4bb8700 0 log_channel(cluster) log [DBG] : Standby manager daemon xxxx-c3510 started
3980 2019-12-02 11:22:01.776657 7f3dc0bb0700 0 log_channel(cluster) log [DBG] : mgrmap e6423: xxxx-c3542(active, starting), standbys: xxxx-c3510
3981 2019-12-02 11:22:09.886330 7f3dc73bd700 0 log_channel(cluster) log [INF] : Manager daemon xxxx-c3542 is unresponsive, replacing it with standby daemon xxxx-c3510
3982 2019-12-02 11:22:13.396388 7f3dc0bb0700 0 log_channel(cluster) log [DBG] : mgrmap e6424: xxxx-c3510(active, starting)
3983 2019-12-02 11:22:46.957938 7f3dc73bd700 0 log_channel(cluster) log [INF] : Manager daemon xxxx-c3510 is unresponsive. No standby daemons available.
3984 2019-12-02 11:22:46.958028 7f3dc73bd700 0 log_channel(cluster) log [WRN] : Health check failed: no active mgr (MGR_DOWN)
3985 2019-12-02 11:22:48.330422 7f3dc0bb0700 0 log_channel(cluster) log [INF] : Activating manager daemon xxxx-c3542
3986 2019-12-02 11:22:48.330440 7f3dc0bb0700 0 log_channel(cluster) log [DBG] : mgrmap e6425: no daemons active
3987 2019-12-02 11:22:49.592789 7f3dc73bd700 0 log_channel(cluster) log [INF] : Health check cleared: MGR_DOWN (was: no active mgr)
3988 2019-12-02 11:22:50.861478 7f3dc73bd700 0 mon.xxxx-c3478@0(leader).data_health(50) update_stats avail 71% total 47.1GiB, used 11.0GiB, avail 33.7GiB
3989 2019-12-02 11:22:52.091876 7f3dc0bb0700 0 log_channel(cluster) log [DBG] : mgrmap e6426: xxxx-c3542(active, starting)
```

History

#1 Updated by Neha Ojha over 4 years ago

  • Status changed from New to Need More Info

Can you provide mgr logs from when this happened?

#2 Updated by simon gao over 4 years ago

mgr has no log when setting the debug_mgr to 40.

Also available in: Atom PDF