Project

General

Profile

Actions

Bug #50420

open

all osd down after mon scrub too long

Added by hoan nv about 3 years ago. Updated almost 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi all.

My cluster has 5 mons. Everything is ok.

my ceph mon config

mon_osd_report_timeout = 1800
mon_scrub_max_keys = 500

One day leader mon has log:

2021-04-18 18:32:35.292 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {auth=109,config=2,health=10,logm=379} crc {auth=4175143226,config=2351104334,health=3725708835,logm=1504818101})
2021-04-18 18:32:35.316 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {logm=500} crc {logm=3519233493})
2021-04-18 18:32:35.358 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {logm=500} crc {logm=360977370})
2021-04-18 18:32:35.401 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {logm=48,mds_health=3,mds_metadata=1,mdsmap=34,mgr=414} crc {logm=3065177786,mds_health=3427260264,mds_metadata=4153380910,mdsmap=3025637172,mgr=1163851084})
2021-04-18 18:32:35.427 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {mgr=146,mgr_command_descs=1,mgr_metadata=4,mgrstat=146,mon_config_key=203} crc {mgr=1664927963,mgr_command_descs=2533195504,mgr_metadata=167114949,mgrstat=2601563346,mon_config_key=1097652760})
2021-04-18 18:32:35.435 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {mon_config_key=6,monmap=26,osd_metadata=468} crc {mon_config_key=2683020667,monmap=2672004040,osd_metadata=2946197946})
2021-04-18 18:32:35.441 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_metadata=223,osd_pg_creating=1,osd_snap=276} crc {osd_metadata=213405302,osd_pg_creating=1164028210,osd_snap=3162940035})
2021-04-18 18:32:35.446 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=2627010237})
2021-04-18 18:32:35.451 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=3257476674})
2021-04-18 18:32:35.457 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=3260111946})
2021-04-18 18:32:35.462 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=3560445762})
2021-04-18 18:32:35.467 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=1767217459})
2021-04-18 18:32:35.472 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=1157970184})
2021-04-18 18:32:35.477 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=4184155926})
....

2021-04-18 19:02:38.184 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=1037417682})
2021-04-18 19:02:38.189 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=1150375840})
2021-04-18 19:02:38.194 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=1110637961})
2021-04-18 19:02:38.206 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=3042911527})
2021-04-18 19:02:38.214 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=887725757})
2021-04-18 19:02:38.218 7fee4af5c700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=2982522619})

ceph mon run scrub process on 1800 seconds, after that all osd dowwn and cluster has slow ops.

2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.439 since 2021-04-18 18:32:04.932260, 1800.691873 seconds ago.  marking down
2021-04-18 19:02:05.624 7fee51769700  0 log_channel(cluster) log [INF] : osd.453 marked down after no beacon for 1800.262573 seconds
2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.453 since 2021-04-18 18:32:05.361560, 1800.262573 seconds ago.  marking down
2021-04-18 19:02:05.624 7fee51769700  0 log_channel(cluster) log [INF] : osd.524 marked down after no beacon for 1802.030657 seconds
2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.524 since 2021-04-18 18:32:03.593476, 1802.030657 seconds ago.  marking down
2021-04-18 19:02:05.624 7fee51769700  0 log_channel(cluster) log [INF] : osd.526 marked down after no beacon for 1801.345993 seconds
2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.526 since 2021-04-18 18:32:04.278140, 1801.345993 seconds ago.  marking down
2021-04-18 19:02:05.624 7fee51769700  0 log_channel(cluster) log [INF] : osd.542 marked down after no beacon for 1802.250146 seconds
2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.542 since 2021-04-18 18:32:03.373987, 1802.250146 seconds ago.  marking down
2021-04-18 19:02:05.624 7fee51769700  0 log_channel(cluster) log [INF] : osd.567 marked down after no beacon for 1804.243142 seconds
2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.567 since 2021-04-18 18:32:01.380991, 1804.243142 seconds ago.  marking down
2021-04-18 19:02:05.624 7fee51769700  0 log_channel(cluster) log [INF] : osd.576 marked down after no beacon for 1803.987611 seconds
2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.576 since 2021-04-18 18:32:01.636522, 1803.987611 seconds ago.  marking down

I restart mon and cluster health ok.

Why ceph mon compact too long. How i can fix this.

Thanks.

Actions #1

Updated by Neha Ojha almost 3 years ago

  • Status changed from New to Need More Info

Can provide us with the cluster log from this time? How large is your mon db?

Actions

Also available in: Atom PDF