Actions
Bug #50420
openall osd down after mon scrub too long
Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Hi all.
My cluster has 5 mons. Everything is ok.
my ceph mon config
mon_osd_report_timeout = 1800 mon_scrub_max_keys = 500
One day leader mon has log:
2021-04-18 18:32:35.292 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {auth=109,config=2,health=10,logm=379} crc {auth=4175143226,config=2351104334,health=3725708835,logm=1504818101}) 2021-04-18 18:32:35.316 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {logm=500} crc {logm=3519233493}) 2021-04-18 18:32:35.358 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {logm=500} crc {logm=360977370}) 2021-04-18 18:32:35.401 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {logm=48,mds_health=3,mds_metadata=1,mdsmap=34,mgr=414} crc {logm=3065177786,mds_health=3427260264,mds_metadata=4153380910,mdsmap=3025637172,mgr=1163851084}) 2021-04-18 18:32:35.427 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {mgr=146,mgr_command_descs=1,mgr_metadata=4,mgrstat=146,mon_config_key=203} crc {mgr=1664927963,mgr_command_descs=2533195504,mgr_metadata=167114949,mgrstat=2601563346,mon_config_key=1097652760}) 2021-04-18 18:32:35.435 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {mon_config_key=6,monmap=26,osd_metadata=468} crc {mon_config_key=2683020667,monmap=2672004040,osd_metadata=2946197946}) 2021-04-18 18:32:35.441 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_metadata=223,osd_pg_creating=1,osd_snap=276} crc {osd_metadata=213405302,osd_pg_creating=1164028210,osd_snap=3162940035}) 2021-04-18 18:32:35.446 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=2627010237}) 2021-04-18 18:32:35.451 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=3257476674}) 2021-04-18 18:32:35.457 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=3260111946}) 2021-04-18 18:32:35.462 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=3560445762}) 2021-04-18 18:32:35.467 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=1767217459}) 2021-04-18 18:32:35.472 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=1157970184}) 2021-04-18 18:32:35.477 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=4184155926}) .... 2021-04-18 19:02:38.184 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=1037417682}) 2021-04-18 19:02:38.189 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=1150375840}) 2021-04-18 19:02:38.194 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=1110637961}) 2021-04-18 19:02:38.206 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=3042911527}) 2021-04-18 19:02:38.214 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=887725757}) 2021-04-18 19:02:38.218 7fee4af5c700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=500} crc {osd_snap=2982522619})
ceph mon run scrub process on 1800 seconds, after that all osd dowwn and cluster has slow ops.
2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.439 since 2021-04-18 18:32:04.932260, 1800.691873 seconds ago. marking down 2021-04-18 19:02:05.624 7fee51769700 0 log_channel(cluster) log [INF] : osd.453 marked down after no beacon for 1800.262573 seconds 2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.453 since 2021-04-18 18:32:05.361560, 1800.262573 seconds ago. marking down 2021-04-18 19:02:05.624 7fee51769700 0 log_channel(cluster) log [INF] : osd.524 marked down after no beacon for 1802.030657 seconds 2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.524 since 2021-04-18 18:32:03.593476, 1802.030657 seconds ago. marking down 2021-04-18 19:02:05.624 7fee51769700 0 log_channel(cluster) log [INF] : osd.526 marked down after no beacon for 1801.345993 seconds 2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.526 since 2021-04-18 18:32:04.278140, 1801.345993 seconds ago. marking down 2021-04-18 19:02:05.624 7fee51769700 0 log_channel(cluster) log [INF] : osd.542 marked down after no beacon for 1802.250146 seconds 2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.542 since 2021-04-18 18:32:03.373987, 1802.250146 seconds ago. marking down 2021-04-18 19:02:05.624 7fee51769700 0 log_channel(cluster) log [INF] : osd.567 marked down after no beacon for 1804.243142 seconds 2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.567 since 2021-04-18 18:32:01.380991, 1804.243142 seconds ago. marking down 2021-04-18 19:02:05.624 7fee51769700 0 log_channel(cluster) log [INF] : osd.576 marked down after no beacon for 1803.987611 seconds 2021-04-18 19:02:05.624 7fee51769700 -1 mon.ceph-mon-1@0(leader).osd e4587000 no beacon from osd.576 since 2021-04-18 18:32:01.636522, 1803.987611 seconds ago. marking down
I restart mon and cluster health ok.
Why ceph mon compact too long. How i can fix this.
Thanks.
Actions