Project

General

Profile

Actions

Support #54621

open

ask for help about ceph mds offline

Added by zhiyong lin about 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
common
Target version:
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

I faced a serious problem in our production environment about ceph mds offline.

just like this issue .but our ceph version is 14.2.3 and didn't update
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/KQ5A5OWRIUEOJBC7VILBGDIKPQGJQIWN/

In last Friday morning, our ceph appears 4 pg not inconsistent in ceph_fs_data pool,and I used the ceph pg repair but it doesn't work, then we try to flush the osd journal but it doesn't work either,then we decide to move the pg main osd out to make the data remapped in other osd. but before this ,suddenly our ceph appears so many mds slow request.Then I restart the active mds, try to make the standby mds active to solve the slow request(at this time, our ceph still has 4 pg not inconsistent in ceph_fs_data pool)

After this restart, our ceph-two-mds keep in standby state and no active all the time!
we tried the
ceph fs set cephfs max_mds 1
ceph fs set cephfs allow_standby_replay false
ceph mds repaired 0
and two repair ways mentioned in this page
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
but they all failed...
the cephfs is degraded and offline
there aren't any useful logs in mds or osd though we set the log level to 20

it scared us so much, by the way ,because we restart the pg related osd .the pg not inconsistent state disappeared but it still here !!

finally we use the ceph pg deep-scrub to check all the pg and found the hidden inconsistent pg,we mark the related osd out and down and remapped .then the pg has no inconsistent ,then we used the commands and it works
ceph fs set cephfs max_mds 1
ceph fs set cephfs allow_standby_replay false
ceph mds repaired 0

thanks god, we finally bring the cephfs normal, but after this accident, we all so confused that, how does the inconsistent pg in ceph_fs_data pool affect the mds state change from standby to up, the mds should depends on ceph_fs_metadata pool only we think. it's none of ceph_fs_data pool business.
we used the ceph-objectstore-tool to mutate many pg-inconsistent in ceph_fs_data pool and retart the mds ,but it all works well.and the ceph pg repair could fix them all.
well,maybe the slow request is also related to the mds restart process?..we have no idea and couldn't replay or solve it..

any advice...thanks so much!
when pg inconsistent in ceph_fs_data pool,shouldn't we restart ceph-mds?. and how to understand it .

No data to display

Actions

Also available in: Atom PDF