mds: fix mds damaged due to unexpected journal length
Recently we met a rare but serious mds damaged issue during switching/restarting mds frequently. MDS reports damaged due to journaler found log size is less than expected. Since this is production cluster, we have to recalculate write_pos and reset it to activate mds quickly, but this might cause some data loss in some case.
2018-09-10 17:26:54.499445 7fe688cc0180 0 pidfile_write: ignore empty --pid-file 2018-09-10 17:26:54.678165 7fe68249e700 1 mds.xxx handle_mds_map standby 2018-09-10 17:26:54.690824 7fe68249e700 1 mds.0.1900764 handle_mds_map i am now mds.0.1900764 2018-09-10 17:26:54.690828 7fe68249e700 1 mds.0.1900764 handle_mds_map state change up:boot --> up:replay 2018-09-10 17:26:54.690836 7fe68249e700 1 mds.0.1900764 replay_start 2018-09-10 17:26:54.690850 7fe68249e700 1 mds.0.1900764 recovery set is 2018-09-10 17:26:54.690857 7fe68249e700 1 mds.0.1900764 waiting for osdmap 106280 (which blacklists prior instance) 2018-09-10 17:26:54.800741 7fe67dc95700 0 mds.0.cache creating system inode with ino:100 2018-09-10 17:26:54.800861 7fe67dc95700 0 mds.0.cache creating system inode with ino:1 2018-09-10 17:26:56.465455 7fe67dc95700 0 mds.0.journaler(ro) _finish_read got less than expected (4194304) 2018-09-10 17:26:56.465481 7fe67bc91700 0 mds.0.log _replay journaler got error -22, aborting 2018-09-10 17:26:56.467651 7fe67bc91700 -1 log_channel(cluster) log [ERR] : Error loading MDS rank 0: (22) Invalid argument 2018-09-10 17:26:56.471741 7fe67bc91700 1 mds.xxx respawn
We use "cephfs-journal-tool journal inspect" to locate the last log object was damaged (200.00834942). This object size is only 4193284 bytes. Then the write_pos we recalculated from this offset should be 36087669980164, but from "cephfs-journal-tool header get", we get the write_pos is 36087669981184. This means the write_pos in the journal header is larger than the real pos, so mds thinks it is damaged.
It is very difficult to reproduce this issue, but we seems to find out such a scenario that might trigger this issue.
t1: Journaler::_do_flush pending_safe[flush_pos_1] -> next_safe_pos_1 filer.write t2: Journaler::_do_flush pending_safe[flush_pos_2] -> next_safe_pos_2 (next_safe_pos_2 > next_safe_pos_1) filer.write t3: Journaler::_finish_flush (for flush_pos_1) pending_safe.erase (flush_pos_1) pending_safe not empty, safe_pos = next_safe_pos_2 t4: switch/restart mds Journaler::_write_head write_pos = safe_pos (next_safe_pos_2) (something wrong with flush_pos_2, not commit to osd?)