Project

General

Profile

Bug #6458

Updated by Greg Farnum over 10 years ago

Got a report on irc from a user whose log was 611 bytes shorter than the header indicated it should be. His guess was that it had happened the day before when he restarted the MDS "a couple times" while some OSDs were down.

Checking details:
1) The header object indicated
the log should have ended at an object boundary. The last object was 611 bytes short (as evidenced by the object reads in the log, and manual listings he pasted).
2) After the problem began, he ran a deep scrub
code, it turns out that Journaler::flush will unconditionally call write_head(), which turned up clean — uses the issue was not filesystem corruption/lost writes on current in-memory locations. We should instead be preparing a single OSD.
3) The log ended cleanly (except for being shorter than it should have —
header (of the last entry was the correct length current state) and there was no extra data.
4) Fixing the header fixed the problem.

I did not gather enough data
then sending that to disprove it having been degraded disk once we get acks of our journal log writes! Not doing so leads to a single copy, having the OSD holding MDS getting stuck at the data lose the last write, and having it recover elsewhere end of replay, waiting to a different node. That seems less likely to me than some coding issue, though I have been quite unable to find one. read entries that will never be written.

Back