osd: mark pg and use replica on EIO from client read
Copy the below email thread and open an issue to track the enhancement.
Date: Wed, 29 Oct 2014 08:11:01 -0700 From: firstname.lastname@example.org To: email@example.com CC: firstname.lastname@example.org Subject: Re: OSD crashed due to filestore EIO On Wed, 29 Oct 2014, GuangYang wrote: > Recently we observed an OSD crash due to file corruption in filesystem, > which leads to an assertion failure at FileStore::read as EIO is not > tolerated. As file corruption is normal in large deployment, I am > thinking if that behavior is too aggressive, especially for EC pool. > > After searching, I found this flag might help : filestore_fail_eio, > which can make the OSD survive an EIO failure, it is true by default > though. I haven't tested it yet. That will reove the immediate assert. Currently, for an object being read by a client, it will just pass EIO back to the client, though, which is clearly not what we want. > Does it make sense to adjust the behavior a little bit, if the filestore > read fail due to file corruption, return back the failure and at the > same time mark the PG as inconsistent, due the redundancy (replication > or EC), the request can still be served, and at the same time, we can > get alert saying there is inconsistency and manually trigger a PG > repair? That would be ideal, yeah. I think that initially it makes sense to doing *just that read* via a replica but letting the admin trigger the repair. This most closely mirrors what scrub currently does on EIO (mark inconsistent but let admin repair). Later, when we support automatic repair, that option can affect both scrub and client-triggered EIOs? We just need to be careful that any EIO on *metadata* still triggers a failure as we need to be especially careful about handling that. IIRC there is a flag passed to read indicating whether EIO is okay; we should probably use that so that EIO-ok vs EIO-notok cases are still clearly annotated.
#3 Updated by Wei Luo over 7 years ago
- Status changed from New to In Progress
Current OSD check PG map and get only k items and send sub-read request. So if one read failed. It assert and core dump.
I think it will have 2 solutions:
1. If sub-read failed. We can read the m items and use ECUtils to get correct data.
Same logic when all sub-read success. And will have more latency when sub-read failed.
2 choice when get error on sub-read:
1. read un-selected data and merge with normal data this request get and use ECUtils to get correct data return.
It is more quick but hard to code. And we will deal with read fail again.
2. reconstruct sub-read request and remove the failed OSD. This may be more slow but more clear to code.
2. OSD send k+m sub-read request and in return filter error read result. Check normal result more than k and use ECUtils decode it.
Since one request will make more sub-read request. Total read will trigger and the whole system through put will increase.
It will have less get latency when some sub-read failed. And I see design doc that EC can implement get the fastest k response and return to client which will improve read latency.
This solution will not change current logic and fix is more simple.
Further more. we can add a parameter r makes 0 < r <= m to balance the disk load and error rate.