Feature #9943

osd: mark pg and use replica on EIO from client read

Added by Guang Yang over 7 years ago. Updated over 7 years ago.

In Progress
Wei Luo
Target version:
% Done:


Community (dev)
Affected Versions:
Pull request ID:


Copy the below email thread and open an issue to track the enhancement.

Date: Wed, 29 Oct 2014 08:11:01 -0700
Subject: Re: OSD crashed due to filestore EIO

On Wed, 29 Oct 2014, GuangYang wrote:
> Recently we observed an OSD crash due to file corruption in filesystem,
> which leads to an assertion failure at FileStore::read as EIO is not
> tolerated. As file corruption is normal in large deployment, I am
> thinking if that behavior is too aggressive, especially for EC pool.
> After searching, I found this flag might help : filestore_fail_eio,
> which can make the OSD survive an EIO failure, it is true by default
> though. I haven't tested it yet.

 That will reove the immediate assert. Currently, for an object being read
 by a client, it will just pass EIO back to the client, though, which is
 clearly not what we want.

> Does it make sense to adjust the behavior a little bit, if the filestore
> read fail due to file corruption, return back the failure and at the
> same time mark the PG as inconsistent, due the redundancy (replication
> or EC), the request can still be served, and at the same time, we can
> get alert saying there is inconsistency and manually trigger a PG
> repair?

 That would be ideal, yeah. I think that initially it makes sense to doing
 *just that read* via a replica but letting the admin trigger the repair.
 This most closely mirrors what scrub currently does on EIO (mark
 inconsistent but let admin repair). Later, when we support automatic
 repair, that option can affect both scrub and client-triggered EIOs?

 We just need to be careful that any EIO on *metadata* still triggers a
 failure as we need to be especially careful about handling that. IIRC
 there is a flag passed to read indicating whether EIO is okay; we should
 probably use that so that EIO-ok vs EIO-notok cases are still clearly

Related issues

Related to Ceph - Bug #8588: In the erasure-coded pool, primary OSD will crash at decoding if any data chunk's size is changed Duplicate 06/11/2014


#1 Updated by Sage Weil over 7 years ago

  • Subject changed from OSD crashed due to filestore EIO to osd: mark pg and use replica on EIO from client read

#2 Updated by Guang Yang over 7 years ago

  • Assignee set to Wei Luo

Wei will work on this one.

#3 Updated by Wei Luo over 7 years ago

  • Status changed from New to In Progress

Current OSD check PG map and get only k items and send sub-read request. So if one read failed. It assert and core dump.
I think it will have 2 solutions:
1. If sub-read failed. We can read the m items and use ECUtils to get correct data.
Same logic when all sub-read success. And will have more latency when sub-read failed.
2 choice when get error on sub-read:
1. read un-selected data and merge with normal data this request get and use ECUtils to get correct data return.
It is more quick but hard to code. And we will deal with read fail again.

2. reconstruct sub-read request and remove the failed OSD. This may be more slow but more clear to code.

2. OSD send k+m sub-read request and in return filter error read result. Check normal result more than k and use ECUtils decode it.
Since one request will make more sub-read request. Total read will trigger and the whole system through put will increase.
It will have less get latency when some sub-read failed. And I see design doc that EC can implement get the fastest k response and return to client which will improve read latency.
This solution will not change current logic and fix is more simple.
Further more. we can add a parameter r makes 0 < r <= m to balance the disk load and error rate.

#4 Updated by Wei Luo over 7 years ago

Also available in: Atom PDF