Project

General

Profile

Actions

Bug #24193

closed

In a case of a network partition, Write and delete operations succeed also users receive an time_out error

Added by Shooter qu almost 6 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are testing various alternatives for object storage systems and their resilience for network failures. We noticed suspicious behavior while working on Ceph v13.0.2. We deployed a cluster of 3 OSDs and 1 monitor (each daemon is running on a separate machine). For clients, we used Librados API.

we used the following configuration:

*osd pool default size = 3
*osd pool default min size = 3
*rados_osd_op_timeout = 15 (if we do not use this configuration, the librados client will block indefinitely in a case of a network partition)

The following is the sequence of events to reproduce the behavior:
1- Create a network partition in which one OSD cannot communicate with other two OSDs but can communicate with the monitor.
2- Send write or delete operation which will fail(time_out error) after 15s (rados_osd_op_timeout).
3- Heal the network partition to recover all communication between OSDs
4- Send a read request. For write operation, the read will return the data written by the failed write operation. While for delete operation, the client will receive a message that the object is not found, which means that the delete operation executed successfully.

our concern is whether this is an expected behavior or not.

Actions #1

Updated by Greg Farnum almost 6 years ago

  • Status changed from New to Closed

This is expected behavior. The "rados osd op timeout" config causes the client to return an error and cancel the op locally if it doesn't get a response within that period, but nothing can cancel an op that's already been sent over the network. In your case, the operation has been delivered to an OSD immediately, but that OSD has to block on the rest of the cluster. It doesn't forget about the op, though!

(There are other circumstances in which it could drop the op, depending on your version and if the network partition leads to OSD map changes, but once a client sends an op over the wire then the client needs to be prepared for that op to have happened.)

Actions #2

Updated by Shooter qu almost 6 years ago

Thank you for the clarification. I am not an expert in Ceph internal design but I am worried about the write API semantics.
I just wanted to note that when a write operation fails, typically, it does not leave side effects, and does not change the data. This is not the case with this configuration of Ceph. In Ceph, when the timeout happens the application does not know in what state the file is. It could have the last write or could not.

This is confusing for the application:
If a write operation fails, the application may retry the operation leading to duplicate records in the file, or may abandon the write, but future read operations will return a value the application is not expecting (because the last write failed).

It will help if Ceph has a more consistent and clear semantics in case of operation failures, and maybe the documentation needs to be updated to reflect this semantic.

Thank you

Actions #3

Updated by Greg Farnum almost 6 years ago

We'd happily accept PRs about this, but AFAIK the osd op timeout is an undocumented API and is not intended for use with any of the standard interfaces. Certain users had custom applications which are happy with the given semantics.

(This is not a problem particular to Ceph, btw — any networked storage system which offers an error-on-timeout is going to have the same behavior.)

Actions #4

Updated by Shooter qu almost 6 years ago

Thank you for the prompt response and for the clarification.

Actions

Also available in: Atom PDF