Bug #24193: In a case of a network partition, Write and delete operations succeed also users receive an time_out error - Ceph - Ceph

Actions

Copy link

Bug #24193

closed

In a case of a network partition, Write and delete operations succeed also users receive an time_out error

Added by Shooter qu almost 6 years ago. Updated almost 6 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

v13.0.0

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We are testing various alternatives for object storage systems and their resilience for network failures. We noticed suspicious behavior while working on Ceph v13.0.2. We deployed a cluster of 3 OSDs and 1 monitor (each daemon is running on a separate machine). For clients, we used Librados API.

we used the following configuration:

*osd pool default size = 3
*osd pool default min size = 3
*rados_osd_op_timeout = 15 (if we do not use this configuration, the librados client will block indefinitely in a case of a network partition)

The following is the sequence of events to reproduce the behavior:
1- Create a network partition in which one OSD cannot communicate with other two OSDs but can communicate with the monitor.
2- Send write or delete operation which will fail(time_out error) after 15s (rados_osd_op_timeout).
3- Heal the network partition to recover all communication between OSDs
4- Send a read request. For write operation, the read will return the data written by the failed write operation. While for delete operation, the client will receive a message that the object is not found, which means that the delete operation executed successfully.

our concern is whether this is an expected behavior or not.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #24193

In a case of a network partition, Write and delete operations succeed also users receive an time_out error

Updated by Greg Farnum almost 6 years ago

Updated by Shooter qu almost 6 years ago

Updated by Greg Farnum almost 6 years ago

Updated by Shooter qu almost 6 years ago