Project

General

Profile

Actions

Bug #24193

closed

In a case of a network partition, Write and delete operations succeed also users receive an time_out error

Added by Shooter qu almost 6 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are testing various alternatives for object storage systems and their resilience for network failures. We noticed suspicious behavior while working on Ceph v13.0.2. We deployed a cluster of 3 OSDs and 1 monitor (each daemon is running on a separate machine). For clients, we used Librados API.

we used the following configuration:

*osd pool default size = 3
*osd pool default min size = 3
*rados_osd_op_timeout = 15 (if we do not use this configuration, the librados client will block indefinitely in a case of a network partition)

The following is the sequence of events to reproduce the behavior:
1- Create a network partition in which one OSD cannot communicate with other two OSDs but can communicate with the monitor.
2- Send write or delete operation which will fail(time_out error) after 15s (rados_osd_op_timeout).
3- Heal the network partition to recover all communication between OSDs
4- Send a read request. For write operation, the read will return the data written by the failed write operation. While for delete operation, the client will receive a message that the object is not found, which means that the delete operation executed successfully.

our concern is whether this is an expected behavior or not.

Actions

Also available in: Atom PDF