Project

General

Profile

Actions

Support #38526

closed

Ceph cluster unavailable if a server with OSDs goes down even if replication is set on host...

Added by Bogdan Adrian Velica about 5 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

Hello,

I have the following setup:
- 5 servers with 22 HDD disk (as OSD);
- 5 monitor daemons running on the same servers;
- 3 mds daemons (1 active and 2 standby);
- Linux OS on all servers - ubuntu 16.04.05 with kernel 4.15.0-43-generic -
- ceph cluster mimic 13.2.4

The crush map is :

ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default" 
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host" 
            },
            {
                "op": "emit" 
            }
        ]
    }
]

My ceph.conf file looks like this on all ceph servers:

### Ceph SATA Cluster ###

[global]
fsid = 98295168-4f61-4660-82bf-94383ef747cf

public_network = 172.24.152.0/21
cluster_network = 172.24.160.0/24

mon_initial_members = ceph-hyperstore101-sata
mon_host = 172.24.152.33, 172.24.152.34, 172.24.152.35, 172.24.152.36, 172.24.152.37

auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

### End config file ###

and my conf file looks like this on clients (I am using RBD)

[global]
fsid = 98295168-4f61-4660-82bf-94383ef747cf

public_network = 172.24.152.0/21
cluster_network = 172.24.160.0/24
mon_initial_members = ceph-hyperstore101-sata
mon_host = 172.24.152.33
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

now.... if I power-off one server the entire cluster becomes unavailable and all the clients with RBD (I am using only RBD at the moment) cannot read data from mounted disk in RBD. The pool replication for RBD is 2 (one data and one copy).

I have checked the see if the data on OSD1 si replicated on a OSD44 on a different server for example so this is OK.

Am I missing something ?

Actions #1

Updated by Greg Farnum about 5 years ago

  • Tracker changed from Bug to Support

This is a better question for the ceph-user mailing list than the tracker. :)

But it is probably the min_size, which you will want to better understand.

Actions #2

Updated by Bogdan Adrian Velica about 5 years ago

Greg Farnum wrote:

This is a better question for the ceph-user mailing list than the tracker. :)

But it is probably the min_size, which you will want to better understand.

Thank you for the hint. now i understand.

Actions #3

Updated by Bogdan Adrian Velica about 5 years ago

please disregard this, this is not a bug just miss-configuration problem.

Actions #4

Updated by Greg Farnum about 5 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF