Project

General

Profile

Actions

Bug #52814

open

RGW services stop responding when datacenter down

Added by Pablo Higueras over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
Or Friedmann
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We planned a stop of half of our cluster hosts to move to a new datacenter and right after all the RGW services stop responding and therefore we suffer a service drop since production depends on CEPH to work. We actually first did it in a test environment so it wasn't catastrophic but we will have to move the rest of our datacenter and almost every host there is real production.

Here I share some status so you can see it through.

$ ceph -s
  cluster:
    id:     b410bef1-f09c-521c-b3e9-0198cb041e6c
    health: HEALTH_WARN
            2 hosts fail cephadm check
            8 large omap objects
            1/3 mons down, quorum cephm00a,cephm00c
            noout,norecover flag(s) set
            2 datacenters (38 osds) down
            38 osds down
            4 hosts (38 osds) down
            Reduced data availability: 88 pgs inactive
            Degraded data redundancy: 315300237/646890369 objects degraded (48.741%), 2132 pgs degraded, 2217 pgs undersized

About RGW I don't know what to actually share since suddenly it stop responding. Although requests keep entering none of them respond:

2021-10-05T04:10:16.863+0000 7f6ab45f4700  1 ====== starting new request req=0x7f6a2ab75620 =====
2021-10-05T04:10:29.833+0000 7f6a5d546700  1 ====== starting new request req=0x7f6a2aaf4620 =====
2021-10-05T04:10:42.775+0000 7f6af1e6f700  1 ====== starting new request req=0x7f6a2aa73620 =====
2021-10-05T04:10:55.820+0000 7f6a53d33700  1 ====== starting new request req=0x7f6a2a9f2620 =====
2021-10-05T04:11:08.782+0000 7f6adfe4b700  1 ====== starting new request req=0x7f6a2a971620 =====
2021-10-05T04:11:21.790+0000 7f6acbe23700  1 ====== starting new request req=0x7f6a2a8f0620 =====
2021-10-05T04:11:34.822+0000 7f6a7b582700  1 ====== starting new request req=0x7f6a2a86f620 =====

Also "radosgw-admin" command doesn't respond, I guess it's normal since neither of the daemons are responding.

Actions

Also available in: Atom PDF