Project

General

Profile

Actions

Bug #52814

open

RGW services stop responding when datacenter down

Added by Pablo Higueras over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
Or Friedmann
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We planned a stop of half of our cluster hosts to move to a new datacenter and right after all the RGW services stop responding and therefore we suffer a service drop since production depends on CEPH to work. We actually first did it in a test environment so it wasn't catastrophic but we will have to move the rest of our datacenter and almost every host there is real production.

Here I share some status so you can see it through.

$ ceph -s
  cluster:
    id:     b410bef1-f09c-521c-b3e9-0198cb041e6c
    health: HEALTH_WARN
            2 hosts fail cephadm check
            8 large omap objects
            1/3 mons down, quorum cephm00a,cephm00c
            noout,norecover flag(s) set
            2 datacenters (38 osds) down
            38 osds down
            4 hosts (38 osds) down
            Reduced data availability: 88 pgs inactive
            Degraded data redundancy: 315300237/646890369 objects degraded (48.741%), 2132 pgs degraded, 2217 pgs undersized

About RGW I don't know what to actually share since suddenly it stop responding. Although requests keep entering none of them respond:

2021-10-05T04:10:16.863+0000 7f6ab45f4700  1 ====== starting new request req=0x7f6a2ab75620 =====
2021-10-05T04:10:29.833+0000 7f6a5d546700  1 ====== starting new request req=0x7f6a2aaf4620 =====
2021-10-05T04:10:42.775+0000 7f6af1e6f700  1 ====== starting new request req=0x7f6a2aa73620 =====
2021-10-05T04:10:55.820+0000 7f6a53d33700  1 ====== starting new request req=0x7f6a2a9f2620 =====
2021-10-05T04:11:08.782+0000 7f6adfe4b700  1 ====== starting new request req=0x7f6a2a971620 =====
2021-10-05T04:11:21.790+0000 7f6acbe23700  1 ====== starting new request req=0x7f6a2a8f0620 =====
2021-10-05T04:11:34.822+0000 7f6a7b582700  1 ====== starting new request req=0x7f6a2a86f620 =====

Also "radosgw-admin" command doesn't respond, I guess it's normal since neither of the daemons are responding.

Actions #1

Updated by Pablo Higueras over 2 years ago

We find out a "solution" in order to reactivate RGW service. The only workaround was to remove every host that was down and the OSDs within from the crush map.

Here it is the piece of code that "solved" the issue for us:

for host in $(ceph health detail -f json | jq -r '.checks.OSD_HOST_DOWN.detail[].message' | awk '{print $2}'); do 
    for osd in $(ceph osd metadata | jq -r '.[] | select(.hostname=="'$host'") | .id'); do 
        ceph osd crush rm osd.$osd
    done
    ceph osd crush rm $host
done

Once the hosts are up again we can retrieve the crush map to its original state.

Anyway, I still reckon that this situation is a bug that should not occur in a clustered environment.

Actions #2

Updated by Sebastian Wagner over 2 years ago

  • Project changed from Ceph to rgw
Actions #3

Updated by Casey Bodley over 2 years ago

  • Assignee set to Or Friedmann

hi Or, this case looks kind of similar to the fast-fail work you did previously. can you tell whether that change would apply here?

Actions

Also available in: Atom PDF