Bug #52814: RGW services stop responding when datacenter down - rgw - Ceph

Actions

Copy link

Bug #52814

open

RGW services stop responding when datacenter down

Added by Pablo Higueras over 2 years ago. Updated over 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Or Friedmann

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v16.2.4

ceph-qa-suite:

rgw

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We planned a stop of half of our cluster hosts to move to a new datacenter and right after all the RGW services stop responding and therefore we suffer a service drop since production depends on CEPH to work. We actually first did it in a test environment so it wasn't catastrophic but we will have to move the rest of our datacenter and almost every host there is real production.

Here I share some status so you can see it through.

$ ceph -s
  cluster:
    id:     b410bef1-f09c-521c-b3e9-0198cb041e6c
    health: HEALTH_WARN
            2 hosts fail cephadm check
            8 large omap objects
            1/3 mons down, quorum cephm00a,cephm00c
            noout,norecover flag(s) set
            2 datacenters (38 osds) down
            38 osds down
            4 hosts (38 osds) down
            Reduced data availability: 88 pgs inactive
            Degraded data redundancy: 315300237/646890369 objects degraded (48.741%), 2132 pgs degraded, 2217 pgs undersized

About RGW I don't know what to actually share since suddenly it stop responding. Although requests keep entering none of them respond:

2021-10-05T04:10:16.863+0000 7f6ab45f4700  1 ====== starting new request req=0x7f6a2ab75620 =====
2021-10-05T04:10:29.833+0000 7f6a5d546700  1 ====== starting new request req=0x7f6a2aaf4620 =====
2021-10-05T04:10:42.775+0000 7f6af1e6f700  1 ====== starting new request req=0x7f6a2aa73620 =====
2021-10-05T04:10:55.820+0000 7f6a53d33700  1 ====== starting new request req=0x7f6a2a9f2620 =====
2021-10-05T04:11:08.782+0000 7f6adfe4b700  1 ====== starting new request req=0x7f6a2a971620 =====
2021-10-05T04:11:21.790+0000 7f6acbe23700  1 ====== starting new request req=0x7f6a2a8f0620 =====
2021-10-05T04:11:34.822+0000 7f6a7b582700  1 ====== starting new request req=0x7f6a2a86f620 =====

Also "radosgw-admin" command doesn't respond, I guess it's normal since neither of the daemons are responding.

Actions

Copy link

Updated by Pablo Higueras over 2 years ago

We find out a "solution" in order to reactivate RGW service. The only workaround was to remove every host that was down and the OSDs within from the crush map.

Here it is the piece of code that "solved" the issue for us:

for host in $(ceph health detail -f json | jq -r '.checks.OSD_HOST_DOWN.detail[].message' | awk '{print $2}'); do 
    for osd in $(ceph osd metadata | jq -r '.[] | select(.hostname=="'$host'") | .id'); do 
        ceph osd crush rm osd.$osd
    done
    ceph osd crush rm $host
done

Once the hosts are up again we can retrieve the crush map to its original state.

Anyway, I still reckon that this situation is a bug that should not occur in a clustered environment.

Actions

Copy link