Project

General

Profile

Actions

Bug #39175

closed

RGW DELETE calls partially missed shortly after OSD startup

Added by Bryan Stillwell about 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have two separate clusters (physically 2,000+ miles apart) that are seeing
PGs going inconsistent while doing reboots to apply the latest security
patches. Digging into the inconsistent PGs points to DELETE calls coming into
the RGWs shortly after one the OSDs restarts being missed by one of the
three OSDs.

First cluster:
Version: 12.2.8 (Luminous)
OSDs: 1,964 (1,720 HDDs, 244 SSDs)
OSD back-end: FileStore

Second cluster:
Version: 12.2.8 (Luminous)
OSDs: 570 (475 HDDs, 95 SSDs)
OSD back-end: FileStore

This was determined by doing a dump of the omap values on each of the OSDs that
make up the PG and comparing the results:

rados -p .rgw.buckets.index listomapvals [offending object mentioned in logs]

Here are the timelines for 5 different PGs:

Cluster 1 - PG 7.146a [1740,1802,1814]:
2019-04-05 18:08:17 - osd.1740 marked itself down
2019-04-05 18:18:27 - osd.1740 boot
2019-04-05 18:19:12 - DELETE call seen on RGW for object whose omap values exist on osd.1740, but not osd.1802 or osd.1814
2019-04-10 01:27:02 - omap_digest mismatch detected

Cluster 1 - PG 7.1a62 [1840,1786,1736]:
2019-04-05 16:40:28 - osd.1736 marked itself down
2019-04-05 16:49:58 - osd.1736 boot
2019-04-05 16:50:47 - DELETE call seen on RGW for object whose omap values exist on osd.1736, but not osd.1840 or osd.1786
2019-04-10 11:31:07 - omap_digest mismatch detected

Cluster 2 - PG 7.3 [504,525,556]:
2019-04-05 09:08:16 - osd.504 marked itself down
2019-04-05 09:12:03 - osd.504 boot
2019-04-05 09:13:09 - DELETE call seen on RGW for object whose omap values exist on osd.504, but not osd.525 or osd.556
2019-04-08 11:46:15 - omap_digest mismatch detected

Cluster 2 - PG 7.9 [492,546,523]:
2019-04-04 14:37:34 - osd.492 marked itself down
2019-04-04 14:40:35 - osd.492 boot
2019-04-04 14:41:55 - DELETE call seen on RGW for object whose omap values exist on osd.492, but not osd.546 or osd.523
2019-04-08 12:06:14 - omap_digest mismatch detected

Cluster 2 - PG 7.2b [488,511,541]:
2019-04-03 13:54:17 - osd.488 marked itself down
2019-04-03 13:59:27 - osd.488 boot
2019-04-03 14:00:54 - DELETE call seen on RGW for object whose omap values exist on osd.488, but not osd.511 or osd.541
2019-04-08 12:42:21 - omap_digest mismatch detected

As you can see the DELETE calls happen between 45-90 seconds after the OSD
boots. Then for some reason the omap data isn't removed from the OSD that
booted, but it is removed from the other two OSDs.

For both clusters the SSDs are used for the .rgw.buckets.index pool, but other
clusters where the .rgw.buckets.index pool still exists on HDDs aren't seeing
this problem.

Actions

Also available in: Atom PDF