Bug #39175: RGW DELETE calls partially missed shortly after OSD startup - RADOS - Ceph

Actions

Copy link

Bug #39175

closed

RGW DELETE calls partially missed shortly after OSD startup

Added by Bryan Stillwell about 5 years ago. Updated almost 5 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.8

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We have two separate clusters (physically 2,000+ miles apart) that are seeing
PGs going inconsistent while doing reboots to apply the latest security
patches. Digging into the inconsistent PGs points to DELETE calls coming into
the RGWs shortly after one the OSDs restarts being missed by one of the
three OSDs.

First cluster:
Version: 12.2.8 (Luminous)
OSDs: 1,964 (1,720 HDDs, 244 SSDs)
OSD back-end: FileStore

Second cluster:
Version: 12.2.8 (Luminous)
OSDs: 570 (475 HDDs, 95 SSDs)
OSD back-end: FileStore

This was determined by doing a dump of the omap values on each of the OSDs that
make up the PG and comparing the results:

rados -p .rgw.buckets.index listomapvals [offending object mentioned in logs]

Here are the timelines for 5 different PGs:

Cluster 1 - PG 7.146a [1740,1802,1814]:
2019-04-05 18:08:17 - osd.1740 marked itself down
2019-04-05 18:18:27 - osd.1740 boot
2019-04-05 18:19:12 - DELETE call seen on RGW for object whose omap values exist on osd.1740, but not osd.1802 or osd.1814
2019-04-10 01:27:02 - omap_digest mismatch detected

Cluster 1 - PG 7.1a62 [1840,1786,1736]:
2019-04-05 16:40:28 - osd.1736 marked itself down
2019-04-05 16:49:58 - osd.1736 boot
2019-04-05 16:50:47 - DELETE call seen on RGW for object whose omap values exist on osd.1736, but not osd.1840 or osd.1786
2019-04-10 11:31:07 - omap_digest mismatch detected

Cluster 2 - PG 7.3 [504,525,556]:
2019-04-05 09:08:16 - osd.504 marked itself down
2019-04-05 09:12:03 - osd.504 boot
2019-04-05 09:13:09 - DELETE call seen on RGW for object whose omap values exist on osd.504, but not osd.525 or osd.556
2019-04-08 11:46:15 - omap_digest mismatch detected

Cluster 2 - PG 7.9 [492,546,523]:
2019-04-04 14:37:34 - osd.492 marked itself down
2019-04-04 14:40:35 - osd.492 boot
2019-04-04 14:41:55 - DELETE call seen on RGW for object whose omap values exist on osd.492, but not osd.546 or osd.523
2019-04-08 12:06:14 - omap_digest mismatch detected

Cluster 2 - PG 7.2b [488,511,541]:
2019-04-03 13:54:17 - osd.488 marked itself down
2019-04-03 13:59:27 - osd.488 boot
2019-04-03 14:00:54 - DELETE call seen on RGW for object whose omap values exist on osd.488, but not osd.511 or osd.541
2019-04-08 12:42:21 - omap_digest mismatch detected

As you can see the DELETE calls happen between 45-90 seconds after the OSD
boots. Then for some reason the omap data isn't removed from the OSD that
booted, but it is removed from the other two OSDs.

For both clusters the SSDs are used for the .rgw.buckets.index pool, but other
clusters where the .rgw.buckets.index pool still exists on HDDs aren't seeing
this problem.

Actions

Copy link

Updated by Patrick Donnelly about 5 years ago

Project changed from Ceph to rgw

Actions

Copy link

Updated by Kyle Bader about 5 years ago

Source set to Community (user)
Affected Versions v12.2.8 added

Actions

Copy link

Updated by Patrick Donnelly about 5 years ago

Project changed from rgw to RADOS

Actions

Copy link

Updated by Greg Farnum about 5 years ago

Is there any chance of getting good debug logs of the event while it happens (ie, not just after scrub detects the issue).

How is the OSD restart happening, and what is the SSD — is it a restart or something that might result in an SSD losing power and dropping updates? :/

Actions

Copy link

Updated by Greg Farnum about 5 years ago

Or maybe I misread that; is the claim that an OSD reboots, then a delete happens, and then later on you discover that the delete happened on two copies but not the third? Exactly what operations are you seeing at which layers of the stack?

Actions

Copy link

Updated by Sage Weil about 5 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Bryan Stillwell about 5 years ago

The reason the OSDs are rebooting is that we're applying the latest OS updates for CentOS, so it should be a proper shutdown.

This is what appears to be happening:

1. All OSDs are shutdown and the server is rebooted
2. The server boots up and starts the OSDs
3. The OSD finishes booting and displays messages like this:

2019-04-15 16:27:39 - osd.527 load_pgs
   2019-04-15 16:27:40 - osd.527 load_pgs opened 51 pgs
   2019-04-15 16:27:40 - osd.527 boot

4. Then a DELETE call (or two comes) in:

2019-04-15 16:27:53 - DELETE /b2955/backups/PUBLISH-2019-04-13T20-11-29.067Z.json
   2019-04-15 16:28:50 - DELETE /b247/backups/AUTO-2019-04-15T23-15-35.933Z.json

5. Scrubs happen later which detect the problem

2019-04-17 11:39:51.158721 osd.480 osd.480 10.26.20.7:6820/10511 299 : cluster [ERR] 7.50 : soid 7:0a477eb8:::.dir.default.2451118.1796353:head omap_digest 0x7e65e1ec != omap_digest 0xd5c9d878 from shard 480
   2019-04-17 12:04:27.538815 osd.480 osd.480 10.26.20.7:6820/10511 300 : cluster [ERR] 7.50 : soid 7:0bca7957:::.dir.default.28479564.1255843:head omap_digest 0xe0c83379 != omap_digest 0x66bbad9 from shard 480

6. Then using listomapkeys and primary-affinity I can grab the keys for each OSD:

rados -p .rgw.buckets.index listomapkeys .dir.default.2451118.1796353 >pg-7.50-osd.480-listomapkeys-1.txt
   rados -p .rgw.buckets.index listomapkeys .dir.default.28479564.1255843 >pg-7.50-osd.480-listomapkeys-2.txt
   ceph osd primary-affinity osd.480 0
   rados -p .rgw.buckets.index listomapkeys .dir.default.2451118.1796353 >pg-7.50-osd.544-listomapkeys-1.txt
   rados -p .rgw.buckets.index listomapkeys .dir.default.28479564.1255843 >pg-7.50-osd.544-listomapkeys-2.txt
   ceph osd primary-affinity osd.544 0
   rados -p .rgw.buckets.index listomapkeys .dir.default.2451118.1796353 >pg-7.50-osd.527-listomapkeys-1.txt
   rados -p .rgw.buckets.index listomapkeys .dir.default.28479564.1255843 >pg-7.50-osd.527-listomapkeys-2.txt
   ceph osd primary-affinity osd.544 1
   ceph osd primary-affinity osd.480 1

7. Doing a comparison points to the objects which were deleted, but still show up on one of the OSDs:

diff u pg-7.50-osd.480-listomapkeys-1.txt pg-7.50-osd.527-listomapkeys-1.txt
-- pg-7.50-osd.480-listomapkeys-1.txt 2019-04-17 14:12:59.443810473 -0700
+++ pg-7.50-osd.527-listomapkeys-1.txt 2019-04-17 14:13:50.168719106 -0700
@ -20031,6 +20031,7 @
backups/AUTO-2019-04-13T20-21-53.652Z.json
backups/AUTO-2019-04-16T21-15-26.172Z.json
backups/AUTO-2019-04-16T21-27-34.611Z.json
+backups/PUBLISH-2019-04-13T20-11-29.067Z.json
backups/PUBLISH-2019-04-13T20-51-27.564Z.json
backups/PUBLISH-2019-04-13T21-37-47.976Z.json
backups/PUBLISH-2019-04-13T21-38-35.323Z.json

[root@a2plcephmon004 ~]# diff u pg-7.50-osd.480-listomapkeys-2.txt pg-7.50-osd.527-listomapkeys-2.txt
-- pg-7.50-osd.480-listomapkeys-2.txt 2019-04-17 14:13:03.252803616 -0700
+++ pg-7.50-osd.527-listomapkeys-2.txt 2019-04-17 14:13:53.798712564 -0700
@ -22961,6 +22961,7 @
pub/e4df/script.js
pub/e37c/styles.css
pub/24d4/script.js
+backups/AUTO-2019-04-15T23-15-35.933Z.json
backups/AUTO-2019-04-15T23-20-14.397Z.json
backups/AUTO-2019-04-15T23-20-35.727Z.json
backups/AUTO-2019-04-15T23-21-23.572Z.json

So after osd.527 booted it missed deleting those two objects. It only seems to happen within the first 2 minutes after the OSD boots.

Actions

Copy link

Updated by Bryan Stillwell about 5 years ago

I could grab you the debug logs, but that could take a while. Which knobs do you want me to turn up?

This is what I'm thinking:

Set the requested debug settings in ceph.conf
Restart the SSD-based OSD process
Do a deep scrub of every PG on that OSD
Repeat until the problem shows back up

Are there any corrections you would like to make to the process?

Actions

Copy link

Updated by Neha Ojha almost 5 years ago

Bryan Stillwell wrote:

I could grab you the debug logs, but that could take a while. Which knobs do you want me to turn up?

This is what I'm thinking:

Set the requested debug settings in ceph.conf

Restart the SSD-based OSD process

Do a deep scrub of every PG on that OSD

Repeat until the problem shows back up

Are there any corrections you would like to make to the process?

The process sounds fine, make sure to turn on logging with debug_osd=20 and debug_filestore=5.

Actions

Copy link

#10

Updated by Neha Ojha almost 5 years ago

Bryan, could you also upload osd.516 and osd.563 logs from the same time period as you've provided for osd.503.

Actions

Copy link

#11

Updated by Bryan Stillwell almost 5 years ago

Unfortunately I didn't turn up debug logging for every OSD in the cluster so I don't have those logs. I'll reproduce it again and grab you those logs.

Actions

Copy link

#12

Updated by Neha Ojha almost 5 years ago

In both the set of logs you have shared so far, it seems that the object appears in the OSD log during omap-set-vals("setting vals:") - an omap write, a few seconds after the RGW DELETE call on that object is seen.

Here's an example:

[30/Apr/2019:12:35:51 -0700] - DELETE seen on RGW
2019-04-30 12:35:54.230926 - do_osd_op omap-set-vals setting vals on the object.

This seems a bit weird as we would have expected the entry to be present when RGW sent the DELETE, but we could probably take a look at the sequence of operations on that object from the RGW logs, and see how those correlate to what we are seeing in the OSD logs. Do you have more information that you could provide on that object?

Actions

Copy link

#13

Updated by Wes Dillingham almost 5 years ago

Hi Neha,

I am on Bryan's team. Bryan is out this week but is returning soon.

I was able to inspect logs for about 9 hours prior to the delete for any action on that object, and there was none logged on any of our RGWs. Those 9 hours would encompass all of the work Bryan did to recreate the inconsistent PG.

I believe these objects are part of a monthly backup operation (written once, never read, deleted a month later) as later that day (hours after the pg goes inconsistent) we see a similar object with a different date stamp (a month later) get written.

If there is a particular action you would like us to take (turn up rgw logging etc) or characteristic about the object you would like use to investigate we can attempt to recreate again. Thanks.

Actions

Copy link

#14

Updated by Neha Ojha almost 5 years ago

Hi Wes,

Can you check if the bucket on which you are seeing issues, was present in a jewel cluster. We have seen something similar in jewel before http://tracker.ceph.com/issues/18533, which got fixed in 10.2.8.

Actions

Copy link

#15

Updated by Wes Dillingham almost 5 years ago

The cluster itself followed this upgrade path:
0.94.10 -> 10.2.10 -> 12.2.5 -> 12.2.8

We will look into the history of the particular bucket.

Actions

Copy link

#16

Updated by Bryan Stillwell almost 5 years ago

The bucket was created on 2017-01-26 while the cluster was running the 0.94.3 (Hammer) release. Also the cluster has the following history:

2017-09-20: Upgraded to Hammer (0.94.10)
2018-02-06: Upgraded to Jewel (10.2.10)
2018-07-12: Upgraded to Luminous (12.2.5)
2018-10-04: Upgraded to Luminous (12.2.8)
2019-03-22: Moved the bucket index pool from HDDs to SSDs

Based on the name of the object I would be pretty confident that it was created on 2019-03-30. Since the buckets index pool was switched to SSDs just 8 days prior to this object being uploaded, could there be a problem with the older bucket index format and the increased performance provided by the SSDs? If so, is there any way to upgrade the buckets index to the newer format?

Actions

Copy link

#17

Updated by Bryan Stillwell almost 5 years ago

Neha,

It was great meeting with you in Barcelona! I can't remember everything you wanted me to gather, but here's a breakdown for 4 of our object clusters and whether or not they're seeing this problem:

a2
Status: Seeing the inconsistent PG problem during OSD restarts
PGs in .rgw.buckets.index: 128
SSD-based OSDs backing .rgw.buckets.index: 95
PGs/SSD: 4
Total omap size: 470.9 GiB
omap/SSD: 5.0 GiB
omap/PG: 3.7 GiB
History:
2016-03-31 - Cluster created (0.94.3)
2017-09-20 - Upgraded (0.94.10)
2018-02-06 - Upgraded (10.2.10)
2018-07-12 - Upgraded (12.2.5)
2018-10-04 - Upgraded (12.2.8)
2019-03-22 - Switched buckets index pool over to SSDs

n3
Status: No problems with inconsistent PGs
PGs in .rgw.buckets.index: 64
SSD-based OSDs backing .rgw.buckets.index: 0
PGs/SSD: 0
Total omap size: Unknown (all pools on HDDs)
omap/SSD: Unknown (all pools on HDDs)
History:
2016-06-28 - Created (0.94.3)
2017-05-23 - Upgraded (0.94.10)
2018-01-24 - Upgraded (10.2.10)
2018-07-10 - Upgraded (12.2.5)
2018-10-05 - Upgraded (12.2.8)
2019-05-08 - Upgraded (12.2.12)

p3
Status: Seeing the inconsistent PG problem during OSD restarts
PGs in .rgw.buckets.index: 8192
SSD-based OSDs backing .rgw.buckets.index: 244
PGs/SSD: 101
Total omap size: 6,619.5 GiB
omap/SSD: 27.1 GiB
omap/PG: 0.8 GiB
History:
2015-12-16 - Created (0.94.3)
2017-07-18 - Upgraded (0.94.10)
2018-02-12 - Upgraded (10.2.10)
2018-08-02 - Upgraded (12.2.5)
2018-10-10 - Upgraded (12.2.8)
2018-12-07 - Moved .rgw.buckets.index pool to SSDs

s2
Status: No problems with inconsistent PGs
PGs in default.rgw.buckets.index: 1024
SSD-based OSDs backing .rgw.buckets.index: 83
PGs/SSD: 37
Total omap size: 69.7 GiB
omap/SSD: 0.8 GiB
omap/PG: 0.07 GiB
History:
2017-09-18 - Created (10.2.9)
2018-01-22 - Upgraded (10.2.10)
2018-07-24 - Upgraded (12.2.5)
2018-10-09 - Upgraded (12.2.8)
2019-02-27 - Moved default.rgw.buckets.index pool to SSDs
2019-05-09 - Upgraded (12.2.12)

The two clusters with the problem (a2 & p3) have a few things in common:

1. They both were installed with the hammer release (0.94.3)
2. They both have a high amount of omap data per OSD (5.0 GiB in one and 27.1 GiB in the other)
3. They're both still running 12.2.8 while the other two are running 12.2.12

The third option is unlikely to matter since we only upgraded them to 12.2.12 because they weren't seeing the problem.

As for the other two, they started with leveldb and were switched to rocksdb when we moved them to SSDs. This might be the problem, but I'm starting to feel like the combination of a lot of omap data on top of rocksdb feels like the combination which is needed to get this bug to surface.

The main reason there's so much omap data on the clusters with the problem is because of the bug which caused lots of stale bucket instances to be created. I'm thinking that if we use the new cleanup command that was introduced in 12.2.11 (radosgw-admin stale-instances rm) we can drastically reduce the amount of omap data per SSD and avoid this problem (but not fix the bug of course).

What are your thoughts on this? I imagine you would like to find the bug in the code, but we would really like to be able to finished patching these clusters and upgrade them to 12.2.12.

Thanks,
Bryan

Actions

Copy link

#18

Updated by Josh Durgin almost 5 years ago

Hey Bryan, Neha's out this week. I'd like to verify whether this could be the same bug we'd seen before (http://tracker.ceph.com/issues/18533).

Since it only appears on clusters that started out on hammer, where that bug was present, it seems likely that it may be the culprit.

Essentially that bug results in the internal structure of omap for the bucket index becoming incorrect, particularly when deletions occur. It doesn't matter if the s3 object was only created recently, but whether the bucket has existed since 0.94.3, with the same internal omap structure.

When moving the bucket index pool to SSD, did you end up backfilling onto the SSDs? This would rewrite the internal omap structures and wipe out the effects of that bug. If you moved things around via ceph-kvstore-tool or ceph-objectstore-tool, the bug in the internal omap structure would be preserved.

Actions

Copy link

#19

Updated by Bryan Stillwell almost 5 years ago

Hey Josh,

We backfilled onto the SSDs by creating a new crush rule which just uses the ssd class and switching the .rgw.buckets.index to use that new rule.

Also I finished upgrading the a2 cluster to 12.2.12 today and was still able to reproduce the problem. My next step is to split the PGs from 128 to 2048 to try and reduce the amount of omap data per PG.

Bryan

Actions

Copy link

#20

Updated by Josh Durgin almost 5 years ago

Ok, so this is a different bug then. Any chance you're planning on migrating to bluestore with part of one of the problem clusters? FileStore has a more complex omap structure that may be the source of this bug. I doubt omap data/pg is related, but it's tough to tell.

Actions

Copy link

#21

Updated by Bryan Stillwell almost 5 years ago

We are planning on migrating all of our clusters to BlueStore, but that's going to take the rest of the year. We could move just the SSD-based OSDs over to BlueStore, but first we'll need to do it on our test and staging clusters.

What are the odds this will fix the problem?

Actions

Copy link

#22

Updated by Josh Durgin almost 5 years ago

I'd say the odds are high migrating the bucket indexes to bluestore would fix it - the omap structure there is very simple compared to filestore. If it doesn't fix it, it rules out a large area where the bug might be.

Actions

Copy link

#23

Updated by Bryan Stillwell almost 5 years ago

After switching both of these over to using BlueStore for the SSDs, the problem has gone away! Thanks!

Actions

Copy link

#24

Updated by Neha Ojha almost 5 years ago

Status changed from New to Resolved

That's great. Marking this bug as "Resolved" since migrating to BlueStore fixed the issue.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #39175

RGW DELETE calls partially missed shortly after OSD startup

Updated by Patrick Donnelly about 5 years ago

Updated by Kyle Bader about 5 years ago

Updated by Patrick Donnelly about 5 years ago

Updated by Greg Farnum about 5 years ago

Updated by Greg Farnum about 5 years ago

Updated by Sage Weil about 5 years ago

Updated by Bryan Stillwell about 5 years ago

Updated by Bryan Stillwell about 5 years ago

Updated by Neha Ojha almost 5 years ago

Updated by Neha Ojha almost 5 years ago

Updated by Bryan Stillwell almost 5 years ago

Updated by Neha Ojha almost 5 years ago

Updated by Wes Dillingham almost 5 years ago

Updated by Neha Ojha almost 5 years ago

Updated by Wes Dillingham almost 5 years ago

Updated by Bryan Stillwell almost 5 years ago

Updated by Bryan Stillwell almost 5 years ago

Updated by Josh Durgin almost 5 years ago

Updated by Bryan Stillwell almost 5 years ago

Updated by Josh Durgin almost 5 years ago

Updated by Bryan Stillwell almost 5 years ago

Updated by Josh Durgin almost 5 years ago

Updated by Bryan Stillwell almost 5 years ago

Updated by Neha Ojha almost 5 years ago