Bug #43174: pgs inconsistent, union_shard_errors=missing - RADOS - Ceph

Actions

Copy link

Bug #43174

closed

pgs inconsistent, union_shard_errors=missing

Added by Aleksandr Rudenko over 4 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Mykola Golub

Category:

Scrub/Repair

Target version:

% Done:

Source:

Tags:

Backport:

octopus, nautilus, mimic, luminous

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v12.2.12

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

36230

Crash signature (v1):

Crash signature (v2):

Description

Hi,

Luminous 12.2.12.
2/3 OSDs - Filestore, 1/3 - Bluestore
size=3, min_size=2
Cluster used as S3 (RadosGW).

I have "pgs inconsistent" for about 80 PGs.

For example one of inconsistent objects:

rados list-inconsistent-obj 6.f32 | jq
{
  "epoch": 31466,
  "inconsistents": [
    {
      "object": {
        "name": "d48c233a-cef5-4072-8fee-8e425695b655.319082.2_ovBck/SEKRETERYA 28.06.2019/SEBLA-HUKUK/Sebla-temyiz dilekçesine cevap - birleştirme için.docx",
        "nspace": "",
        "locator": "",
        "snap": "head",
        "version": 41926
      },
      "errors": [],
      "union_shard_errors": [
        "missing" 
      ],
      "selected_object_info": {
        "oid": {
          "oid": "d48c233a-cef5-4072-8fee-8e425695b655.319082.2_ovBck/SEKRETERYA 28.06.2019/SEBLA-HUKUK/Sebla-temyiz dilekçesine cevap - birleştirme için.docx",
          "key": "",
          "snapid": -2,
          "hash": 2273537842,
          "max": 0,
          "pool": 6,
          "namespace": "" 
        },
        "version": "31462'45912",
        "prior_version": "31410'41926",
        "last_reqid": "osd.47.0:57943766",
        "user_version": 41926,
        "size": 62411,
        "mtime": "2019-11-21 07:52:29.497853",
        "local_mtime": "2019-11-21 07:52:29.513779",
        "lost": 0,
        "flags": [
          "dirty",
          "data_digest",
          "omap_digest" 
        ],
        "legacy_snaps": [],
        "truncate_seq": 0,
        "truncate_size": 0,
        "data_digest": "0x3b9127ee",
        "omap_digest": "0xffffffff",
        "expected_object_size": 0,
        "expected_write_size": 0,
        "alloc_hint_flags": 0,
        "manifest": {
          "type": 0,
          "redirect_target": {
            "oid": "",
            "key": "",
            "snapid": 0,
            "hash": 0,
            "max": 0,
            "pool": -9223372036854776000,
            "namespace": "" 
          }
        },
        "watchers": {}
      },
      "shards": [
        {
          "osd": 9,
          "primary": false,
          "errors": [
            "missing" 
          ]
        },
        {
          "osd": 47,
          "primary": true,
          "errors": [
            "missing" 
          ]
        },
        {
          "osd": 62,
          "primary": false,
          "errors": [],
          "size": 62411,
          "omap_digest": "0xffffffff",
          "data_digest": "0x3b9127ee" 
        }
      ]
    }
  ]
}

As you can see 2/3 OSDs have "errors": ["missing"]. Primary OSD (47) with this error too, but i can GET this object by awscli (through S3 API) and md5 of this object similar to Etag (the integrity of the object is not broken). If i stop OSD-62 (which has object according to report), i can successfully get this object using S3 API.

If i run PG repair, in cluster log i can see:

36:45.145709 osd.47 osd.47 172.19.0.17:6860/1093343 4254 : cluster [ERR] 6.f32 repair : stat mismatch, got 3556/3555 objects, 0/0 clones, 3556/3555 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 11169954579/11169814803 bytes, 0/0 hit_set_archive bytes.
2019-12-05 20:36:45.148312 osd.47 osd.47 172.19.0.17:6860/1093343 4255 : cluster [ERR] 6.f32 repair 1 missing, 0 inconsistent objects
2019-12-05 20:36:45.148434 osd.47 osd.47 172.19.0.17:6860/1093343 4256 : cluster [ERR] 6.f32 repair 4 errors, 2 fixed

After repairing, i run deep-scrub again.

2019-12-06 12:50:30.742346 osd.47 osd.47 172.19.0.17:6860/1093343 4268 : cluster [ERR] 6.f32 shard 62 6:4cf6c1e1:::d48c233a-cef5-4072-8fee-8e425695b655.319082.2_ovBck%2fSEKRETERYA 28.06.2019%2fSEBLA-HUKUK%2fSebla-temyiz dilek%c3%a7esine cevap - birle%c5%9ftirme i%c3%a7in.docx:head : missing
2019-12-06 12:50:32.872768 osd.47 osd.47 172.19.0.17:6860/1093343 4269 : cluster [ERR] 6.f32 shard 9 6:4cf6c1e1:::d48c233a-cef5-4072-8fee-8e425695b655.319082.2_ovBck%2fSEKRETERYA 28.06.2019%2fSEBLA-HUKUK%2fSebla-temyiz dilek%c3%a7esine cevap - birle%c5%9ftirme i%c3%a7in.docx:head : missing
2019-12-06 12:50:32.872781 osd.47 osd.47 172.19.0.17:6860/1093343 4270 : cluster [ERR] 6.f32 shard 47 6:4cf6c1e1:::d48c233a-cef5-4072-8fee-8e425695b655.319082.2_ovBck%2fSEKRETERYA 28.06.2019%2fSEBLA-HUKUK%2fSebla-temyiz dilek%c3%a7esine cevap - birle%c5%9ftirme i%c3%a7in.docx:head : missing
...
2019-12-06 13:14:45.485929 osd.47 osd.47 172.19.0.17:6860/1093343 4272 : cluster [ERR] 6.f32 deep-scrub 1 missing, 0 inconsistent objects
2019-12-06 13:14:45.485941 osd.47 osd.47 172.19.0.17:6860/1093343 4273 : cluster [ERR] 6.f32 deep-scrub 4 errors

After deep-scrub i can see "missing" on the same 2/3 OSDs.

Why can i get object successfully from S3 when 2/3 OSD missing object?
what does "missing" meen?

Related issues 7 (1 open — 6 closed)

Actions

Copy link

Updated by Nathan Cutler over 4 years ago

Has duplicate Bug #43175: pgs inconsistent, union_shard_errors=missing added

Actions

Copy link

Updated by Nathan Cutler over 4 years ago

Has duplicate Bug #43176: pgs inconsistent, union_shard_errors=missing added

Actions

Copy link

Updated by Greg Farnum over 4 years ago

Tracker changed from Bug to Support
Status changed from New to Closed

If you fetch an object in RGW and its backing RADOS objects are missing, it just fills in the space with zeros. It sounds like you checksummed it after getting the object while the OSD which held it was running, then turned off the OSD and saw it is fetched successfully (but it would be empty the second time!).

If you having missing objects, it's best to just repair them, but you also want to find out how so many went missing to begin with. Perhaps you had a power failure and your drives aren't respecting flushes and syncs correctly under power failures?

The mailing list or irc will be able to give you more support on issues like this in future. :)

Actions

Copy link

Updated by Aleksandr Rudenko over 4 years ago

Greg thanks for the reply.

Greg Farnum wrote:

If you fetch an object in RGW and its backing RADOS objects are missing, it just fills in the space with zeros.

It's not posible. All S3 objects have Etag and i compare it with md5 and it's correct. Moreover, objects are documents like pdf and i can open them.

I can get inconsistent objects from all of three OSDs using ceph-objectstore-tool like this:

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-61 --pgid 6.386 "d48c233a-cef5-4072-8fee-8e425695b655.319082.2_ovBck/logo.jpg" get-bytes logo.dump

logo.dump is identical on all OSDs and has correct md5.

I think it's problem in scrub mechanism, but i'm not sure.

If you having missing objects, it's best to just repair them

How can i repair this object? I ran repair many times on different PGs and had no success.

Actions

Copy link

Updated by Greg Farnum over 4 years ago

Status changed from Closed to New
Assignee set to David Zafman

Hmm this may be something else then. David, does it look familiar?

Actions

Copy link

Updated by David Zafman over 4 years ago

Related to Bug #39116: Draining filestore osd, removing, and adding new bluestore osd causes OSDs to crash added

Actions

Copy link

Updated by David Zafman over 4 years ago

Scrub incorrectly thinks the object really isn't there, but we know it is.

The way that you can see missing objects during scrub that actually exist is that different OSDs are sorting the objects in a different order. During scrub a range of objects is requested from each OSD. If they aren't sorted in the same way some objects may sort outside the range requested and aren't reported to the primary. Or a sort of "extra" object sorts inside and the primary and other replicas don't have that object in the current range. These are marked as missing.

Are the missing ones always using the same objectstore? From your initial description is osd.62 (bluestore) and osd.9 /osd.47 (filestore) or vice versa?

Any difference between your OSDs should be examined.

Versions different for some OSDs (mixing pre-luminous OSDs)
Maybe different configuration values between OSDs
Filestore vs Bluestore, if we have a bug in v12.2.12
Sortbitwise needed to be set before upgrade to Luminous. Not sure Luminous OSDs would even boot if it isn't set.
Are you running your own Ceph build?

Check OSD versions. Verify configuration values are the same. Finish upgrade of filestore to bluestore might help, but they should work.

Actions

Copy link

Updated by Aleksandr Rudenko over 4 years ago

Hi David.

Are you running your own Ceph build?

No, we use official (comunity) build.

Sortbitwise needed to be set before upgrade to Luminous. Not sure Luminous OSDs would even boot if it isn't set.

Sortbitwise was sat before update to Luminous as in official update manual.
Bluestore OSDs were added on Luminous after a year since upgrade to Luminous.

Maybe different configuration values between OSDs

we use identical parameters for all OSDs (filestore, bluestore) and there is no something special in our config.

Versions different for some OSDs (mixing pre-luminous OSDs)

No. Versions are identical.

Are the missing ones always using the same objectstore?

No. I can see 'missing' on filestore as well on bluestore.

Actions

Copy link

Updated by Mykola Golub almost 4 years ago

Pull request ID set to 35938

One of our customers also experienced this issue after adding bluestore osds to a filestore backed cluster.

Using ceph-objectstore-tool we produced the listing of the affected pgs on both bluestore and filestore osds and found that order for some object differed. I may not provide the examples without the customer permission because the object names contain rather sensitive information, but the root cause seems clear to me. The affected objects had the same hash so the object name was used for sorting. But as I see from the code the bluestore uses the escaped string for the key while the filestore the raw string [2,3]. Note, the object names we saw the problem with were in Japanese so had many escaped characters.

I have a patch for review that make the filestore to use the same order as the blustore. Though applying this solution would mean to have the same problem on a cluster with mixed old and new version filestore osds.

[1] https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L146
[2] https://github.com/ceph/ceph/blob/master/src/os/filestore/HashIndex.cc#L1070
[3] https://github.com/ceph/ceph/blob/master/src/os/filestore/HashIndex.h#L405
[4] https://github.com/ceph/ceph/pull/35938

Actions

Copy link

#10

Updated by Mykola Golub almost 4 years ago

Our partners noticed that actually there is an issue with how the bluestore escapes the key strings. Here is their patch with the comment that illustrates the issue:

diff --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc
index 700bc6918b..f83752c1c6 100644
--- a/src/os/bluestore/BlueStore.cc
+++ b/src/os/bluestore/BlueStore.cc
@@ -189,11 +189,11 @@ static void append_escaped(const string &in, S *out)
   char hexbyte[in.length() * 3 + 1];
   char* ptr = &hexbyte[0];
   for (string::const_iterator i = in.begin(); i != in.end(); ++i) {
-    if (*i <= '#') {
+    if ((unsigned char)*i <= '#') {
       *ptr++ = '#';
       *ptr++ = "0123456789abcdef"[(*i >> 4) & 0x0f];
       *ptr++ = "0123456789abcdef"[*i & 0x0f];
-    } else if (*i >= '~') {
+    } else if ((unsigned char)*i >= '~') {
       *ptr++ = '~';
       *ptr++ = "0123456789abcdef"[(*i >> 4) & 0x0f];
       *ptr++ = "0123456789abcdef"[*i & 0x0f];

Patch comment:
From the object list of example:
"ﾑ" should be handled as "~", but it is escaped by "#" ("ﾑ" is processed as utf-8 #ef#be#91" in the actual behavior, it should be "~EF~BE~91". Thus the issue comes from i18n issue.

A string is a character type. And when the letter exceeds 0x7f, the letter be a minus. 

basic_string (From Japanese C++ reference)
https://cpprefjp.github.io/reference/string/basic_string.html

So, we the patch apply I get for the escaped strings the same order as for not escaped, and the object order on the filestore (using not escaped strings) matches the order on the bluestore. But if we fix append_escaped on the bluestore, the current keys in db will become invalid and would require conversion on upgrade.

Actions

Copy link

#11