Bug #37942: Integer underflow in bucket stats - rgw - Ceph

Actions

Copy link

Bug #37942

closed

Integer underflow in bucket stats

Added by Paul Emmerich over 5 years ago. Updated about 2 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

J. Eric Ivancich

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v13.2.2

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I've found a cluster reporting these stats for a bucket:

            "rgw.none": {
                "size": 0,
                "size_actual": 0,
                "size_utilized": 0,
                "size_kb": 0,
                "size_kb_actual": 0,
                "size_kb_utilized": 0,
                "num_objects": 18446744073709551613
            },

This confused one of our scripts a little bit. However, I've got no idea how the bucket ended up with these stats.

I guess the fix is to replace a -1 literal with a 0 somewhere, but I couldn't find any obvious place in the code where this is happening.

Actions

Copy link

Updated by Paul Emmerich over 5 years ago

Oh, the cluster is all Mimic 13.2.2

Actions

Copy link

Updated by Casey Bodley about 5 years ago

Assignee set to J. Eric Ivancich

Actions

Copy link

Updated by J. Eric Ivancich about 5 years ago

Affected Versions v13.2.2 added

Actions

Copy link

Updated by J. Eric Ivancich about 5 years ago

It appears it's set to -3 rather than -1. 18446744073709551615 would be -1.

Actions

Copy link

Updated by J. Eric Ivancich almost 5 years ago

This one is going to be hard to reproduce given the count is -3. So unless you have a reproducer, I'm inclined to close this as "Can't Reproduce".

I believe resharding that bucket would recalculate bucket stats. There was a bug with the recalculation that has not yet been backported to mimic (see: https://tracker.ceph.com/issues/37473) but I don't think that would result in a non-zero object count for an empty bucket.

There's also the issue of reporting a huge positive rather than a negative. If that's the behavior in master then we should probably fix that (and backport it).

Please let me know your thoughts.

Actions

Copy link

Updated by J. Eric Ivancich almost 5 years ago

I just looked at master, and the stats are stored as unsigned 64-bit ints. I don't know if it would help to at least display that value as negative to possibly better indicate a bug. I'll discuss this with colleagues.

Actions

Copy link

Updated by J. Eric Ivancich almost 5 years ago

There was a fix that prevented these values from going negative (see: http://tracker.ceph.com/issues/20934). You're running mimic, and this should have gone in before the release of mimic. Was this cluster running an older version of ceph at any time?

Actions

Copy link

Updated by J. Eric Ivancich almost 5 years ago

(Note: on master this fix is commit 634215eea1ddd4e4f5dc0066c4a2e745cfc20475)

Actions

Copy link

Updated by Paul Emmerich almost 5 years ago

Cluster was deployed with Luminous, later upgraded to Mimic. S3 was first deployed after the upgrade to 13.2.2.

I can't reproduce this, no. I've just seen it on this one cluster

Actions

Copy link

#10

Updated by J. Eric Ivancich almost 5 years ago

Status changed from New to Can't reproduce

Actions

Copy link

#11

Updated by Paul Emmerich almost 5 years ago

I've checked other clusters for this behavior and basically every cluster got a at least one bucket that report stats like this; but I don't have a way to reproduce this on a fresh cluster. Seems like something that just happens...

Values seem to vary between -10 and -2

Actions

Copy link

#12

Updated by Yue Zhu about 4 years ago

We have seen exactly the same issue on ceph version 12.2.12, where num_objects in rgw.none show extremely large values on some buckets, like below

"usage": {
        "rgw.none": {
            "size": 0,
            "size_actual": 0,
            "size_utilized": 0,
            "size_kb": 0,
            "size_kb_actual": 0,
            "size_kb_utilized": 0,
            "num_objects": 18446744073709551607
        },
        "rgw.main": {
            "size": 1687971465874,
            "size_actual": 1696692400128,
            "size_utilized": 1687971465874,
            "size_kb": 1648409635,
            "size_kb_actual": 1656926172,
            "size_kb_utilized": 1648409635,
            "num_objects": 4290147
        },
        "rgw.multimeta": {
            "size": 0,
            "size_actual": 0,
            "size_utilized": 0,
            "size_kb": 0,
            "size_kb_actual": 0,
            "size_kb_utilized": 0,
            "num_objects": 75
        }
    },
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    }

Actions

Copy link

#13

Updated by dovefi Z over 3 years ago

Yue Zhu wrote:

We have seen exactly the same issue on ceph version 12.2.12, where num_objects in rgw.none show extremely large values on some buckets, like below

[...]

hi, Yue Zhu, have you fixd this?

Actions

Copy link

#14

Updated by Janne Johansson about 3 years ago

Paul Emmerich wrote:

I've found a cluster reporting these stats for a bucket:
[...]

This confused one of our scripts a little bit. However, I've got no idea how the bucket ended up with these stats.

I guess the fix is to replace a -1 literal with a 0 somewhere, but I couldn't find any obvious place in the code where this is happening.

Now I see this on a Mimic cluster. It was never been anything else.
Installed as 13.2.8 (I think), all nodes now running 13.2.10.

The autosharding is aiming for 65521 shards on a bunch of buckets,

{
    “time”: “2021-02-18 13:56:59.197586Z”,
    “bucket_name”: “nextcloud-bucket-2",
    “old_num_shards”: 1,
    “new_num_shards”: 65521
}

and this bucket has: (slightly abbreviated)

{
  “bucket”: “nextcloud-bucket-2",
  “num_shards”: 0,
  “placement_rule”: “default-placement”,
  },
  “usage”: {
    “rgw.none”: {
      “size”: 0,
      “size_actual”: 0,
      “size_utilized”: 0,
      “size_kb”: 0,
      “size_kb_actual”: 0,
      “size_kb_utilized”: 0,
      “num_objects”: 18446744073709551602
    },

There has been a few of these from what I can see,

radosgw-admin reshard list|grep 65521 |wc -l
10

so while I don't know how they appear, they don't seem rare here.

Actions

Copy link

#15

Updated by J. Eric Ivancich about 3 years ago

Janne Johansson wrote:

Paul Emmerich wrote:

I've found a cluster reporting these stats for a bucket:
[...]

This confused one of our scripts a little bit. However, I've got no idea how the bucket ended up with these stats.

I guess the fix is to replace a -1 literal with a 0 somewhere, but I couldn't find any obvious place in the code where this is happening.

Now I see this on a Mimic cluster. It was never been anything else.
Installed as 13.2.8 (I think), all nodes now running 13.2.10.

The autosharding is aiming for 65521 shards on a bunch of buckets,

[...]

and this bucket has: (slightly abbreviated)

[...]

There has been a few of these from what I can see,

radosgw-admin reshard list|grep 65521 |wc -l
10

so while I don't know how they appear, they don't seem rare here.

I want to make sure I'm clear in what you see as the connection between shard count and a bad num_objects in rgw.none. Are you saying these bad num_objects are associated with number of shards being 65521?

65521 is defined as the maximum number of shards. So we would expect to see it when the number of objects was at or larger than 6.5 billion (65521 * 100000).

Actions

Copy link

#16

Updated by 玮文胡 almost 3 years ago

I can reproduce this on our cluster, running 16.2.3. And also radosgw compiled from current master branch. It appears on a small bucket, with only 40GiB data.

I'm going to investigate further. If you need more info, let me know.

Actions

Copy link

#17

Updated by 玮文胡 almost 3 years ago

It is hard to find the root cause, I gave up. I have confirmed with gdb that this large value comes from rados. So the bug must reside on where the stats are updated.

A "radosgw-admin bucket reshard" to the same number of shards (11 in our case) fixes this. The "rgw.none" category just disappeared from stats.

Our cluster is deployed as 15.2.8, now upgraded to 16.2.3.

Actions

Copy link

#18

Updated by Janne Johansson about 2 years ago

J. Eric Ivancich wrote:

Janne Johansson wrote:

Paul Emmerich wrote:

I've found a cluster reporting these stats for a bucket:
[...]

This confused one of our scripts a little bit. However, I've got no idea how the bucket ended up with these stats.

I guess the fix is to replace a -1 literal with a 0 somewhere, but I couldn't find any obvious place in the code where this is happening.

Now I see this on a Mimic cluster. It was never been anything else.
Installed as 13.2.8 (I think), all nodes now running 13.2.10.

The autosharding is aiming for 65521 shards on a bunch of buckets,

[...]

and this bucket has: (slightly abbreviated)

[...]

There has been a few of these from what I can see,

radosgw-admin reshard list|grep 65521 |wc -l
10

so while I don't know how they appear, they don't seem rare here.

I want to make sure I'm clear in what you see as the connection between shard count and a bad num_objects in rgw.none. Are you saying these bad num_objects are associated with number of shards being 65521?

65521 is defined as the maximum number of shards. So we would expect to see it when the number of objects was at or larger than 6.5 billion (65521 * 100000).

Yes, the auto-sharder seems to react to the crazy high number and aims to shard the bucket accordingly, which fails and then it is stuck at wanting to create 65521 shards, while the negative number stays until I run bucket check --fix. Not sure if it returns later or not or what triggers the underflow below zero, but assuming some kind of race while emptying/deleting objects springs to mind.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #37942

Integer underflow in bucket stats

Updated by Paul Emmerich over 5 years ago

Updated by Casey Bodley about 5 years ago

Updated by J. Eric Ivancich about 5 years ago

Updated by J. Eric Ivancich about 5 years ago

Updated by J. Eric Ivancich almost 5 years ago

Updated by J. Eric Ivancich almost 5 years ago

Updated by J. Eric Ivancich almost 5 years ago

Updated by J. Eric Ivancich almost 5 years ago

Updated by Paul Emmerich almost 5 years ago

Updated by J. Eric Ivancich almost 5 years ago

Updated by Paul Emmerich almost 5 years ago

Updated by Yue Zhu about 4 years ago

Updated by dovefi Z over 3 years ago

Updated by Janne Johansson about 3 years ago

Updated by J. Eric Ivancich about 3 years ago

Updated by 玮文胡 almost 3 years ago

Updated by 玮文胡 almost 3 years ago

Updated by Janne Johansson about 2 years ago

Project

General

Profile

Ceph » rgw

Custom queries

Bug #37942

Integer underflow in bucket stats

Updated by Paul Emmerich over 5 years ago

Updated by Casey Bodley about 5 years ago

Updated by J. Eric Ivancich about 5 years ago

Updated by J. Eric Ivancich about 5 years ago

Updated by J. Eric Ivancich almost 5 years ago

Updated by J. Eric Ivancich almost 5 years ago

Updated by J. Eric Ivancich almost 5 years ago

Updated by J. Eric Ivancich almost 5 years ago

Updated by Paul Emmerich almost 5 years ago

Updated by J. Eric Ivancich almost 5 years ago

Updated by Paul Emmerich almost 5 years ago

Updated by Yue Zhu about 4 years ago

Updated by dovefi Z over 3 years ago

Updated by Janne Johansson about 3 years ago

Updated by J. Eric Ivancich about 3 years ago

Updated by 玮文 胡 almost 3 years ago

Updated by 玮文 胡 almost 3 years ago

Updated by Janne Johansson about 2 years ago

Updated by 玮文胡 almost 3 years ago

Updated by 玮文胡 almost 3 years ago