Bug #15416: ceph df reports incorrect pool usage - Ceph - Ceph

Actions

Copy link

Bug #15416

closed

ceph df reports incorrect pool usage

Added by Simon Weald about 8 years ago. Updated almost 7 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v9.2.1

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi guys

We're currently facing an outage with our Openstack cluster which is backed by Ceph. The issue appears to be related to the Cinder cold-store pool and the number of objects Ceph thinks it contains:

GLOBAL:
    SIZE     AVAIL     RAW USED     %RAW USED 
    165T      160T        5100G          3.00 
POOLS:
    NAME                         ID     USED       %USED     MAX AVAIL     OBJECTS              
    cinder-volumes_p02           3          8E         0        53943G     -9223372036854751073 
    glance-images_p02            4       1178G      0.69        53943G                    92985 
    ephemeral-vms_p02            5        229G      0.14        53943G                    44052 
    cinder-volumes-cache_p02     6      16585M         0          651G                    26914 
    glance-images-cache_p02      7      14718M         0          651G                    12350 
    ephemeral-vms-cache_p02      8      50774M      0.03          651G                    85424

As you can see, the cinder-volumes_p02 pool reports a huge number in both the USED and OBJECTS column - what could be causing this, and how can we get it re-read?

We can create RBD devices in the affected pool using the rbd client manually, but Cinder refuses to use the pool as I presume it thinks that it is full.

We're running Infernalis 9.2.1 on Trusty - if there are any pertinent logs I can provide to help then please let me know.

Actions

Copy link

Updated by Blair Bethwaite almost 8 years ago

Simon Weald wrote:

As you can see, the cinder-volumes_p02 pool reports a huge number in both the USED and OBJECTS column - what could be causing this, and how can we get it re-read?

What does "rbd ls -l" show you? I'm guessing you might have inadvertently created a huge volume, cinder may then be rejecting creation of any new volumes based on your over-provisioning config (I think cinder does this now, what version are you on?). You should check the cinder-volume and/or cinder-scheduler logs (preferably with debug enabled) and update here with some output showing what happens when you try to create a new cinder volume.

The overflowed OBJECTS count number looks like it might be a bug.

Cheers,
Blair

Actions

Copy link

Updated by Simon Weald almost 8 years ago

Blair Bethwaite wrote:

What does "rbd ls -l" show you?

It didn't paste well so I've dumped it on pastebin here: http://pastebin.com/raw/7r1GbFeN

I'm guessing you might have inadvertently created a huge volume, cinder may then be rejecting creation of any new volumes based on your over-provisioning config (I think cinder does this now, what version are you on?). You should check the cinder-volume and/or cinder-scheduler logs (preferably with debug enabled) and update here with some output showing what happens when you try to create a new cinder volume.

We use vanilla Openstack packages from the Ubuntu cloud repos, so we're on 2015.1.3-0ubuntu1 (Kilo). We're not setting max_over_subscription_ratio, so it's defaulting to a value to 20, however the Ceph cluster is currently only at roughly 3% used of the raw capacity, and there are currently no restrictions set on pool sizes.

As it was causing disruption for the Openstack cluster we moved the data to a new pool and pointed Openstack at that, so we aren't actually using the problematic pool any more; however we haven't removed it in the hope that we can understand what caused this issue.

Thanks for the input!

Simon

Actions

Copy link

Updated by Sage Weil almost 8 years ago

Status changed from New to Need More Info

Scrub should correct this. Can you trigger a scrub on all pgs in that pool and see if it goes away?

We won't be able to get enough info here to figure out how it happend, but we haven't seen this post-infernalis, so I'm not too worried. But confirming that scrub fixes it would be helpful.

Thanks!

Actions

Copy link

Updated by Simon Weald almost 8 years ago

Sage Weil wrote:

Scrub should correct this. Can you trigger a scrub on all pgs in that pool and see if it goes away?

We won't be able to get enough info here to figure out how it happend, but we haven't seen this post-infernalis, so I'm not too worried. But confirming that scrub fixes it would be helpful.

Thanks!

Hi Sage, thanks for coming back on this. I triggered a deep scrub of all pgs in the problematic pool about 12 hours ago, but ceph df is still showing the funky numbers, so no dice.

Regarding the cause, are you able to elaborate at all? This is a live customer-facing cluster and I need to produce an internal incident report so having some information for that would be handy if that's at all possible?

Additionally, when you say post-infernalis, do you mean Jewel onwards?

Thanks!

Simon

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Status changed from Need More Info to Can't reproduce

Infernalis is done.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #15416

ceph df reports incorrect pool usage

Updated by Blair Bethwaite almost 8 years ago

Updated by Simon Weald almost 8 years ago

Updated by Sage Weil almost 8 years ago

Updated by Simon Weald almost 8 years ago

Updated by Greg Farnum almost 7 years ago