Project

General

Profile

Actions

Bug #15416

closed

ceph df reports incorrect pool usage

Added by Simon Weald about 8 years ago. Updated almost 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi guys

We're currently facing an outage with our Openstack cluster which is backed by Ceph. The issue appears to be related to the Cinder cold-store pool and the number of objects Ceph thinks it contains:

GLOBAL:
    SIZE     AVAIL     RAW USED     %RAW USED 
    165T      160T        5100G          3.00 
POOLS:
    NAME                         ID     USED       %USED     MAX AVAIL     OBJECTS              
    cinder-volumes_p02           3          8E         0        53943G     -9223372036854751073 
    glance-images_p02            4       1178G      0.69        53943G                    92985 
    ephemeral-vms_p02            5        229G      0.14        53943G                    44052 
    cinder-volumes-cache_p02     6      16585M         0          651G                    26914 
    glance-images-cache_p02      7      14718M         0          651G                    12350 
    ephemeral-vms-cache_p02      8      50774M      0.03          651G                    85424 

As you can see, the cinder-volumes_p02 pool reports a huge number in both the USED and OBJECTS column - what could be causing this, and how can we get it re-read?

We can create RBD devices in the affected pool using the rbd client manually, but Cinder refuses to use the pool as I presume it thinks that it is full.

We're running Infernalis 9.2.1 on Trusty - if there are any pertinent logs I can provide to help then please let me know.

Actions #1

Updated by Blair Bethwaite almost 8 years ago

Simon Weald wrote:

As you can see, the cinder-volumes_p02 pool reports a huge number in both the USED and OBJECTS column - what could be causing this, and how can we get it re-read?

What does "rbd ls -l" show you? I'm guessing you might have inadvertently created a huge volume, cinder may then be rejecting creation of any new volumes based on your over-provisioning config (I think cinder does this now, what version are you on?). You should check the cinder-volume and/or cinder-scheduler logs (preferably with debug enabled) and update here with some output showing what happens when you try to create a new cinder volume.

The overflowed OBJECTS count number looks like it might be a bug.

Cheers,
Blair

Actions #2

Updated by Simon Weald almost 8 years ago

Blair Bethwaite wrote:

What does "rbd ls -l" show you?

It didn't paste well so I've dumped it on pastebin here: http://pastebin.com/raw/7r1GbFeN

I'm guessing you might have inadvertently created a huge volume, cinder may then be rejecting creation of any new volumes based on your over-provisioning config (I think cinder does this now, what version are you on?). You should check the cinder-volume and/or cinder-scheduler logs (preferably with debug enabled) and update here with some output showing what happens when you try to create a new cinder volume.

We use vanilla Openstack packages from the Ubuntu cloud repos, so we're on 2015.1.3-0ubuntu1 (Kilo). We're not setting max_over_subscription_ratio, so it's defaulting to a value to 20, however the Ceph cluster is currently only at roughly 3% used of the raw capacity, and there are currently no restrictions set on pool sizes.

As it was causing disruption for the Openstack cluster we moved the data to a new pool and pointed Openstack at that, so we aren't actually using the problematic pool any more; however we haven't removed it in the hope that we can understand what caused this issue.

Thanks for the input!

Simon

Actions #3

Updated by Sage Weil almost 8 years ago

  • Status changed from New to Need More Info

Scrub should correct this. Can you trigger a scrub on all pgs in that pool and see if it goes away?

We won't be able to get enough info here to figure out how it happend, but we haven't seen this post-infernalis, so I'm not too worried. But confirming that scrub fixes it would be helpful.

Thanks!

Actions #4

Updated by Simon Weald almost 8 years ago

Sage Weil wrote:

Scrub should correct this. Can you trigger a scrub on all pgs in that pool and see if it goes away?

We won't be able to get enough info here to figure out how it happend, but we haven't seen this post-infernalis, so I'm not too worried. But confirming that scrub fixes it would be helpful.

Thanks!

Hi Sage, thanks for coming back on this. I triggered a deep scrub of all pgs in the problematic pool about 12 hours ago, but ceph df is still showing the funky numbers, so no dice.

Regarding the cause, are you able to elaborate at all? This is a live customer-facing cluster and I need to produce an internal incident report so having some information for that would be handy if that's at all possible?

Additionally, when you say post-infernalis, do you mean Jewel onwards?

Thanks!

Simon

Actions #5

Updated by Greg Farnum almost 7 years ago

  • Status changed from Need More Info to Can't reproduce

Infernalis is done.

Actions

Also available in: Atom PDF