Project

General

Profile

Actions

Bug #52406

open

cephfs_metadata pool got full after upgrade from Nautilus to Pacific 16.2.5

Added by Denis Polom over 2 years ago. Updated over 2 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi

I have following setup on my Ceph cluster:

cephfs_metadata pool - using crush rule to use only SSD devices that are not used by any other pools, with replica size 3
cephfs_data pool - using cursh rule to use only HDD devices, EC

SSDs utilization before upgrade was about 1%. After upgrade SSDs utiliztion started raising about 15% per day.

ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP      META      AVAIL    %USE   VAR   PGS  STATUS
119    ssd  0.10918   1.00000  112 GiB   58 GiB   57 GiB   381 MiB   518 MiB   54 GiB  51.92  0.85  128      up
100    ssd  0.10918   1.00000  112 GiB   58 GiB   57 GiB   334 MiB   519 MiB   54 GiB  51.88  0.85  128      up
 82    ssd  0.10918   1.00000  112 GiB   58 GiB   57 GiB   405 MiB   494 MiB   54 GiB  51.92  0.85  128      up
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    192 TiB   75 TiB  118 TiB   118 TiB      61.18
ssd    335 GiB  161 GiB  174 GiB   174 GiB      51.90
TOTAL  192 TiB   75 TiB  118 TiB   118 TiB      61.17

--- POOLS ---
POOL                   ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
cephfs_data             1  8192   50 TiB   16.32M   87 TiB  61.25     33 TiB
cephfs_metadata         2   128  1.3 GiB    1.99k  4.0 GiB   2.66     48 GiB
device_health_metrics   3     1   89 MiB      438  177 MiB      0     28 TiB

until we got metadata pool full. I checked crush map and crush rules, and all was ok, there wasn't any misconfiguration. We added new SSDs and even their utilization raised immediately. I tried to drain OSDs on SSD drives one by one and to recreate OSD again. But it didn't help.
After restarting the cluster I got all PGs on metadata pool unknown.

Because data on the cluster weren't production I decided to recreate CephFS.
I removed CephFS and all pools from the clutser but OSDs were utilized anyway (I waited an hours)

# ceph -s
  cluster:
    id:     aac4b123-8351-4442-a07c-e2c62f15591b
    health: HEALTH_WARN
            noout flag(s) set
            3 nearfull osd(s)

  services:
    mon: 3 daemons, quorum cache2-mon2,cache2-mon3,cache2-mon1 (age 20s)
    mgr: cache2-mon3(active, since 52s), standbys: cache2-mon1, cache2-mon2
    osd: 399 osds: 399 up (since 34m), 399 in (since 73m)
         flags noout

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   54 TiB used, 135 TiB / 188 TiB avail
    pgs:

So I purged all OSDs and created them again, then I filled CephFS with same data as before destroying CephFS and all looks normal - utilization is reasonable as was before upgrade.

There is definitely something wrong with upgrading from latest Nautilus to Pacific and I was lucky that data aren't production.

Actions

Also available in: Atom PDF