Project

General

Profile

Actions

Bug #40011

closed

ceph -s shows wrong number of pools when pool was deleted

Added by Jan Fajerski almost 5 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Daniel Oliveira
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus, mimic
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is reproducible in a vstart cluster:

 MDS=0 ../src/vstart.sh -n -b -d
 bin/ceph osd pool create foo 12
 bin/ceph osd pool create bar 12
 bin/ceph osd pool create foobar 12
 bin/ceph -s
 bin/ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
 bin/ceph osd pool rm foo foo --yes-i-really-really-mean-it
 bin/ceph -s
 bin/ceph osd lspools

"ceph -s" show the following at the first invocation:

*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-05-23 10:26:46.503 7fbb7db4c700 -1 WARNING: all dangerous and experimental features are enabled.
2019-05-23 10:26:46.519 7fbb7db4c700 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     d240be1a-33ca-483d-94e7-aadc47d6e8a4
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 18m)
    mgr: x(active, since 17m)
    osd: 3 osds: 3 up (since 17m), 3 in (since 17m)

  data:
    pools:   3 pools, 36 pgs
    objects: 0 objects, 0 B
    usage:   6.0 GiB used, 27 GiB / 33 GiB avail
    pgs:     36 active+clean

After deleting the pool:

*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-05-23 10:27:02.763 7f9f5f7d2700 -1 WARNING: all dangerous and experimental features are enabled.
2019-05-23 10:27:02.783 7f9f5f7d2700 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     d240be1a-33ca-483d-94e7-aadc47d6e8a4
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 18m)
    mgr: x(active, since 18m)
    osd: 3 osds: 3 up (since 17m), 3 in (since 17m)

  data:
    pools:   3 pools, 24 pgs
    objects: 0 objects, 0 B
    usage:   6.0 GiB used, 27 GiB / 33 GiB avail
    pgs:     24 active+clean

Note the PG count changes as expected, the number of pools does not. "ceph osd lspools" is not affected.


Related issues 9 (0 open9 closed)

Related to mgr - Bug #40871: osd status reports old crush location after osd movesResolvedKefu Chai

Actions
Has duplicate mgr - Bug #41414: OSDMonitor: deleted pool still shown in stats via `ceph status`Duplicate

Actions
Has duplicate Ceph - Bug #41832: Different pools count in ceph -s and ceph osd pool lsDuplicate09/14/2019

Actions
Has duplicate RADOS - Bug #41944: inconsistent pool count in ceph -s outputResolved09/20/2019

Actions
Has duplicate RADOS - Bug #42592: ceph-mon/mgr PGstat Segmentation FaultDuplicate11/01/2019

Actions
Has duplicate RADOS - Bug #42689: nautilus mon/mgr: ceph status:pool number display is not rightDuplicate11/08/2019

Actions
Has duplicate CephFS - Bug #41228: mon: deleting a CephFS and its pools causes MONs to crashResolved

Actions
Copied to mgr - Backport #42857: mimic: ceph -s shows wrong number of pools when pool was deletedRejectedActions
Copied to mgr - Backport #42858: nautilus: ceph -s shows wrong number of pools when pool was deletedResolvedNathan CutlerActions
Actions #1

Updated by Jan Fajerski almost 5 years ago

  • Affected Versions v15.0.0 added
Actions #2

Updated by Nathan Cutler almost 5 years ago

  • Backport set to nautilus
Actions #3

Updated by Nathan Cutler almost 5 years ago

  • Regression changed from No to Yes
Actions #4

Updated by Nathan Cutler almost 5 years ago

  • Affected Versions v14.2.0, v14.2.1, v14.2.2 added
Actions #5

Updated by Jan Fajerski almost 5 years ago

  • Regression changed from Yes to No
  • Affected Versions deleted (v15.0.0)

It actually shows the correct number of pools (2) for a short time and then displays the erroneous 3 pools after a few seconds.

Actions #6

Updated by Greg Farnum almost 5 years ago

  • Project changed from RADOS to mgr
  • Priority changed from Normal to High

This data is actually sourced from the manager’s pgstats. It’s turned up in the mailing list a couple times and is resolved by restarting the manager.
I took a brief look and really don’t see how it could be going wrong. Maybe it’s not going wrong when the manager handles a new osdmap but rather when it propagates that state elsewhere, with some protocol issue?

Actions #8

Updated by Noah Watkins almost 5 years ago

It looks to me like `ceph status` is getting this state not from the ceph-mgr but from the MgrStatMonitor PaxosService. The difference between lspool and pool count being that the former is coming from osdmap while the later from the pgmap.

Actions #9

Updated by Daniel Oliveira over 4 years ago

I started investigating this last week. I was only able to reproduce it once so far.

Actions #10

Updated by Daniel Oliveira over 4 years ago

Still checking if we have an environment where this could be reproducible at will since I was only able to see the behavior once.

Actions #11

Updated by Sebastian Wagner over 4 years ago

  • Assignee set to Daniel Oliveira
Actions #12

Updated by Sebastian Wagner over 4 years ago

  • Related to Bug #40871: osd status reports old crush location after osd moves added
Actions #13

Updated by Neha Ojha over 4 years ago

  • Has duplicate Bug #41414: OSDMonitor: deleted pool still shown in stats via `ceph status` added
Actions #14

Updated by Kefu Chai over 4 years ago

  • Assignee changed from Daniel Oliveira to Kefu Chai

assigning it to myself to see if it's a dup.

Actions #15

Updated by Daniel Oliveira over 4 years ago

@Kefu Chai,

Just trying to understand it, but did you assign it to you to check if 'it is a dup' with what? Would you like to me still check on it?

Thanks,
-Daniel

Actions #16

Updated by Nathan Cutler over 4 years ago

  • Has duplicate Bug #41832: Different pools count in ceph -s and ceph osd pool ls added
Actions #17

Updated by Kefu Chai over 4 years ago

  • Assignee changed from Kefu Chai to Daniel Oliveira

@Daniel i assigned it to me temporarily to see if #40871 is a dup of this one in hope to resolve them together. but seems they are different. sorry for hijacking this ticket from you!

i am returning it to you.

Actions #18

Updated by Kefu Chai over 4 years ago

not reproducible on master (261fab6465877862f777c9e6a7225863472cd53a), nautilus v14.2.0, nautilus v14.2.2, or nautilus HEAD (v14.2.4-27-g462e659cea).

Actions #19

Updated by Daniel Oliveira over 4 years ago

@Kefu Chai,

No problem at all! I just wanted to make sure I was on the same page!
Also, your comment https://tracker.ceph.com/issues/40011#note-18 explains why I wasn't able to reproduce it and ended up helping me to validate it.

Thanks!

Actions #20

Updated by Nathan Cutler over 4 years ago

I wonder if the messenger is involved here? If it happens more often in downstream products, that might be because msgr version 1 is in use there, while Kefu's and Daniel's attempts might have been using msgr2?

(Just thinking out loud after reading Greg's comment #40011-6)

Actions #21

Updated by Kefu Chai over 4 years ago

Nathan, that's plausible. i didn't adjust "ms_bind_msgr2". and i think "ms_bind_msgr2=true" has been around since v14.1.0:

$ git tag --contains 40a7dfbb1f25cae7cea68de18af981cb3a1b980f
v14.1.0
v14.1.1
v14.2.0
v14.2.1
v14.2.2
v14.2.3
v15.0.0
Actions #22

Updated by Jan Fajerski over 4 years ago

This still reproduces for me on current master

jan@ws ~/code/ceph/ceph/build (git)-[master] % bin/ceph osd pool rm foo foo --yes-i-really-really-mean-it
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-09-19T09:02:55.409+0200 7fcf49ab9700 -1 WARNING: all dangerous and experimental features are enabled.
2019-09-19T09:02:55.449+0200 7fcf49ab9700 -1 WARNING: all dangerous and experimental features are enabled.
pool 'foo' removed
jan@ws ~/code/ceph/ceph/build (git)-[master] % bin/ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-09-19T09:03:01.793+0200 7fb389d32700 -1 WARNING: all dangerous and experimental features are enabled.
2019-09-19T09:03:01.817+0200 7fb389d32700 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     151dbb4b-8bf7-452f-9f81-0c3968859117
    health: HEALTH_WARN
            3 pools have too many placement groups

  services:
    mon: 3 daemons, quorum a,b,c (age 2m)
    mgr: x(active, since 2m)
    mds: a:1 {0=a=up:active} 2 up:standby
    osd: 3 osds: 3 up (since 102s), 3 in (since 102s)

  task status:
    scrub status:
        mds.0: idle

  data:
    pools:   5 pools, 48 pgs
    objects: 22 objects, 2.2 KiB
    usage:   6.0 GiB used, 3.0 TiB / 3.0 TiB avail
    pgs:     48 active+clean

jan@ws ~/code/ceph/ceph/build (git)-[master] % bin/ceph osd lspools
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-09-19T09:03:08.125+0200 7fbb616de700 -1 WARNING: all dangerous and experimental features are enabled.
2019-09-19T09:03:08.157+0200 7fbb616de700 -1 WARNING: all dangerous and experimental features are enabled.
1 cephfs.a.meta
2 cephfs.a.data
4 bar
5 foobar
jan@ws ~/code/ceph/ceph/build (git)-[master] % bin/ceph --version
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
ceph version v15.0.0-5218-g3d7e5b0e3f (3d7e5b0e3fcf0dda9c664175ce6a0c0f3673a662) octopus (dev)

Actions #23

Updated by Jan Fajerski over 4 years ago

  • Affected Versions v15.0.0 added
Actions #24

Updated by Nathan Cutler over 4 years ago

  • Has duplicate Bug #41944: inconsistent pool count in ceph -s output added
Actions #25

Updated by Daniel Oliveira over 4 years ago

@Jan,

Thanks for the update! I will redeploy my test environment and recheck it.

Actions #26

Updated by Sage Weil over 4 years ago

This bug is probably somewhere in PGMap.cc--that's where the pool count is coming from. And that structure is updated in awkward ways by examining new OSDMap updates. It also happens on teh mgr, and is reported periodically to the mon, so it's normal for this mismatch to be there for 1-2 seconds (but not longer than that).

Actions #27

Updated by Kefu Chai over 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 31560

per the downstream bz, that's not the case.

Actions #28

Updated by Neha Ojha over 4 years ago

  • Related to Bug #42689: nautilus mon/mgr: ceph status:pool number display is not right added
Actions #29

Updated by Neha Ojha over 4 years ago

  • Related to Bug #42592: ceph-mon/mgr PGstat Segmentation Fault added
Actions #30

Updated by Neha Ojha over 4 years ago

  • Related to Bug #41228: mon: deleting a CephFS and its pools causes MONs to crash added
Actions #31

Updated by Kefu Chai over 4 years ago

  • Related to deleted (Bug #42592: ceph-mon/mgr PGstat Segmentation Fault)
Actions #32

Updated by Kefu Chai over 4 years ago

  • Related to deleted (Bug #42689: nautilus mon/mgr: ceph status:pool number display is not right)
Actions #33

Updated by Kefu Chai over 4 years ago

  • Has duplicate Bug #42592: ceph-mon/mgr PGstat Segmentation Fault added
Actions #34

Updated by Kefu Chai over 4 years ago

  • Has duplicate Bug #42689: nautilus mon/mgr: ceph status:pool number display is not right added
Actions #35

Updated by Kefu Chai over 4 years ago

  • Related to deleted (Bug #41228: mon: deleting a CephFS and its pools causes MONs to crash)
Actions #36

Updated by Kefu Chai over 4 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from nautilus to nautilus, mimic
Actions #37

Updated by Kefu Chai over 4 years ago

  • Has duplicate Bug #41228: mon: deleting a CephFS and its pools causes MONs to crash added
Actions #38

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #42857: mimic: ceph -s shows wrong number of pools when pool was deleted added
Actions #39

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #42858: nautilus: ceph -s shows wrong number of pools when pool was deleted added
Actions #40

Updated by Nathan Cutler over 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF