Project

General

Profile

Bug #40011

ceph -s shows wrong number of pools when pool was deleted

Added by Jan Fajerski 11 months ago. Updated 5 months ago.

Status:
Pending Backport
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus, mimic
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

This is reproducible in a vstart cluster:

 MDS=0 ../src/vstart.sh -n -b -d
 bin/ceph osd pool create foo 12
 bin/ceph osd pool create bar 12
 bin/ceph osd pool create foobar 12
 bin/ceph -s
 bin/ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
 bin/ceph osd pool rm foo foo --yes-i-really-really-mean-it
 bin/ceph -s
 bin/ceph osd lspools

"ceph -s" show the following at the first invocation:

*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-05-23 10:26:46.503 7fbb7db4c700 -1 WARNING: all dangerous and experimental features are enabled.
2019-05-23 10:26:46.519 7fbb7db4c700 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     d240be1a-33ca-483d-94e7-aadc47d6e8a4
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 18m)
    mgr: x(active, since 17m)
    osd: 3 osds: 3 up (since 17m), 3 in (since 17m)

  data:
    pools:   3 pools, 36 pgs
    objects: 0 objects, 0 B
    usage:   6.0 GiB used, 27 GiB / 33 GiB avail
    pgs:     36 active+clean

After deleting the pool:

*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-05-23 10:27:02.763 7f9f5f7d2700 -1 WARNING: all dangerous and experimental features are enabled.
2019-05-23 10:27:02.783 7f9f5f7d2700 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     d240be1a-33ca-483d-94e7-aadc47d6e8a4
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 18m)
    mgr: x(active, since 18m)
    osd: 3 osds: 3 up (since 17m), 3 in (since 17m)

  data:
    pools:   3 pools, 24 pgs
    objects: 0 objects, 0 B
    usage:   6.0 GiB used, 27 GiB / 33 GiB avail
    pgs:     24 active+clean

Note the PG count changes as expected, the number of pools does not. "ceph osd lspools" is not affected.


Related issues

Related to mgr - Bug #40871: osd status reports old crush location after osd moves Pending Backport 07/22/2019
Duplicated by mgr - Bug #41414: OSDMonitor: deleted pool still shown in stats via `ceph status` Duplicate
Duplicated by Ceph - Bug #41832: Different pools count in ceph -s and ceph osd pool ls Duplicate 09/14/2019
Duplicated by RADOS - Bug #41944: inconsistent pool count in ceph -s output Need More Info 09/20/2019
Duplicated by RADOS - Bug #42592: ceph-mon/mgr PGstat Segmentation Fault Duplicate 11/01/2019
Duplicated by RADOS - Bug #42689: nautilus mon/mgr: ceph status:pool number display is not right Duplicate 11/08/2019
Duplicated by fs - Bug #41228: mon: deleting a CephFS and its pools causes MONs to crash Duplicate
Copied to mgr - Backport #42857: mimic: ceph -s shows wrong number of pools when pool was deleted New
Copied to mgr - Backport #42858: nautilus: ceph -s shows wrong number of pools when pool was deleted Resolved

History

#1 Updated by Jan Fajerski 11 months ago

  • Affected Versions v15.0.0 added

#2 Updated by Nathan Cutler 11 months ago

  • Backport set to nautilus

#3 Updated by Nathan Cutler 11 months ago

  • Regression changed from No to Yes

#4 Updated by Nathan Cutler 11 months ago

  • Affected Versions v14.2.0, v14.2.1, v14.2.2 added

#5 Updated by Jan Fajerski 11 months ago

  • Regression changed from Yes to No
  • Affected Versions deleted (v15.0.0)

It actually shows the correct number of pools (2) for a short time and then displays the erroneous 3 pools after a few seconds.

#6 Updated by Greg Farnum 11 months ago

  • Project changed from RADOS to mgr
  • Priority changed from Normal to High

This data is actually sourced from the manager’s pgstats. It’s turned up in the mailing list a couple times and is resolved by restarting the manager.
I took a brief look and really don’t see how it could be going wrong. Maybe it’s not going wrong when the manager handles a new osdmap but rather when it propagates that state elsewhere, with some protocol issue?

#8 Updated by Noah Watkins 10 months ago

It looks to me like `ceph status` is getting this state not from the ceph-mgr but from the MgrStatMonitor PaxosService. The difference between lspool and pool count being that the former is coming from osdmap while the later from the pgmap.

#9 Updated by Daniel Oliveira 8 months ago

I started investigating this last week. I was only able to reproduce it once so far.

#10 Updated by Daniel Oliveira 8 months ago

Still checking if we have an environment where this could be reproducible at will since I was only able to see the behavior once.

#11 Updated by Sebastian Wagner 8 months ago

  • Assignee set to Daniel Oliveira

#12 Updated by Sebastian Wagner 8 months ago

  • Related to Bug #40871: osd status reports old crush location after osd moves added

#13 Updated by Neha Ojha 8 months ago

  • Duplicated by Bug #41414: OSDMonitor: deleted pool still shown in stats via `ceph status` added

#14 Updated by Kefu Chai 7 months ago

  • Assignee changed from Daniel Oliveira to Kefu Chai

assigning it to myself to see if it's a dup.

#15 Updated by Daniel Oliveira 7 months ago

@Kefu,

Just trying to understand it, but did you assign it to you to check if 'it is a dup' with what? Would you like to me still check on it?

Thanks,
-Daniel

#16 Updated by Nathan Cutler 7 months ago

  • Duplicated by Bug #41832: Different pools count in ceph -s and ceph osd pool ls added

#17 Updated by Kefu Chai 7 months ago

  • Assignee changed from Kefu Chai to Daniel Oliveira

@Daniel i assigned it to me temporarily to see if #40871 is a dup of this one in hope to resolve them together. but seems they are different. sorry for hijacking this ticket from you!

i am returning it to you.

#18 Updated by Kefu Chai 7 months ago

not reproducible on master (261fab6465877862f777c9e6a7225863472cd53a), nautilus v14.2.0, nautilus v14.2.2, or nautilus HEAD (v14.2.4-27-g462e659cea).

#19 Updated by Daniel Oliveira 7 months ago

@Kefu,

No problem at all! I just wanted to make sure I was on the same page!
Also, your comment https://tracker.ceph.com/issues/40011#note-18 explains why I wasn't able to reproduce it and ended up helping me to validate it.

Thanks!

#20 Updated by Nathan Cutler 7 months ago

I wonder if the messenger is involved here? If it happens more often in downstream products, that might be because msgr version 1 is in use there, while Kefu's and Daniel's attempts might have been using msgr2?

(Just thinking out loud after reading Greg's comment #40011-6)

#21 Updated by Kefu Chai 7 months ago

Nathan, that's plausible. i didn't adjust "ms_bind_msgr2". and i think "ms_bind_msgr2=true" has been around since v14.1.0:

$ git tag --contains 40a7dfbb1f25cae7cea68de18af981cb3a1b980f
v14.1.0
v14.1.1
v14.2.0
v14.2.1
v14.2.2
v14.2.3
v15.0.0

#22 Updated by Jan Fajerski 7 months ago

This still reproduces for me on current master

jan@ws ~/code/ceph/ceph/build (git)-[master] % bin/ceph osd pool rm foo foo --yes-i-really-really-mean-it
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-09-19T09:02:55.409+0200 7fcf49ab9700 -1 WARNING: all dangerous and experimental features are enabled.
2019-09-19T09:02:55.449+0200 7fcf49ab9700 -1 WARNING: all dangerous and experimental features are enabled.
pool 'foo' removed
jan@ws ~/code/ceph/ceph/build (git)-[master] % bin/ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-09-19T09:03:01.793+0200 7fb389d32700 -1 WARNING: all dangerous and experimental features are enabled.
2019-09-19T09:03:01.817+0200 7fb389d32700 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     151dbb4b-8bf7-452f-9f81-0c3968859117
    health: HEALTH_WARN
            3 pools have too many placement groups

  services:
    mon: 3 daemons, quorum a,b,c (age 2m)
    mgr: x(active, since 2m)
    mds: a:1 {0=a=up:active} 2 up:standby
    osd: 3 osds: 3 up (since 102s), 3 in (since 102s)

  task status:
    scrub status:
        mds.0: idle

  data:
    pools:   5 pools, 48 pgs
    objects: 22 objects, 2.2 KiB
    usage:   6.0 GiB used, 3.0 TiB / 3.0 TiB avail
    pgs:     48 active+clean

jan@ws ~/code/ceph/ceph/build (git)-[master] % bin/ceph osd lspools
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-09-19T09:03:08.125+0200 7fbb616de700 -1 WARNING: all dangerous and experimental features are enabled.
2019-09-19T09:03:08.157+0200 7fbb616de700 -1 WARNING: all dangerous and experimental features are enabled.
1 cephfs.a.meta
2 cephfs.a.data
4 bar
5 foobar
jan@ws ~/code/ceph/ceph/build (git)-[master] % bin/ceph --version
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
ceph version v15.0.0-5218-g3d7e5b0e3f (3d7e5b0e3fcf0dda9c664175ce6a0c0f3673a662) octopus (dev)

#23 Updated by Jan Fajerski 7 months ago

  • Affected Versions v15.0.0 added

#24 Updated by Nathan Cutler 7 months ago

  • Duplicated by Bug #41944: inconsistent pool count in ceph -s output added

#25 Updated by Daniel Oliveira 7 months ago

@Jan,

Thanks for the update! I will redeploy my test environment and recheck it.

#26 Updated by Sage Weil 7 months ago

This bug is probably somewhere in PGMap.cc--that's where the pool count is coming from. And that structure is updated in awkward ways by examining new OSDMap updates. It also happens on teh mgr, and is reported periodically to the mon, so it's normal for this mismatch to be there for 1-2 seconds (but not longer than that).

#27 Updated by Kefu Chai 5 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 31560

per the downstream bz, that's not the case.

#28 Updated by Neha Ojha 5 months ago

  • Related to Bug #42689: nautilus mon/mgr: ceph status:pool number display is not right added

#29 Updated by Neha Ojha 5 months ago

  • Related to Bug #42592: ceph-mon/mgr PGstat Segmentation Fault added

#30 Updated by Neha Ojha 5 months ago

  • Related to Bug #41228: mon: deleting a CephFS and its pools causes MONs to crash added

#31 Updated by Kefu Chai 5 months ago

  • Related to deleted (Bug #42592: ceph-mon/mgr PGstat Segmentation Fault)

#32 Updated by Kefu Chai 5 months ago

  • Related to deleted (Bug #42689: nautilus mon/mgr: ceph status:pool number display is not right)

#33 Updated by Kefu Chai 5 months ago

  • Duplicated by Bug #42592: ceph-mon/mgr PGstat Segmentation Fault added

#34 Updated by Kefu Chai 5 months ago

  • Duplicated by Bug #42689: nautilus mon/mgr: ceph status:pool number display is not right added

#35 Updated by Kefu Chai 5 months ago

  • Related to deleted (Bug #41228: mon: deleting a CephFS and its pools causes MONs to crash)

#36 Updated by Kefu Chai 5 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from nautilus to nautilus, mimic

#37 Updated by Kefu Chai 5 months ago

  • Duplicated by Bug #41228: mon: deleting a CephFS and its pools causes MONs to crash added

#38 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #42857: mimic: ceph -s shows wrong number of pools when pool was deleted added

#39 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #42858: nautilus: ceph -s shows wrong number of pools when pool was deleted added

Also available in: Atom PDF