Bug #40011: ceph -s shows wrong number of pools when pool was deleted - mgr - Ceph

Actions

Copy link

Bug #40011

closed

ceph -s shows wrong number of pools when pool was deleted

Added by Jan Fajerski almost 5 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

Daniel Oliveira

Category:

Target version:

% Done:

Source:

Tags:

Backport:

nautilus, mimic

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.0, Ceph - v14.2.1, Ceph - v14.2.2, Ceph - v15.0.0

ceph-qa-suite:

Pull request ID:

31560

Crash signature (v1):

Crash signature (v2):

Description

This is reproducible in a vstart cluster:

 MDS=0 ../src/vstart.sh -n -b -d
 bin/ceph osd pool create foo 12
 bin/ceph osd pool create bar 12
 bin/ceph osd pool create foobar 12
 bin/ceph -s
 bin/ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
 bin/ceph osd pool rm foo foo --yes-i-really-really-mean-it
 bin/ceph -s
 bin/ceph osd lspools

"ceph -s" show the following at the first invocation:

*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-05-23 10:26:46.503 7fbb7db4c700 -1 WARNING: all dangerous and experimental features are enabled.
2019-05-23 10:26:46.519 7fbb7db4c700 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     d240be1a-33ca-483d-94e7-aadc47d6e8a4
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 18m)
    mgr: x(active, since 17m)
    osd: 3 osds: 3 up (since 17m), 3 in (since 17m)

  data:
    pools:   3 pools, 36 pgs
    objects: 0 objects, 0 B
    usage:   6.0 GiB used, 27 GiB / 33 GiB avail
    pgs:     36 active+clean

After deleting the pool:

*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-05-23 10:27:02.763 7f9f5f7d2700 -1 WARNING: all dangerous and experimental features are enabled.
2019-05-23 10:27:02.783 7f9f5f7d2700 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     d240be1a-33ca-483d-94e7-aadc47d6e8a4
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 18m)
    mgr: x(active, since 18m)
    osd: 3 osds: 3 up (since 17m), 3 in (since 17m)

  data:
    pools:   3 pools, 24 pgs
    objects: 0 objects, 0 B
    usage:   6.0 GiB used, 27 GiB / 33 GiB avail
    pgs:     24 active+clean

Note the PG count changes as expected, the number of pools does not. "ceph osd lspools" is not affected.

Related issues 9 (0 open — 9 closed)

Related to mgr - Bug #40871: osd status reports old crush location after osd moves

Resolved

Kefu Chai

Actions

Has duplicate mgr - Bug #41414: OSDMonitor: deleted pool still shown in stats via `ceph status`

Duplicate

Actions

Has duplicate Ceph - Bug #41832: Different pools count in ceph -s and ceph osd pool ls

Duplicate

09/14/2019

Actions

Has duplicate RADOS - Bug #41944: inconsistent pool count in ceph -s output

Resolved

09/20/2019

Actions

Has duplicate RADOS - Bug #42592: ceph-mon/mgr PGstat Segmentation Fault

Duplicate

11/01/2019

Actions

Has duplicate RADOS - Bug #42689: nautilus mon/mgr: ceph status:pool number display is not right

Duplicate

11/08/2019

Actions

Has duplicate CephFS - Bug #41228: mon: deleting a CephFS and its pools causes MONs to crash

Resolved

Actions

Copied to mgr - Backport #42857: mimic: ceph -s shows wrong number of pools when pool was deleted

Rejected

Actions

Copied to mgr - Backport #42858: nautilus: ceph -s shows wrong number of pools when pool was deleted

Resolved

Nathan Cutler

Actions

Copy link

Updated by Jan Fajerski almost 5 years ago

Affected Versions v15.0.0 added

Actions

Copy link

Updated by Nathan Cutler almost 5 years ago

Backport set to nautilus

Actions

Copy link

Updated by Nathan Cutler almost 5 years ago

Regression changed from No to Yes

Actions

Copy link

Updated by Nathan Cutler almost 5 years ago

Affected Versions v14.2.0, v14.2.1, v14.2.2 added

Actions

Copy link

Updated by Jan Fajerski almost 5 years ago

Regression changed from Yes to No
Affected Versions deleted (~~v15.0.0~~)

It actually shows the correct number of pools (2) for a short time and then displays the erroneous 3 pools after a few seconds.

Actions

Copy link

Updated by Greg Farnum almost 5 years ago

Project changed from RADOS to mgr
Priority changed from Normal to High

This data is actually sourced from the manager’s pgstats. It’s turned up in the mailing list a couple times and is resolved by restarting the manager.
I took a brief look and really don’t see how it could be going wrong. Maybe it’s not going wrong when the manager handles a new osdmap but rather when it propagates that state elsewhere, with some protocol issue?

Actions

Copy link

Updated by Noah Watkins almost 5 years ago

https://bugzilla.redhat.com/show_bug.cgi?id=1705464

Actions

Copy link

Assignee changed from Daniel Oliveira to Kefu Chai

assigning it to myself to see if it's a dup.

Actions

Copy link

#15

Updated by Daniel Oliveira over 4 years ago

@Kefu Chai,

Just trying to understand it, but did you assign it to you to check if 'it is a dup' with what? Would you like to me still check on it?

Thanks,
-Daniel

Actions

Copy link

#16

Updated by Nathan Cutler over 4 years ago

Has duplicate Bug #41832: Different pools count in ceph -s and ceph osd pool ls added

Actions

Copy link

#17

Updated by Kefu Chai over 4 years ago

Assignee changed from Kefu Chai to Daniel Oliveira

@Daniel i assigned it to me temporarily to see if #40871 is a dup of this one in hope to resolve them together. but seems they are different. sorry for hijacking this ticket from you!

i am returning it to you.

Actions

Copy link

#18

Updated by Kefu Chai over 4 years ago

not reproducible on master (261fab6465877862f777c9e6a7225863472cd53a), nautilus v14.2.0, nautilus v14.2.2, or nautilus HEAD (v14.2.4-27-g462e659cea).

Actions

Copy link

#19

Updated by Daniel Oliveira over 4 years ago

@Kefu Chai,

No problem at all! I just wanted to make sure I was on the same page!
Also, your comment https://tracker.ceph.com/issues/40011#note-18 explains why I wasn't able to reproduce it and ended up helping me to validate it.

Thanks!

Actions

Copy link

#20

Updated by Nathan Cutler over 4 years ago

I wonder if the messenger is involved here? If it happens more often in downstream products, that might be because msgr version 1 is in use there, while Kefu's and Daniel's attempts might have been using msgr2?

(Just thinking out loud after reading Greg's comment #40011-6)

Actions

Copy link

#21

Updated by Kefu Chai over 4 years ago

Nathan, that's plausible. i didn't adjust "ms_bind_msgr2". and i think "ms_bind_msgr2=true" has been around since v14.1.0:

$ git tag --contains 40a7dfbb1f25cae7cea68de18af981cb3a1b980f
v14.1.0
v14.1.1
v14.2.0
v14.2.1
v14.2.2
v14.2.3
v15.0.0

Actions

Copy link

#22

Updated by Jan Fajerski over 4 years ago

This still reproduces for me on current master

jan@ws ~/code/ceph/ceph/build (git)-[master] % bin/ceph osd pool rm foo foo --yes-i-really-really-mean-it
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-09-19T09:02:55.409+0200 7fcf49ab9700 -1 WARNING: all dangerous and experimental features are enabled.
2019-09-19T09:02:55.449+0200 7fcf49ab9700 -1 WARNING: all dangerous and experimental features are enabled.
pool 'foo' removed
jan@ws ~/code/ceph/ceph/build (git)-[master] % bin/ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-09-19T09:03:01.793+0200 7fb389d32700 -1 WARNING: all dangerous and experimental features are enabled.
2019-09-19T09:03:01.817+0200 7fb389d32700 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     151dbb4b-8bf7-452f-9f81-0c3968859117
    health: HEALTH_WARN
            3 pools have too many placement groups

  services:
    mon: 3 daemons, quorum a,b,c (age 2m)
    mgr: x(active, since 2m)
    mds: a:1 {0=a=up:active} 2 up:standby
    osd: 3 osds: 3 up (since 102s), 3 in (since 102s)

  task status:
    scrub status:
        mds.0: idle

  data:
    pools:   5 pools, 48 pgs
    objects: 22 objects, 2.2 KiB
    usage:   6.0 GiB used, 3.0 TiB / 3.0 TiB avail
    pgs:     48 active+clean

jan@ws ~/code/ceph/ceph/build (git)-[master] % bin/ceph osd lspools
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2019-09-19T09:03:08.125+0200 7fbb616de700 -1 WARNING: all dangerous and experimental features are enabled.
2019-09-19T09:03:08.157+0200 7fbb616de700 -1 WARNING: all dangerous and experimental features are enabled.
1 cephfs.a.meta
2 cephfs.a.data
4 bar
5 foobar
jan@ws ~/code/ceph/ceph/build (git)-[master] % bin/ceph --version
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
ceph version v15.0.0-5218-g3d7e5b0e3f (3d7e5b0e3fcf0dda9c664175ce6a0c0f3673a662) octopus (dev)

Actions

Copy link

#23

Updated by Jan Fajerski over 4 years ago

Affected Versions v15.0.0 added

Actions

Copy link

#24

Updated by Nathan Cutler over 4 years ago

Has duplicate Bug #41944: inconsistent pool count in ceph -s output added

Actions

Copy link

#25

Updated by Daniel Oliveira over 4 years ago

@Jan,

Thanks for the update! I will redeploy my test environment and recheck it.

Actions

Copy link

#26

Updated by Sage Weil over 4 years ago

This bug is probably somewhere in PGMap.cc--that's where the pool count is coming from. And that structure is updated in awkward ways by examining new OSDMap updates. It also happens on teh mgr, and is reported periodically to the mon, so it's normal for this mismatch to be there for 1-2 seconds (but not longer than that).

Actions

Copy link

#27

Updated by Kefu Chai over 4 years ago

Status changed from New to Fix Under Review
Pull request ID set to 31560

per the downstream bz, that's not the case.

Actions

Copy link

#28

Updated by Neha Ojha over 4 years ago

Related to Bug #42689: nautilus mon/mgr: ceph status:pool number display is not right added

Actions

Copy link

#29

Updated by Neha Ojha over 4 years ago

Related to Bug #42592: ceph-mon/mgr PGstat Segmentation Fault added

Actions

Copy link

#30

Updated by Neha Ojha over 4 years ago

Related to Bug #41228: mon: deleting a CephFS and its pools causes MONs to crash added

Actions

Copy link

#31

Updated by Kefu Chai over 4 years ago

Related to deleted (Bug #42592: ceph-mon/mgr PGstat Segmentation Fault)

Actions

Copy link

#32

Updated by Kefu Chai over 4 years ago

Related to deleted (Bug #42689: nautilus mon/mgr: ceph status:pool number display is not right)

Actions

Copy link

#33

Updated by Kefu Chai over 4 years ago

Has duplicate Bug #42592: ceph-mon/mgr PGstat Segmentation Fault added

Actions

Copy link

#34

Updated by Kefu Chai over 4 years ago

Has duplicate Bug #42689: nautilus mon/mgr: ceph status:pool number display is not right added

Actions

Copy link

#35

Updated by Kefu Chai over 4 years ago

Related to deleted (Bug #41228: mon: deleting a CephFS and its pools causes MONs to crash)

Actions

Copy link

#36

Updated by Kefu Chai over 4 years ago

Status changed from Fix Under Review to Pending Backport
Backport changed from nautilus to nautilus, mimic

Actions

Copy link

#37

Updated by Kefu Chai over 4 years ago

Has duplicate Bug #41228: mon: deleting a CephFS and its pools causes MONs to crash added

Actions

Copy link

#38

Updated by Nathan Cutler over 4 years ago

Copied to Backport #42857: mimic: ceph -s shows wrong number of pools when pool was deleted added

Actions

Copy link

#39

Updated by Nathan Cutler over 4 years ago

Copied to Backport #42858: nautilus: ceph -s shows wrong number of pools when pool was deleted added

Actions

Copy link

#40

Updated by Nathan Cutler over 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #40011

ceph -s shows wrong number of pools when pool was deleted

Updated by Jan Fajerski almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Jan Fajerski almost 5 years ago

Updated by Greg Farnum almost 5 years ago

Updated by Noah Watkins almost 5 years ago

Updated by Noah Watkins almost 5 years ago

Updated by Daniel Oliveira over 4 years ago

Updated by Daniel Oliveira over 4 years ago

Updated by Sebastian Wagner over 4 years ago

Updated by Sebastian Wagner over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by Kefu Chai over 4 years ago

Updated by Daniel Oliveira over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Kefu Chai over 4 years ago

Updated by Kefu Chai over 4 years ago

Updated by Daniel Oliveira over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Kefu Chai over 4 years ago

Updated by Jan Fajerski over 4 years ago

Updated by Jan Fajerski over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Daniel Oliveira over 4 years ago

Updated by Sage Weil over 4 years ago

Updated by Kefu Chai over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by Kefu Chai over 4 years ago

Updated by Kefu Chai over 4 years ago

Updated by Kefu Chai over 4 years ago

Updated by Kefu Chai over 4 years ago

Updated by Kefu Chai over 4 years ago

Updated by Kefu Chai over 4 years ago

Updated by Kefu Chai over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 3 years ago