Project

General

Profile

Actions

Bug #25182

closed

Upmaps forgotten after restarting OSDs

Added by Bryan Stillwell over 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Problem:
I have a small cluster at home and I noticed that during the upgrade from 12.2.5 -> 12.2.7 and the upgrade from 12.2.7 to 13.2.1 the upmap settings are forgotten when restarting the OSDs.

Expected behavior:
Upmaps are remembered so they don't have to be re-applied after each OSD restart.

Additional notes:
Originally I created the upmaps with the balancer, but because of this problem they are now set using 'ceph osd pg-upmap-items' like this:

ceph osd pg-upmap-items 5.38 10 18
ceph osd pg-upmap-items 5.d 2 18
ceph osd pg-upmap-items 5.21 11 18
ceph osd pg-upmap-items 5.48 2 18
ceph osd pg-upmap-items 5.3d 11 18
ceph osd pg-upmap-items 5.14 5 18
ceph osd pg-upmap-items 5.1 11 18 8 6 9 0
ceph osd pg-upmap-items 5.f 11 18
ceph osd pg-upmap-items 5.3 10 6 11 0


Files

disappearing-upmaps.log.gz (308 KB) disappearing-upmaps.log.gz Bryan Stillwell, 08/27/2018 07:54 PM
Actions #1

Updated by John Spray over 5 years ago

  • Project changed from Ceph to RADOS
Actions #2

Updated by Josh Durgin over 5 years ago

  • Category set to Administration/Usability
  • Priority changed from Normal to High
  • Component(RADOS) Monitor added
Actions #3

Updated by Sage Weil over 5 years ago

It is expected that the upmaps may evaporate if the "raw" CRUSH mapping changes. This shouldn't happen for osd up/down, but it may happen for osd in/out or osd addition or other crush changes.

Actions #4

Updated by Sage Weil over 5 years ago

Hmm, I wasn't able to reproduce this...

Actions #5

Updated by Greg Farnum over 5 years ago

  • Priority changed from High to Normal
Actions #6

Updated by Bryan Stillwell over 5 years ago

What debugging logs would be helpful in figuring this out? I just restarted an OSD on my 13.2.1-based cluster and all the PGs that contained that OSD had their upmaps disappear. The total number of pg_upmap_items dropped from 288 to 240 in the process.

Actions #7

Updated by Sage Weil over 5 years ago

Bryan Stillwell wrote:

What debugging logs would be helpful in figuring this out? I just restarted an OSD on my 13.2.1-based cluster and all the PGs that contained that OSD had their upmaps disappear. The total number of pg_upmap_items dropped from 288 to 240 in the process.

If you could turn up the mon log (debug mon = 20, debug ms = 1, and debug osd = 20 on the mons only) and then reproduce it (restart an osd and observe the mappings go away) that should tell us where/why they are getting removed... thanks!

Actions #8

Updated by Bryan Stillwell over 5 years ago

I believe these log messages explain why the upmaps are being removed, but I'll attach the relevant section of the log as well:

2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.1->[4,11,12,9,1,10]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.2->[6,11,8,1]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.3->[6,9]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.5->[6,11]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.7->[12,7]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.8->[6,1]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.9->[17,10,8,1]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.b->[0,11,8,4,6,9]
...
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.47->[12,9]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.48->[6,7]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.4d->[2,3]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.4e->[6,17]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.4f->[10,17]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.55->[6,3]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.57->[2,3]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.6a->[12,3]

Actions #9

Updated by Bryan Stillwell over 5 years ago

One thing I've noticed after living with this for a while is that the upmap entries that are forgotten are always for the same pool which uses 4+2 erasure coding.

Actions #10

Updated by Bryan Stillwell over 5 years ago

After upgrading to 13.2.4 this problem went away. I believe this was the change which made it happen:

https://github.com/ceph/ceph/pull/25365

And this is the one which pulled it into Mimic:

https://github.com/ceph/ceph/pull/25419

Feel free to mark this as resolved.

Actions #11

Updated by Josh Durgin almost 5 years ago

  • Status changed from New to Resolved

Thanks for verifying the fixes Bryan. Looks like those are all backported to mimic + luminous.

Actions

Also available in: Atom PDF