Bug #25182: Upmaps forgotten after restarting OSDs - RADOS - Ceph

Actions

Copy link

Bug #25182

closed

Upmaps forgotten after restarting OSDs

Added by Bryan Stillwell over 5 years ago. Updated almost 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Administration/Usability

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v13.2.1

ceph-qa-suite:

Component(RADOS):

Monitor

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Problem:
I have a small cluster at home and I noticed that during the upgrade from 12.2.5 -> 12.2.7 and the upgrade from 12.2.7 to 13.2.1 the upmap settings are forgotten when restarting the OSDs.

Expected behavior:
Upmaps are remembered so they don't have to be re-applied after each OSD restart.

Additional notes:
Originally I created the upmaps with the balancer, but because of this problem they are now set using 'ceph osd pg-upmap-items' like this:

ceph osd pg-upmap-items 5.38 10 18
ceph osd pg-upmap-items 5.d 2 18
ceph osd pg-upmap-items 5.21 11 18
ceph osd pg-upmap-items 5.48 2 18
ceph osd pg-upmap-items 5.3d 11 18
ceph osd pg-upmap-items 5.14 5 18
ceph osd pg-upmap-items 5.1 11 18 8 6 9 0
ceph osd pg-upmap-items 5.f 11 18
ceph osd pg-upmap-items 5.3 10 6 11 0

Files

disappearing-upmaps.log.gz (308 KB) disappearing-upmaps.log.gz

Bryan Stillwell, 08/27/2018 07:54 PM

Actions

Copy link

Updated by John Spray over 5 years ago

Project changed from Ceph to RADOS

Actions

Copy link

Updated by Josh Durgin over 5 years ago

Category set to Administration/Usability
Priority changed from Normal to High
Component(RADOS) Monitor added

Actions

Copy link

Updated by Sage Weil over 5 years ago

It is expected that the upmaps may evaporate if the "raw" CRUSH mapping changes. This shouldn't happen for osd up/down, but it may happen for osd in/out or osd addition or other crush changes.

Actions

Copy link

Updated by Sage Weil over 5 years ago

Hmm, I wasn't able to reproduce this...

Actions

Copy link

Updated by Greg Farnum over 5 years ago

Priority changed from High to Normal

Actions

Copy link

Updated by Bryan Stillwell over 5 years ago

What debugging logs would be helpful in figuring this out? I just restarted an OSD on my 13.2.1-based cluster and all the PGs that contained that OSD had their upmaps disappear. The total number of pg_upmap_items dropped from 288 to 240 in the process.

Actions

Copy link

Updated by Sage Weil over 5 years ago

Bryan Stillwell wrote:

What debugging logs would be helpful in figuring this out? I just restarted an OSD on my 13.2.1-based cluster and all the PGs that contained that OSD had their upmaps disappear. The total number of pg_upmap_items dropped from 288 to 240 in the process.

If you could turn up the mon log (debug mon = 20, debug ms = 1, and debug osd = 20 on the mons only) and then reproduce it (restart an osd and observe the mappings go away) that should tell us where/why they are getting removed... thanks!

Actions

Copy link

Updated by Bryan Stillwell over 5 years ago

File disappearing-upmaps.log.gz disappearing-upmaps.log.gz added

I believe these log messages explain why the upmaps are being removed, but I'll attach the relevant section of the log as well:

2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.1->[4,11,12,9,1,10]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.2->[6,11,8,1]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.3->[6,9]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.5->[6,11]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.7->[12,7]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.8->[6,1]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.9->[17,10,8,1]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.b->[0,11,8,4,6,9]
...
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.47->[12,9]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.48->[6,7]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.4d->[2,3]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.4e->[6,17]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.4f->[10,17]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.55->[6,3]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.57->[2,3]
2018-08-27 13:33:25.988 7fa8dee42700 10 maybe_remove_pg_upmaps cancel invalid pg_upmap_items entry 5.6a->[12,3]

Actions

Copy link

Updated by Bryan Stillwell over 5 years ago

One thing I've noticed after living with this for a while is that the upmap entries that are forgotten are always for the same pool which uses 4+2 erasure coding.

Actions

Copy link

#10