Project

General

Profile

Bug #17023

OSD failed to subscribe skipped osdmaps after "ceph osd pause"

Added by Kefu Chai over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
08/10/2016
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

per Wido's comment in #16982-7,

I tried adding new OSDs to the cluster and they also have to catch up, which never happens until I restart them over and over.
osd.136 in this case is a fresh OSD. You can see it jumps with 1.000 maps (osd_map_message_max), but then just waits.
I restart the osd, it goes 1k maps forward and waits. I restart, etc, etc.

and the root cause is analyzed at #16982-11.

in short, the problem here, due to "ceph osd pause", the subscription sent by objecter always gets in the way of OSD, so the latter cannot subscribe for the older osdmap to catch up with the cluster.

so a workaround is to not "ceph osd pause".

ceph-osd.136.log.gz (47.9 KB) Wido den Hollander, 08/11/2016 06:51 AM


Related issues

Related to Ceph - Bug #16982: OSD crash after upgrade to Jewel: give useful error when trying to commit 4000 maps to a 100MB journal Resolved 08/10/2016
Copied to Ceph - Backport #17089: jewel: OSD failed to subscribe skipped osdmaps after "ceph osd pause" Resolved

History

#1 Updated by Kefu Chai over 2 years ago

  • Copied from Bug #16982: OSD crash after upgrade to Jewel: give useful error when trying to commit 4000 maps to a 100MB journal added

#2 Updated by Wido den Hollander over 2 years ago

To clarify why we did the 'osd pause'.

The upgrade from Hammer -> Jewel didn't go very well. We had machines go OOM, CPU 100% busy and even a disk failing under the pressure.

We want to be back up and running asap, but data integrity was even more important. I set the pause flag to prevent any changes to the data so that the cluster could make all PGs active again before I thought it was safe to continue.

#3 Updated by Kefu Chai over 2 years ago

  • Status changed from New to Need Review

#4 Updated by Kefu Chai over 2 years ago

  • Status changed from Need Review to Resolved

#5 Updated by Kefu Chai over 2 years ago

  • Status changed from Resolved to Pending Backport

#6 Updated by Loic Dachary over 2 years ago

  • Tags deleted (jewel)
  • Backport set to jewel

#7 Updated by Loic Dachary over 2 years ago

  • Copied to Backport #17089: jewel: OSD failed to subscribe skipped osdmaps after "ceph osd pause" added

#8 Updated by Loic Dachary over 2 years ago

  • Copied from deleted (Bug #16982: OSD crash after upgrade to Jewel: give useful error when trying to commit 4000 maps to a 100MB journal)

#9 Updated by Loic Dachary over 2 years ago

  • Related to Bug #16982: OSD crash after upgrade to Jewel: give useful error when trying to commit 4000 maps to a 100MB journal added

#10 Updated by Loic Dachary over 2 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF