Project

General

Profile

Actions

Backport #14592

closed

osd crashes when handling a stale pg-create message (hammer)

Added by Kefu Chai over 8 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
High
Assignee:
Target version:
-
Release:
hammer
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

see http://tracker.ceph.com/issues/13990#note-39.

reproduce steps:

  1. monitor sends pg-create messages, so the pg_create.created is pool.last_change, while the newly pool.last_change is the OSDMonitor.pending_inc.epoch at that moment. but somehow these PGs fail to create because
    • some osd is down but not out, or
    • some osd's osd_debug_drop_pg_create_probability is 1.0 // this option is only available in hammer, or
    • the osd just starts up, so it's waiting for an osdmap
  2. and some changes are happening in the meantime, which update the osdmap. once the number osdmap epochs reach the threshold, monitor starts to trim them
  3. the the OSDs are back to business, they start to process pg-create, and these pg-create messages carry old osdmaps which were already trimmed by mon and osd, so when osd try to build the prior set, they are missing. so assert failure!

a possible fix could be:

monitor should not trim the osdmaps until the pg-create which references them gets processed by osd.


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #13990: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?)ResolvedKefu Chai12/05/2015

Actions
Actions

Also available in: Atom PDF