Project

General

Profile

Actions

Bug #4675

closed

mon: pg creations don't get queued on mon startup

Added by Sage Weil about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
cuttlefish, bobtail
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

PGMonitor::send_pg_creates also divvies up pg creations among the current osds they map to. This happens from update_from_paxos(), and presumably also when osdmaps update. On startup, it happens from preinit() -> init_paxos(), which calls PGMonitor::update_from_paxos() before teh OSDMOnitor, which means the OSDMap is not loaded and everything maps to no OSD. Until there is an osdmap update, a reconnecting osd will fail to see creations queued for it.

The fix is probably to break the divvying out of send_pg_creates(), and then ensure that it is called at some other point during startup.

The result is that a pool creation that races with a mon restart will hang. Some other path also gets in this state, or there is a different bug, since it was triggered by the job below (osd thrashing only). In any case, after that hang, restarting the mon got into this buggy state, so it should get fixed regardless.

ubuntu@teuthology:/a/sage-2013-04-06_09:10:56-rados-wip-osd-throttle-testing-basic/9833$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: b0bb70d12c365872547f10d185bf88eba3ed6083
machine_type: plana
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject delay type: osd
        ms inject socket failures: 2500
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    fs: ext4
    log-whitelist:
    - slow request
    sha1: aca0aea1bfbafba9cab1b2c693760b824bd82d30
  s3tests:
    branch: master
  workunit:
    sha1: aca0aea1bfbafba9cab1b2c693760b824bd82d30
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- ceph-fuse: null
- workunit:
    clients:
      client.0:
      - rados/test.sh


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #4813: pgs stuck creatingResolvedSamuel Just04/25/2013

Actions
Actions

Also available in: Atom PDF