Bug #4675: mon: pg creations don't get queued on mon startup - Ceph - Ceph

Actions

Copy link

Bug #4675

closed

mon: pg creations don't get queued on mon startup

Added by Sage Weil about 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

Monitor

Target version:

v0.61 - Cuttlefish

% Done:

Source:

Q/A

Tags:

Backport:

cuttlefish, bobtail

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

PGMonitor::send_pg_creates also divvies up pg creations among the current osds they map to. This happens from update_from_paxos(), and presumably also when osdmaps update. On startup, it happens from preinit() -> init_paxos(), which calls PGMonitor::update_from_paxos() before teh OSDMOnitor, which means the OSDMap is not loaded and everything maps to no OSD. Until there is an osdmap update, a reconnecting osd will fail to see creations queued for it.

The fix is probably to break the divvying out of send_pg_creates(), and then ensure that it is called at some other point during startup.

The result is that a pool creation that races with a mon restart will hang. Some other path also gets in this state, or there is a different bug, since it was triggered by the job below (osd thrashing only). In any case, after that hang, restarting the mon got into this buggy state, so it should get fixed regardless.

ubuntu@teuthology:/a/sage-2013-04-06_09:10:56-rados-wip-osd-throttle-testing-basic/9833$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: b0bb70d12c365872547f10d185bf88eba3ed6083
machine_type: plana
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject delay type: osd
        ms inject socket failures: 2500
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    fs: ext4
    log-whitelist:
    - slow request
    sha1: aca0aea1bfbafba9cab1b2c693760b824bd82d30
  s3tests:
    branch: master
  workunit:
    sha1: aca0aea1bfbafba9cab1b2c693760b824bd82d30
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- ceph-fuse: null
- workunit:
    clients:
      client.0:
      - rados/test.sh

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil about 11 years ago

Status changed from New to Fix Under Review
Priority changed from High to Urgent

wip-mon-pg

Actions

Copy link

Updated by Ian Colle about 11 years ago

Assignee set to Greg Farnum

Greg - can you please review this wip branch?

Actions

Copy link

Updated by Greg Farnum about 11 years ago

Status changed from Fix Under Review to Need More Info

Okay, I've looked at the patches and I've looked at the bug description and I can't tell what the problem is here. The effective change from the patches is to queue PG creates less frequently. Each monitor calls update_from_paxos() when it boots or wins an election, and when the OSDMonitor updates then it tells the PGMonitor to check the map, which calculates these mappings again. So it should all be fine with or without the patches (which would not have any impact on the stuck-creating PGs that I see when I go look at the teuthology archive).
Unfortunately there are no logs that I can find, but I see that there were 8 PGs creating and it manages one of them; do we know this is a pool create and not a split? Why do we believe the issue is with the monitors?

Actions

Copy link

Updated by Greg Farnum about 11 years ago

Assignee changed from Greg Farnum to Sage Weil

Actions

Copy link

Updated by Sage Weil about 11 years ago

Status changed from Need More Info to Fix Under Review
Assignee changed from Sage Weil to Greg Farnum

the problem is that update_from_apxos() is called on startup when the osdmap isn't loaded yet, so it remaps everything to [] and no creates are queued. then osdmap does get loaded, mon starts up, osds reconnect.. but if there are then no osdmap updates, then the creates never get recalculated with a non-broken value. unfortunately it's a difficult case to reproduce; i only saw it once with the hung qa run last week. easy fix though.

the quick fix is just the if get_epoch() != 0 chekcs in teh second patch. the first patch can wait.. eventually we'll want to be more explicit about when we recalc the mapping and when we send, although as you say that's not needed for cuttlefish.

Actions

Copy link

Updated by Greg Farnum about 11 years ago

Okay, but an OSD booting creates a new OSD Map, which will lead to PGMonitor::check_pg_map(), which will lead to send_pg_creates(). I do see that we won't actually calculate them again when the OSDMap initially loads since we'll have seen it previously, so that assumption of mine wasn't quite right, but as soon as we have an OSD boot we're good.
So, thus the thrashing monitors but not thrashing OSDs. Okay, I see it now. But this won't actually fix that race either — the only callers of send_pg_creates are PGMonitor::update_from_paxos() and PGMonitor::check_osd_map(). The second is called whenever the OSD Map changes and the PGMap hasn't seen it before, but that's true for the same cases with and without these patches. PGMonitor::update_from_paxos() is also going to happen at the same times with or without these patches. So I'm still not seeing how these do anything. Just pushed a wip-4675-model that might handle this better; check it out?

Also I think the actual referenced hang here is a different issue than this race.

Actions

Copy link