Bug #5665: mds takeover too early causes new mds to shutdown - CephFS - Ceph

Actions

Copy link

Bug #5665

closed

mds takeover too early causes new mds to shutdown

Added by Sage Weil over 10 years ago. Updated almost 8 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

after replay we get

2013-07-17 21:50:59.701234 7f39a5eb1700  1 mds.0.2 rejoin_done
2013-07-17 21:50:59.701236 7f39a5eb1700 10 mds.0.cache show_subtrees - no subtrees
2013-07-17 21:50:59.701239 7f39a5eb1700  7 mds.0.cache show_cache
2013-07-17 21:50:59.701241 7f39a5eb1700  7 mds.0.cache  unlinked [inode 1 [...2,head] / auth v1 snaprealm=0x1b39900 f(v0 1=0+1) n(v0 1=0+1) (iversion lock) 0x1b48860]
2013-07-17 21:50:59.701248 7f39a5eb1700  7 mds.0.cache  unlinked [inode 100 [...2,head] ~mds0/ auth v1 snaprealm=0x1b39480 f(v0 11=1+10) n(v0 11=1+10) (iversion lock) 0x1b48000]
2013-07-17 21:50:59.701254 7f39a5eb1700  1 mds.0.2  empty cache, no subtrees, leaving cluster
2013-07-17 21:50:59.701256 7f39a5eb1700  3 mds.0.2 request_state down:stopped

full logs for original and takeover mds attached

job was

ubuntu@teuthology:/a/teuthology-2013-07-17_20:00:59-fs-cuttlefish-testing-basic/71119$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 77c8bf2f972a9d6ff446c49a41678bf931bbee44
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: cuttlefish
  ceph:
    conf:
      client:
        debug client: 10
      mds:
        debug mds: 20
        debug ms: 1
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    - wrongly marked me down
    sha1: 39bffac6b6c898882d03de392f7f2218933d942b
  ceph-deploy:
    conf:
      client:
        debug monc: 20
        debug ms: 1
        debug objecter: 20
        debug rados: 20
        log file: /var/log/ceph/ceph-..log
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
  install:
    ceph:
      sha1: 39bffac6b6c898882d03de392f7f2218933d942b
  s3tests:
    branch: cuttlefish
  workunit:
    sha1: 39bffac6b6c898882d03de392f7f2218933d942b
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
  - mds.b-s-a
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- mds_thrash: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - suites/pjd.sh

Files

Download all files

ceph-mds.b-s-a.log (44.2 KB) ceph-mds.b-s-a.log		Sage Weil, 07/18/2013 09:10 AM
ceph-mds.a.log (89.4 KB) ceph-mds.a.log		Sage Weil, 07/18/2013 09:10 AM

Actions

Copy link

Updated by Greg Farnum over 10 years ago

Isn't this basically the MDS not getting to write all its startup state to disk?
Seems like maybe we should just prevent the tests from killing them prior to that instead of investing work to recover from it.

Actions

Copy link