Project

General

Profile

Actions

Bug #5665

closed

mds takeover too early causes new mds to shutdown

Added by Sage Weil almost 11 years ago. Updated almost 8 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

after replay we get

2013-07-17 21:50:59.701234 7f39a5eb1700  1 mds.0.2 rejoin_done
2013-07-17 21:50:59.701236 7f39a5eb1700 10 mds.0.cache show_subtrees - no subtrees
2013-07-17 21:50:59.701239 7f39a5eb1700  7 mds.0.cache show_cache
2013-07-17 21:50:59.701241 7f39a5eb1700  7 mds.0.cache  unlinked [inode 1 [...2,head] / auth v1 snaprealm=0x1b39900 f(v0 1=0+1) n(v0 1=0+1) (iversion lock) 0x1b48860]
2013-07-17 21:50:59.701248 7f39a5eb1700  7 mds.0.cache  unlinked [inode 100 [...2,head] ~mds0/ auth v1 snaprealm=0x1b39480 f(v0 11=1+10) n(v0 11=1+10) (iversion lock) 0x1b48000]
2013-07-17 21:50:59.701254 7f39a5eb1700  1 mds.0.2  empty cache, no subtrees, leaving cluster
2013-07-17 21:50:59.701256 7f39a5eb1700  3 mds.0.2 request_state down:stopped

full logs for original and takeover mds attached

job was

ubuntu@teuthology:/a/teuthology-2013-07-17_20:00:59-fs-cuttlefish-testing-basic/71119$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 77c8bf2f972a9d6ff446c49a41678bf931bbee44
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: cuttlefish
  ceph:
    conf:
      client:
        debug client: 10
      mds:
        debug mds: 20
        debug ms: 1
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    - wrongly marked me down
    sha1: 39bffac6b6c898882d03de392f7f2218933d942b
  ceph-deploy:
    conf:
      client:
        debug monc: 20
        debug ms: 1
        debug objecter: 20
        debug rados: 20
        log file: /var/log/ceph/ceph-..log
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
  install:
    ceph:
      sha1: 39bffac6b6c898882d03de392f7f2218933d942b
  s3tests:
    branch: cuttlefish
  workunit:
    sha1: 39bffac6b6c898882d03de392f7f2218933d942b
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
  - mds.b-s-a
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- mds_thrash: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - suites/pjd.sh


Files

ceph-mds.b-s-a.log (44.2 KB) ceph-mds.b-s-a.log Sage Weil, 07/18/2013 09:10 AM
ceph-mds.a.log (89.4 KB) ceph-mds.a.log Sage Weil, 07/18/2013 09:10 AM
Actions #1

Updated by Greg Farnum almost 11 years ago

Isn't this basically the MDS not getting to write all its startup state to disk?
Seems like maybe we should just prevent the tests from killing them prior to that instead of investing work to recover from it.

Actions #2

Updated by Sage Weil almost 11 years ago

  • Priority changed from High to Normal
Actions #3

Updated by Zheng Yan over 10 years ago

  • Status changed from New to Duplicate

I think this is duplication of #4894

Actions #4

Updated by Greg Farnum almost 8 years ago

  • Component(FS) MDS added
Actions

Also available in: Atom PDF