Bug #5236: mds assert when starting file scan - CephFS - Ceph

Actions

Copy link

Bug #5236

closed

mds assert when starting file scan

Added by Sage Weil almost 11 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Zheng Yan

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2013-06-03T05:02:42.986 INFO:teuthology.task.ceph.mds.b-s-a.err:mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7ff0fed1a700 time 2013-06-03 05:03:24.725204
2013-06-03T05:02:42.986 INFO:teuthology.task.ceph.mds.b-s-a.err:mds/Locker.cc: 4442: FAILED assert(lock->get_state() == LOCK_PRE_SCAN)
2013-06-03T05:02:42.986 INFO:teuthology.task.ceph.mds.b-s-a.err: ceph version 0.63-322-gf7c1944 (f7c19440290d4b82ced0320d1dfc4676ad5083d2)
2013-06-03T05:02:42.986 INFO:teuthology.task.ceph.mds.b-s-a.err: 1: (Locker::file_recover(ScatterLock*)+0x1dc) [0x612fcc]
2013-06-03T05:02:42.987 INFO:teuthology.task.ceph.mds.b-s-a.err: 2: (MDCache::start_files_to_recover(std::vector<CInode*, std::allocator<CInode*> >&, std::vector<CInode*, std::allocator<CInode*> >&)+0x86) [0x57a8e6]
2013-06-03T05:02:42.987 INFO:teuthology.task.ceph.mds.b-s-a.err: 3: (MDCache::open_snap_parents()+0x9dc) [0x5cb97c]
2013-06-03T05:02:42.987 INFO:teuthology.task.ceph.mds.b-s-a.err: 4: (MDCache::rejoin_gather_finish()+0x115) [0x5ceac5]
2013-06-03T05:02:42.987 INFO:teuthology.task.ceph.mds.b-s-a.err: 5: (MDCache::rejoin_send_rejoins()+0x1329) [0x5d4b89]
2013-06-03T05:02:42.987 INFO:teuthology.task.ceph.mds.b-s-a.err: 6: (MDS::rejoin_joint_start()+0x130) [0x4cbf30]
2013-06-03T05:02:42.988 INFO:teuthology.task.ceph.mds.b-s-a.err: 7: (MDS::handle_mds_map(MMDSMap*)+0x39cc) [0x4e03ec]
2013-06-03T05:02:42.988 INFO:teuthology.task.ceph.mds.b-s-a.err: 8: (MDS::handle_core_message(Message*)+0xb1b) [0x4e1d0b]
2013-06-03T05:02:42.988 INFO:teuthology.task.ceph.mds.b-s-a.err: 9: (MDS::_dispatch(Message*)+0x2f) [0x4e1e9f]
2013-06-03T05:02:42.988 INFO:teuthology.task.ceph.mds.b-s-a.err: 10: (MDS::ms_dispatch(Message*)+0x1d3) [0x4e3923]
2013-06-03T05:02:42.988 INFO:teuthology.task.ceph.mds.b-s-a.err: 11: (DispatchQueue::entry()+0x3f1) [0x839021]
2013-06-03T05:02:42.988 INFO:teuthology.task.ceph.mds.b-s-a.err: 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7b571d]
2013-06-03T05:02:42.989 INFO:teuthology.task.ceph.mds.b-s-a.err: 13: (()+0x7e9a) [0x7ff103076e9a]
2013-06-03T05:02:42.989 INFO:teuthology.task.ceph.mds.b-s-a.err: 14: (clone()+0x6d) [0x7ff10182cccd]
2013-06-03T05:02:42.989 INFO:teuthology.task.ceph.mds.b-s-a.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

job was

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-06-03_01:00:48-fs-master-testing-basic/30161$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 19bb6a83cb93383b363cc5956e304213f0f1b79f
machine_type: plana
nuke-on-error: true
overrides:
  ceph:
    conf:
      client:
        debug client: 10
      global:
        mds inject delay type: osd mds
        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject socket failures: 2500
      mds:
        debug mds: 20
        debug ms: 1
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    log-whitelist:
    - slow request
    - wrongly marked me down
    sha1: f7c19440290d4b82ced0320d1dfc4676ad5083d2
  install:
    ceph:
      sha1: f7c19440290d4b82ced0320d1dfc4676ad5083d2
  s3tests:
    branch: master
  workunit:
    sha1: f7c19440290d4b82ced0320d1dfc4676ad5083d2
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
  - mds.b-s-a
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- mds_thrash: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - suites/fsstress.sh

Files

ceph-mds.a.log (64 MB) ceph-mds.a.log

Sage Weil, 06/03/2013 02:17 PM

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Category set to 1
Priority changed from Normal to Urgent
Source changed from other to Q/A

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Assignee set to Sage Weil

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-06-03_01:00:48-fs-master-testing-basic/30161

Actions

Copy link

Updated by Sage Weil almost 11 years ago

File ceph-mds.a.log ceph-mds.a.log added
Status changed from New to 12
Assignee changed from Sage Weil to Zheng Yan

Yan, I got as far as identifying that the problem is that rejoin_gather_finish->identify_files_to_recovery is getting called twice: once from rejoin_start, via the completion check at the end of rejoin_start -> process_imported_caps, and again from rejoin_joint_start -> rejoin_send_rejoins. I think this broke from one of your recent changes... do you have any quick thought on the cleanest way to resolve it? A bool to make us only do the rejoin_gather_finish() once? That could short-circuit the entire rejoin_send_rejoins() call in rejoin_joint_start(). set it true in rejoin_start(), clear it in rejoin_gather_finish(), and add a guard...

Full mds log is attached.

Thanks!

Actions

Copy link