Project

General

Profile

Actions

Bug #9540

closed

Crash during FS upgrade: assert(o->get_num_ref() == 0)

Added by John Spray over 9 years ago. Updated over 9 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

   -10> 2014-09-18 07:05:14.008418 7f71ab92d700 10 mds.0.locker mark_updated_scatterlock (ifile sync dirty) - already on list since 2014-09-18 07:05:11.804950
    -9> 2014-09-18 07:05:14.008420 7f71ab92d700 10 mds.0.journal EMetaBlob.replay updated dir [dir 600 ~mds0/stray0/ [2,head] auth v=7891 cv=0/0 state=1610612736 f(v1 m2014-09-18 06:58:11.104563 136=2+134)/f(v1 m2014-09-18 06:58:11.104563 139=5+134) n(v2 rc2014-09-18 06:58:11.104563 b8388608 136=2+134)/n(v2 rc2014-09-18 06:58:11.104563 b10194500 139=5+134) hs=115+304,ss=0+0 dirty=419 | child=1 dirty=1 0x6399d90]
    -8> 2014-09-18 07:05:14.008431 7f71ab92d700 10 mds.0.journal EMetaBlob.replay unlinking [dentry #100/stray0/10000000f8b [2,head] auth (dversion lock) v=7882 inode=0x76ea338 | inodepin=1 dirty=1 0x7b042e0]
    -7> 2014-09-18 07:05:14.008434 7f71ab92d700 12 mds.0.cache.dir(600) unlink_inode [dentry #100/stray0/10000000f8b [2,head] auth (dversion lock) v=7882 inode=0x76ea338 | inodepin=1 dirty=1 0x7b042e0] [inode 10000000f8b [2,head] ~mds0/stray0/10000000f8b auth v7882 dirtyparent s=1805892 nl=0 n(v0 b1805892 1=1+0) (iversion lock) | truncating=1 dirtyparent=1 dirty=1 0x76ea338]
    -6> 2014-09-18 07:05:14.008442 7f71ab92d700 10 mds.0.journal EMetaBlob.replay had [dentry #100/stray0/10000000f8b [2,head] auth NULL (dversion lock) v=7890 inode=0 | inodepin=0 dirty=1 0x7b042e0]
    -5> 2014-09-18 07:05:14.008445 7f71ab92d700 10 mds.0.journal  unlinked set contains {0x76ea338=0x6399d90}
    -4> 2014-09-18 07:05:14.008446 7f71ab92d700 10 mds.0.cache remove_inode_recursive [inode 10000000f8b [2,head] #10000000f8b auth v7882 dirtyparent s=1805892 nl=0 n(v0 b1805892 1=1+0) (iversion lock) | truncating=1 dirtyparent=1 dirty=1 0x76ea338]
    -3> 2014-09-18 07:05:14.008450 7f71ab92d700 14 mds.0.cache remove_inode [inode 10000000f8b [2,head] #10000000f8b auth v7882 dirtyparent s=1805892 nl=0 n(v0 b1805892 1=1+0) (iversion lock) | truncating=1 dirtyparent=1 dirty=1 0x76ea338]
    -2> 2014-09-18 07:05:14.008454 7f71ab92d700 10 mds.0.cache.ino(10000000f8b)  mark_clean [inode 10000000f8b [2,head] #10000000f8b auth v7882 dirtyparent s=1805892 nl=0 n(v0 b1805892 1=1+0) (iversion lock) | truncating=1 dirtyparent=1 dirty=1 0x76ea338]
    -1> 2014-09-18 07:05:14.008458 7f71ab92d700 10 mds.0.cache.ino(10000000f8b) clear_dirty_parent
     0> 2014-09-18 07:05:14.009733 7f71ab92d700 -1 mds/MDCache.cc: In function 'void MDCache::remove_inode(CInode*)' thread 7f71ab92d700 time 2014-09-18 07:05:14.008467
mds/MDCache.cc: 310: FAILED assert(o->get_num_ref() == 0)

 ceph version 0.85-723-g83bd343 (83bd3430e3a17b77265e696095904b7a9032d2ee)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0x90eaff]
 2: (MDCache::remove_inode(CInode*)+0x782) [0x630152]
 3: (MDCache::remove_inode_recursive(CInode*)+0x288) [0x63d5c8]
 4: (EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)+0x43cd) [0x815b7d]
 5: (EUpdate::replay(MDS*)+0x3a) [0x81e79a]
 6: (MDLog::_replay_thread()+0x698) [0x7a3168]
 7: (MDLog::ReplayThread::entry()+0xd) [0x5a099d]
 8: (()+0x7e9a) [0x7f71b5d22e9a]
 9: (clone()+0x6d) [0x7f71b48d73fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

YAML that hit this (once, haven't tried again yet):

interactive-on-error: true

overrides:
  ceph:
    conf:
      mds:
          debug mds: 20
          mds verify scatter: false
      client:
          debug client: 20
      mon:
        mon warn on legacy crush tunables: false
    log-whitelist:
    - scrub
    fs: xfs
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
  - osd.2
  - client.0
- - mon.b
  - mon.c
  - osd.3
  - osd.4
  - osd.5
  - mds.a-s
  - client.1

# Client 0 will remain mounted continuously
# Client 1 will be remounted after each upgrade.
# Both will experience the same workloads

tasks:
- install:
    branch: emperor
- print: "**** done emperor install" 
- ceph:
    fs: xfs
- print: "**** done ceph cluster setup" 
- ceph-fuse:
- workunit:
      clients:
         all:
             - suites/fsstress.sh
             #- fs/misc/trivial_sync.sh
- print: "**** done workunit on emperor" 
- install.upgrade:
    all:
      branch: firefly
- ceph-fuse:
    client.1:
        mounted: false
- ceph.restart:
- ceph-fuse:
    client.1:
        mounted: true
- workunit:
      clients:
         all:
             - suites/fsstress.sh
             #- fs/misc/trivial_sync.sh
- print: "**** done workunit on firefly" 
- install.upgrade:
    all:
        sha1: 83bd3430e3a17b77265e696095904b7a9032d2ee
- ceph-fuse:
    client.1:
        mounted: false
- ceph.restart:
- ceph-fuse:
    client.1:
        mounted: true
- workunit:
      clients:
         all:
             - suites/fsstress.sh
             #- fs/misc/trivial_sync.sh
- print: "**** done workunit on latest" 

- interactive:
Actions #1

Updated by John Spray over 9 years ago

The crash hits at the last ceph.restart (after upgrade from firefly to 83bd3430e3a17b77265e696095904b7a9032d2ee).

That SHA1 being used rather than giant/master HEAD because it was meant to show that the test would trigger the failure that was fixed by

commit 386f2d7c829422695a1b1f41bd3f17ca3eef1f61
Author: John Spray <john.spray@redhat.com>
Date:   Thu Sep 11 14:07:59 2014 +0100

    mds: update segment references during journal rewrite

    ... to avoid leaving log events that reference log
    segments by offsets which no longer exist.

    Signed-off-by: John Spray <john.spray@redhat.com>

Actions #2

Updated by John Spray over 9 years ago

  • Status changed from New to Rejected

Never mind, seems like this was just another manifestation of the original segment reference bug -- giant HEAD is OK.

Actions

Also available in: Atom PDF