Project

General

Profile

Actions

Bug #8086

closed

FDCache::clear failed assert

Added by Mark Nelson about 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hit this today during tiering performance testing with EC backend. OSD is on an SSD that is part of the cache tier and in a parallel crush hierarchy.

18465_object84189 [assert-version v192,copy-get max 8388608] 6.3ec66cd5 RETRY=2 ack+retry+read+ignore_cache+ignore_overlay+flush e188) v4 currently waiting for pg to exist locally
     0> 2014-04-12 12:55:22.945660 7f413c195700 -1 os/FDCache.h: In function 'void FDCache::clear(const ghobject_t&)' thread 7f413c195700 time 2014-04-12 12:55:22.941373
os/FDCache.h: 77: FAILED assert(!registry.lookup(hoid))

 ceph version 0.79-128-g28371a2 (28371a2463cce4600054d00df526c43efa218e0a)
 1: (FileStore::lfn_unlink(coll_t, ghobject_t const&, SequencerPosition const&, bool)+0x494) [0x874b54]
 2: (FileStore::_remove(coll_t, ghobject_t const&, SequencerPosition const&)+0x8b) [0x874eab]
 3: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x2901) [0x884d01]
 4: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x6c) [0x886cac]
 5: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x167) [0x886e37]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaef) [0xa3977f]
 7: (ThreadPool::WorkThread::entry()+0x10) [0xa3a670]
 8: (()+0x7f6e) [0x7f414674af6e]
 9: (clone()+0x6d) [0x7f4144aeb9cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   0/ 0 mds
   0/ 0 mds_balancer
   0/ 0 mds_locker
   0/ 0 mds_log
   0/ 0 mds_log_expire
   0/ 0 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   1/ 3 keyvaluestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /tmp/cbt/ceph/log/osd.33.log
--- end dump of recent events ---
2014-04-12 12:55:22.961071 7f413c195700 -1 *** Caught signal (Aborted) **
 in thread 7f413c195700

 ceph version 0.79-128-g28371a2 (28371a2463cce4600054d00df526c43efa218e0a)
 1: ceph-osd() [0x965adf]
 2: (()+0xfbb0) [0x7f4146752bb0]
 3: (gsignal()+0x37) [0x7f4144a27f77]
 4: (abort()+0x148) [0x7f4144a2b5e8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f41453336e5]
 6: (()+0x5e856) [0x7f4145331856]
 7: (()+0x5e883) [0x7f4145331883]
 8: (()+0x5eaae) [0x7f4145331aae]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa488d2]
 10: (FileStore::lfn_unlink(coll_t, ghobject_t const&, SequencerPosition const&, bool)+0x494) [0x874b54]
 11: (FileStore::_remove(coll_t, ghobject_t const&, SequencerPosition const&)+0x8b) [0x874eab]
 12: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x2901) [0x884d01]
 13: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x6c) [0x886cac]
 14: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x167) [0x886e37]
 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaef) [0xa3977f]
 16: (ThreadPool::WorkThread::entry()+0x10) [0xa3a670]
 17: (()+0x7f6e) [0x7f414674af6e]
 18: (clone()+0x6d) [0x7f4144aeb9cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2014-04-12 12:55:22.961071 7f413c195700 -1 *** Caught signal (Aborted) **
 in thread 7f413c195700

 ceph version 0.79-128-g28371a2 (28371a2463cce4600054d00df526c43efa218e0a)
 1: ceph-osd() [0x965adf]
 2: (()+0xfbb0) [0x7f4146752bb0]
 3: (gsignal()+0x37) [0x7f4144a27f77]
 4: (abort()+0x148) [0x7f4144a2b5e8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f41453336e5]
 6: (()+0x5e856) [0x7f4145331856]
 7: (()+0x5e883) [0x7f4145331883]
 8: (()+0x5eaae) [0x7f4145331aae]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa488d2]
 10: (FileStore::lfn_unlink(coll_t, ghobject_t const&, SequencerPosition const&, bool)+0x494) [0x874b54]
 11: (FileStore::_remove(coll_t, ghobject_t const&, SequencerPosition const&)+0x8b) [0x874eab]
 12: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x2901) [0x884d01]
 13: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x6c) [0x886cac]
 14: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x167) [0x886e37]
 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaef) [0xa3977f]
 16: (ThreadPool::WorkThread::entry()+0x10) [0xa3a670]
 17: (()+0x7f6e) [0x7f414674af6e]
 18: (clone()+0x6d) [0x7f4144aeb9cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   0/ 0 mds
   0/ 0 mds_balancer
   0/ 0 mds_locker
   0/ 0 mds_log
   0/ 0 mds_log_expire
   0/ 0 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   1/ 3 keyvaluestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /tmp/cbt/ceph/log/osd.33.log
--- end dump of recent events ---
Actions #1

Updated by Mark Nelson about 10 years ago

Note; of the 6 OSDs in the cache tier, 5 appeared to fail with a similar stacktrace.

Actions #2

Updated by Samuel Just about 10 years ago

you'll probably have to reproduce with debug filestore = 20 debug osd = 20

Actions #3

Updated by Mark Nelson about 10 years ago

On further review, this seems to be happening when an erasure coded base pool (that has an associated writeback cache pool) is deleted or recreated via ceph osd pool delete/create.

Actions #4

Updated by Samuel Just about 10 years ago

  • Status changed from New to 7
  • Assignee set to Samuel Just
Actions #5

Updated by Mark Nelson about 10 years ago

wip-8086 appears to have solved this. Thanks Sam!

Actions #6

Updated by Samuel Just about 10 years ago

  • Status changed from 7 to Resolved
Actions #7

Updated by Samuel Just almost 10 years ago

  • Status changed from Resolved to 12
Actions #8

Updated by Samuel Just almost 10 years ago

  • Status changed from 12 to Fix Under Review
Actions #9

Updated by Sage Weil almost 10 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF