Actions
Bug #8086
closedFDCache::clear failed assert
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Hit this today during tiering performance testing with EC backend. OSD is on an SSD that is part of the cache tier and in a parallel crush hierarchy.
18465_object84189 [assert-version v192,copy-get max 8388608] 6.3ec66cd5 RETRY=2 ack+retry+read+ignore_cache+ignore_overlay+flush e188) v4 currently waiting for pg to exist locally 0> 2014-04-12 12:55:22.945660 7f413c195700 -1 os/FDCache.h: In function 'void FDCache::clear(const ghobject_t&)' thread 7f413c195700 time 2014-04-12 12:55:22.941373 os/FDCache.h: 77: FAILED assert(!registry.lookup(hoid)) ceph version 0.79-128-g28371a2 (28371a2463cce4600054d00df526c43efa218e0a) 1: (FileStore::lfn_unlink(coll_t, ghobject_t const&, SequencerPosition const&, bool)+0x494) [0x874b54] 2: (FileStore::_remove(coll_t, ghobject_t const&, SequencerPosition const&)+0x8b) [0x874eab] 3: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x2901) [0x884d01] 4: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x6c) [0x886cac] 5: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x167) [0x886e37] 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaef) [0xa3977f] 7: (ThreadPool::WorkThread::entry()+0x10) [0xa3a670] 8: (()+0x7f6e) [0x7f414674af6e] 9: (clone()+0x6d) [0x7f4144aeb9cd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 0/ 0 mds 0/ 0 mds_balancer 0/ 0 mds_locker 0/ 0 mds_log 0/ 0 mds_log_expire 0/ 0 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 0 filer 0/ 1 striper 0/ 0 objecter 0/ 0 rados 0/ 0 rbd 0/ 0 journaler 0/ 0 objectcacher 0/ 0 client 0/ 0 osd 0/ 0 optracker 0/ 0 objclass 0/ 0 filestore 1/ 3 keyvaluestore 0/ 0 journal 0/ 0 ms 0/ 0 mon 0/ 0 monc 0/ 0 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 0/ 0 heartbeatmap 0/ 0 perfcounter 0/ 0 rgw 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /tmp/cbt/ceph/log/osd.33.log --- end dump of recent events --- 2014-04-12 12:55:22.961071 7f413c195700 -1 *** Caught signal (Aborted) ** in thread 7f413c195700 ceph version 0.79-128-g28371a2 (28371a2463cce4600054d00df526c43efa218e0a) 1: ceph-osd() [0x965adf] 2: (()+0xfbb0) [0x7f4146752bb0] 3: (gsignal()+0x37) [0x7f4144a27f77] 4: (abort()+0x148) [0x7f4144a2b5e8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f41453336e5] 6: (()+0x5e856) [0x7f4145331856] 7: (()+0x5e883) [0x7f4145331883] 8: (()+0x5eaae) [0x7f4145331aae] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa488d2] 10: (FileStore::lfn_unlink(coll_t, ghobject_t const&, SequencerPosition const&, bool)+0x494) [0x874b54] 11: (FileStore::_remove(coll_t, ghobject_t const&, SequencerPosition const&)+0x8b) [0x874eab] 12: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x2901) [0x884d01] 13: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x6c) [0x886cac] 14: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x167) [0x886e37] 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaef) [0xa3977f] 16: (ThreadPool::WorkThread::entry()+0x10) [0xa3a670] 17: (()+0x7f6e) [0x7f414674af6e] 18: (clone()+0x6d) [0x7f4144aeb9cd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2014-04-12 12:55:22.961071 7f413c195700 -1 *** Caught signal (Aborted) ** in thread 7f413c195700 ceph version 0.79-128-g28371a2 (28371a2463cce4600054d00df526c43efa218e0a) 1: ceph-osd() [0x965adf] 2: (()+0xfbb0) [0x7f4146752bb0] 3: (gsignal()+0x37) [0x7f4144a27f77] 4: (abort()+0x148) [0x7f4144a2b5e8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f41453336e5] 6: (()+0x5e856) [0x7f4145331856] 7: (()+0x5e883) [0x7f4145331883] 8: (()+0x5eaae) [0x7f4145331aae] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa488d2] 10: (FileStore::lfn_unlink(coll_t, ghobject_t const&, SequencerPosition const&, bool)+0x494) [0x874b54] 11: (FileStore::_remove(coll_t, ghobject_t const&, SequencerPosition const&)+0x8b) [0x874eab] 12: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x2901) [0x884d01] 13: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x6c) [0x886cac] 14: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x167) [0x886e37] 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaef) [0xa3977f] 16: (ThreadPool::WorkThread::entry()+0x10) [0xa3a670] 17: (()+0x7f6e) [0x7f414674af6e] 18: (clone()+0x6d) [0x7f4144aeb9cd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 0/ 0 mds 0/ 0 mds_balancer 0/ 0 mds_locker 0/ 0 mds_log 0/ 0 mds_log_expire 0/ 0 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 0 filer 0/ 1 striper 0/ 0 objecter 0/ 0 rados 0/ 0 rbd 0/ 0 journaler 0/ 0 objectcacher 0/ 0 client 0/ 0 osd 0/ 0 optracker 0/ 0 objclass 0/ 0 filestore 1/ 3 keyvaluestore 0/ 0 journal 0/ 0 ms 0/ 0 mon 0/ 0 monc 0/ 0 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 0/ 0 heartbeatmap 0/ 0 perfcounter 0/ 0 rgw 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /tmp/cbt/ceph/log/osd.33.log --- end dump of recent events ---
Updated by Mark Nelson about 10 years ago
Note; of the 6 OSDs in the cache tier, 5 appeared to fail with a similar stacktrace.
Updated by Samuel Just about 10 years ago
you'll probably have to reproduce with debug filestore = 20 debug osd = 20
Updated by Mark Nelson about 10 years ago
On further review, this seems to be happening when an erasure coded base pool (that has an associated writeback cache pool) is deleted or recreated via ceph osd pool delete/create.
Updated by Samuel Just about 10 years ago
- Status changed from New to 7
- Assignee set to Samuel Just
Updated by Mark Nelson about 10 years ago
wip-8086 appears to have solved this. Thanks Sam!
Updated by Samuel Just about 10 years ago
- Status changed from 12 to Fix Under Review
Updated by Sage Weil about 10 years ago
- Status changed from Fix Under Review to Resolved
Actions