Project

General

Profile

Actions

Bug #9480

closed

OSD is crashing while object deletion

Added by Somnath Roy over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
giant
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Reproducible step:

1. Run a command something like this.

rados bench -p rbench 200 write -t 32 -b 1024

The OSd will be crashed with following trace..

2014-09-12 13:48:06.820524 7fb56596d700 -1 os/FDCache.h: In function 'void FDCache::clear(const ghobject_t&)' thread 7fb56596d700 time 2014-09-12 13:48:06.815407
os/FDCache.h: 89: FAILED assert(!registry[registry_id].lookup(hoid))

ceph version 0.84-998-gfcf8059 (fcf805972124dac1eae18b1cfd286790462b8ec8)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xa82a0b]
2: (FileStore::lfn_unlink(coll_t, ghobject_t const&, SequencerPosition const&, bool)+0x54b) [0x8918eb]
3: (FileStore::_remove(coll_t, ghobject_t const&, SequencerPosition const&)+0x8b) [0x891d8b]
4: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x25ce) [0x8a0fae]
5: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x44) [0x8a32a4]
6: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x169) [0x8a3479]
7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xac0) [0xa707b0]
8: (ThreadPool::WorkThread::entry()+0x10) [0xa72b30]
9: (()+0x7f6e) [0x7fb570cd7f6e]
10: (clone()+0x6d) [0x7fb56f2c59cd]


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #9711: 'cache' osd crash on ceph 0.86Duplicate10/09/2014

Actions
Actions #1

Updated by Somnath Roy over 9 years ago

I have root caused it, it seems to be happening because one of my earlier changes :-( .. Here is the rot cause.

1. The FDCache.clear() and thus SharedLRU::clear() is not able to remove the object from SharedLRU::weak_refs since the FDCache ref is hold by some other threads. Assert is preventing the FD leak.

2. Now, only lfn_open() other than lfn_unlink() works with fdcache and fdcache.lookup() I removed earlier from the scope of Index lock as part of optimization. We thought in cache of Cache hit there is no need to call get_index() and lock it.

3. Moving fdcache.lookup within index lock seems to be fixing the issue.

4. Now, the logic is matching Firefly.

But, I am not sure whether this should prevent the FD leak in all scenarios. What about the following scenario.

1. Thread A, got the index write lock and got a hit in the fdcache. The FD is returned to the caller. The shared_ptr ref will be still 1.

2. By that time, Thread B tries to remove it from lfn_unlink(). Got the index write lock successfully and called fdcache.clear().

3. At this point, FDRef will not be deleted since thread A is working with it (ref = 1). This will result an assert if the FD is not removed before assert is checking for lookup. A valid race condition.

Somehow, I am not able to hit this scenario and I believe similar race condition are there in Firefly as well.

So, my question is, will the fix on lfn_open() be sufficient ?

Actions #2

Updated by Somnath Roy over 9 years ago

  • Backport set to Giant
Actions #3

Updated by Somnath Roy over 9 years ago

Created the following pull request for the fix.

https://github.com/ceph/ceph/pull/2510

Actions #4

Updated by Samuel Just over 9 years ago

  • Status changed from New to 7
Actions #5

Updated by Samuel Just over 9 years ago

  • Status changed from 7 to Resolved
Actions #6

Updated by Loïc Dachary over 9 years ago

  • Status changed from Resolved to 12
  • Backport changed from Giant to giant
Actions #7

Updated by Samuel Just over 9 years ago

  • Assignee changed from Somnath Roy to Samuel Just
Actions #8

Updated by Samuel Just over 9 years ago

  • Status changed from 12 to 7
Actions #9

Updated by Sage Weil over 9 years ago

  • Status changed from 7 to Pending Backport
Actions #10

Updated by Samuel Just over 9 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF