Bug #24101
mds: deadlock during fsstress workunit with 9 actives
0%
Description
Op client was blocked on:
pdonnell@smithi104:/var/log/ceph$ sudo ceph --admin-daemon=/var/run/ceph/ceph-client.0.11792.asok mds_requests { "request": { "tid": 11153, "op": "lookup", "path": "#0x10000000107/f31", "path2": "", "ino": "0x10000000107", "hint_ino": "0x0", "sent_stamp": "2018-05-11 12:41:53.192579", "mds": 5, "resend_mds": -1, "send_to_auth": 0, "sent_on_mseq": 1, "retry_attempt": 2, "got_unsafe": 0, "uid": 0, "gid": 0, "oldest_client_tid": 11118, "mdsmap_epoch": 0, "flags": 0, "num_retry": 0, "num_fwd": 0, "num_releases": 0, "abort_rc": 0 } }
rank 5 was waiting on a rdlock from rank 0. rank 0 was waiting on capability release from the single client.
Just to see what would happen, I tried failing mds.b (rank 5). That didn't do anything so I tried failing mds.i (rank 0) which mds.b was waiting for a rdlock on. Also didn't do anything.
Finally, I evicted the client so that we would get logs from the run.
Related issues
History
#1 Updated by Zheng Yan almost 6 years ago
caused by https://github.com/ceph/ceph/pull/21615
#2 Updated by Zheng Yan almost 6 years ago
- Status changed from New to 12
#3 Updated by Patrick Donnelly almost 6 years ago
- Assignee set to Zheng Yan
#4 Updated by Patrick Donnelly almost 6 years ago
- Related to Bug #23837: client: deleted inode's Bufferhead which was in STATE::Tx would lead a assert fail added
#5 Updated by Ivan Guan almost 6 years ago
Zheng Yan wrote:
caused by https://github.com/ceph/ceph/pull/21615
Sorry,i don't understand why this pr can cause 24101.Could you explain it in detail?
#6 Updated by Zheng Yan almost 6 years ago
Ivan Guan wrote:
Zheng Yan wrote:
caused by https://github.com/ceph/ceph/pull/21615
Sorry,i don't understand why this pr can cause 24101.Could you explain it in detail?
void ObjectCacher::discard_writeback(ObjectSet *oset, const vector<ObjectExtent>& exls, Context* on_finish) { assert(lock.is_locked()); bool was_dirty = oset->dirty_or_tx > 0; C_GatherBuilder gather(cct); _discard(oset, exls, &gather); if (gather.has_subs()) { bool flushed = was_dirty && oset->dirty_or_tx == 0; gather.set_finisher(new FunctionContext( [this, oset, flushed, on_finish](int) { assert(lock.is_locked()); if (flushed && flush_set_callback) flush_set_callback(flush_set_callback_arg, oset); if (on_finish) on_finish->complete(0); })); gather.activate(); return; } _discard_finish(oset, was_dirty, on_finish); }
"oset->dirty_or_tx > 0" is true before _discard() get called. After _discard() return, oset->dirty_or_tx become 0 and gather.has_subs() is true. flush_set_callback() doesn't get called in this case, which causes leakage of Fcb caps
#7 Updated by Patrick Donnelly almost 6 years ago
- Status changed from 12 to Closed
Apparently resolved by the revert.