Actions
Bug #21480
closedbluestore: flush_commit is racy
Status:
Resolved
Priority:
Urgent
Assignee:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
observed hang on 'osd bench' command:
/a/yuriw-2017-09-19_19:54:13-rados-wip-yuri-testing3-2017-09-19-1710-distro-basic-smithi/1648854
2017-09-19 22:29:11.978724 7f70bd261700 1 heartbeat_map is_healthy 'OSD::command_tp thread 0x7f709a9e7700' had suicide timed out after 900 2017-09-19 22:29:11.983003 7f709a9e7700 -1 *** Caught signal (Aborted) ** in thread 7f709a9e7700 thread_name:tp_osd_cmd ceph version 13.0.0-1010-g1c941a3 (1c941a39eaa824e91551e0b37ebcca96e0f6f174) mimic (dev) 1: (()+0xa39e89) [0x7f70c29c3e89] 2: (()+0x10330) [0x7f70c0699330] 3: (pthread_cond_wait()+0xc4) [0x7f70c0695404] 4: (C_SaferCond::wait()+0x8c) [0x7f70c24f741c] 5: (OSD::do_command(Connection*, unsigned long, std::vector<std::string, std::allocator<std::string> >&, ceph::buffer::list&)+0x1b7d) [0x7f70c24e4d7d] 6: (OSD::CommandWQ::_process(OSD::Command*, ThreadPool::TPHandle&)+0x49) [0x7f70c25291a9] 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x7f70c2a06d4e] 8: (ThreadPool::WorkThread::entry()+0x10) [0x7f70c2a07c30]
The flush_commit appears to be racy because it sets state to KV_DONE without holding the osr lock, but uses the lock for the flush_commit() waiter.
the possibly good news is that there are only a handful of users of flush_commit(); maybe we can just drop it.
Actions