Project

General

Profile

Actions

Bug #15695

closed

FileStore: umount hang because sync thread doesn't exit

Added by Haomai Wang almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

int FileStore::umount()
......
do_force_sync();

lock.Lock();
stop = true;
sync_cond.Signal();
lock.Unlock();

because force_sync flag isn't protect by the same lock acquire, it will sync thread wait two times which isn't expected:

utime_t startwait = ceph_clock_now(g_ceph_context);
if (!force_sync) {
dout(20) << "sync_entry waiting for max_interval " << max_interval << dendl;
sync_cond.WaitInterval(g_ceph_context, lock, max_interval);
} else {
dout(20) << "sync_entry not waiting, force_sync set" << dendl;
}
if (force_sync) {
dout(20) << "sync_entry force_sync set" << dendl;
force_sync = false;
} else {
// wait for at least the min interval
utime_t woke = ceph_clock_now(g_ceph_context);
woke -= startwait;
dout(20) << "sync_entry woke after " << woke << dendl;
if (woke < min_interval) {
utime_t t = min_interval;
t -= woke;
dout(20) << "sync_entry waiting for another " << t
<< " to reach min interval " << min_interval << dendl;
sync_cond.WaitInterval(g_ceph_context, lock, t);
}
}

http://pulpito.ceph.com/haomai-2016-05-01_23:40:37-rados-wip-haomai-testing-distro-basic-smithi/161404/
http://pulpito.ceph.com/haomai-2016-05-01_23:40:37-rados-wip-haomai-testing-distro-basic-smithi/161419

These two jobs all stuck into this case, and WaitInterval for 10hours because of cmd line arguments!:

ceph_test_filestore_idempotent_sequence run-sequence-to 0 b.00 b.00/journal --test-seed 59 --osd-journal-size 100 --log-file b.00.clean --debug-filestore 20 --filestore-min-sync-interval 36000 --filestore-max-sync-interval 36001


Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #15768: jewel: FileStore: umount hang because sync thread doesn't exitResolvedAbhishek VarshneyActions
Actions #1

Updated by Haomai Wang almost 8 years ago

we could reduce "--filestore-min-sync-interval 36000" this to a low value for qa test suite, or modify sync_entry to avoid unexpected wait?

Actions #2

Updated by Kefu Chai almost 8 years ago

  • Status changed from New to Fix Under Review
  • Source changed from other to Community (dev)
Actions #3

Updated by Kefu Chai almost 8 years ago

  • Assignee set to Kefu Chai
Actions #4

Updated by Sage Weil almost 8 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to jewel
Actions #5

Updated by Nathan Cutler almost 8 years ago

  • Copied to Backport #15768: jewel: FileStore: umount hang because sync thread doesn't exit added
Actions #6

Updated by Loïc Dachary almost 8 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF