Feature #55283
qa: add fsync/sync stuck waiting for unsafe request test
% Done:
0%
Source:
Tags:
Backport:
Reviewed:
Description
The kclient has fixed this in:
6316d4197437 ("ceph: put the requests/sessions when it fails to alloc memory") 716e68d445fa ("ceph: fix off by one bugs in unsafe_request_wait()") 2e31f19343da ("ceph: flush the mdlog before waiting on unsafe reqs") 65628853caba ("ceph: flush mdlog before umounting") 4c0090260622 ("ceph: make iterate_sessions a global symbol") 90803e2b16ec ("ceph: make ceph_create_session_msg a global symbol")
And libcephfs have fixed in:
https://github.com/ceph/ceph/pull/44255
Related issues
History
#1 Updated by Xiubo Li almost 2 years ago
- Priority changed from Normal to High
#2 Updated by Xiubo Li almost 2 years ago
- Related to Bug #55284: kclient: filesystem sync will stuck for around 5 seconds sometimes added
#3 Updated by Xiubo Li almost 2 years ago
- Status changed from New to In Progress
- Assignee set to Xiubo Li
#4 Updated by Xiubo Li almost 2 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 45886
#5 Updated by Xiubo Li almost 2 years ago
Added two test cases support, one for file sync and another is for filesystem sync.
#6 Updated by Xiubo Li almost 2 years ago
Normally when before fixing this we can reproduce it very easy, and also mostly the duration is larger around 4 seconds:
2022-04-12 16:52:03,516.516 INFO:tasks.cephfs.test_misc:rmdir i = 91, duration = 4.060511827468872 2022-04-12 16:52:08,538.538 INFO:tasks.cephfs.test_misc:mkdir i = 92, duration = 4.193295955657959 2022-04-12 16:52:13,538.538 INFO:tasks.cephfs.test_misc:rmdir i = 92, duration = 4.091280698776245 2022-04-12 16:52:18,865.865 INFO:tasks.cephfs.test_misc:mkdir i = 93, duration = 4.566471338272095 2022-04-12 16:52:23,644.644 INFO:tasks.cephfs.test_misc:rmdir i = 93, duration = 3.8654799461364746 2022-04-12 16:52:28,538.538 INFO:tasks.cephfs.test_misc:mkdir i = 94, duration = 4.076276063919067 2022-04-12 16:52:33,515.515 INFO:tasks.cephfs.test_misc:rmdir i = 94, duration = 4.069322109222412 2022-04-12 16:52:38,528.528 INFO:tasks.cephfs.test_misc:mkdir i = 95, duration = 4.2052106857299805 2022-04-12 16:52:43,524.524 INFO:tasks.cephfs.test_misc:rmdir i = 95, duration = 4.074751138687134 2022-04-12 16:52:48,523.523 INFO:tasks.cephfs.test_misc:mkdir i = 96, duration = 4.137550115585327 2022-04-12 16:52:53,552.552 INFO:tasks.cephfs.test_misc:rmdir i = 96, duration = 4.125017404556274 2022-04-12 16:52:58,530.530 INFO:tasks.cephfs.test_misc:mkdir i = 97, duration = 4.160985708236694 2022-04-12 16:53:03,511.511 INFO:tasks.cephfs.test_misc:rmdir i = 97, duration = 4.096373796463013
So currently in the qa test cases will test mkdir/rmdir for 300 times and if no any duration is larger than 4 seconds we assume it's successful:
self.assertLess(duration, 4)
The following is the logs after fixing in kclient/libcephfs:
2022-04-12 16:05:24,923.923 INFO:tasks.cephfs.test_misc:58, 9, duration = 0.011046648025512695 2022-04-12 16:05:25,872.872 INFO:tasks.cephfs.test_misc:59, 9, duration = 0.016795873641967773 2022-04-12 16:05:27,174.174 INFO:tasks.cephfs.test_misc:59, 9, duration = 0.0077762603759765625 2022-04-12 16:05:28,171.171 INFO:tasks.cephfs.test_misc:60, 9, duration = 0.014036417007446289 2022-04-12 16:05:29,573.573 INFO:tasks.cephfs.test_misc:60, 9, duration = 0.014261007308959961 2022-04-12 16:05:30,576.576 INFO:tasks.cephfs.test_misc:61, 9, duration = 0.009625434875488281 2022-04-12 16:05:31,926.926 INFO:tasks.cephfs.test_misc:61, 9, duration = 0.013339519500732422 2022-04-12 16:05:32,955.955 INFO:tasks.cephfs.test_misc:62, 9, duration = 0.01258397102355957 2022-04-12 16:05:34,341.341 INFO:tasks.cephfs.test_misc:62, 9, duration = 0.0188295841217041 2022-04-12 16:05:35,390.390 INFO:tasks.cephfs.test_misc:63, 9, duration = 0.010167360305786133 2022-04-12 16:05:36,748.748 INFO:tasks.cephfs.test_misc:63, 9, duration = 0.009861469268798828 2022-04-12 16:05:37,724.724 INFO:tasks.cephfs.test_misc:64, 9, duration = 0.01287531852722168 2022-04-12 16:05:39,060.060 INFO:tasks.cephfs.test_misc:64, 9, duration = 0.015590667724609375 2022-04-12 16:05:40,046.046 INFO:tasks.cephfs.test_misc:65, 9, duration = 0.011040925979614258 2022-04-12 16:05:41,392.392 INFO:tasks.cephfs.test_misc:65, 9, duration = 0.01649618148803711 2022-04-12 16:05:42,374.374 INFO:tasks.cephfs.test_misc:66, 9, duration = 0.010881185531616211 2022-04-12 16:05:43,778.778 INFO:tasks.cephfs.test_misc:66, 9, duration = 0.01652979850769043
Maybe we should use 2 or 3 seconds instead of 4 seconds here ? I am just a little afraid that in overload case the safe reply still will take much longer, which shouldn't be relevant to this bugs here we are fixing...
#7 Updated by Xiubo Li almost 2 years ago
- Related to Bug #55329: qa: add test case for fsync crash issue added
#8 Updated by Venky Shankar almost 2 years ago
- Status changed from Fix Under Review to Resolved