Project

General

Profile

Feature #55283

qa: add fsync/sync stuck waiting for unsafe request test

Added by Xiubo Li almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
qa-suite
Labels (FS):
qa
Pull request ID:

Description

The kclient has fixed this in:

6316d4197437 ("ceph: put the requests/sessions when it fails to alloc memory")
716e68d445fa ("ceph: fix off by one bugs in unsafe_request_wait()")
2e31f19343da ("ceph: flush the mdlog before waiting on unsafe reqs")
65628853caba ("ceph: flush mdlog before umounting")
4c0090260622 ("ceph: make iterate_sessions a global symbol")
90803e2b16ec ("ceph: make ceph_create_session_msg a global symbol")

And libcephfs have fixed in:

https://github.com/ceph/ceph/pull/44255

Related issues

Related to Linux kernel client - Bug #55284: kclient: filesystem sync will stuck for around 5 seconds sometimes Resolved
Related to CephFS - Bug #55329: qa: add test case for fsync crash issue Resolved

History

#1 Updated by Xiubo Li almost 2 years ago

  • Priority changed from Normal to High

#2 Updated by Xiubo Li almost 2 years ago

  • Related to Bug #55284: kclient: filesystem sync will stuck for around 5 seconds sometimes added

#3 Updated by Xiubo Li almost 2 years ago

  • Status changed from New to In Progress
  • Assignee set to Xiubo Li

#4 Updated by Xiubo Li almost 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 45886

#5 Updated by Xiubo Li almost 2 years ago

Added two test cases support, one for file sync and another is for filesystem sync.

#6 Updated by Xiubo Li almost 2 years ago

Normally when before fixing this we can reproduce it very easy, and also mostly the duration is larger around 4 seconds:

2022-04-12 16:52:03,516.516 INFO:tasks.cephfs.test_misc:rmdir i = 91, duration = 4.060511827468872
2022-04-12 16:52:08,538.538 INFO:tasks.cephfs.test_misc:mkdir i = 92, duration = 4.193295955657959
2022-04-12 16:52:13,538.538 INFO:tasks.cephfs.test_misc:rmdir i = 92, duration = 4.091280698776245
2022-04-12 16:52:18,865.865 INFO:tasks.cephfs.test_misc:mkdir i = 93, duration = 4.566471338272095
2022-04-12 16:52:23,644.644 INFO:tasks.cephfs.test_misc:rmdir i = 93, duration = 3.8654799461364746
2022-04-12 16:52:28,538.538 INFO:tasks.cephfs.test_misc:mkdir i = 94, duration = 4.076276063919067
2022-04-12 16:52:33,515.515 INFO:tasks.cephfs.test_misc:rmdir i = 94, duration = 4.069322109222412
2022-04-12 16:52:38,528.528 INFO:tasks.cephfs.test_misc:mkdir i = 95, duration = 4.2052106857299805
2022-04-12 16:52:43,524.524 INFO:tasks.cephfs.test_misc:rmdir i = 95, duration = 4.074751138687134
2022-04-12 16:52:48,523.523 INFO:tasks.cephfs.test_misc:mkdir i = 96, duration = 4.137550115585327
2022-04-12 16:52:53,552.552 INFO:tasks.cephfs.test_misc:rmdir i = 96, duration = 4.125017404556274
2022-04-12 16:52:58,530.530 INFO:tasks.cephfs.test_misc:mkdir i = 97, duration = 4.160985708236694
2022-04-12 16:53:03,511.511 INFO:tasks.cephfs.test_misc:rmdir i = 97, duration = 4.096373796463013

So currently in the qa test cases will test mkdir/rmdir for 300 times and if no any duration is larger than 4 seconds we assume it's successful:

self.assertLess(duration, 4)

The following is the logs after fixing in kclient/libcephfs:

2022-04-12 16:05:24,923.923 INFO:tasks.cephfs.test_misc:58, 9, duration = 0.011046648025512695
2022-04-12 16:05:25,872.872 INFO:tasks.cephfs.test_misc:59, 9, duration = 0.016795873641967773
2022-04-12 16:05:27,174.174 INFO:tasks.cephfs.test_misc:59, 9, duration = 0.0077762603759765625
2022-04-12 16:05:28,171.171 INFO:tasks.cephfs.test_misc:60, 9, duration = 0.014036417007446289
2022-04-12 16:05:29,573.573 INFO:tasks.cephfs.test_misc:60, 9, duration = 0.014261007308959961
2022-04-12 16:05:30,576.576 INFO:tasks.cephfs.test_misc:61, 9, duration = 0.009625434875488281
2022-04-12 16:05:31,926.926 INFO:tasks.cephfs.test_misc:61, 9, duration = 0.013339519500732422
2022-04-12 16:05:32,955.955 INFO:tasks.cephfs.test_misc:62, 9, duration = 0.01258397102355957
2022-04-12 16:05:34,341.341 INFO:tasks.cephfs.test_misc:62, 9, duration = 0.0188295841217041
2022-04-12 16:05:35,390.390 INFO:tasks.cephfs.test_misc:63, 9, duration = 0.010167360305786133
2022-04-12 16:05:36,748.748 INFO:tasks.cephfs.test_misc:63, 9, duration = 0.009861469268798828
2022-04-12 16:05:37,724.724 INFO:tasks.cephfs.test_misc:64, 9, duration = 0.01287531852722168
2022-04-12 16:05:39,060.060 INFO:tasks.cephfs.test_misc:64, 9, duration = 0.015590667724609375
2022-04-12 16:05:40,046.046 INFO:tasks.cephfs.test_misc:65, 9, duration = 0.011040925979614258
2022-04-12 16:05:41,392.392 INFO:tasks.cephfs.test_misc:65, 9, duration = 0.01649618148803711
2022-04-12 16:05:42,374.374 INFO:tasks.cephfs.test_misc:66, 9, duration = 0.010881185531616211
2022-04-12 16:05:43,778.778 INFO:tasks.cephfs.test_misc:66, 9, duration = 0.01652979850769043

Maybe we should use 2 or 3 seconds instead of 4 seconds here ? I am just a little afraid that in overload case the safe reply still will take much longer, which shouldn't be relevant to this bugs here we are fixing...

#7 Updated by Xiubo Li almost 2 years ago

  • Related to Bug #55329: qa: add test case for fsync crash issue added

#8 Updated by Venky Shankar almost 2 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF