Bug #63679
closedclient: handle zero byte sync/async write cases
0%
Description
A bit of background: libcephfs is now capable of performing async I/O and a new api `Client::ll_preadv_pwritev` is available for it. I'm writing test cases in [0] and trying to cover all the edge cases as I can; while running them locally, one the test cases aimed to test zero bytes async I/O failed at the write call and returned this back trace
/home/dparmar/ceph/src/osdc/Striper.cc: 186: FAILED ceph_assert(len > 0)
ceph version 18.0.0-7275-g8bc4cf1939c (8bc4cf1939ce76918542dbb5433652de295e5916) reef (dev)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7f7f5602127d]
2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7f7f56021484]
3: (Striper::file_to_extents(ceph::common::CephContext*, file_layout_t const*, unsigned long, unsigned long, unsigned long, unsigned long, boost::container::small_vector<striper::LightweightObjectExtent, 4ul, void, void>*)+0x31d) [0x55e76c4a0be7]
4: (Striper::file_to_extents(ceph::common::CephContext*, char const*, file_layout_t const*, unsigned long, unsigned long, unsigned long, std::vector<ObjectExtent, std::allocator<ObjectExtent> >&, unsigned long)+0x79) [0x55e76c4a17e7]
5: (Striper::file_to_extents(ceph::common::CephContext*, inodeno_t, file_layout_t const*, unsigned long, unsigned long, unsigned long, std::vector<ObjectExtent, std::allocator<ObjectExtent> >&)+0x75) [0x55e76c3bc083]
6: (ObjectCacher::file_write(ObjectCacher::ObjectSet*, file_layout_t*, SnapContext const&, long, unsigned long, ceph::buffer::v15_2_0::list&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, int, Context*, bool)+0x65) [0x55e76c3db4a1]
7: (Client::_write(Fh*, long, unsigned long, char const*, iovec const*, int, Context*, bool, bool)+0x99b) [0x55e76c3a9639]
8: (Client::_preadv_pwritev_locked(Fh*, iovec const*, unsigned int, long, bool, bool, Context*, ceph::buffer::v15_2_0::list*, bool, bool)+0x170) [0x55e76c3aa098]
9: (Client::ll_preadv_pwritev(Fh*, iovec const*, int, long, bool, Context*, ceph::buffer::v15_2_0::list*, bool, bool)+0xbb) [0x55e76c3aa879]
10: (TestClient_LlreadvLlwritevZeroBytes_Test::TestBody()+0x452) [0x55e76c340e74]
11: (void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x1b) [0x55e76c431082]
12: (void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x80) [0x55e76c439019]
13: (testing::Test::Run()+0xb4) [0x55e76c42ab94]
14: (testing::TestInfo::Run()+0x104) [0x55e76c42ac9a]
15: (testing::TestSuite::Run()+0xb2) [0x55e76c42ad4e]
16: (testing::internal::UnitTestImpl::RunAllTests()+0x36b) [0x55e76c42c1d7]
17: (bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x1b) [0x55e76c4312f2]
18: (bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x80) [0x55e76c43945f]
19: (testing::UnitTest::Run()+0x63) [0x55e76c42ae69]
20: (RUN_ALL_TESTS()+0x11) [0x55e76c33f85f]
21: main()
22: /lib64/libc.so.6(+0x27550) [0x7f7f54c4a550]
23: __libc_start_main()
24: _start()
it points to `Striper::file_to_extents`. The assertion that failed was added in 7bfde0a78e2a81ed9dac80b04ed17bfd426fed45. The only purpose i find in the function is assigning its value to `left` which is the loop variable, so it is going to be a no-op if it's value is zero anyways. So do we really this assertion?
Updated by Dhairya Parmar 5 months ago
i experimented by removing the assertion, the test case passed but test case execution could not be completed, its stalled during tear down at
2023-11-29T18:25:36.044+0530 7fcb8f4309c0 2 client.4526 cache still has 0+1 items, waiting (for caps to release?)
2023-11-29T18:25:36.968+0530 7fcb6affd6c0 20 client.4526 tick
2023-11-29T18:25:36.968+0530 7fcb6affd6c0 10 client.4526 renew_caps()
2023-11-29T18:25:36.968+0530 7fcb6affd6c0 15 client.4526 renew_caps requesting from mds.0
2023-11-29T18:25:36.968+0530 7fcb6affd6c0 10 client.4526 renew_caps mds.0
2023-11-29T18:25:36.968+0530 7fcb6affd6c0 20 client.4526 collect_and_send_metrics
2023-11-29T18:25:36.968+0530 7fcb6affd6c0 20 client.4526 collect_and_send_global_metrics
2023-11-29T18:25:36.968+0530 7fcb6affd6c0 20 client.4526 trim_cache size 0 max 16384
2023-11-29T18:25:36.968+0530 7fcb6affd6c0 20 client.4526 upkeep thread waiting interval 1.000000000s
2023-11-29T18:25:36.968+0530 7fcb817fa6c0 10 client.4526 handle_client_session client_session(renewcaps seq 2) v1 from mds.0
2023-11-29T18:25:36.968+0530 7fcb817fa6c0 10 client.4526 unmounting: trim pass, size was 0+1
2023-11-29T18:25:36.968+0530 7fcb817fa6c0 20 client.4526 trim_cache size 0 max 16384
2023-11-29T18:25:36.968+0530 7fcb817fa6c0 10 client.4526 unmounting: trim pass, size still 0+1
2023-11-29T18:25:37.968+0530 7fcb6affd6c0 20 client.4526 tick
2023-11-29T18:25:37.968+0530 7fcb6affd6c0 20 client.4526 collect_and_send_metrics
2023-11-29T18:25:37.968+0530 7fcb6affd6c0 20 client.4526 collect_and_send_global_metrics
2023-11-29T18:25:37.968+0530 7fcb6affd6c0 20 client.4526 trim_cache size 0 max 16384
2023-11-29T18:25:37.968+0530 7fcb6affd6c0 20 client.4526 upkeep thread waiting interval 1.000000000s
Updated by Venky Shankar 5 months ago
Does this fail in the sync path? If not, then it could be due to the async changes.
Updated by Dhairya Parmar 5 months ago
Venky Shankar wrote:
Does this fail in the sync path? If not, then it could be due to the async changes.
I see the sync paths using the same helper function, this might've always existed but since there is no test case for this; i can add one and test it out. PR coming out soon
Updated by Xiubo Li 5 months ago
Dhairya Parmar wrote:
i experimented by removing the assertion, the test case passed but test case execution could not be completed, its stalled during tear down at
[...]
It seems waiting Rados to feedback and then wake up it, but since the len is zero and then just missed that I guess.
Updated by Dhairya Parmar 5 months ago
Xiubo Li wrote:
Dhairya Parmar wrote:
i experimented by removing the assertion, the test case passed but test case execution could not be completed, its stalled during tear down at
[...]
It seems waiting Rados to feedback and then wake up it, but since the len is zero and then just missed that I guess.
yeah looks like that
Updated by Dhairya Parmar 5 months ago
from yesterday's conversation with venky during standup; fixing this on client side
Updated by Dhairya Parmar 5 months ago
- Project changed from RADOS to CephFS
- Subject changed from osdc: cannot perform zero byte async write to client: handle zero byte sync/async write cases
- Category changed from Correctness/Safety to Correctness/Safety
- Assignee set to Dhairya Parmar
- Target version set to v19.0.0
- Backport set to quincy,reef
- Component(FS) Client added
Updated by Dhairya Parmar 5 months ago
- Related to Bug #63697: client: zero byte sync write fails added
Updated by Dhairya Parmar 5 months ago
- Related to Bug #63734: client: handle callback when async io fails added