Bug #63896: client: contiguous read fails for non-contiguous write (in async I/O api) - CephFS - Ceph

Actions

Copy link

Bug #63896

open

client: contiguous read fails for non-contiguous write (in async I/O api)

Added by Dhairya Parmar 4 months ago. Updated 3 months ago.

Status:

In Progress

Priority:

Normal

Assignee:

Dhairya Parmar

Category:

Correctness/Safety

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

Backport:

reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Client

Labels (FS):

Pull request ID:

55144

Crash signature (v1):

Crash signature (v2):

Description

when waiting for onfinish, with this test case(trivial before and after parts removed):

  for(int i = 0; i < NUM_BUF; ++i) {
    writefinish.reset(new C_SaferCond("test-nonblocking-writefinish-non-contiguous"));
    rc = client->ll_preadv_pwritev(fh, current_iov++, 1, i * NUM_BUF * 10,
                                   true, writefinish.get(), nullptr);
    ASSERT_EQ(rc, 0);
    total_bytes_written += writefinish->wait();
  }
  ASSERT_EQ(total_bytes_written, bytes_to_write);

  readfinish.reset(new C_SaferCond("test-nonblocking-readfinish-contiguous"));
  rc = client->ll_preadv_pwritev(fh, iov_in_contiguous, NUM_BUF, 0, false,
                                 readfinish.get(), &bl);
  ASSERT_EQ(rc, 0);
  total_bytes_read = readfinish->wait();
  // should be less since the data written is non-contiguous but the read
  // was contiguous
  ASSERT_LE(total_bytes_read, bytes_to_write);

the executions stalls, check out 1_execution_stalled_contiguous_read linked below

BUT if the code to wait for context to complete is removed, i.e.:

  for(int i = 0; i < NUM_BUF; ++i) {
    writefinish.reset(new C_SaferCond("test-nonblocking-writefinish-non-contiguous"));
    rc = client->ll_preadv_pwritev(fh, current_iov++, 1, i * NUM_BUF * 10,
                                   true, writefinish.get(), nullptr);
    ASSERT_EQ(rc, 0);
    total_bytes_written += writefinish->wait();
  }
  ASSERT_EQ(total_bytes_written, bytes_to_write);

  readfinish.reset(new C_SaferCond("test-nonblocking-readfinish-contiguous"));
  rc = client->ll_preadv_pwritev(fh, iov_in_contiguous, NUM_BUF, 0, false,
                                 readfinish.get(), &bl);
  ASSERT_EQ(rc, 0);

execution aborts with a core dump with:

2023-12-28T19:53:08.780+0530 7f0f47a4a9c0 20 client.4423 awaiting reply|forward|kick on 0x7fff7814eaa0
terminate called after throwing an instance of 'std::system_error'
  what():  Invalid argument
*** Caught signal (Aborted) **

checkout 2_core_dumped(attached below) for more info

Ideally these should've just returned bytes less than total bytes written since the data between offsets must be 0 but it failed to handle this case

Files

Download all files

1_execution_stalled_contiguous_read (102 KB) 1_execution_stalled_contiguous_read	log for case where the contiguous read execution stalls (for non-contiguous write) waiting on context completion	Dhairya Parmar, 12/28/2023 02:17 PM
2_core_dumped (268 KB) 2_core_dumped	log for case where we don't wait for the context completion to take place when performing contiguous read for non-contiguous write	Dhairya Parmar, 12/28/2023 02:25 PM

Actions

Copy link

Updated by Dhairya Parmar 4 months ago

Description updated (diff)

Actions

Copy link

Updated by Dhairya Parmar 4 months ago

Subject changed from client: contiguous read for non-contiguous write to client: contiguous read fails for non-contiguous write

Actions

Copy link

Updated by Dhairya Parmar 4 months ago

from my findings, the read operation needs to follow the same layout as of the write operation i.e. read must take place at the locations where the data was written otherwise the read call just stalls. This is in contradiction of my expectation - return whatever the data that is read (which is in this case less than what was written).

So my question is: is the existing behaviour (of not allowing contiguous read for non-contiguous write) an anticipated thing or is this actually a bug and it should be what i mentioned above?

Actions

Copy link

Updated by Dhairya Parmar 4 months ago

this fails with a backtrace when done synchronously

    -1> 2024-01-02T17:15:56.824+0530 7f0c392299c0 -1 /home/dparmar/CephRepoForRunningTestsLocally/ceph/src/common/mutex_debug.h: In function 'void ceph::mutex_debug_detail::mutex_debug_impl<<anonymous> >::_pre_unlock() [with bool Recursive = false]' thread 7f0c392299c0 time 2024-01-02T17:15:56.822549+0530
/home/dparmar/CephRepoForRunningTestsLocally/ceph/src/common/mutex_debug.h: 163: FAILED ceph_assert(nlock == 1)

 ceph version 18.0.0-7275-g8bc4cf1939c (8bc4cf1939ce76918542dbb5433652de295e5916) reef (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7f0c39c4b726]
 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7f0c39c4b948]
 3: (ceph::mutex_debug_detail::mutex_debug_impl<false>::_pre_unlock()+0x32) [0x55ab11a6448a]
 4: (ceph::mutex_debug_detail::mutex_debug_impl<false>::unlock(bool)+0x10) [0x55ab11a644d2]
 5: (Client::ll_preadv_pwritev(Fh*, iovec const*, int, long, bool, Context*, ceph::buffer::v15_2_0::list*, bool, bool)+0x112) [0x55ab11aced2c]
 6: (TestClient_LlreadvLlwritevNonContiguous_Test::TestBody()+0x677) [0x55ab11a6105b]
 7: (void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x1b) [0x55ab11b5b3fb]
 8: (void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x80) [0x55ab11b6393c]
 9: (testing::Test::Run()+0xb4) [0x55ab11b548aa]
 10: (testing::TestInfo::Run()+0x104) [0x55ab11b549b0]
 11: (testing::TestSuite::Run()+0xb2) [0x55ab11b54a64]
 12: (testing::internal::UnitTestImpl::RunAllTests()+0x36b) [0x55ab11b5611b]
 13: (bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x1b) [0x55ab11b5b6a7]
 14: (bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x80) [0x55ab11b63e99]
 15: (testing::UnitTest::Run()+0x63) [0x55ab11b54b7f]
 16: (RUN_ALL_TESTS()+0x11) [0x55ab11a5f80b]
 17: main()
 18: /lib64/libc.so.6(+0x27b8a) [0x7f0c38849b8a]
 19: __libc_start_main()
 20: _start()

     0> 2024-01-02T17:15:56.826+0530 7f0c392299c0 -1 *** Caught signal (Aborted) **
 in thread 7f0c392299c0 thread_name:ceph_test_clien

Actions

Copy link

Updated by Dhairya Parmar 4 months ago

Subject changed from client: contiguous read fails for non-contiguous write to client: contiguous read fails for non-contiguous write (in async I/O api)

Actions

Copy link

Updated by Dhairya Parmar 4 months ago

update: it will not return lesser bytes but the same bytes since the RADOS fills up the holes with 0s, but the question still remains the same i.e. the stalling issue with the read call

Actions

Copy link

Updated by Dhairya Parmar 4 months ago

So it seems like the caps ref count is still non-zero after the test case teardown:

2024-01-08T17:26:42.594+0530 7f746a27d9c0 10 client.4342 _put_inode on 0x10000000003.head(faked_ino=0 nref=19 ll_ref=0 cap_refs={4=0,1024=1,2048=0,4096=0,8192=0} open={3=0} mode=100666 size=2016/4194304 nlink=0 btime=2024-01-08T17:26:42.522136+0530 mtime=2024-01-08T17:26:42.568853+0530 ctime=2024-01-08T17:26:42.569355+0530 change_attr=5 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x10000000003 ts 0/0 objects 1 dirty_or_tx 0] 0x7f743800ce00) n = 17
2024-01-08T17:26:42.594+0530 7f746a27d9c0  2 client.4342 cache still has 0+1 items, waiting (for caps to release?)

interesting part is "cap_refs={4=0,1024=1,2048=0,4096=0,8192=0}", 1024 is the file cache cap which seems is not being put anytime which has stalled the read call.

Actions

Copy link

Updated by Venky Shankar 4 months ago

Dhairya Parmar wrote:

So it seems like the caps ref count is still non-zero after the test case teardown:

[...]

interesting part is "cap_refs={4=0,1024=1,2048=0,4096=0,8192=0}", 1024 is the file cache cap which seems is not being put anytime which has stalled the read call.

So, this is the bug we spoke about after stand-up a couple of days ago. This would require putting a cap ref when the async call finishes.

Also, just FYI, the issue about the read call returning zero's (0's) isn't an issue since that the expected behaviour, correct?

Actions

Copy link

Updated by Venky Shankar 4 months ago

Category set to Correctness/Safety
Status changed from New to Triaged
Assignee set to Dhairya Parmar
Target version set to v19.0.0
Backport set to reef
Component(FS) Client added

Actions

Copy link

#10

Updated by Dhairya Parmar 4 months ago

Venky Shankar wrote:

Dhairya Parmar wrote:

So it seems like the caps ref count is still non-zero after the test case teardown:

[...]

interesting part is "cap_refs={4=0,1024=1,2048=0,4096=0,8192=0}", 1024 is the file cache cap which seems is not being put anytime which has stalled the read call.

So, this is the bug we spoke about after stand-up a couple of days ago. This would require putting a cap ref when the async call finishes.

Yes since the managed pointer is of type C_Read_Async_Finisher, the file cache cap needs to be put in C_Read_Async_Finisher::finish before finishing the read call

Also, just FYI, the issue about the read call returning zero's (0's) isn't an issue since that the expected behaviour, correct?

Yup, it's the RADOS which does the memset of all the holes with 0's so the read call(sync or async) will always return you the no. of bytes you requested for, so long story short - the contents will not be the same (comparing with the buffers that were written) but indeed the amount of bytes will be.

Actions

Copy link

#11

Updated by Dhairya Parmar 4 months ago

RCA:

When doing a read like this where the ObjectCacher::file_read could not find the amount of data requested at the given offset, it would return zero; in this case for sync calls the code is pretty straightforward - inode gets a file cache cap ref, waits for the io to finish, put the cap ref and then do readahead but for async call, the code just provides a file cache cap ref and leaves it in the hands of async call finisher (i.e. Client::C_Read_Async_Finisher) to complete rest of the process, now the issue here is that this part of code never decremented the ref count and so here we have a cap leak which ultimately prevents the IO call to return successfully since revoking all the provided caps is a must else we will indefinitely keep on waiting for the caps to be released:

2024-01-08T17:26:42.594+0530 7f746a27d9c0 10 client.4342 _put_inode on 0x10000000003.head(faked_ino=0 nref=19 ll_ref=0 cap_refs={4=0,1024=1,2048=0,4096=0,8192=0} open={3=0} mode=100666 size=2016/4194304 nlink=0 btime=2024-01-08T17:26:42.522136+0530 mtime=2024-01-08T17:26:42.568853+0530 ctime=2024-01-08T17:26:42.569355+0530 change_attr=5 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x10000000003 ts 0/0 objects 1 dirty_or_tx 0] 0x7f743800ce00) n = 17
2024-01-08T17:26:42.594+0530 7f746a27d9c0  2 client.4342 cache still has 0+1 items, waiting (for caps to release?)

Actions

Copy link

#12