Bug #40765
closedmimic: "Command failed (workunit test rados/test.sh)" in smoke/master/mimic
0%
Description
Run: http://pulpito.ceph.com/teuthology-2019-07-12_06:00:03-smoke-mimic-testing-basic-smithi/
Jobs: '4113997', '4113998'
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2019-07-12_06:00:03-smoke-mimic-testing-basic-smithi/4113997/teuthology.log
2019-07-12T09:58:39.832 INFO:tasks.mon_thrash.mon_thrasher:waiting for 90.0 secs before reviving monitors 2019-07-12T09:59:02.440 INFO:teuthology.orchestra.run.smithi019:Running: 2019-07-12T09:59:02.441 INFO:teuthology.orchestra.run.smithi019:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2019-07-12T09:59:02.445 INFO:teuthology.orchestra.run.smithi041:Running: 2019-07-12T09:59:02.445 INFO:teuthology.orchestra.run.smithi041:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2019-07-12T09:59:02.450 INFO:teuthology.orchestra.run.smithi080:Running: 2019-07-12T09:59:02.450 INFO:teuthology.orchestra.run.smithi080:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2019-07-12T09:59:30.049 INFO:tasks.workunit.client.0.smithi019.stderr:++ cleanup 2019-07-12T09:59:30.049 INFO:tasks.workunit.client.0.smithi019.stderr:++ pkill -P 9446 2019-07-12T09:59:30.083 INFO:tasks.workunit.client.0.smithi019.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 9459 Terminated bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 2019-07-12T09:59:30.083 INFO:tasks.workunit.client.0.smithi019.stderr:++ true 2019-07-12T09:59:30.085 DEBUG:teuthology.orchestra.run:got remote process result: 124 2019-07-12T09:59:30.085 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0... 2019-07-12T09:59:30.085 INFO:teuthology.orchestra.run.smithi019:Running: 2019-07-12T09:59:30.086 INFO:teuthology.orchestra.run.smithi019:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0 2019-07-12T09:59:30.506 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_ceph_mimic/qa/tasks/workunit.py", line 123, in task timeout=timeout,cleanup=cleanup) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__ for result in self: File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next resurrect_traceback(result) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/git.ceph.com_ceph_mimic/qa/tasks/workunit.py", line 409, in _run_tests label="workunit test {workunit}".format(workunit=workunit) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 205, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 437, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status node=self.hostname, label=self.label CommandFailedError: Command failed (workunit test rados/test.sh) on smithi019 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=6ac795a266b802d40086d0a664964097794d516d TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh' 2019-07-12T09:59:30.517 DEBUG:teuthology.run_tasks:Unwinding manager mon_thrash
Files
Updated by Yuri Weinstein almost 5 years ago
- Subject changed from "Command failed (workunit test rados/test.sh)" in smoke/master to "Command failed (workunit test rados/test.sh)" in smoke/master/mimic
- Priority changed from Normal to High
- Affected Versions v13.2.7 added
Seems on mimic as well
http://pulpito.ceph.com/teuthology-2019-07-13_06:00:03-smoke-mimic-testing-basic-smithi/
2019-07-13T10:30:06.616 INFO:teuthology.orchestra.run.smithi194:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2019-07-13T10:30:32.086 INFO:tasks.workunit.client.0.smithi158.stderr:++ cleanup 2019-07-13T10:30:32.086 INFO:tasks.workunit.client.0.smithi158.stderr:++ pkill -P 8938 2019-07-13T10:30:32.106 INFO:tasks.workunit.client.0.smithi158.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 8951 Terminated bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 2019-07-13T10:30:32.106 INFO:tasks.workunit.client.0.smithi158.stderr:++ true 2019-07-13T10:30:32.107 DEBUG:teuthology.orchestra.run:got remote process result: 124 2019-07-13T10:30:32.108 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0... 2019-07-13T10:30:32.108 INFO:teuthology.orchestra.run.smithi158:Running: 2019-07-13T10:30:32.108 INFO:teuthology.orchestra.run.smithi158:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0 2019-07-13T10:30:32.513 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_ceph_mimic/qa/tasks/workunit.py", line 123, in task timeout=timeout,cleanup=cleanup) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__ for result in self: File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next resurrect_traceback(result) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/git.ceph.com_ceph_mimic/qa/tasks/workunit.py", line 409, in _run_tests label="workunit test {workunit}".format(workunit=workunit) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 205, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 437, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status node=self.hostname, label=self.label CommandFailedError: Command failed (workunit test rados/test.sh) on smithi158 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=6ac795a266b802d40086d0a664964097794d516d TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'
Updated by Sage Weil almost 5 years ago
- Related to Bug #38238: rados/test.sh: api_aio_pp doesn't seem to start added
Updated by Sage Weil almost 5 years ago
2019-07-12T06:59:36.562 INFO:tasks.workunit.client.0.smithi019.stdout: api_io: [ RUN ] LibRadosIoPP.TruncTestPP 2019-07-12T06:59:36.563 INFO:tasks.workunit.client.0.smithi019.stdout: api_io: [ OK ] LibRadosIoPP.TruncTestPP (17 ms) 2019-07-12T06:59:36.563 INFO:tasks.workunit.client.0.smithi019.stdout: api_io: [ RUN ] LibRadosIoPP.RemoveTestPP 2019-07-12T06:59:37.397 INFO:tasks.workunit.client.0.smithi019.stdout: api_io: [ OK ] LibRadosIoPP.RemoveTestPP (7 delete_pools_parallel: process_1_[9896]: starting.
appears to be the problematic child process.
Updated by Brad Hubbard over 4 years ago
- Has duplicate Bug #40926: "Command failed (workunit test rados/test.sh)" in rados added
Updated by Brad Hubbard over 4 years ago
Still working on this but steps to reproduce...
$ cat << EOF >> orig.config.yaml interactive-on-error: true $(tail -n+2 /a/teuthology-2019-07-24_06:00:03-smoke-mimic-testing-basic-smithi/4144267/orig.config.yaml) EOF $ rm -rf ~/working/archive/* ;teuthology -v -a ~/working/archive/ --lock orig.config.yaml
Make sure ~/working/archive/ exists.
NOTE: You might need to change machine-type to "mira" if there are no smithis available (that's what I did so maybe just change it anyway?).
Try to take note of the machines that are being used. I just scanned the output for them but there's probably better ways. Let it get to the point where it's running the tests and then log onto each machine (or all at once) and run the following.
$ ps auwwx|grep ceph_test_rados_api_io
Eventually you'll find the process we want to investigate and it will hang around for a long time so you don't have to hurry much. You can also kill the task that started it and I think that will stop it from killing it.
I grabbed a gcore and I'm currently looking at that.
Updated by Brad Hubbard over 4 years ago
Looking at a live process it doesn't seem to be a deadlock or "hang" but more like some sort of livelock where the 'msgr-worker-1' thread is chewing the CPU and the 'ms_dispatch' and 'log' threads using a little bit of cpu as well.
top - 09:48:20 up 3:14, 1 user, load average: 0.06, 0.12, 0.11 Threads: 175 total, 1 running, 120 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.8 us, 1.7 sy, 0.0 ni, 95.7 id, 0.0 wa, 0.0 hi, 1.8 si, 0.0 st KiB Mem : 16402184 total, 14600708 free, 220520 used, 1580956 buff/cache KiB Swap: 2097148 total, 2097148 free, 0 used. 15842512 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10612 ubuntu 20 0 1186956 26604 16680 S 9.6 0.2 20:17.43 msgr-worker-1 10624 ubuntu 20 0 1186956 26604 16680 S 1.7 0.2 3:40.29 ms_dispatch 10603 ubuntu 20 0 1186956 26604 16680 S 1.0 0.2 2:01.56 log $ ps axHo stat,pid,tid,pgid,ppid,comm,wchan|grep 9524 Sl 9524 9524 9503 9519 ceph_test_rados - Sl 9524 10603 9503 9519 log - Sl 9524 10611 9503 9519 msgr-worker-0 epoll_wait Sl 9524 10612 9503 9519 msgr-worker-1 epoll_wait Sl 9524 10613 9503 9519 msgr-worker-2 epoll_wait Sl 9524 10621 9503 9519 service - Sl 9524 10622 9503 9519 admin_socket x64_sys_poll Sl 9524 10623 9503 9519 ceph_test_rados - Sl 9524 10624 9503 9519 ms_dispatch - Sl 9524 10625 9503 9519 ms_local - Sl 9524 10626 9503 9519 safe_timer - Sl 9524 10627 9503 9519 fn_anonymous - Sl 9524 10628 9503 9519 safe_timer - Sl 9524 10629 9503 9519 safe_timer - Sl 9524 10630 9503 9519 fn-radosclient -
Updated by Brad Hubbard over 4 years ago
I dumped out the log (500 lines, attached) almost exclusively this sequence so we're not communicating with osd.3.
2019-07-26 16:52:59.098856 thread: 0x7f5a9fa29700 priority: 0 subsystem: 27 bad crc in data 1465487574 != exp 1041109653 from 172.21.4.108:6804/10042 2019-07-26 16:52:59.098856 thread: 0x7f5a9e226700 priority: 1 subsystem: 14 client.4580.objecter ms_handle_reset 0x7f5a8c01f0f0 session 0x55855d9cc5f0 osd.3 2019-07-26 16:52:59.106857 thread: 0x7f5a9fa29700 priority: 0 subsystem: 27 bad crc in data 1465487574 != exp 1041109653 from 172.21.4.108:6804/10042 2019-07-26 16:52:59.106857 thread: 0x7f5a9e226700 priority: 1 subsystem: 14 client.4580.objecter ms_handle_reset 0x7f5a8c01b7a0 session 0x55855d9cc5f0 osd.3 2019-07-26 16:52:59.110857 thread: 0x7f5a9fa29700 priority: 0 subsystem: 27 bad crc in data 1465487574 != exp 1041109653 from 172.21.4.108:6804/10042 2019-07-26 16:52:59.114857 thread: 0x7f5a9e226700 priority: 1 subsystem: 14 client.4580.objecter ms_handle_reset 0x7f5a8c01f0f0 session 0x55855d9cc5f0 osd.3 2019-07-26 16:52:59.118857 thread: 0x7f5a9fa29700 priority: 0 subsystem: 27 bad crc in data 1465487574 != exp 1041109653 from 172.21.4.108:6804/10042 2019-07-26 16:52:59.118857 thread: 0x7f5a9e226700 priority: 1 subsystem: 14 client.4580.objecter ms_handle_reset 0x7f5a8c01b7a0 session 0x55855d9cc5f0 osd.3 2019-07-26 16:52:59.126857 thread: 0x7f5a9fa29700 priority: 0 subsystem: 27 bad crc in data 1465487574 != exp 1041109653 from 172.21.4.108:6804/10042 2019-07-26 16:52:59.126857 thread: 0x7f5a9e226700 priority: 1 subsystem: 14 client.4580.objecter ms_handle_reset 0x7f5a8c01f0f0 session 0x55855d9cc5f0 osd.3
For the time being you can get the coredump from https://people.redhat.com/~bhubbard/core.9524.gz and osd.3's log at https://people.redhat.com/~bhubbard/ceph-osd.3.log.gz in case someone wants to carry on :)
From the osd log it looks like the client is just retrying a read over and over again and keeps failing.
The 10612 and main thread look like this so they match up with what the logs are saying I guess.
Thread 4 (Thread 0x7f5a9fa29700 (LWP 10612)): #0 0x00007f5aa50dbbb7 in epoll_wait (epfd=7, events=0x55855d9b13a0, maxevents=5000, timeout=30000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x00007f5aa5fc475e in EpollDriver::event_wait (this=0x55855d9878d0, fired_events=std::vector of length 0, capacity 0, tvp=<optimized out>) at ./src/msg/async/EventEpoll.cc:122 #2 0x00007f5aa5d81f57 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) () at ./src/msg/async/Event.cc:401 #3 0x00007f5aa5d84da6 in NetworkStack::<lambda()>::operator() (__closure=0x55855da9b8b8, __closure=0x55855da9b8b8) at ./src/msg/async/Stack.cc:53 #4 std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () at /usr/include/c++/8/bits/std_function.h:297 #5 0x00007f5aa568066f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007f5aae9736db in start_thread (arg=0x7f5a9fa29700) at pthread_create.c:463 #7 0x00007f5aa50db88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 1 (Thread 0x7f5aaed93fc0 (LWP 9524)): #0 0x00007f5aae9799f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7ffd318fb340) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0x7ffd318fb378, cond=0x7ffd318fb318) at pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=cond@entry=0x7ffd318fb318, mutex=mutex@entry=0x7ffd318fb378) at pthread_cond_wait.c:655 #3 0x00007f5aae6b0fd6 in Cond::Wait (this=this@entry=0x7ffd318fb310, mutex=...) at ./src/common/Cond.h:48 #4 0x00007f5aae6a957f in librados::IoCtxImpl::operate_read(object_t const&, ObjectOperation*, ceph::buffer::list*, int) () at ./src/librados/IoCtxImpl.cc:754 #5 0x00007f5aae6692b9 in librados::IoCtx::operate(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, librados::ObjectReadOperation*, ceph::buffer::list*) () at ./src/librados/librados.cc:1521 #6 0x000055855cc20e60 in LibRadosIoPP_CmpExtPP_Test::TestBody() () at /usr/include/c++/8/ext/new_allocator.h:79 #7 0x000055855cc6333a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x55855cc8361c "the test body", method=<optimized out>, object=0x55855da02950) at ./src/googletest/googletest/src/gtest.cc:2383 #8 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x55855da02950, method=<optimized out>, location=location@entry=0x55855cc8361c "the test body") at ./src/googletest/googletest/src/gtest.cc:2438 #9 0x000055855cc5935a in testing::Test::Run (this=0x55855da02950) at ./src/googletest/googletest/src/gtest.cc:2474 #10 0x000055855cc594a8 in testing::TestInfo::Run (this=0x55855d95a6b0) at ./src/googletest/googletest/src/gtest.cc:2656 #11 0x000055855cc59585 in testing::TestCase::Run (this=0x55855d935e90) at ./src/googletest/googletest/src/gtest.cc:2774 #12 0x000055855cc59aa4 in testing::internal::UnitTestImpl::RunAllTests() () at /usr/include/c++/8/bits/stl_vector.h:930 #13 0x000055855cc6381a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0x55855cc84860 "auxiliary test code (environments or event listeners)", method=<optimized out>, object=0x55855d93aad0) at ./src/googletest/googletest/src/gtest.cc:2383 #14 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x55855d93aad0, method=<optimized out>, location=location@entry=0x55855cc84860 "auxiliary test code (environments or event listeners)") at ./src/googletest/googletest/src/gtest.cc:2438 #15 0x000055855cc59bbd in testing::UnitTest::Run() () at ./src/googletest/googletest/include/gtest/gtest.h:1287 #16 0x000055855cc1aff9 in RUN_ALL_TESTS () at ./src/googletest/googletest/include/gtest/gtest.h:2233 #17 main (argc=<optimized out>, argv=0x7ffd318fb928) at ./src/googletest/googlemock/src/gmock_main.cc:53
Updated by Brad Hubbard over 4 years ago
- File ceph-dump-log.txt.gz ceph-dump-log.txt.gz added
Updated by Brad Hubbard over 4 years ago
Client:
2019-07-30 23:49:44.351 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).prepare_send_message encoding features 4611087854031667195 0x5643de78af10 osd_op(client.14126. ⤷ 0:59 2.5 2:b3a108dd:test-rados-api-mira110-6042-31::foo:head [read 0~4] snapc 0=[] ondisk+read+known_if_redirected e21) v8 2019-07-30 23:49:44.351 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).write_message signed m=0x5643de78af10): sig = 10196077939576485518 2019-07-30 23:49:44.351 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).write_message sending message type=42 src client.14126 front=242 data=0 off 0 2019-07-30 23:49:44.351 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).write_message sending 30 0x5643de78af10 2019-07-30 23:49:44.351 7fe779d6d700 10 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1)._try_send sent bytes 317 remaining bytes 0 2019-07-30 23:49:44.351 7fe779d6d700 10 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).write_message sending 0x5643de78af10 done. 2019-07-30 23:49:44.351 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).process prev state is STATE_OPEN 2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_HEADER pgs=4 cs=1 l=1).process prev state is STATE_OPEN 2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_HEADER pgs=4 cs=1 l=1).process begin MSG 2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_HEADER pgs=4 cs=1 l=1).process got MSG header 2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_HEADER pgs=4 cs=1 l=1).process got envelope type=43 src osd.1 front=147 data=4 off 0 2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=4 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_HEADER 2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_THROTTLE_BYTES pgs=4 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_MESSAGE 2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_THROTTLE_DISPATCH_QUEUE pgs=4 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_BYTES 2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_READ_FRONT pgs=4 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_DISPATCH_QUEUE 2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_READ_FRONT pgs=4 cs=1 l=1).process got front 147 2019-07-30 23:49:44.352 7fe779d6d700 10 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_READ_DATA_PREPARE pgs=4 cs=1 l=1).process seleting rx buffer v 2 at offset 0 len 4 2019-07-30 23:49:44.352 7fe779d6d700 10 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4 cs=1 l=1).process aborted = 0 2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4 cs=1 l=1).process got 147 + 0 + 4 byte message 2019-07-30 23:49:44.352 7fe779d6d700 0 bad crc in data 1465487574 != exp 1041109653 from 127.0.0.1:6805/5820 2019-07-30 23:49:44.352 7fe779d6d700 20 00000000 43 45 50 48 |CEPH| 00000004 2019-07-30 23:49:44.352 7fe779d6d700 1 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4 cs=1 l=1).process decode message failed·
OSD:
2019-07-30 23:49:44.351 7f4e49124700 1 -- 127.0.0.1:6805/5820 --> 127.0.0.1:0/2016699731 -- osd_op_reply(59 foo [read 0~4] v0'0 uv3 ondisk = 0) v8 -- 0x55927bd2d080 con 0 2019-07-30 23:49:44.351 7f4e49124700 15 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).send_message inline write is denied, reschedule m=0x55927bd2d080 2019-07-30 23:49:44.351 7f4e49124700 20 Event(0x55927b4b3c80 nevent=5000 time_id=7).wakeup 2019-07-30 23:49:44.351 7f4e49124700 10 osd.1 21 dequeue_op 0x55927be1f680 finish 2019-07-30 23:49:44.351 7f4e49124700 20 osd.1 op_wq(0) _process empty q, waiting 2019-07-30 23:49:44.351 7f4e65bd1700 10 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).handle_write 2019-07-30 23:49:44.351 7f4e65bd1700 20 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).prepare_send_message m osd_op_reply(59 foo [read 0~4] v0'0 uv3 ondisk = 0) v8 2019-07-30 23:49:44.351 7f4e65bd1700 20 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).prepare_send_message encoding features 4611087854031667195 0x55927bd2d080 osd_op_reply(59 ⤷ foo [read 0~4] v0'0 uv3 ondisk = 0) v8 2019-07-30 23:49:44.351 7f4e65bd1700 20 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).write_message signed m=0x55927bd2d080): sig = 2555750077203053029 2019-07-30 23:49:44.351 7f4e65bd1700 20 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).write_message sending message type=43 src osd.1 front=147 data=4 off 0 2019-07-30 23:49:44.351 7f4e65bd1700 20 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).write_message sending 30 0x55927bd2d080 2019-07-30 23:49:44.351 7f4e65bd1700 10 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1)._try_send sent bytes 226 remaining bytes 0 2019-07-30 23:49:44.351 7f4e65bd1700 10 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).write_message sending 0x55927bd2d080 done. 2019-07-30 23:49:44.352 7f4e65bd1700 20 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).process prev state is STATE_OPEN 2019-07-30 23:49:44.352 7f4e65bd1700 1 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).read_bulk peer close file descriptor 53 2019-07-30 23:49:44.352 7f4e65bd1700 1 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).read_until read failed 2019-07-30 23:49:44.352 7f4e65bd1700 1 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).process read tag failed 2019-07-30 23:49:44.352 7f4e65bd1700 1 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).fault on lossy channel, failing 2019-07-30 23:49:44.352 7f4e65bd1700 2 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1)._stop 2019-07-30 23:49:44.352 7f4e65bd1700 10 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).discard_out_queue started
Updated by Brad Hubbard over 4 years ago
Ok, slowly closing in on this I think.
bad crc in data 1465487574 != exp 1041109653 from 172.21.4.108:6804/10042
That's our error seen during the execution (specifically when doing the read) of the "LibRadosIoPP.CmpExtPP" test which looks like this.
TEST_F(LibRadosIoPP, CmpExtPP) { bufferlist bl; bl.append("ceph"); ObjectWriteOperation write1; write1.write(0, bl); ASSERT_EQ(0, ioctx.operate("foo", &write1)); bufferlist new_bl; new_bl.append("CEPH"); ObjectWriteOperation write2; write2.cmpext(0, bl, nullptr); write2.write(0, new_bl); ASSERT_EQ(0, ioctx.operate("foo", &write2)); ObjectReadOperation read; read.read(0, bl.length(), NULL, NULL); ASSERT_EQ(0, ioctx.operate("foo", &read, &bl)); ASSERT_EQ(0, memcmp(bl.c_str(), "CEPH", 4)); }
Now if we build and execute the following code.
#include "include/rados/librados.h" #include "include/rados/librados.hpp" #include <iostream> using namespace std; using namespace librados; int main() { bufferlist bl1, bl2; bl1.append("ceph"); bl2.append("CEPH"); cout << bl1.c_str() << " = " << bl1.crc32c(0) << std::endl; cout << bl2.c_str() << " = " << bl2.crc32c(0) << std::endl; }
We get this output.
ceph = 1465487574 CEPH = 1041109653
So it looks like we expected the checksum for "CEPH" but got the checksum for "ceph". So what did we actually read?
2019-07-30 23:49:44.352 7fe779d6d700 0 bad crc in data 1465487574 != exp 1041109653 from 127.0.0.1:6805/5820 2019-07-30 23:49:44.352 7fe779d6d700 20 00000000 43 45 50 48 |CEPH|
We read "CEPH" so the next question is why would footer.data_crc contain the checksum for "ceph"? Do we have some cache for the checksum that we haven't invalidated when we do the second write?
Updated by Brad Hubbard over 4 years ago
Simplified test case.
#include "include/rados/librados.h" #include "include/rados/librados.hpp" #include <iostream> using namespace std; using namespace librados; int main() { Rados rados; rados.init("admin"); rados.conf_parse_env(NULL); rados.conf_read_file(NULL); rados.connect(); rados.pool_create("test-pool"); IoCtx ioctx; rados.ioctx_create("test-pool", ioctx); bufferlist bl; bl.append("ceph"); ObjectWriteOperation write1; write1.write(0, bl); ioctx.operate("foo", &write1); bufferlist new_bl; new_bl.append("CEPH"); ObjectWriteOperation write2; write2.write(0, new_bl); ioctx.operate("foo", &write2); ObjectReadOperation read; read.read(0, bl.length(), NULL, NULL); ioctx.operate("foo", &read, &bl); cout << bl.c_str() << endl; }
Updated by Yuri Weinstein over 4 years ago
Updated by Brad Hubbard over 4 years ago
I was discussing this with Josh Durgin and he noticed the reuse of the bufferlist for the read and mentioned we'd had problems with buffer reuse in the past. Sure enough testing with a new buffer for the read works as does adding "bl.invalidate_crc();" just before the read. I believe the problem is related to some crc caching code in buffer::list::crc32c but I'm still looking into the specifics. Of course a simple fix for the test would be to not reuse the write buffer for the read, but that would be cheating right? ;)
Updated by Brad Hubbard over 4 years ago
Correcting my statement in comment #12.
313 __u32 data_crc = data.crc32c(0); 314 if (data_crc != footer.data_crc) { 315 if (cct) { 316 ldout(cct, 0) << "bad crc in data " << data_crc << " != exp " << footer.data_crc << dendl;
We read "CEPH" and footer.data_crc contains 1041109653, which is the correct crc for "CEPH" but ceph::buffer::list::crc32c (called on line 313) returns a cached crc for "ceph" (1465487574). This is because we don't invalidate the crc when we re-use the buffer in AsyncConnection::process(). I have a solution and should have a PR after I test it.
Updated by Brad Hubbard over 4 years ago
- Status changed from New to In Progress
- Backport set to luminous
- Pull request ID set to 29519
Updated by Brad Hubbard over 4 years ago
- Subject changed from "Command failed (workunit test rados/test.sh)" in smoke/master/mimic to mimic: "Command failed (workunit test rados/test.sh)" in smoke/master/mimic
Updated by Brad Hubbard over 4 years ago
- Is duplicate of Bug #38416: crc cache should be invalidated when posting preallocated rx buffers added
Updated by Brad Hubbard over 4 years ago
- Status changed from In Progress to Duplicate
Updated by Brad Hubbard over 4 years ago
- Has duplicate Bug #38238: rados/test.sh: api_aio_pp doesn't seem to start added
Updated by Brad Hubbard over 4 years ago
- Has duplicate deleted (Bug #38238: rados/test.sh: api_aio_pp doesn't seem to start)