Bug #40765: mimic: "Command failed (workunit test rados/test.sh)" in smoke/master/mimic - RADOS - Ceph

Actions

Copy link

Bug #40765

closed

mimic: "Command failed (workunit test rados/test.sh)" in smoke/master/mimic

Added by Yuri Weinstein almost 5 years ago. Updated over 4 years ago.

Status:

Duplicate

Priority:

Urgent

Assignee:

Brad Hubbard

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

luminous

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v13.2.7

ceph-qa-suite:

smoke

Component(RADOS):

Pull request ID:

29519

Crash signature (v1):

Crash signature (v2):

Description

Run: http://pulpito.ceph.com/teuthology-2019-07-12_06:00:03-smoke-mimic-testing-basic-smithi/
Jobs: '4113997', '4113998'
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2019-07-12_06:00:03-smoke-mimic-testing-basic-smithi/4113997/teuthology.log

2019-07-12T09:58:39.832 INFO:tasks.mon_thrash.mon_thrasher:waiting for 90.0 secs before reviving monitors
2019-07-12T09:59:02.440 INFO:teuthology.orchestra.run.smithi019:Running:
2019-07-12T09:59:02.441 INFO:teuthology.orchestra.run.smithi019:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2019-07-12T09:59:02.445 INFO:teuthology.orchestra.run.smithi041:Running:
2019-07-12T09:59:02.445 INFO:teuthology.orchestra.run.smithi041:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2019-07-12T09:59:02.450 INFO:teuthology.orchestra.run.smithi080:Running:
2019-07-12T09:59:02.450 INFO:teuthology.orchestra.run.smithi080:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2019-07-12T09:59:30.049 INFO:tasks.workunit.client.0.smithi019.stderr:++ cleanup
2019-07-12T09:59:30.049 INFO:tasks.workunit.client.0.smithi019.stderr:++ pkill -P 9446
2019-07-12T09:59:30.083 INFO:tasks.workunit.client.0.smithi019.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10:  9459 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 
2019-07-12T09:59:30.083 INFO:tasks.workunit.client.0.smithi019.stderr:++ true
2019-07-12T09:59:30.085 DEBUG:teuthology.orchestra.run:got remote process result: 124
2019-07-12T09:59:30.085 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2019-07-12T09:59:30.085 INFO:teuthology.orchestra.run.smithi019:Running:
2019-07-12T09:59:30.086 INFO:teuthology.orchestra.run.smithi019:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2019-07-12T09:59:30.506 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph_mimic/qa/tasks/workunit.py", line 123, in task
    timeout=timeout,cleanup=cleanup)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph_mimic/qa/tasks/workunit.py", line 409, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 205, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 437, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed (workunit test rados/test.sh) on smithi019 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=6ac795a266b802d40086d0a664964097794d516d TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'
2019-07-12T09:59:30.517 DEBUG:teuthology.run_tasks:Unwinding manager mon_thrash

Files

ceph-dump-log.txt.gz (2.44 KB) ceph-dump-log.txt.gz

Brad Hubbard, 07/26/2019 10:21 AM

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Yuri Weinstein almost 5 years ago

Subject changed from "Command failed (workunit test rados/test.sh)" in smoke/master to "Command failed (workunit test rados/test.sh)" in smoke/master/mimic
Priority changed from Normal to High
Affected Versions v13.2.7 added

Seems on mimic as well

http://pulpito.ceph.com/teuthology-2019-07-13_06:00:03-smoke-mimic-testing-basic-smithi/

http://qa-proxy.ceph.com/teuthology/teuthology-2019-07-13_06:00:03-smoke-mimic-testing-basic-smithi/4117340/teuthology.log

2019-07-13T10:30:06.616 INFO:teuthology.orchestra.run.smithi194:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2019-07-13T10:30:32.086 INFO:tasks.workunit.client.0.smithi158.stderr:++ cleanup
2019-07-13T10:30:32.086 INFO:tasks.workunit.client.0.smithi158.stderr:++ pkill -P 8938
2019-07-13T10:30:32.106 INFO:tasks.workunit.client.0.smithi158.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10:  8951 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 
2019-07-13T10:30:32.106 INFO:tasks.workunit.client.0.smithi158.stderr:++ true
2019-07-13T10:30:32.107 DEBUG:teuthology.orchestra.run:got remote process result: 124
2019-07-13T10:30:32.108 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2019-07-13T10:30:32.108 INFO:teuthology.orchestra.run.smithi158:Running:
2019-07-13T10:30:32.108 INFO:teuthology.orchestra.run.smithi158:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2019-07-13T10:30:32.513 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph_mimic/qa/tasks/workunit.py", line 123, in task
    timeout=timeout,cleanup=cleanup)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph_mimic/qa/tasks/workunit.py", line 409, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 205, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 437, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed (workunit test rados/test.sh) on smithi158 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=6ac795a266b802d40086d0a664964097794d516d TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'

Actions

Copy link

Updated by Greg Farnum almost 5 years ago

Project changed from Ceph to RADOS

Actions

Copy link

Updated by Neha Ojha almost 5 years ago

Priority changed from High to Urgent

Actions

Copy link

Updated by Sage Weil almost 5 years ago

Related to Bug #38238: rados/test.sh: api_aio_pp doesn't seem to start added

Actions

Copy link

Updated by Sage Weil almost 5 years ago

2019-07-12T06:59:36.562 INFO:tasks.workunit.client.0.smithi019.stdout:                   api_io: [ RUN      ] LibRadosIoPP.TruncTestPP
2019-07-12T06:59:36.563 INFO:tasks.workunit.client.0.smithi019.stdout:                   api_io: [       OK ] LibRadosIoPP.TruncTestPP (17 ms)
2019-07-12T06:59:36.563 INFO:tasks.workunit.client.0.smithi019.stdout:                   api_io: [ RUN      ] LibRadosIoPP.RemoveTestPP
2019-07-12T06:59:37.397 INFO:tasks.workunit.client.0.smithi019.stdout:                   api_io: [       OK ] LibRadosIoPP.RemoveTestPP (7    delete_pools_parallel: process_1_[9896]: starting.

appears to be the problematic child process.

Actions

Copy link

Updated by Brad Hubbard over 4 years ago

Has duplicate Bug #40926: "Command failed (workunit test rados/test.sh)" in rados added

Actions

Copy link

Updated by Brad Hubbard over 4 years ago

Still working on this but steps to reproduce...

$ cat << EOF >> orig.config.yaml                                                                                                                                                                                        
interactive-on-error: true
$(tail -n+2 /a/teuthology-2019-07-24_06:00:03-smoke-mimic-testing-basic-smithi/4144267/orig.config.yaml)
EOF

$ rm -rf ~/working/archive/* ;teuthology -v -a ~/working/archive/ --lock orig.config.yaml

Make sure ~/working/archive/ exists.

NOTE: You might need to change machine-type to "mira" if there are no smithis available (that's what I did so maybe just change it anyway?).

Try to take note of the machines that are being used. I just scanned the output for them but there's probably better ways. Let it get to the point where it's running the tests and then log onto each machine (or all at once) and run the following.

$ ps auwwx|grep ceph_test_rados_api_io

Eventually you'll find the process we want to investigate and it will hang around for a long time so you don't have to hurry much. You can also kill the task that started it and I think that will stop it from killing it.

I grabbed a gcore and I'm currently looking at that.

Actions

Copy link

Updated by Brad Hubbard over 4 years ago

Looking at a live process it doesn't seem to be a deadlock or "hang" but more like some sort of livelock where the 'msgr-worker-1' thread is chewing the CPU and the 'ms_dispatch' and 'log' threads using a little bit of cpu as well.

top - 09:48:20 up  3:14,  1 user,  load average: 0.06, 0.12, 0.11
Threads: 175 total,   1 running, 120 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.8 us,  1.7 sy,  0.0 ni, 95.7 id,  0.0 wa,  0.0 hi,  1.8 si,  0.0 st
KiB Mem : 16402184 total, 14600708 free,   220520 used,  1580956 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used. 15842512 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                                                                
  10612 ubuntu    20   0 1186956  26604  16680 S  9.6  0.2  20:17.43 msgr-worker-1                                                                                                                                                          
  10624 ubuntu    20   0 1186956  26604  16680 S  1.7  0.2   3:40.29 ms_dispatch                                                                                                                                                            
  10603 ubuntu    20   0 1186956  26604  16680 S  1.0  0.2   2:01.56 log

$ ps axHo stat,pid,tid,pgid,ppid,comm,wchan|grep 9524
Sl      9524    9524    9503    9519 ceph_test_rados -
Sl      9524   10603    9503    9519 log             -
Sl      9524   10611    9503    9519 msgr-worker-0   epoll_wait
Sl      9524   10612    9503    9519 msgr-worker-1   epoll_wait
Sl      9524   10613    9503    9519 msgr-worker-2   epoll_wait
Sl      9524   10621    9503    9519 service         -
Sl      9524   10622    9503    9519 admin_socket    x64_sys_poll
Sl      9524   10623    9503    9519 ceph_test_rados -
Sl      9524   10624    9503    9519 ms_dispatch     -
Sl      9524   10625    9503    9519 ms_local        -
Sl      9524   10626    9503    9519 safe_timer      -
Sl      9524   10627    9503    9519 fn_anonymous    -
Sl      9524   10628    9503    9519 safe_timer      -
Sl      9524   10629    9503    9519 safe_timer      -
Sl      9524   10630    9503    9519 fn-radosclient  -

Actions

Copy link

Updated by Brad Hubbard over 4 years ago

I dumped out the log (500 lines, attached) almost exclusively this sequence so we're not communicating with osd.3.

2019-07-26 16:52:59.098856 thread: 0x7f5a9fa29700 priority: 0 subsystem: 27 bad crc in data 1465487574 != exp 1041109653 from 172.21.4.108:6804/10042
2019-07-26 16:52:59.098856 thread: 0x7f5a9e226700 priority: 1 subsystem: 14 client.4580.objecter ms_handle_reset 0x7f5a8c01f0f0 session 0x55855d9cc5f0 osd.3
2019-07-26 16:52:59.106857 thread: 0x7f5a9fa29700 priority: 0 subsystem: 27 bad crc in data 1465487574 != exp 1041109653 from 172.21.4.108:6804/10042
2019-07-26 16:52:59.106857 thread: 0x7f5a9e226700 priority: 1 subsystem: 14 client.4580.objecter ms_handle_reset 0x7f5a8c01b7a0 session 0x55855d9cc5f0 osd.3
2019-07-26 16:52:59.110857 thread: 0x7f5a9fa29700 priority: 0 subsystem: 27 bad crc in data 1465487574 != exp 1041109653 from 172.21.4.108:6804/10042
2019-07-26 16:52:59.114857 thread: 0x7f5a9e226700 priority: 1 subsystem: 14 client.4580.objecter ms_handle_reset 0x7f5a8c01f0f0 session 0x55855d9cc5f0 osd.3
2019-07-26 16:52:59.118857 thread: 0x7f5a9fa29700 priority: 0 subsystem: 27 bad crc in data 1465487574 != exp 1041109653 from 172.21.4.108:6804/10042
2019-07-26 16:52:59.118857 thread: 0x7f5a9e226700 priority: 1 subsystem: 14 client.4580.objecter ms_handle_reset 0x7f5a8c01b7a0 session 0x55855d9cc5f0 osd.3
2019-07-26 16:52:59.126857 thread: 0x7f5a9fa29700 priority: 0 subsystem: 27 bad crc in data 1465487574 != exp 1041109653 from 172.21.4.108:6804/10042
2019-07-26 16:52:59.126857 thread: 0x7f5a9e226700 priority: 1 subsystem: 14 client.4580.objecter ms_handle_reset 0x7f5a8c01f0f0 session 0x55855d9cc5f0 osd.3

For the time being you can get the coredump from https://people.redhat.com/~bhubbard/core.9524.gz and osd.3's log at https://people.redhat.com/~bhubbard/ceph-osd.3.log.gz in case someone wants to carry on :)

From the osd log it looks like the client is just retrying a read over and over again and keeps failing.

The 10612 and main thread look like this so they match up with what the logs are saying I guess.

Thread 4 (Thread 0x7f5a9fa29700 (LWP 10612)):
#0  0x00007f5aa50dbbb7 in epoll_wait (epfd=7, events=0x55855d9b13a0, maxevents=5000, timeout=30000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f5aa5fc475e in EpollDriver::event_wait (this=0x55855d9878d0, fired_events=std::vector of length 0, capacity 0, tvp=<optimized out>) at ./src/msg/async/EventEpoll.cc:122
#2  0x00007f5aa5d81f57 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) () at ./src/msg/async/Event.cc:401
#3  0x00007f5aa5d84da6 in NetworkStack::<lambda()>::operator() (__closure=0x55855da9b8b8, __closure=0x55855da9b8b8) at ./src/msg/async/Stack.cc:53
#4  std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () at /usr/include/c++/8/bits/std_function.h:297
#5  0x00007f5aa568066f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f5aae9736db in start_thread (arg=0x7f5a9fa29700) at pthread_create.c:463
#7  0x00007f5aa50db88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f5aaed93fc0 (LWP 9524)):
#0  0x00007f5aae9799f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7ffd318fb340) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7ffd318fb378, cond=0x7ffd318fb318) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x7ffd318fb318, mutex=mutex@entry=0x7ffd318fb378) at pthread_cond_wait.c:655
#3  0x00007f5aae6b0fd6 in Cond::Wait (this=this@entry=0x7ffd318fb310, mutex=...) at ./src/common/Cond.h:48
#4  0x00007f5aae6a957f in librados::IoCtxImpl::operate_read(object_t const&, ObjectOperation*, ceph::buffer::list*, int) () at ./src/librados/IoCtxImpl.cc:754
#5  0x00007f5aae6692b9 in librados::IoCtx::operate(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, librados::ObjectReadOperation*, ceph::buffer::list*) () at ./src/librados/librados.cc:1521
#6  0x000055855cc20e60 in LibRadosIoPP_CmpExtPP_Test::TestBody() () at /usr/include/c++/8/ext/new_allocator.h:79
#7  0x000055855cc6333a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x55855cc8361c "the test body", method=<optimized out>, object=0x55855da02950)
    at ./src/googletest/googletest/src/gtest.cc:2383
#8  testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x55855da02950, method=<optimized out>, location=location@entry=0x55855cc8361c "the test body")
    at ./src/googletest/googletest/src/gtest.cc:2438
#9  0x000055855cc5935a in testing::Test::Run (this=0x55855da02950) at ./src/googletest/googletest/src/gtest.cc:2474
#10 0x000055855cc594a8 in testing::TestInfo::Run (this=0x55855d95a6b0) at ./src/googletest/googletest/src/gtest.cc:2656
#11 0x000055855cc59585 in testing::TestCase::Run (this=0x55855d935e90) at ./src/googletest/googletest/src/gtest.cc:2774
#12 0x000055855cc59aa4 in testing::internal::UnitTestImpl::RunAllTests() () at /usr/include/c++/8/bits/stl_vector.h:930
#13 0x000055855cc6381a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0x55855cc84860 "auxiliary test code (environments or event listeners)", method=<optimized out>, 
    object=0x55855d93aad0) at ./src/googletest/googletest/src/gtest.cc:2383
#14 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x55855d93aad0, method=<optimized out>, 
    location=location@entry=0x55855cc84860 "auxiliary test code (environments or event listeners)") at ./src/googletest/googletest/src/gtest.cc:2438
#15 0x000055855cc59bbd in testing::UnitTest::Run() () at ./src/googletest/googletest/include/gtest/gtest.h:1287
#16 0x000055855cc1aff9 in RUN_ALL_TESTS () at ./src/googletest/googletest/include/gtest/gtest.h:2233
#17 main (argc=<optimized out>, argv=0x7ffd318fb928) at ./src/googletest/googlemock/src/gmock_main.cc:53

Actions

Copy link

#10

Updated by Brad Hubbard over 4 years ago

File ceph-dump-log.txt.gz ceph-dump-log.txt.gz added

Actions

Copy link

#11

Updated by Brad Hubbard over 4 years ago

Client:

2019-07-30 23:49:44.351 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).prepare_send_message encoding features 4611087854031667195 0x5643de78af10 osd_op(client.14126.   ⤷ 0:59 2.5 2:b3a108dd:test-rados-api-mira110-6042-31::foo:head [read 0~4] snapc 0=[] ondisk+read+known_if_redirected e21) v8
2019-07-30 23:49:44.351 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).write_message signed m=0x5643de78af10): sig = 10196077939576485518
2019-07-30 23:49:44.351 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).write_message sending message type=42 src client.14126 front=242 data=0 off 0
2019-07-30 23:49:44.351 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).write_message sending 30 0x5643de78af10
2019-07-30 23:49:44.351 7fe779d6d700 10 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1)._try_send sent bytes 317 remaining bytes 0
2019-07-30 23:49:44.351 7fe779d6d700 10 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).write_message sending 0x5643de78af10 done.
2019-07-30 23:49:44.351 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).process prev state is STATE_OPEN
2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_HEADER pgs=4 cs=1 l=1).process prev state is STATE_OPEN
2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_HEADER pgs=4 cs=1 l=1).process begin MSG
2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_HEADER pgs=4 cs=1 l=1).process got MSG header
2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_HEADER pgs=4 cs=1 l=1).process got envelope type=43 src osd.1 front=147 data=4 off 0
2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=4 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_HEADER
2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_THROTTLE_BYTES pgs=4 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_MESSAGE
2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_THROTTLE_DISPATCH_QUEUE pgs=4 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_BYTES
2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_READ_FRONT pgs=4 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_DISPATCH_QUEUE
2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_READ_FRONT pgs=4 cs=1 l=1).process got front 147
2019-07-30 23:49:44.352 7fe779d6d700 10 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_READ_DATA_PREPARE pgs=4 cs=1 l=1).process seleting rx buffer v 2 at offset 0 len 4
2019-07-30 23:49:44.352 7fe779d6d700 10 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4 cs=1 l=1).process aborted = 0
2019-07-30 23:49:44.352 7fe779d6d700 20 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4 cs=1 l=1).process got 147 + 0 + 4 byte message
2019-07-30 23:49:44.352 7fe779d6d700  0 bad crc in data 1465487574 != exp 1041109653 from 127.0.0.1:6805/5820
2019-07-30 23:49:44.352 7fe779d6d700 20  00000000  43 45 50 48                                       |CEPH|                                                                                                                                  
00000004

2019-07-30 23:49:44.352 7fe779d6d700  1 -- 127.0.0.1:0/2016699731 >> 127.0.0.1:6805/5820 conn(0x5643de7b1540 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4 cs=1 l=1).process decode message failed·

OSD:

2019-07-30 23:49:44.351 7f4e49124700  1 -- 127.0.0.1:6805/5820 --> 127.0.0.1:0/2016699731 -- osd_op_reply(59 foo [read 0~4] v0'0 uv3 ondisk = 0) v8 -- 0x55927bd2d080 con 0                                                                  
2019-07-30 23:49:44.351 7f4e49124700 15 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).send_message inline write is denied, reschedule m=0x55927bd2d080                             
2019-07-30 23:49:44.351 7f4e49124700 20 Event(0x55927b4b3c80 nevent=5000 time_id=7).wakeup                                                                                                                                                   
2019-07-30 23:49:44.351 7f4e49124700 10 osd.1 21 dequeue_op 0x55927be1f680 finish                                                                                                                                                            
2019-07-30 23:49:44.351 7f4e49124700 20 osd.1 op_wq(0) _process empty q, waiting                                                                                                                                                             
2019-07-30 23:49:44.351 7f4e65bd1700 10 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).handle_write                                                                                 
2019-07-30 23:49:44.351 7f4e65bd1700 20 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).prepare_send_message m osd_op_reply(59 foo [read 0~4] v0'0 uv3 ondisk = 0) v8                
2019-07-30 23:49:44.351 7f4e65bd1700 20 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).prepare_send_message encoding features 4611087854031667195 0x55927bd2d080 osd_op_reply(59      ⤷ foo [read 0~4] v0'0 uv3 ondisk = 0) v8                                                       
2019-07-30 23:49:44.351 7f4e65bd1700 20 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).write_message signed m=0x55927bd2d080): sig = 2555750077203053029                            
2019-07-30 23:49:44.351 7f4e65bd1700 20 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).write_message sending message type=43 src osd.1 front=147 data=4 off 0                       
2019-07-30 23:49:44.351 7f4e65bd1700 20 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).write_message sending 30 0x55927bd2d080                                                      
2019-07-30 23:49:44.351 7f4e65bd1700 10 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1)._try_send sent bytes 226 remaining bytes 0                                                   
2019-07-30 23:49:44.351 7f4e65bd1700 10 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).write_message sending 0x55927bd2d080 done.                                                   
2019-07-30 23:49:44.352 7f4e65bd1700 20 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).process prev state is STATE_OPEN                                                             
2019-07-30 23:49:44.352 7f4e65bd1700  1 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).read_bulk peer close file descriptor 53                                                      
2019-07-30 23:49:44.352 7f4e65bd1700  1 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).read_until read failed                                                                       
2019-07-30 23:49:44.352 7f4e65bd1700  1 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).process read tag failed                                                                      
2019-07-30 23:49:44.352 7f4e65bd1700  1 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).fault on lossy channel, failing                                                              
2019-07-30 23:49:44.352 7f4e65bd1700  2 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1)._stop                                                                                        
2019-07-30 23:49:44.352 7f4e65bd1700 10 -- 127.0.0.1:6805/5820 >> 127.0.0.1:0/2016699731 conn(0x55927ba8de00 :6805 s=STATE_OPEN pgs=5 cs=1 l=1).discard_out_queue started

Actions

Copy link

#12

Updated by Brad Hubbard over 4 years ago

Ok, slowly closing in on this I think.

bad crc in data 1465487574 != exp 1041109653 from 172.21.4.108:6804/10042

That's our error seen during the execution (specifically when doing the read) of the "LibRadosIoPP.CmpExtPP" test which looks like this.

TEST_F(LibRadosIoPP, CmpExtPP) {
  bufferlist bl;
  bl.append("ceph");
  ObjectWriteOperation write1;
  write1.write(0, bl);
  ASSERT_EQ(0, ioctx.operate("foo", &write1));

  bufferlist new_bl;
  new_bl.append("CEPH");
  ObjectWriteOperation write2;
  write2.cmpext(0, bl, nullptr);
  write2.write(0, new_bl);
  ASSERT_EQ(0, ioctx.operate("foo", &write2));

  ObjectReadOperation read;
  read.read(0, bl.length(), NULL, NULL);
  ASSERT_EQ(0, ioctx.operate("foo", &read, &bl));
  ASSERT_EQ(0, memcmp(bl.c_str(), "CEPH", 4));
}

Now if we build and execute the following code.

#include "include/rados/librados.h" 
#include "include/rados/librados.hpp" 
#include <iostream>

using namespace std;
using namespace librados;

int main()
{
  bufferlist bl1, bl2;
  bl1.append("ceph");
  bl2.append("CEPH");
  cout << bl1.c_str() << " = " << bl1.crc32c(0) << std::endl;
  cout << bl2.c_str() << " = " << bl2.crc32c(0) << std::endl;
}

We get this output.

ceph = 1465487574
CEPH = 1041109653

So it looks like we expected the checksum for "CEPH" but got the checksum for "ceph". So what did we actually read?

2019-07-30 23:49:44.352 7fe779d6d700  0 bad crc in data 1465487574 != exp 1041109653 from 127.0.0.1:6805/5820
2019-07-30 23:49:44.352 7fe779d6d700 20  00000000  43 45 50 48                                       |CEPH|

We read "CEPH" so the next question is why would footer.data_crc contain the checksum for "ceph"? Do we have some cache for the checksum that we haven't invalidated when we do the second write?

Actions

Copy link

#13

Updated by Brad Hubbard over 4 years ago

Simplified test case.

#include "include/rados/librados.h" 
#include "include/rados/librados.hpp" 
#include <iostream>

using namespace std;
using namespace librados;

int main()
{
  Rados rados;
  rados.init("admin");          
  rados.conf_parse_env(NULL);
  rados.conf_read_file(NULL);
  rados.connect();      
  rados.pool_create("test-pool");
  IoCtx ioctx;
  rados.ioctx_create("test-pool", ioctx);

  bufferlist bl;
  bl.append("ceph");
  ObjectWriteOperation write1;
  write1.write(0, bl);
  ioctx.operate("foo", &write1);

  bufferlist new_bl;
  new_bl.append("CEPH");
  ObjectWriteOperation write2;
  write2.write(0, new_bl);
  ioctx.operate("foo", &write2);

  ObjectReadOperation read;
  read.read(0, bl.length(), NULL, NULL);
  ioctx.operate("foo", &read, &bl);
  cout << bl.c_str() << endl;
}

Actions

Copy link

#14

Updated by Yuri Weinstein over 4 years ago

Suspect also in => http://pulpito.ceph.com/teuthology-2019-08-01_02:25:03-upgrade:luminous-x-mimic-distro-basic-smithi/

Actions

Copy link

#15

Updated by Brad Hubbard over 4 years ago

I was discussing this with Josh Durgin and he noticed the reuse of the bufferlist for the read and mentioned we'd had problems with buffer reuse in the past. Sure enough testing with a new buffer for the read works as does adding "bl.invalidate_crc();" just before the read. I believe the problem is related to some crc caching code in buffer::list::crc32c but I'm still looking into the specifics. Of course a simple fix for the test would be to not reuse the write buffer for the read, but that would be cheating right? ;)

Actions

Copy link

#16

Updated by Brad Hubbard over 4 years ago

Assignee set to Brad Hubbard

Actions

Copy link

#17

Updated by Brad Hubbard over 4 years ago

Correcting my statement in comment #12.

    313       __u32 data_crc = data.crc32c(0);
    314       if (data_crc != footer.data_crc) {
    315         if (cct) {
    316           ldout(cct, 0) << "bad crc in data " << data_crc << " != exp " << footer.data_crc << dendl;

We read "CEPH" and footer.data_crc contains 1041109653, which is the correct crc for "CEPH" but ceph::buffer::list::crc32c (called on line 313) returns a cached crc for "ceph" (1465487574). This is because we don't invalidate the crc when we re-use the buffer in AsyncConnection::process(). I have a solution and should have a PR after I test it.

Actions

Copy link

#18

Updated by Brad Hubbard over 4 years ago

Status changed from New to In Progress
Backport set to luminous
Pull request ID set to 29519

Actions

Copy link

#19

Updated by Brad Hubbard over 4 years ago

Subject changed from "Command failed (workunit test rados/test.sh)" in smoke/master/mimic to mimic: "Command failed (workunit test rados/test.sh)" in smoke/master/mimic

Actions

Copy link

#20