Bug #42387
closedceph_test_admin_socket_output fails in rados qa suite
0%
Description
osd.0
2019-10-20T13:51:51.501+0000 7f6974ac3700 1 osd.0 32 bench count 1073741824 bsize 4 MiB ... 2019-10-20T13:51:58.623+0000 7f6974ac3700 -1 asok(0x55c8ad926000) AdminSocket: error writing response length (32) Broken pipe
ceph_test_admin_socket_output
2019-10-20T13:51:51.765 INFO:teuthology.orchestra.run.mira027.stdout:Sending commands to osd socket 2019-10-20T13:51:51.765 INFO:teuthology.orchestra.run.mira027.stdout:Sending command "bench" 2019-10-20T13:51:56.505 INFO:teuthology.orchestra.run.mira027.stderr:run_command AdminSocketClient::do_request errored with: safe_read(3) failed to read message size: (11) Resource temporarily unavailabl e
/a/kchai-2019-10-20_12:50:42-rados-wip-kefu-testing-2019-10-18-1835-distro-basic-mira/4427741/
not reproducible locally
Updated by Kefu Chai over 4 years ago
pushed ceph_test_admin_socket_output to ceph-ci, so i can rerun this test without Sage's asock changes.
Updated by Kefu Chai over 4 years ago
being tested at http://pulpito.ceph.com/kchai-2019-10-21_11:21:40-rados-wip-before-asok-changes-distro-basic-mira/.
http://pulpito.ceph.com/kchai-2019-10-21_11:21:40-rados-wip-before-asok-changes-distro-basic-mira/ is green. so the suspect is very likely https://github.com/ceph/ceph/pull/30217
Updated by Kefu Chai over 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 30859
Updated by Kefu Chai over 4 years ago
- Status changed from Fix Under Review to Resolved
Updated by Kefu Chai over 4 years ago
- Status changed from Resolved to New
2019-11-11T09:32:03.043 INFO:teuthology.orchestra.run.mira057.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cento s7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-6788-gd1320f4/rpm/el7/BUILD/ceph-15.0.0-6788-gd1320f4/src/test/admin_socket_output.cc: In function 'std::pair<std::basic_string<c har>, std::basic_string<char> > AdminSocketOutput::run_command(AdminSocketClient&, const string&, bool)' thread 7f5217c76980 time 2019-11-11T09:32:03.038754+0000 2019-11-11T09:32:03.043 INFO:teuthology.orchestra.run.mira057.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cento s7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-6788-gd1320f4/rpm/el7/BUILD/ceph-15.0.0-6788-gd1320f4/src/test/admin_socket_output.cc: 119: ceph_abort_msg("abort() called") 2019-11-11T09:32:03.043 INFO:teuthology.orchestra.run.mira057.stderr: ceph version 15.0.0-6788-gd1320f4 (d1320f44859751513827bfdf742bc8bcc860bda7) octopus (dev) 2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x7f520eef0896] 2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 2: (AdminSocketOutput::run_command(AdminSocketClient&, std::string const&, bool)+0x31f) [0x55e7e0d451af] 2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 3: (AdminSocketOutput::gather_socket_output()+0x3fd) [0x55e7e0d4654d] 2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 4: (AdminSocketOutput::exec()+0x88) [0x55e7e0d46bd8] 2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 5: (main()+0x883) [0x55e7e0d3db03] 2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 6: (__libc_start_main()+0xf5) [0x7f520cb81545] 2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 7: (()+0x360ef) [0x55e7e0d3e0ef] 2019-11-11T09:32:03.056 DEBUG:teuthology.orchestra.run:got remote process result: None
/a/kchai-2019-11-11_03:49:19-rados-wip-kefu2-testing-2019-11-03-0500-distro-basic-mira/4495159
wip-kefu2-testing-2019-11-03-0500 contains 29c97547a956469cbf089a705da70df83edb2b24
Updated by Brad Hubbard over 4 years ago
- Assignee set to Brad Hubbard
It's only the bench command that causes the issue.
$ sudo gdb --args ceph_test_admin_socket_output --osd (gdb) b src/test/admin_socket_output.cc:164 (gdb) r ... Breakpoint 1, AdminSocketOutput::gather_socket_output() () at /usr/src/debug/ceph-15.0.0-6788-gd1320f4/src/test/admin_socket_output.cc:164 164 sresults.insert(run_command(client, (*iter)->get_name())); ... (gdb) del Delete all breakpoints? (y or n) y (gdb) set ++iter (gdb) c Continuing. Sending command "bluestore allocator dump block" ... Test passed [Inferior 1 (process 140190) exited normally]
$ sudo /bin/time ceph daemon osd.1 bench { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 6.8355695379999997, "bytes_per_sec": 157081545.00234419, "iops": 37.451158762537048 } 0.15user 0.03system 0:07.20elapsed 2%CPU (0avgtext+0avgdata 19916maxresident)k 3528inputs+0outputs (5major+12858minor)pagefaults 0swaps
If I add a break just before the read, after the request has been sent, I pause for a count of seven and continue and the program runs to completion.
(gdb) b src/common/admin_socket_client.cc:139 Breakpoint 2 at 0x7fffef35bd1d: file /usr/src/debug/ceph-15.0.0-6788-gd1320f4/src/common/admin_socket_client.cc, line 139. (gdb) r Starting program: /bin/ceph_test_admin_socket_output --osd ... Sending commands to osd socket Sending command "bench" Breakpoint 2, AdminSocketClient::do_request(std::string, std::string*) () at /usr/src/debug/ceph-15.0.0-6788-gd1320f4/src/common/admin_socket_client.cc:139 139 res = safe_read_exact(socket_fd, &message_size_raw, (gdb) c <----------------- No pause Continuing. run_command AdminSocketClient::do_request errored with: safe_read(7) failed to read message size: (11) Resource temporarily unavailable /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-6788-gd1320f4/rpm/el7/BUILD/ceph-15.0.0-6788-gd1320f4/src/test/admin_socket_out put.cc: In function 'std::pair<std::basic_string<char>, std::basic_string<char> > AdminSocketOutput::run_command(AdminSocketClient&, const string&, bool)' thread 7ffff7fde980 time 2019-11-13T08:19:52.136682+0000 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-6788-gd1320f4/rpm/el7/BUILD/ceph-15.0.0-6788-gd1320f4/src/test/admin_socket_out put.cc: 119: ceph_abort_msg("abort() called") ceph version 15.0.0-6788-gd1320f4 (d1320f44859751513827bfdf742bc8bcc860bda7) octopus (dev) 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x7fffef25a896] 2: (AdminSocketOutput::run_command(AdminSocketClient&, std::string const&, bool)+0x31f) [0x5555555911af] 3: (AdminSocketOutput::gather_socket_output()+0x3fd) [0x55555559254d] 4: (AdminSocketOutput::exec()+0x88) [0x555555592bd8] 5: (main()+0x883) [0x555555589b03] 6: (__libc_start_main()+0xf5) [0x7fffeceeb545] 7: (()+0x360ef) [0x55555558a0ef] (gdb) k Kill the program being debugged? (y or n) y (gdb) r Starting program: /bin/ceph_test_admin_socket_output --osd ... Sending commands to osd socket Sending command "bench" Breakpoint 2, AdminSocketClient::do_request(std::string, std::string*) () at /usr/src/debug/ceph-15.0.0-6788-gd1320f4/src/common/admin_socket_client.cc:139 139 res = safe_read_exact(socket_fd, &message_size_raw, (gdb) del Delete all breakpoints? (y or n) y (gdb) c <---- Pause for count of 7 Continuing. ... Test passed [Inferior 1 (process 143476) exited normally]
Updated by Brad Hubbard over 4 years ago
As suspected this is due to the length of time the 'bench' command takes to run and return a result to the socket.
This is the timeout we are hitting.
48 static std::string asok_connect(const std::string &path, int *fd) 49 { ... 72 struct timeval timer; 73 timer.tv_sec = 5; 74 timer.tv_usec = 0; 75 if (::setsockopt(socket_fd, SOL_SOCKET, SO_RCVTIMEO, &timer, sizeof(timer))) {
This has come to light since https://github.com/ceph/ceph/pull/30217/commits/8987f94416f453829eae6dda08837ef5a42531c6 moved the 'bench' command to the admin socket.
Setting the timer to 10 seconds allows the process to exit normally.
Updated by Brad Hubbard over 4 years ago
- Pull request ID changed from 30859 to 31623
Updated by Nathan Cutler over 4 years ago
- Status changed from New to Fix Under Review
Updated by Brad Hubbard over 4 years ago
- Status changed from Fix Under Review to In Progress
Updated by Kefu Chai over 4 years ago
- Status changed from In Progress to Pending Backport
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #42878: nautilus: ceph_test_admin_socket_output fails in rados qa suite added
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #42879: mimic: ceph_test_admin_socket_output fails in rados qa suite added
Updated by Nathan Cutler about 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".