Project

General

Profile

Actions

Bug #42387

closed

ceph_test_admin_socket_output fails in rados qa suite

Added by Kefu Chai over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus, mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

osd.0

2019-10-20T13:51:51.501+0000 7f6974ac3700  1 osd.0 32  bench count 1073741824 bsize 4 MiB
...
2019-10-20T13:51:58.623+0000 7f6974ac3700 -1 asok(0x55c8ad926000) AdminSocket: error writing response length (32) Broken pipe

ceph_test_admin_socket_output

2019-10-20T13:51:51.765 INFO:teuthology.orchestra.run.mira027.stdout:Sending commands to osd socket
2019-10-20T13:51:51.765 INFO:teuthology.orchestra.run.mira027.stdout:Sending command "bench" 
2019-10-20T13:51:56.505 INFO:teuthology.orchestra.run.mira027.stderr:run_command AdminSocketClient::do_request errored with: safe_read(3) failed to read message size: (11) Resource temporarily unavailabl
e

/a/kchai-2019-10-20_12:50:42-rados-wip-kefu-testing-2019-10-18-1835-distro-basic-mira/4427741/

not reproducible locally


Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #42878: nautilus: ceph_test_admin_socket_output fails in rados qa suiteResolvedNathan CutlerActions
Copied to RADOS - Backport #42879: mimic: ceph_test_admin_socket_output fails in rados qa suiteResolvedNathan CutlerActions
Actions #1

Updated by Kefu Chai over 4 years ago

pushed ceph_test_admin_socket_output to ceph-ci, so i can rerun this test without Sage's asock changes.

Actions #3

Updated by Kefu Chai over 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 30859
Actions #4

Updated by Kefu Chai over 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions #5

Updated by Kefu Chai over 4 years ago

  • Status changed from Resolved to New
2019-11-11T09:32:03.043 INFO:teuthology.orchestra.run.mira057.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cento
s7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-6788-gd1320f4/rpm/el7/BUILD/ceph-15.0.0-6788-gd1320f4/src/test/admin_socket_output.cc: In function 'std::pair<std::basic_string<c
har>, std::basic_string<char> > AdminSocketOutput::run_command(AdminSocketClient&, const string&, bool)' thread 7f5217c76980 time 2019-11-11T09:32:03.038754+0000
2019-11-11T09:32:03.043 INFO:teuthology.orchestra.run.mira057.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cento
s7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-6788-gd1320f4/rpm/el7/BUILD/ceph-15.0.0-6788-gd1320f4/src/test/admin_socket_output.cc: 119: ceph_abort_msg("abort() called")
2019-11-11T09:32:03.043 INFO:teuthology.orchestra.run.mira057.stderr: ceph version 15.0.0-6788-gd1320f4 (d1320f44859751513827bfdf742bc8bcc860bda7) octopus (dev)
2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x7f520eef0896]
2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 2: (AdminSocketOutput::run_command(AdminSocketClient&, std::string const&, bool)+0x31f) [0x55e7e0d451af]
2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 3: (AdminSocketOutput::gather_socket_output()+0x3fd) [0x55e7e0d4654d]
2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 4: (AdminSocketOutput::exec()+0x88) [0x55e7e0d46bd8]
2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 5: (main()+0x883) [0x55e7e0d3db03]
2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 6: (__libc_start_main()+0xf5) [0x7f520cb81545]
2019-11-11T09:32:03.044 INFO:teuthology.orchestra.run.mira057.stderr: 7: (()+0x360ef) [0x55e7e0d3e0ef]
2019-11-11T09:32:03.056 DEBUG:teuthology.orchestra.run:got remote process result: None

/a/kchai-2019-11-11_03:49:19-rados-wip-kefu2-testing-2019-11-03-0500-distro-basic-mira/4495159

wip-kefu2-testing-2019-11-03-0500 contains 29c97547a956469cbf089a705da70df83edb2b24

Actions #6

Updated by Brad Hubbard over 4 years ago

  • Assignee set to Brad Hubbard

It's only the bench command that causes the issue.


$ sudo gdb --args ceph_test_admin_socket_output --osd
(gdb) b src/test/admin_socket_output.cc:164
(gdb) r
...
Breakpoint 1, AdminSocketOutput::gather_socket_output() () at /usr/src/debug/ceph-15.0.0-6788-gd1320f4/src/test/admin_socket_output.cc:164
164           sresults.insert(run_command(client, (*iter)->get_name()));
...
(gdb) del
Delete all breakpoints? (y or n) y
(gdb) set ++iter
(gdb) c
Continuing.
Sending command "bluestore allocator dump block" 
...
Test passed
[Inferior 1 (process 140190) exited normally]

$ sudo /bin/time ceph daemon osd.1 bench
{
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 6.8355695379999997,
    "bytes_per_sec": 157081545.00234419,
    "iops": 37.451158762537048
}
0.15user 0.03system 0:07.20elapsed 2%CPU (0avgtext+0avgdata 19916maxresident)k
3528inputs+0outputs (5major+12858minor)pagefaults 0swaps

If I add a break just before the read, after the request has been sent, I pause for a count of seven and continue and the program runs to completion.

(gdb) b src/common/admin_socket_client.cc:139
Breakpoint 2 at 0x7fffef35bd1d: file /usr/src/debug/ceph-15.0.0-6788-gd1320f4/src/common/admin_socket_client.cc, line 139.
(gdb) r
Starting program: /bin/ceph_test_admin_socket_output --osd
...
Sending commands to osd socket
Sending command "bench" 

Breakpoint 2, AdminSocketClient::do_request(std::string, std::string*) () at /usr/src/debug/ceph-15.0.0-6788-gd1320f4/src/common/admin_socket_client.cc:139
139       res = safe_read_exact(socket_fd, &message_size_raw,
(gdb) c     <----------------- No pause
Continuing.
run_command AdminSocketClient::do_request errored with: safe_read(7) failed to read message size: (11) Resource temporarily unavailable
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-6788-gd1320f4/rpm/el7/BUILD/ceph-15.0.0-6788-gd1320f4/src/test/admin_socket_out
put.cc: In function 'std::pair<std::basic_string<char>, std::basic_string<char> > AdminSocketOutput::run_command(AdminSocketClient&, const string&, bool)' thread 7ffff7fde980 time 2019-11-13T08:19:52.136682+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-6788-gd1320f4/rpm/el7/BUILD/ceph-15.0.0-6788-gd1320f4/src/test/admin_socket_out
put.cc: 119: ceph_abort_msg("abort() called")
 ceph version 15.0.0-6788-gd1320f4 (d1320f44859751513827bfdf742bc8bcc860bda7) octopus (dev)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x7fffef25a896]
 2: (AdminSocketOutput::run_command(AdminSocketClient&, std::string const&, bool)+0x31f) [0x5555555911af]
 3: (AdminSocketOutput::gather_socket_output()+0x3fd) [0x55555559254d]
 4: (AdminSocketOutput::exec()+0x88) [0x555555592bd8]
 5: (main()+0x883) [0x555555589b03]
 6: (__libc_start_main()+0xf5) [0x7fffeceeb545]
 7: (()+0x360ef) [0x55555558a0ef]
(gdb) k
Kill the program being debugged? (y or n) y
(gdb) r
Starting program: /bin/ceph_test_admin_socket_output --osd
...
Sending commands to osd socket
Sending command "bench" 

Breakpoint 2, AdminSocketClient::do_request(std::string, std::string*) () at /usr/src/debug/ceph-15.0.0-6788-gd1320f4/src/common/admin_socket_client.cc:139
139       res = safe_read_exact(socket_fd, &message_size_raw,
(gdb) del 
Delete all breakpoints? (y or n) y
(gdb) c   <---- Pause for count of 7
Continuing.
...
Test passed
[Inferior 1 (process 143476) exited normally]
Actions #7

Updated by Brad Hubbard over 4 years ago

As suspected this is due to the length of time the 'bench' command takes to run and return a result to the socket.

This is the timeout we are hitting.

 48 static std::string asok_connect(const std::string &path, int *fd)
 49 {
...
 72   struct timeval timer;
 73   timer.tv_sec = 5;
 74   timer.tv_usec = 0;
 75   if (::setsockopt(socket_fd, SOL_SOCKET, SO_RCVTIMEO, &timer, sizeof(timer))) {

This has come to light since https://github.com/ceph/ceph/pull/30217/commits/8987f94416f453829eae6dda08837ef5a42531c6 moved the 'bench' command to the admin socket.

Setting the timer to 10 seconds allows the process to exit normally.

Actions #8

Updated by Brad Hubbard over 4 years ago

  • Pull request ID changed from 30859 to 31623
Actions #9

Updated by Nathan Cutler over 4 years ago

  • Backport set to nautilus, mimic
Actions #10

Updated by Nathan Cutler over 4 years ago

  • Status changed from New to Fix Under Review
Actions #11

Updated by Brad Hubbard over 4 years ago

  • Status changed from Fix Under Review to In Progress
Actions #12

Updated by Kefu Chai over 4 years ago

  • Status changed from In Progress to Pending Backport
Actions #13

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #42878: nautilus: ceph_test_admin_socket_output fails in rados qa suite added
Actions #14

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #42879: mimic: ceph_test_admin_socket_output fails in rados qa suite added
Actions #15

Updated by Nathan Cutler about 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF