Project

General

Profile

Bug #47179

AdminSocket::do_accept() terminate called after throwing an instance of 'std::out_of_range'

Added by Neha Ojha 8 months ago. Updated 7 months ago.

Status:
Pending Backport
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus, nautilus
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-08-27T12:28:49.073 INFO:teuthology.orchestra.run.smithi072.stdout:Sending commands to client socket
2020-08-27T12:28:49.074 INFO:teuthology.orchestra.run.smithi072.stdout:Sending command "cache erase" 
2020-08-27T12:28:49.081 INFO:tasks.rgw.client.0.smithi072.stdout:terminate called after throwing an instance of 'std::out_of_range'
2020-08-27T12:28:49.081 INFO:tasks.rgw.client.0.smithi072.stdout:  what():  map::at
2020-08-27T12:28:49.081 INFO:tasks.rgw.client.0.smithi072.stdout:*** Caught signal (Aborted) **
2020-08-27T12:28:49.082 INFO:tasks.rgw.client.0.smithi072.stdout: in thread 7f006bfff700 thread_name:admin_socket
2020-08-27T12:28:49.082 INFO:tasks.rgw.client.0.smithi072.stdout: ceph version 16.0.0-4708-gfb44c2f (fb44c2f5009f8e2e18663251149c3bf6b06256a9) pacific (dev)
2020-08-27T12:28:49.082 INFO:tasks.rgw.client.0.smithi072.stdout: 1: (()+0x3efd0) [0x7f008d6bffd0]
2020-08-27T12:28:49.083 INFO:tasks.rgw.client.0.smithi072.stdout: 2: (gsignal()+0xc7) [0x7f008d6bff47]
2020-08-27T12:28:49.083 INFO:tasks.rgw.client.0.smithi072.stdout: 3: (abort()+0x141) [0x7f008d6c18b1]
2020-08-27T12:28:49.083 INFO:tasks.rgw.client.0.smithi072.stdout: 4: (()+0x8c957) [0x7f00827c8957]
2020-08-27T12:28:49.083 INFO:tasks.rgw.client.0.smithi072.stdout: 5: (()+0x92ae6) [0x7f00827ceae6]
2020-08-27T12:28:49.083 INFO:tasks.rgw.client.0.smithi072.stdout: 6: (()+0x91b49) [0x7f00827cdb49]
2020-08-27T12:28:49.084 INFO:tasks.rgw.client.0.smithi072.stdout: 7: (__gxx_personality_v0()+0x2a8) [0x7f00827ce4b8]
2020-08-27T12:28:49.084 INFO:tasks.rgw.client.0.smithi072.stdout: 8: (()+0x10573) [0x7f0082196573]
2020-08-27T12:28:49.084 INFO:tasks.rgw.client.0.smithi072.stdout: 9: (_Unwind_Resume()+0x125) [0x7f0082196df5]
2020-08-27T12:28:49.085 INFO:tasks.rgw.client.0.smithi072.stdout: 10: (AdminSocket::do_accept()+0xecc) [0x7f0083ff130c]
2020-08-27T12:28:49.085 INFO:tasks.rgw.client.0.smithi072.stdout: 11: (AdminSocket::entry()+0x328) [0x7f0083ff1868]
2020-08-27T12:28:49.085 INFO:tasks.rgw.client.0.smithi072.stdout: 12: (()+0xbd6df) [0x7f00827f96df]
2020-08-27T12:28:49.085 INFO:tasks.rgw.client.0.smithi072.stdout: 13: (()+0x76db) [0x7f0082f976db]
2020-08-27T12:28:49.085 INFO:tasks.rgw.client.0.smithi072.stdout: 14: (clone()+0x3f) [0x7f008d7a2a3f]

rados/singleton-nomsgr/{all/admin_socket_output rados supported-random-distro$/{ubuntu_latest}}

/a/teuthology-2020-08-26_07:01:02-rados-master-distro-basic-smithi/5377107
/a/kchai-2020-08-27_10:01:54-rados-wip-kefu-testing-2020-08-27-1255-distro-basic-smithi/5379955


Related issues

Related to CephFS - Bug #46985: common: validate type CephBool cause 'invalid command json' Pending Backport
Copied to rgw - Backport #47465: nautilus: AdminSocket::do_accept() terminate called after throwing an instance of 'std::out_of_range' Need More Info
Copied to rgw - Backport #47466: octopus: AdminSocket::do_accept() terminate called after throwing an instance of 'std::out_of_range' Resolved

History

#1 Updated by Brad Hubbard 8 months ago

  • Assignee set to Brad Hubbard

#2 Updated by Brad Hubbard 8 months ago

Ah, this stack is coming from an instance of radosgw. I wondered about the call to clone since ceph_test_admin_socket_output is single threaded.

The stack from ceph_test_admin_socket_output is further down and looks like this.

2020-08-27T12:28:49.199 INFO:teuthology.orchestra.run.smithi072.stderr:run_command AdminSocketClient::do_request errored with: safe_read(3) failed to read message size: (33) Numerical argument out of domain
2020-08-27T12:28:49.199 INFO:teuthology.orchestra.run.smithi072.stderr:/build/ceph-16.0.0-4708-gfb44c2f/src/test/admin_socket_output.cc: In function 'std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> > AdminSocketOutput::run_command(AdminSocketClient&, const string&, bool)' thread 7fd5cd138c00 time 2020-08-27T12:28:49.201692+0000
2020-08-27T12:28:49.200 INFO:teuthology.orchestra.run.smithi072.stderr:/build/ceph-16.0.0-4708-gfb44c2f/src/test/admin_socket_output.cc: 119: ceph_abort_msg("abort() called")
2020-08-27T12:28:49.200 INFO:teuthology.orchestra.run.smithi072.stderr: ceph version 16.0.0-4708-gfb44c2f (fb44c2f5009f8e2e18663251149c3bf6b06256a9) pacific (dev)
2020-08-27T12:28:49.200 INFO:teuthology.orchestra.run.smithi072.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe1) [0x7fd5c42ec477]
2020-08-27T12:28:49.200 INFO:teuthology.orchestra.run.smithi072.stderr: 2: (AdminSocketOutput::run_command(AdminSocketClient&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)+0x3a0) [0x5612e7362a10]
2020-08-27T12:28:49.200 INFO:teuthology.orchestra.run.smithi072.stderr: 3: (AdminSocketOutput::gather_socket_output()+0x3fa) [0x5612e736494a]
2020-08-27T12:28:49.201 INFO:teuthology.orchestra.run.smithi072.stderr: 4: (AdminSocketOutput::exec()+0x88) [0x5612e7364fd8]
2020-08-27T12:28:49.201 INFO:teuthology.orchestra.run.smithi072.stderr: 5: (main()+0x833) [0x5612e735cd93]
2020-08-27T12:28:49.201 INFO:teuthology.orchestra.run.smithi072.stderr: 6: (__libc_start_main()+0xe7) [0x7fd5c34f8b97]
2020-08-27T12:28:49.201 INFO:teuthology.orchestra.run.smithi072.stderr: 7: (_start()+0x2a) [0x5612e735d6ba]

Looking at the the rgw log it looks like this.

   -23> 2020-08-27T12:28:48.975+0000 7f00627fc700 10 monclient: tick
   -22> 2020-08-27T12:28:48.987+0000 7f0063fff700 10 monclient: _renew_subs
   -21> 2020-08-27T12:28:48.987+0000 7f0063fff700 10 monclient: _send_mon_message to mon.a at v2:172.21.15.72:3300/0
   -20> 2020-08-27T12:28:48.991+0000 7eff6effd700  1 ====== starting new request req=0x7eff6eff5040 =====
   -19> 2020-08-27T12:28:48.991+0000 7eff6effd700  2 req 1 0s initializing for trans_id = tx000000000000000000001-005f47a700-1098-default
   -18> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s getting op 0
   -17> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets verifying requester
   -16> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets normalizing buckets and tenants
   -15> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets init permissions
   -14> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets recalculating target
   -13> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets reading permissions
   -12> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets init op
   -11> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets verifying op mask
   -10> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets verifying op permissions
    -9> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets verifying op params
    -8> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets pre-executing
    -7> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets executing
    -6> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets completing
    -5> 2020-08-27T12:28:48.995+0000 7eff6effd700  5 nothing to log for operation
    -4> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets op status=0
    -3> 2020-08-27T12:28:48.995+0000 7eff6effd700  2 req 1 0.004000000s s3:list_buckets http status=200
    -2> 2020-08-27T12:28:48.995+0000 7eff6effd700  1 ====== req done req=0x7eff6eff5040 op status=0 http_status=200 latency=0.004000000s ======
    -1> 2020-08-27T12:28:48.995+0000 7eff6effd700  1 civetweb: 0x7f007829a010: 172.21.15.72 - - [27/Aug/2020:12:28:48 +0000] "GET / HTTP/1.1" 200 413 - curl/7.58.0
     0> 2020-08-27T12:28:49.079+0000 7f006bfff700 -1 *** Caught signal (Aborted) **
 in thread 7f006bfff700 thread_name:admin_socket

 ceph version 16.0.0-4708-gfb44c2f (fb44c2f5009f8e2e18663251149c3bf6b06256a9) pacific (dev)
 1: (()+0x3efd0) [0x7f008d6bffd0]
 2: (gsignal()+0xc7) [0x7f008d6bff47]
 3: (abort()+0x141) [0x7f008d6c18b1]
 4: (()+0x8c957) [0x7f00827c8957]
 5: (()+0x92ae6) [0x7f00827ceae6]
 6: (()+0x91b49) [0x7f00827cdb49]
 7: (__gxx_personality_v0()+0x2a8) [0x7f00827ce4b8]
 8: (()+0x10573) [0x7f0082196573]
 9: (_Unwind_Resume()+0x125) [0x7f0082196df5]
 10: (AdminSocket::do_accept()+0xecc) [0x7f0083ff130c]
 11: (AdminSocket::entry()+0x328) [0x7f0083ff1868]
 12: (()+0xbd6df) [0x7f00827f96df]
 13: (()+0x76db) [0x7f0082f976db]
 14: (clone()+0x3f) [0x7f008d7a2a3f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

So ceph_test_admin_socket_output sent the first admin socket command "cache erase" to the radosgw and that killed the daemon and then ceph_test_admin_socket_output crashed due to the error from 'safe_read' (due to the socket being disconnected). So ceph_test_admin_socket_output was doing it's job.

The crash seems to be specific to radosgw (but in 'common' code) and I can reproduce it locally with that daemon but not the others. Will investigate further.

#3 Updated by Brad Hubbard 7 months ago

  • Project changed from RADOS to rgw

#4 Updated by Brad Hubbard 7 months ago

  • Pull request ID set to 36938

This is happening due to the code for "cache erase" and "cache inspect" not checking there is a 'target' argument in the command before trying to use it. I have submitted a patch.

#5 Updated by Brad Hubbard 7 months ago

  • Status changed from New to Fix Under Review

#6 Updated by Brad Hubbard 7 months ago

  • Related to Bug #46985: common: validate type CephBool cause 'invalid command json' added

#8 Updated by Kefu Chai 7 months ago

  • Pull request ID changed from 36938 to 37098

#9 Updated by Kefu Chai 7 months ago

  • Backport set to octopus, nautilus
  • Regression changed from No to Yes

#10 Updated by Kefu Chai 7 months ago

  • Status changed from Fix Under Review to Pending Backport

#11 Updated by Kefu Chai 7 months ago

please assign the backport to me if the backport script does not work for the related PR.

#12 Updated by Nathan Cutler 7 months ago

  • Copied to Backport #47465: nautilus: AdminSocket::do_accept() terminate called after throwing an instance of 'std::out_of_range' added

#13 Updated by Nathan Cutler 7 months ago

  • Copied to Backport #47466: octopus: AdminSocket::do_accept() terminate called after throwing an instance of 'std::out_of_range' added

Also available in: Atom PDF