Project

General

Profile

Actions

Bug #24982

closed

mgr: terminate called after throwing an instance of 'std::out_of_range' in DaemonPerfCounters::update

Added by Iain Bucław almost 6 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Backtrace from logs:

2018-07-18 14:22:49.241346 7fc045459700 20 mgr.server handle_report updating existing DaemonState for rgw,bucket
2018-07-18 14:22:49.241349 7fc045459700 20 mgr update loading 0 new types, 0 old types, had 146 types, got 214 bytes of data
2018-07-18 14:22:49.242640 7fc045459700 -1 *** Caught signal (Aborted) **
 in thread 7fc045459700 thread_name:ms_dispatch

 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
 1: (()+0x40e744) [0x560798ff2744]
 2: (()+0x11390) [0x7fc053a13390]
 3: (gsignal()+0x38) [0x7fc0529a3428]
 4: (abort()+0x16a) [0x7fc0529a502a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fc0532e684d]
 6: (()+0x8d6b6) [0x7fc0532e46b6]
 7: (()+0x8d701) [0x7fc0532e4701]
 8: (()+0x8d919) [0x7fc0532e4919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fc05330d2cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x560798e86dec]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x560798e8f3d9]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x560798e9d5a7]
 13: (DispatchQueue::entry()+0xf4a) [0x56079934caba]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x5607990edaed]
 15: (()+0x76ba) [0x7fc053a096ba]
 16: (clone()+0x6d) [0x7fc052a7541d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Backtrace from stdout:

terminate called after throwing an instance of 'std::out_of_range'
  what():  map::at
*** Caught signal (Aborted) **
 in thread 7fbb9de22700 thread_name:ms_dispatch
 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
 1: (()+0x40e744) [0x564115e80744]
 2: (()+0x11390) [0x7fbbac51b390]
 3: (gsignal()+0x38) [0x7fbbab4ab428]
 4: (abort()+0x16a) [0x7fbbab4ad02a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fbbabdee84d]
 6: (()+0x8d6b6) [0x7fbbabdec6b6]
 7: (()+0x8d701) [0x7fbbabdec701]
 8: (()+0x8d919) [0x7fbbabdec919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fbbabe152cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x564115d14dec]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x564115d1d3d9]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x564115d2b5a7]
 13: (DispatchQueue::entry()+0xf4a) [0x5641161daaba]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x564115f7baed]
 15: (()+0x76ba) [0x7fbbac5116ba]
 16: (clone()+0x6d) [0x7fbbab57d41d]
2018-07-18 15:37:51.827425 7fbb9de22700 -1 *** Caught signal (Aborted) **
 in thread 7fbb9de22700 thread_name:ms_dispatch

 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
 1: (()+0x40e744) [0x564115e80744]
 2: (()+0x11390) [0x7fbbac51b390]
 3: (gsignal()+0x38) [0x7fbbab4ab428]
 4: (abort()+0x16a) [0x7fbbab4ad02a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fbbabdee84d]
 6: (()+0x8d6b6) [0x7fbbabdec6b6]
 7: (()+0x8d701) [0x7fbbabdec701]
 8: (()+0x8d919) [0x7fbbabdec919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fbbabe152cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x564115d14dec]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x564115d1d3d9]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x564115d2b5a7]
 13: (DispatchQueue::entry()+0xf4a) [0x5641161daaba]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x564115f7baed]
 15: (()+0x76ba) [0x7fbbac5116ba]
 16: (clone()+0x6d) [0x7fbbab57d41d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2018-07-18 15:37:51.827425 7fbb9de22700 -1 *** Caught signal (Aborted) **
 in thread 7fbb9de22700 thread_name:ms_dispatch

 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
 1: (()+0x40e744) [0x564115e80744]
 2: (()+0x11390) [0x7fbbac51b390]
 3: (gsignal()+0x38) [0x7fbbab4ab428]
 4: (abort()+0x16a) [0x7fbbab4ad02a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fbbabdee84d]
 6: (()+0x8d6b6) [0x7fbbabdec6b6]
 7: (()+0x8d701) [0x7fbbabdec701]
 8: (()+0x8d919) [0x7fbbabdec919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fbbabe152cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x564115d14dec]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x564115d1d3d9]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x564115d2b5a7]
 13: (DispatchQueue::entry()+0xf4a) [0x5641161daaba]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x564115f7baed]
 15: (()+0x76ba) [0x7fbbac5116ba]
 16: (clone()+0x6d) [0x7fbbab57d41d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Aborted

This patch introduces the use of `map::at`:

https://github.com/ceph/ceph/commit/1164ef2f32d81d4f35623c3f6a77af2b6871f962#diff-1d4ae230c3c43537437b704c5d05a40cR167

Notes on diagnosing the issue on IRC:

  • It'd only be triggered when the perf counter being updated was not 'declared' and thus created before being updated
  • The MGRs that fail must have got into a state where the mgr thinks some perf counters being updated were never declared by the osds/rgw, while the others either did declare those perf counters or don't have any updates for them
  • Mgrs only seem to crash only when a perf counter update comes from radosgw.

Related issues 1 (0 open1 closed)

Related to mgr - Bug #36244: mgr crash when handle_report updating existing DaemonState for rgwResolvedMykola Golub09/28/2018

Actions
Actions #1

Updated by Iain Bucław almost 6 years ago

It was suggested to set:

mgr_stats_threshold = 12

However, this only issues an error that the value is out of range, and seems to be ignored.

Setting it to 10 and all mgrs still crash. I'm going to have to revert all mgr binaries back to 12.2.5.

Actions #2

Updated by Patrick Donnelly over 5 years ago

  • Project changed from Ceph to mgr
  • Subject changed from terminate called after throwing an instance of 'std::out_of_range' in DaemonPerfCounters::update to mgr: terminate called after throwing an instance of 'std::out_of_range' in DaemonPerfCounters::update
  • Priority changed from Normal to High
Actions #3

Updated by John Spray over 5 years ago

Iain: the way it's rejecting mgr_stats_threshold when too high is a bug http://tracker.ceph.com/issues/25197

Actions #4

Updated by John Spray over 5 years ago

  • Assignee set to Boris Ranto

Boris: this looks like a regression, could you take a look please?

Actions #5

Updated by Boris Ranto over 5 years ago

I can take a look but in two weeks since I am going on a vacation tomorrow. If anyone else wants to take a look in the mean-time, any sort of help is welcomed.

Actions #6

Updated by Boris Ranto over 5 years ago

I took a quick look, a couple of notes:

While the patch did add the map::at call, we call map::at even before that to get the type for the path with the `types` map. That is likely when the exception occurs. Both `types` and `instances` are populated in the same step so if one is defined then the other should be too. The only difference that I can notice between populating `types` and `instances` is that we use `std::make_pair` for `types` and `std::pair` for instances. AFAIK, both methods should be identical. In case they are not I have pushed `wip-mgr-make-pair` branch to the ceph-ci so that it gets build and you could test:

https://shaman.ceph.com/builds/ceph/wip-mgr-make-pair/e3c9afcc4b3d72d4603dc2f7241ca7895b6335a2/

You can choose your distro variant. Afterwards, you should be able to click through towards the actual packages/repositories. The build is based on latest upstream luminous branch (i.e. 12.2.7 + a couple of patches).

If it does not help then this probably is not a regression (unless we have made some changes to the way rgw reports the perf counters, too).

Anyway, how reproducible is this (always/once/couple of times)? Will it help if you reboot the radosgw node that is making it fail?

Actions #7

Updated by Burkhard Linke over 5 years ago

We are also affected by this bug.

During the upgrade from 12.2.5 to 12.2.7 the mgr starts to abort upon restart with the same stack trace as mentioned above.

The problem also persists after all RGW nodes were updated to 12.2.7. We run three nodes with two instances each (internal and external users) using haproxy and pacemaker. After I terminated all RGW instance except those running on one host, the mgrs stopped to crash.

The RGWs use the same ceph user credentials (one user for internal, one user for external), so maybe this problem is related to this kind of HA setup?

We can reproduce the problem by starting a second instance on a different host if more/extended logs are needed.

Actions #8

Updated by Iain Bucław over 5 years ago

Boris Ranto wrote:

I took a quick look, a couple of notes:

While the patch did add the map::at call, we call map::at even before that to get the type for the path with the `types` map. That is likely when the exception occurs. Both `types` and `instances` are populated in the same step so if one is defined then the other should be too. The only difference that I can notice between populating `types` and `instances` is that we use `std::make_pair` for `types` and `std::pair` for instances. AFAIK, both methods should be identical. In case they are not I have pushed `wip-mgr-make-pair` branch to the ceph-ci so that it gets build and you could test:

https://shaman.ceph.com/builds/ceph/wip-mgr-make-pair/e3c9afcc4b3d72d4603dc2f7241ca7895b6335a2/

You can choose your distro variant. Afterwards, you should be able to click through towards the actual packages/repositories. The build is based on latest upstream luminous branch (i.e. 12.2.7 + a couple of patches).

If it does not help then this probably is not a regression (unless we have made some changes to the way rgw reports the perf counters, too).

Anyway, how reproducible is this (always/once/couple of times)? Will it help if you reboot the radosgw node that is making it fail?

It happens within the first five seconds of the mgr becoming "active".

$ sudo -u ceph /usr/bin/ceph-mgr -f --cluster ceph --id eu-262 --setuser ceph --setgroup ceph
ignoring --setuser ceph since I am not root
ignoring --setgroup ceph since I am not root
terminate called after throwing an instance of 'std::out_of_range'
  what():  map::at
*** Caught signal (Aborted) **
 in thread 7f6f4c228700 thread_name:ms_dispatch
 ceph version 12.2.7-92-ge3c9afc (e3c9afcc4b3d72d4603dc2f7241ca7895b6335a2) luminous (stable)
 1: (()+0x40f074) [0x563051e43074]
 2: (()+0x11390) [0x7f6f59fa2390]
 3: (gsignal()+0x38) [0x7f6f58f32428]
 4: (abort()+0x16a) [0x7f6f58f3402a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f6f5987584d]
 6: (()+0x8d6b6) [0x7f6f598736b6]
 7: (()+0x8d701) [0x7f6f59873701]
 8: (()+0x8d919) [0x7f6f59873919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7f6f5989c2cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x563051cd702c]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x563051cdf619]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x563051ced7e7]
 13: (DispatchQueue::entry()+0xf4a) [0x56305219d4ea]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x563051f3e50d]
 15: (()+0x76ba) [0x7f6f59f986ba]
 16: (clone()+0x6d) [0x7f6f5900441d]
2018-08-01 17:53:13.594981 7f6f4c228700 -1 *** Caught signal (Aborted) **
 in thread 7f6f4c228700 thread_name:ms_dispatch

 ceph version 12.2.7-92-ge3c9afc (e3c9afcc4b3d72d4603dc2f7241ca7895b6335a2) luminous (stable)
 1: (()+0x40f074) [0x563051e43074]
 2: (()+0x11390) [0x7f6f59fa2390]
 3: (gsignal()+0x38) [0x7f6f58f32428]
 4: (abort()+0x16a) [0x7f6f58f3402a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f6f5987584d]
 6: (()+0x8d6b6) [0x7f6f598736b6]
 7: (()+0x8d701) [0x7f6f59873701]
 8: (()+0x8d919) [0x7f6f59873919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7f6f5989c2cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x563051cd702c]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x563051cdf619]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x563051ced7e7]
 13: (DispatchQueue::entry()+0xf4a) [0x56305219d4ea]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x563051f3e50d]
 15: (()+0x76ba) [0x7f6f59f986ba]
 16: (clone()+0x6d) [0x7f6f5900441d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2018-08-01 17:53:13.594981 7f6f4c228700 -1 *** Caught signal (Aborted) **
 in thread 7f6f4c228700 thread_name:ms_dispatch

 ceph version 12.2.7-92-ge3c9afc (e3c9afcc4b3d72d4603dc2f7241ca7895b6335a2) luminous (stable)
 1: (()+0x40f074) [0x563051e43074]
 2: (()+0x11390) [0x7f6f59fa2390]
 3: (gsignal()+0x38) [0x7f6f58f32428]
 4: (abort()+0x16a) [0x7f6f58f3402a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f6f5987584d]
 6: (()+0x8d6b6) [0x7f6f598736b6]
 7: (()+0x8d701) [0x7f6f59873701]
 8: (()+0x8d919) [0x7f6f59873919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7f6f5989c2cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x563051cd702c]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x563051cdf619]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x563051ced7e7]
 13: (DispatchQueue::entry()+0xf4a) [0x56305219d4ea]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x563051f3e50d]
 15: (()+0x76ba) [0x7f6f59f986ba]
 16: (clone()+0x6d) [0x7f6f5900441d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Aborted

Actions #9

Updated by John Spray over 5 years ago

This isn't reproducing for me on a development environment compiling the 12.2.7 tag.

Anything else you can tell me about how the RGW daemons are configured?

Actions #10

Updated by Burkhard Linke over 5 years ago

In our case, the RGW instances use the following setup on three hosts:

[client.radosgw.gateway-internal]
keyring = /etc/ceph/ceph.client.radosgw-internal.keyring
debug rgw = 0
rgw frontends = civetweb port=8080 num_threads=100
rgw print continue = false
rgw dns name = s3.internal.XYZ

rgw keystone admin user = radosgw
rgw keystone admin password = XYZ
rgw keystone token cache size = 10000
rgw keystone url = http://XYZ:5000
rgw keystone admin tenant = services
rgw keystone admin domain = Default
rgw keystone api version = 3
rgw s3 auth use keystone = true
rgw keystone accepted roles = Member, member, admin
rgw keystone revocation interval = 900

rgw num rados handles = 100

[client.radosgw.gateway]
keyring = /etc/ceph/ceph.client.radosgw.keyring
debug rgw = 0
rgw frontends = civetweb port=8081 num_threads=100
rgw print continue = false
rgw dns name = s3.XYZ

rgw keystone admin user = radosgw
rgw keystone admin password = XYZ
rgw keystone token cache size = 10000
rgw keystone url = http://XYZ:5000
rgw keystone admin tenant = services
rgw keystone admin domain = Default
rgw keystone api version = 3
rgw s3 auth use keystone = true
rgw keystone accepted roles = Member, member, admin
rgw keystone revocation interval = 900

rgw num rados handles = 100

haproxy setup (although this is probably not part of the problem):
global
ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS
ssl-default-bind-options no-sslv3

defaults
log global
maxconn 8000
option redispatch
retries 3
stats enable
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
timeout check 10s

listen external
bind XYZ:443 ssl crt /etc/haproxy/s3.XYZ.pem
mode http
balance roundrobin
option tcplog
option http-keep-alive
server ceph-storage-07 XYZ1:8081 check
server ceph-storage-08 XYZ2:8081 check
server ceph-storage-09 XYZ3:8081 check

listen internal
bind XYZ:80
mode http
balance roundrobin
option tcplog
option http-keep-alive
stats enable
stats hide-version
stats refresh 30s
stats show-node
stats auth admin:XYZ
stats uri /ha-stats
server ceph-storage-07 XYZ1:8080 check
server ceph-storage-08 XYZ2:8080 check
server ceph-storage-09 XYZ3:8080 check

haproxy and RGW are colocated with OSDs on the same host, pacemaker (config not shown) manages the VIP setup and VIP failover.

Actions #11

Updated by Jens Harbott over 5 years ago

We are seeing the same issue after upgrading from 12.2.5 to 12.2.8. Similar scenario with three rgw nodes, running with just one rgw daemon active solves the issue, but will lead to severe performance issues in our production setup. So please tag this as a regression and provide a fix. If you need more data to reproduce, I'm happy to help.

Maybe related: When running under 12.2.5, even while 3 rgw daemons are active, ceph -s still outputs:

  services:
    rgw: 1 daemon active
Actions #12

Updated by Iain Bucław over 5 years ago

Regression still persists in 12.2.8 downgraded to ceph-mgr 12.2.5 binaries... again.

     0> 2018-09-10 10:18:49.857757 7fe023eae700 -1 *** Caught signal (Aborted) **
 in thread 7fe023eae700 thread_name:ms_dispatch

 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
 1: (()+0x4105b4) [0x55db085df5b4]
 2: (()+0x11390) [0x7fe0325e7390]
 3: (gsignal()+0x38) [0x7fe031577428]
 4: (abort()+0x16a) [0x7fe03157902a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fe031eba84d]
 6: (()+0x8d6b6) [0x7fe031eb86b6]
 7: (()+0x8d701) [0x7fe031eb8701]
 8: (()+0x8d919) [0x7fe031eb8919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fe031ee12cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x199c) [0x55db084733cc]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x55db0847b9b9]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x55db08489b87]
 13: (DispatchQueue::entry()+0xf4a) [0x55db0893c4fa]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x55db086dac7d]
 15: (()+0x76ba) [0x7fe0325dd6ba]
 16: (clone()+0x6d) [0x7fe03164941d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #13

Updated by Iain Bucław over 5 years ago

Jens Harbott wrote:

We are seeing the same issue after upgrading from 12.2.5 to 12.2.8. Similar scenario with three rgw nodes, running with just one rgw daemon active solves the issue, but will lead to severe performance issues in our production setup. So please tag this as a regression and provide a fix. If you need more data to reproduce, I'm happy to help.

Maybe related: When running under 12.2.5, even while 3 rgw daemons are active, ceph -s still outputs:

[...]

In the smallest region, there are 5 servers running rgw, handling 2 realms.

    rgw: 2 daemons active
Actions #14

Updated by Iain Bucław over 5 years ago

Iain Bucław wrote:

Jens Harbott wrote:

We are seeing the same issue after upgrading from 12.2.5 to 12.2.8. Similar scenario with three rgw nodes, running with just one rgw daemon active solves the issue, but will lead to severe performance issues in our production setup. So please tag this as a regression and provide a fix. If you need more data to reproduce, I'm happy to help.

Maybe related: When running under 12.2.5, even while 3 rgw daemons are active, ceph -s still outputs:

[...]

In the smallest region, there are 5 servers running rgw, handling 2 realms.

[...]

That is to say, I think the daemons active only reflects the number of realms in the cluster, not the number of running instances (in my above example, there are 10 running instances).

Actions #15

Updated by Jens Harbott over 5 years ago

FYI with the patch from http://tracker.ceph.com/issues/26838 applied to 12.2.8 and setting

mgr_stats_threshold = 11

(not 12 as mentioned above), the mgr daemons seem to be running fine now. Not sure though what the side effect of this setting may be.

Actions #16

Updated by Dmitry Mishin about 5 years ago

Jens Harbott wrote:

Not sure though what the side effect of this setting may be.

Some metrics, like bandwidth and IOPS, stop working.

Is there any progress on the issue? I just updated to 13.2.4, still same problem.

Actions #17

Updated by Boris Ranto about 5 years ago

I believe this should be fixed by this PR:

https://github.com/ceph/ceph/pull/25534

It is being back-ported to luminous and mimic. The luminous back-port seems to be already in and should part of the next release - 12.2.11. I am not sure what the state of the mimic back-port is.

Actions #18

Updated by Lenz Grimmer about 5 years ago

  • Related to Bug #36244: mgr crash when handle_report updating existing DaemonState for rgw added
Actions #19

Updated by Sage Weil about 5 years ago

  • Priority changed from High to Urgent

/a/sage-2019-02-15_00:51:48-rados-wip-sage-testing-2019-02-14-1642-distro-basic-smithi/3591594

2019-02-15 05:14:41.904 7fb9dc242700  4 mgr.server handle_report from 0x560eb2606400 osd,1
2019-02-15 05:14:41.904 7fb9dc242700 20 mgr.server handle_report updating existing DaemonState for osd,1
2019-02-15 05:14:41.904 7fb9dc242700 20 mgr update loading 0 new types, 0 old types, had 110 types, got 782 bytes of data
2019-02-15 05:14:41.905 7fb9dc242700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fb9dc242700 thread_name:ms_dispatch

 ceph version 14.0.1-3749-g2aae580 (2aae58097fd39ec4bff12ccfd1de93e28cef88fa) nautilus (dev)
 1: (()+0xf5d0) [0x7fb9fb7915d0]
 2: (DaemonPerfCounters::update(MMgrReport*)+0x37c) [0x560eada3e07c]
 3: (DaemonServer::handle_report(MMgrReport*)+0x3ab) [0x560eada107fb]
 4: (DaemonServer::ms_dispatch(Message*)+0x195) [0x560eada25f55]
 5: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x560eada39e96]
 6: (DispatchQueue::entry()+0x11b9) [0x7fb9fe074bf9]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb9fe12343d]
 8: (()+0x7dd5) [0x7fb9fb789dd5]
 9: (clone()+0x6d) [0x7fb9fa439ead]
Actions #20

Updated by Sage Weil about 5 years ago

nevermind, the new failure is unrelated!

Actions #21

Updated by Boris Ranto about 5 years ago

  • Status changed from New to Resolved

As I mentioned above, I believe this was resolved by a patch for a different ticket. I'll close this. Feel free to re-open if you can hit this with luminous 12.2.11+ or current master.

Actions #22

Updated by Nathan Cutler about 5 years ago

Duplicate of #36244

Actions #23

Updated by Dmitry Mishin about 5 years ago

I still don't see the backport to mimic.. Is there a ticker for it?

Actions #24

Updated by Boris Ranto about 5 years ago

It looks like the mimic back-port for this was also merged: https://github.com/ceph/ceph/pull/25864 and it should be in the 13.2.5 release.

Actions #25

Updated by Nathan Cutler about 5 years ago

@Dmitry: For backports, look at the "Copied to" entries in #36244

(Hint: the mimic backport is #37826)

Actions #26

Updated by Dmitry Mishin about 5 years ago

Awesome, thanks!

Actions

Also available in: Atom PDF