Project

General

Profile

Bug #24982

mgr: terminate called after throwing an instance of 'std::out_of_range' in DaemonPerfCounters::update

Added by Iain Bucław 8 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
07/18/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Backtrace from logs:

2018-07-18 14:22:49.241346 7fc045459700 20 mgr.server handle_report updating existing DaemonState for rgw,bucket
2018-07-18 14:22:49.241349 7fc045459700 20 mgr update loading 0 new types, 0 old types, had 146 types, got 214 bytes of data
2018-07-18 14:22:49.242640 7fc045459700 -1 *** Caught signal (Aborted) **
 in thread 7fc045459700 thread_name:ms_dispatch

 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
 1: (()+0x40e744) [0x560798ff2744]
 2: (()+0x11390) [0x7fc053a13390]
 3: (gsignal()+0x38) [0x7fc0529a3428]
 4: (abort()+0x16a) [0x7fc0529a502a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fc0532e684d]
 6: (()+0x8d6b6) [0x7fc0532e46b6]
 7: (()+0x8d701) [0x7fc0532e4701]
 8: (()+0x8d919) [0x7fc0532e4919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fc05330d2cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x560798e86dec]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x560798e8f3d9]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x560798e9d5a7]
 13: (DispatchQueue::entry()+0xf4a) [0x56079934caba]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x5607990edaed]
 15: (()+0x76ba) [0x7fc053a096ba]
 16: (clone()+0x6d) [0x7fc052a7541d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Backtrace from stdout:

terminate called after throwing an instance of 'std::out_of_range'
  what():  map::at
*** Caught signal (Aborted) **
 in thread 7fbb9de22700 thread_name:ms_dispatch
 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
 1: (()+0x40e744) [0x564115e80744]
 2: (()+0x11390) [0x7fbbac51b390]
 3: (gsignal()+0x38) [0x7fbbab4ab428]
 4: (abort()+0x16a) [0x7fbbab4ad02a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fbbabdee84d]
 6: (()+0x8d6b6) [0x7fbbabdec6b6]
 7: (()+0x8d701) [0x7fbbabdec701]
 8: (()+0x8d919) [0x7fbbabdec919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fbbabe152cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x564115d14dec]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x564115d1d3d9]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x564115d2b5a7]
 13: (DispatchQueue::entry()+0xf4a) [0x5641161daaba]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x564115f7baed]
 15: (()+0x76ba) [0x7fbbac5116ba]
 16: (clone()+0x6d) [0x7fbbab57d41d]
2018-07-18 15:37:51.827425 7fbb9de22700 -1 *** Caught signal (Aborted) **
 in thread 7fbb9de22700 thread_name:ms_dispatch

 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
 1: (()+0x40e744) [0x564115e80744]
 2: (()+0x11390) [0x7fbbac51b390]
 3: (gsignal()+0x38) [0x7fbbab4ab428]
 4: (abort()+0x16a) [0x7fbbab4ad02a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fbbabdee84d]
 6: (()+0x8d6b6) [0x7fbbabdec6b6]
 7: (()+0x8d701) [0x7fbbabdec701]
 8: (()+0x8d919) [0x7fbbabdec919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fbbabe152cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x564115d14dec]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x564115d1d3d9]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x564115d2b5a7]
 13: (DispatchQueue::entry()+0xf4a) [0x5641161daaba]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x564115f7baed]
 15: (()+0x76ba) [0x7fbbac5116ba]
 16: (clone()+0x6d) [0x7fbbab57d41d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2018-07-18 15:37:51.827425 7fbb9de22700 -1 *** Caught signal (Aborted) **
 in thread 7fbb9de22700 thread_name:ms_dispatch

 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
 1: (()+0x40e744) [0x564115e80744]
 2: (()+0x11390) [0x7fbbac51b390]
 3: (gsignal()+0x38) [0x7fbbab4ab428]
 4: (abort()+0x16a) [0x7fbbab4ad02a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fbbabdee84d]
 6: (()+0x8d6b6) [0x7fbbabdec6b6]
 7: (()+0x8d701) [0x7fbbabdec701]
 8: (()+0x8d919) [0x7fbbabdec919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fbbabe152cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x564115d14dec]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x564115d1d3d9]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x564115d2b5a7]
 13: (DispatchQueue::entry()+0xf4a) [0x5641161daaba]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x564115f7baed]
 15: (()+0x76ba) [0x7fbbac5116ba]
 16: (clone()+0x6d) [0x7fbbab57d41d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Aborted

This patch introduces the use of `map::at`:

https://github.com/ceph/ceph/commit/1164ef2f32d81d4f35623c3f6a77af2b6871f962#diff-1d4ae230c3c43537437b704c5d05a40cR167

Notes on diagnosing the issue on IRC:

  • It'd only be triggered when the perf counter being updated was not 'declared' and thus created before being updated
  • The MGRs that fail must have got into a state where the mgr thinks some perf counters being updated were never declared by the osds/rgw, while the others either did declare those perf counters or don't have any updates for them
  • Mgrs only seem to crash only when a perf counter update comes from radosgw.

Related issues

Related to mgr - Bug #36244: mgr crash when handle_report updating existing DaemonState for rgw Resolved 09/28/2018

History

#1 Updated by Iain Bucław 8 months ago

It was suggested to set:

mgr_stats_threshold = 12

However, this only issues an error that the value is out of range, and seems to be ignored.

Setting it to 10 and all mgrs still crash. I'm going to have to revert all mgr binaries back to 12.2.5.

#2 Updated by Patrick Donnelly 8 months ago

  • Project changed from Ceph to mgr
  • Subject changed from terminate called after throwing an instance of 'std::out_of_range' in DaemonPerfCounters::update to mgr: terminate called after throwing an instance of 'std::out_of_range' in DaemonPerfCounters::update
  • Priority changed from Normal to High

#3 Updated by John Spray 8 months ago

Iain: the way it's rejecting mgr_stats_threshold when too high is a bug http://tracker.ceph.com/issues/25197

#4 Updated by John Spray 8 months ago

  • Assignee set to Boris Ranto

Boris: this looks like a regression, could you take a look please?

#5 Updated by Boris Ranto 8 months ago

I can take a look but in two weeks since I am going on a vacation tomorrow. If anyone else wants to take a look in the mean-time, any sort of help is welcomed.

#6 Updated by Boris Ranto 8 months ago

I took a quick look, a couple of notes:

While the patch did add the map::at call, we call map::at even before that to get the type for the path with the `types` map. That is likely when the exception occurs. Both `types` and `instances` are populated in the same step so if one is defined then the other should be too. The only difference that I can notice between populating `types` and `instances` is that we use `std::make_pair` for `types` and `std::pair` for instances. AFAIK, both methods should be identical. In case they are not I have pushed `wip-mgr-make-pair` branch to the ceph-ci so that it gets build and you could test:

https://shaman.ceph.com/builds/ceph/wip-mgr-make-pair/e3c9afcc4b3d72d4603dc2f7241ca7895b6335a2/

You can choose your distro variant. Afterwards, you should be able to click through towards the actual packages/repositories. The build is based on latest upstream luminous branch (i.e. 12.2.7 + a couple of patches).

If it does not help then this probably is not a regression (unless we have made some changes to the way rgw reports the perf counters, too).

Anyway, how reproducible is this (always/once/couple of times)? Will it help if you reboot the radosgw node that is making it fail?

#7 Updated by Burkhard Linke 8 months ago

We are also affected by this bug.

During the upgrade from 12.2.5 to 12.2.7 the mgr starts to abort upon restart with the same stack trace as mentioned above.

The problem also persists after all RGW nodes were updated to 12.2.7. We run three nodes with two instances each (internal and external users) using haproxy and pacemaker. After I terminated all RGW instance except those running on one host, the mgrs stopped to crash.

The RGWs use the same ceph user credentials (one user for internal, one user for external), so maybe this problem is related to this kind of HA setup?

We can reproduce the problem by starting a second instance on a different host if more/extended logs are needed.

#8 Updated by Iain Bucław 8 months ago

Boris Ranto wrote:

I took a quick look, a couple of notes:

While the patch did add the map::at call, we call map::at even before that to get the type for the path with the `types` map. That is likely when the exception occurs. Both `types` and `instances` are populated in the same step so if one is defined then the other should be too. The only difference that I can notice between populating `types` and `instances` is that we use `std::make_pair` for `types` and `std::pair` for instances. AFAIK, both methods should be identical. In case they are not I have pushed `wip-mgr-make-pair` branch to the ceph-ci so that it gets build and you could test:

https://shaman.ceph.com/builds/ceph/wip-mgr-make-pair/e3c9afcc4b3d72d4603dc2f7241ca7895b6335a2/

You can choose your distro variant. Afterwards, you should be able to click through towards the actual packages/repositories. The build is based on latest upstream luminous branch (i.e. 12.2.7 + a couple of patches).

If it does not help then this probably is not a regression (unless we have made some changes to the way rgw reports the perf counters, too).

Anyway, how reproducible is this (always/once/couple of times)? Will it help if you reboot the radosgw node that is making it fail?

It happens within the first five seconds of the mgr becoming "active".

$ sudo -u ceph /usr/bin/ceph-mgr -f --cluster ceph --id eu-262 --setuser ceph --setgroup ceph
ignoring --setuser ceph since I am not root
ignoring --setgroup ceph since I am not root
terminate called after throwing an instance of 'std::out_of_range'
  what():  map::at
*** Caught signal (Aborted) **
 in thread 7f6f4c228700 thread_name:ms_dispatch
 ceph version 12.2.7-92-ge3c9afc (e3c9afcc4b3d72d4603dc2f7241ca7895b6335a2) luminous (stable)
 1: (()+0x40f074) [0x563051e43074]
 2: (()+0x11390) [0x7f6f59fa2390]
 3: (gsignal()+0x38) [0x7f6f58f32428]
 4: (abort()+0x16a) [0x7f6f58f3402a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f6f5987584d]
 6: (()+0x8d6b6) [0x7f6f598736b6]
 7: (()+0x8d701) [0x7f6f59873701]
 8: (()+0x8d919) [0x7f6f59873919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7f6f5989c2cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x563051cd702c]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x563051cdf619]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x563051ced7e7]
 13: (DispatchQueue::entry()+0xf4a) [0x56305219d4ea]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x563051f3e50d]
 15: (()+0x76ba) [0x7f6f59f986ba]
 16: (clone()+0x6d) [0x7f6f5900441d]
2018-08-01 17:53:13.594981 7f6f4c228700 -1 *** Caught signal (Aborted) **
 in thread 7f6f4c228700 thread_name:ms_dispatch

 ceph version 12.2.7-92-ge3c9afc (e3c9afcc4b3d72d4603dc2f7241ca7895b6335a2) luminous (stable)
 1: (()+0x40f074) [0x563051e43074]
 2: (()+0x11390) [0x7f6f59fa2390]
 3: (gsignal()+0x38) [0x7f6f58f32428]
 4: (abort()+0x16a) [0x7f6f58f3402a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f6f5987584d]
 6: (()+0x8d6b6) [0x7f6f598736b6]
 7: (()+0x8d701) [0x7f6f59873701]
 8: (()+0x8d919) [0x7f6f59873919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7f6f5989c2cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x563051cd702c]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x563051cdf619]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x563051ced7e7]
 13: (DispatchQueue::entry()+0xf4a) [0x56305219d4ea]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x563051f3e50d]
 15: (()+0x76ba) [0x7f6f59f986ba]
 16: (clone()+0x6d) [0x7f6f5900441d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2018-08-01 17:53:13.594981 7f6f4c228700 -1 *** Caught signal (Aborted) **
 in thread 7f6f4c228700 thread_name:ms_dispatch

 ceph version 12.2.7-92-ge3c9afc (e3c9afcc4b3d72d4603dc2f7241ca7895b6335a2) luminous (stable)
 1: (()+0x40f074) [0x563051e43074]
 2: (()+0x11390) [0x7f6f59fa2390]
 3: (gsignal()+0x38) [0x7f6f58f32428]
 4: (abort()+0x16a) [0x7f6f58f3402a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f6f5987584d]
 6: (()+0x8d6b6) [0x7f6f598736b6]
 7: (()+0x8d701) [0x7f6f59873701]
 8: (()+0x8d919) [0x7f6f59873919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7f6f5989c2cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x563051cd702c]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x563051cdf619]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x563051ced7e7]
 13: (DispatchQueue::entry()+0xf4a) [0x56305219d4ea]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x563051f3e50d]
 15: (()+0x76ba) [0x7f6f59f986ba]
 16: (clone()+0x6d) [0x7f6f5900441d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Aborted

#9 Updated by John Spray 8 months ago

This isn't reproducing for me on a development environment compiling the 12.2.7 tag.

Anything else you can tell me about how the RGW daemons are configured?

#10 Updated by Burkhard Linke 8 months ago

In our case, the RGW instances use the following setup on three hosts:

[client.radosgw.gateway-internal]
keyring = /etc/ceph/ceph.client.radosgw-internal.keyring
debug rgw = 0
rgw frontends = civetweb port=8080 num_threads=100
rgw print continue = false
rgw dns name = s3.internal.XYZ

rgw keystone admin user = radosgw
rgw keystone admin password = XYZ
rgw keystone token cache size = 10000
rgw keystone url = http://XYZ:5000
rgw keystone admin tenant = services
rgw keystone admin domain = Default
rgw keystone api version = 3
rgw s3 auth use keystone = true
rgw keystone accepted roles = Member, member, admin
rgw keystone revocation interval = 900

rgw num rados handles = 100

[client.radosgw.gateway]
keyring = /etc/ceph/ceph.client.radosgw.keyring
debug rgw = 0
rgw frontends = civetweb port=8081 num_threads=100
rgw print continue = false
rgw dns name = s3.XYZ

rgw keystone admin user = radosgw
rgw keystone admin password = XYZ
rgw keystone token cache size = 10000
rgw keystone url = http://XYZ:5000
rgw keystone admin tenant = services
rgw keystone admin domain = Default
rgw keystone api version = 3
rgw s3 auth use keystone = true
rgw keystone accepted roles = Member, member, admin
rgw keystone revocation interval = 900

rgw num rados handles = 100

haproxy setup (although this is probably not part of the problem):
global
ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS
ssl-default-bind-options no-sslv3

defaults
log global
maxconn 8000
option redispatch
retries 3
stats enable
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
timeout check 10s

listen external
bind XYZ:443 ssl crt /etc/haproxy/s3.XYZ.pem
mode http
balance roundrobin
option tcplog
option http-keep-alive
server ceph-storage-07 XYZ1:8081 check
server ceph-storage-08 XYZ2:8081 check
server ceph-storage-09 XYZ3:8081 check

listen internal
bind XYZ:80
mode http
balance roundrobin
option tcplog
option http-keep-alive
stats enable
stats hide-version
stats refresh 30s
stats show-node
stats auth admin:XYZ
stats uri /ha-stats
server ceph-storage-07 XYZ1:8080 check
server ceph-storage-08 XYZ2:8080 check
server ceph-storage-09 XYZ3:8080 check

haproxy and RGW are colocated with OSDs on the same host, pacemaker (config not shown) manages the VIP setup and VIP failover.

#11 Updated by Jens Harbott 6 months ago

We are seeing the same issue after upgrading from 12.2.5 to 12.2.8. Similar scenario with three rgw nodes, running with just one rgw daemon active solves the issue, but will lead to severe performance issues in our production setup. So please tag this as a regression and provide a fix. If you need more data to reproduce, I'm happy to help.

Maybe related: When running under 12.2.5, even while 3 rgw daemons are active, ceph -s still outputs:

  services:
    rgw: 1 daemon active

#12 Updated by Iain Bucław 6 months ago

Regression still persists in 12.2.8 downgraded to ceph-mgr 12.2.5 binaries... again.

     0> 2018-09-10 10:18:49.857757 7fe023eae700 -1 *** Caught signal (Aborted) **
 in thread 7fe023eae700 thread_name:ms_dispatch

 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
 1: (()+0x4105b4) [0x55db085df5b4]
 2: (()+0x11390) [0x7fe0325e7390]
 3: (gsignal()+0x38) [0x7fe031577428]
 4: (abort()+0x16a) [0x7fe03157902a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fe031eba84d]
 6: (()+0x8d6b6) [0x7fe031eb86b6]
 7: (()+0x8d701) [0x7fe031eb8701]
 8: (()+0x8d919) [0x7fe031eb8919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fe031ee12cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x199c) [0x55db084733cc]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x55db0847b9b9]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x55db08489b87]
 13: (DispatchQueue::entry()+0xf4a) [0x55db0893c4fa]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x55db086dac7d]
 15: (()+0x76ba) [0x7fe0325dd6ba]
 16: (clone()+0x6d) [0x7fe03164941d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#13 Updated by Iain Bucław 6 months ago

Jens Harbott wrote:

We are seeing the same issue after upgrading from 12.2.5 to 12.2.8. Similar scenario with three rgw nodes, running with just one rgw daemon active solves the issue, but will lead to severe performance issues in our production setup. So please tag this as a regression and provide a fix. If you need more data to reproduce, I'm happy to help.

Maybe related: When running under 12.2.5, even while 3 rgw daemons are active, ceph -s still outputs:

[...]

In the smallest region, there are 5 servers running rgw, handling 2 realms.

    rgw: 2 daemons active

#14 Updated by Iain Bucław 6 months ago

Iain Bucław wrote:

Jens Harbott wrote:

We are seeing the same issue after upgrading from 12.2.5 to 12.2.8. Similar scenario with three rgw nodes, running with just one rgw daemon active solves the issue, but will lead to severe performance issues in our production setup. So please tag this as a regression and provide a fix. If you need more data to reproduce, I'm happy to help.

Maybe related: When running under 12.2.5, even while 3 rgw daemons are active, ceph -s still outputs:

[...]

In the smallest region, there are 5 servers running rgw, handling 2 realms.

[...]

That is to say, I think the daemons active only reflects the number of realms in the cluster, not the number of running instances (in my above example, there are 10 running instances).

#15 Updated by Jens Harbott 6 months ago

FYI with the patch from http://tracker.ceph.com/issues/26838 applied to 12.2.8 and setting

mgr_stats_threshold = 11

(not 12 as mentioned above), the mgr daemons seem to be running fine now. Not sure though what the side effect of this setting may be.

#16 Updated by Dmitry Mishin about 2 months ago

Jens Harbott wrote:

Not sure though what the side effect of this setting may be.

Some metrics, like bandwidth and IOPS, stop working.

Is there any progress on the issue? I just updated to 13.2.4, still same problem.

#17 Updated by Boris Ranto about 2 months ago

I believe this should be fixed by this PR:

https://github.com/ceph/ceph/pull/25534

It is being back-ported to luminous and mimic. The luminous back-port seems to be already in and should part of the next release - 12.2.11. I am not sure what the state of the mimic back-port is.

#18 Updated by Lenz Grimmer about 2 months ago

  • Related to Bug #36244: mgr crash when handle_report updating existing DaemonState for rgw added

#19 Updated by Sage Weil about 1 month ago

  • Priority changed from High to Urgent

/a/sage-2019-02-15_00:51:48-rados-wip-sage-testing-2019-02-14-1642-distro-basic-smithi/3591594

2019-02-15 05:14:41.904 7fb9dc242700  4 mgr.server handle_report from 0x560eb2606400 osd,1
2019-02-15 05:14:41.904 7fb9dc242700 20 mgr.server handle_report updating existing DaemonState for osd,1
2019-02-15 05:14:41.904 7fb9dc242700 20 mgr update loading 0 new types, 0 old types, had 110 types, got 782 bytes of data
2019-02-15 05:14:41.905 7fb9dc242700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fb9dc242700 thread_name:ms_dispatch

 ceph version 14.0.1-3749-g2aae580 (2aae58097fd39ec4bff12ccfd1de93e28cef88fa) nautilus (dev)
 1: (()+0xf5d0) [0x7fb9fb7915d0]
 2: (DaemonPerfCounters::update(MMgrReport*)+0x37c) [0x560eada3e07c]
 3: (DaemonServer::handle_report(MMgrReport*)+0x3ab) [0x560eada107fb]
 4: (DaemonServer::ms_dispatch(Message*)+0x195) [0x560eada25f55]
 5: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x560eada39e96]
 6: (DispatchQueue::entry()+0x11b9) [0x7fb9fe074bf9]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb9fe12343d]
 8: (()+0x7dd5) [0x7fb9fb789dd5]
 9: (clone()+0x6d) [0x7fb9fa439ead]

#20 Updated by Sage Weil about 1 month ago

nevermind, the new failure is unrelated!

#21 Updated by Boris Ranto about 1 month ago

  • Status changed from New to Resolved

As I mentioned above, I believe this was resolved by a patch for a different ticket. I'll close this. Feel free to re-open if you can hit this with luminous 12.2.11+ or current master.

#22 Updated by Nathan Cutler about 1 month ago

Duplicate of #36244

#23 Updated by Dmitry Mishin about 1 month ago

I still don't see the backport to mimic.. Is there a ticker for it?

#24 Updated by Boris Ranto about 1 month ago

It looks like the mimic back-port for this was also merged: https://github.com/ceph/ceph/pull/25864 and it should be in the 13.2.5 release.

#25 Updated by Nathan Cutler about 1 month ago

@Dmitry: For backports, look at the "Copied to" entries in #36244

(Hint: the mimic backport is #37826)

#26 Updated by Dmitry Mishin about 1 month ago

Awesome, thanks!

Also available in: Atom PDF