Project

General

Profile

Actions

Bug #45136

open

mgr: crash with "tcmalloc::allocate_full_cpp_throw_oom"

Added by Ernesto Puerta about 4 years ago. Updated almost 2 years ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

  • Crash time (02:45:38 UTC).
  • Downstream version (14.2.8-28.el7cp) running on bare metal (RHEL7.8), system with 32 GB RAM.
  • Cluster setup:
# ceph -s
  cluster:
    health: HEALTH_WARN
            1 pool(s) have non-power-of-two pg_num
            application not enabled on 1 pool(s)
            1 daemons have recently crashed

  services:
    mon: 3 daemons, quorum magna057,magna067,magna120 (age 2h)
    mgr: magna057(active, since 14h), standbys: magna120
    osd: 9 osds: 9 up (since 2h), 9 in (since 2h)
    rgw: 2 daemons active (magna057.rgw0, magna067.rgw0)

  task status:

  data:
    pools:   5 pools, 132 pgs
    objects: 116.87k objects, 458 GiB
    usage:   1.3 TiB used, 6.8 TiB / 8.2 TiB avail
    pgs:     132 active+clean

Several tcmalloc large alloc started to be reported (/var/log/messages) 4 hours before the crash (every 30min-2hours):

messages:Apr 16 22:41:48 magna057 ceph-mgr: tcmalloc: large alloc 1073741824 bytes == 0x5628c62f8000 @  0x7fbf605db35f 0x7fbf605faa56 0x5625e6034976 0x5625e6034b77 0x5625e602a839 0x5625e60a1928 0x5625e60a889b 0x5625e60b0157 0x5625e609a916 0x7fbf61400c29 0x7fbf614ae0ad 0x7fbf5eaccea5 0x7fbf5d77a8dd
messages:Apr 17 00:25:53 magna057 ceph-mgr: tcmalloc: large alloc 2147483648 bytes == 0x562b05510000 @  0x7fbf605db35f 0x7fbf605faa56 0x5625e6034976 0x5625e6034b77 0x5625e602a839 0x5625e60a1928 0x5625e60a889b 0x5625e60b0157 0x5625e609a916 0x7fbf61400c29 0x7fbf614ae0ad 0x7fbf5eaccea5 0x7fbf5d77a8dd
messages:Apr 17 01:01:45 magna057 ceph-mgr: tcmalloc: large alloc 2147483648 bytes == 0x562bc5db8000 @  0x7fbf605db35f 0x7fbf605faa56 0x5625e6034976 0x5625e6034b77 0x5625e602a839 0x5625e60a1928 0x5625e60a889b 0x5625e60b0157 0x5625e609a916 0x7fbf61400c29 0x7fbf614ae0ad 0x7fbf5eaccea5 0x7fbf5d77a8dd
messages:Apr 17 01:35:23 magna057 ceph-mgr: tcmalloc: large alloc 2147483648 bytes == 0x562c465b8000 @  0x7fbf605db35f 0x7fbf605faa56 0x5625e6034976 0x5625e6034b77 0x5625e602a839 0x5625e60a1928 0x5625e60a889b 0x5625e60b0157 0x5625e609a916 0x7fbf61400c29 0x7fbf614ae0ad 0x7fbf5eaccea5 0x7fbf5d77a8dd
messages:Apr 17 02:09:16 magna057 ceph-mgr: tcmalloc: large alloc 2147483648 bytes == 0x562b054b8000 @  0x7fbf605db35f 0x7fbf605faa56 0x5625e6034976 0x5625e6034b77 0x5625e602a839 0x5625e60a1928 0x5625e60a889b 0x5625e60b0157 0x5625e609a916 0x7fbf61400c29 0x7fbf614ae0ad 0x7fbf5eaccea5 0x7fbf5d77a8dd
messages:Apr 17 02:45:38 magna057 ceph-mgr: tcmalloc: large alloc 2147483648 bytes == (nil) @  0x7fbf605db35f 0x7fbf605faa56 0x5625e6034976 0x5625e6034b77 0x5625e602a839 0x5625e60a1928 0x5625e60a889b 0x5625e60b0157 0x5625e609a916 0x7fbf61400c29 0x7fbf614ae0ad 0x7fbf5eaccea5 0x7fbf5d77a8dd

Stats:

magna57 (active ceph-mgr colocated with mon and rgw)

magna057 (RAM, 48 hours around OOM event)

magna57 ceph-mgr log:

2020-04-17 02:45:34.040 7fbf5af87700  0 auth: could not find secret_id=390
2020-04-17 02:45:34.040 7fbf5af87700  0 cephx: verify_authorizer could not get service secret for service mgr secret_id=390
2020-04-17 02:45:35.063 7fbf5af87700  0 auth: could not find secret_id=391
2020-04-17 02:45:35.063 7fbf5af87700  0 cephx: verify_authorizer could not get service secret for service mgr secret_id=391
2020-04-17 02:45:35.878 7fbf2fba6700  0 log_channel(cluster) log [DBG] : pgmap v58306: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail; 4.0 KiB/s rd, 0 B/s wr, 6 op/s
2020-04-17 02:45:37.879 7fbf2fba6700  0 log_channel(cluster) log [DBG] : pgmap v58307: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail; 4.0 KiB/s rd, 0 B/s wr, 6 op/s
2020-04-17 02:45:39.029 7fbf5577c700 -1 *** Caught signal (Aborted) **
 in thread 7fbf5577c700 thread_name:ms_dispatch

 ceph version 14.2.8-28.el7cp (b32eac9fd60c00c62a2d3c85d88b483be7b55ba1) nautilus (stable)
 1: (()+0xf630) [0x7fbf5ead4630]
 2: (gsignal()+0x37) [0x7fbf5d6b2387]
 3: (abort()+0x148) [0x7fbf5d6b3a78]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fbf5dfc27d5]
 5: (()+0x5e746) [0x7fbf5dfc0746]
 6: (()+0x5e773) [0x7fbf5dfc0773]
 7: (()+0x5e993) [0x7fbf5dfc0993]
 8: (()+0x1772b) [0x7fbf605dc72b]
 9: (tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0xf3) [0x7fbf605faa03]
 10: (void std::vector<std::pair<Context*, int>, std::allocator<std::pair<Context*, int> > >::_M_realloc_insert<std::pair<Context*, int> >(__gnu_cxx::__normal_iterator<std::pair<Context*, int>*, std::vector<std::pair<Context*, int>, std::allocator<std::pair<Context*, int> > > >, std::pair<Context*, int>&&)+0x56) [0x5625e6034976]
 11: (Finisher::queue(Context*, int)+0xe7) [0x5625e6034b77]
 12: (ActivePyModules::notify_all(LogEntry const&)+0x109) [0x5625e602a839]
 13: (Mgr::handle_log(MLog*)+0x58) [0x5625e60a1928]
 14: (Mgr::ms_dispatch(Message*)+0x22b) [0x5625e60a889b]
 15: (MgrStandby::ms_dispatch(Message*)+0x97) [0x5625e60b0157]
 16: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x5625e609a916]
 17: (DispatchQueue::entry()+0x1699) [0x7fbf61400c29]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fbf614ae0ad]
 19: (()+0x7ea5) [0x7fbf5eaccea5]
 20: (clone()+0x6d) [0x7fbf5d77a8dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
 -9999> 2020-04-17 02:32:59.553 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
 -9998> 2020-04-17 02:32:59.908 7fbf30ba8700  4 mgr.server handle_report from 0x56264c176800 mon,magna120
 -9997> 2020-04-17 02:33:00.014 7fbf5577c700  4 mgr ms_dispatch active log(5000 entries from seq 232332058 at 2020-04-17 02:32:42.159809) v1
 -9996> 2020-04-17 02:33:00.014 7fbf5577c700  4 mgr ms_dispatch log(5000 entries from seq 232332058 at 2020-04-17 02:32:42.159809) v1
 -9995> 2020-04-17 02:33:00.058 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
 -9994> 2020-04-17 02:33:00.058 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
 -9993> 2020-04-17 02:33:00.188 7fbf30ba8700  4 mgr.server handle_report from 0x5625f55c3c00 osd,3
 -9992> 2020-04-17 02:33:00.322 7fbf30ba8700  4 mgr.server handle_report from 0x5625f2f72c00 osd,0
 -9991> 2020-04-17 02:33:00.694 7fbf51f75700  4 mgr send_beacon active
 -9990> 2020-04-17 02:33:00.695 7fbf51f75700 10 monclient: _send_mon_message to mon.magna057 at v2:10.8.128.57:3300/0
 -9989> 2020-04-17 02:33:01.140 7fbf5477a700 10 monclient: tick
 -9988> 2020-04-17 02:33:01.140 7fbf5477a700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2020-04-17 02:32:31.141196)
 -9987> 2020-04-17 02:33:01.140 7fbf5477a700 10 log_client  log_queue is 5 last_log 57779 sent 57774 num 5 unsent 5 sending 5
 -9986> 2020-04-17 02:33:01.140 7fbf5477a700 10 log_client  will send 2020-04-17 02:32:51.451241 mgr.magna057 (mgr.154572) 57775 : cluster [DBG] pgmap v57924: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail
 -9985> 2020-04-17 02:33:01.140 7fbf5477a700 10 log_client  will send 2020-04-17 02:32:53.452017 mgr.magna057 (mgr.154572) 57776 : cluster [DBG] pgmap v57925: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail
 -9984> 2020-04-17 02:33:01.140 7fbf5477a700 10 log_client  will send 2020-04-17 02:32:55.452660 mgr.magna057 (mgr.154572) 57777 : cluster [DBG] pgmap v57926: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail
 -9983> 2020-04-17 02:33:01.140 7fbf5477a700 10 log_client  will send 2020-04-17 02:32:57.485798 mgr.magna057 (mgr.154572) 57778 : cluster [DBG] pgmap v57927: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail
 -9982> 2020-04-17 02:33:01.140 7fbf5477a700 10 log_client  will send 2020-04-17 02:32:59.486323 mgr.magna057 (mgr.154572) 57779 : cluster [DBG] pgmap v57928: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail
 -9981> 2020-04-17 02:33:01.140 7fbf5477a700 10 monclient: _send_mon_message to mon.magna057 at v2:10.8.128.57:3300/0
 -9980> 2020-04-17 02:33:01.158 7fbf5577c700  4 mgr ms_dispatch active log(5000 entries from seq 232337058 at 2020-04-17 02:32:42.190039) v1
 -9979> 2020-04-17 02:33:01.158 7fbf5577c700  4 mgr ms_dispatch log(5000 entries from seq 232337058 at 2020-04-17 02:32:42.190039) v1
 -9978> 2020-04-17 02:33:01.197 7fbf5af87700  0 auth: could not find secret_id=390
 -9977> 2020-04-17 02:33:01.197 7fbf5af87700  0 cephx: verify_authorizer could not get service secret for service mgr secret_id=390
 -9976> 2020-04-17 02:33:01.197 7fbf5af87700 10 monclient: handle_auth_request bad authorizer on 0x56264c176400
 -9975> 2020-04-17 02:33:01.205 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
 -9974> 2020-04-17 02:33:01.205 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
 -9973> 2020-04-17 02:33:01.309 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
 -9972> 2020-04-17 02:33:01.309 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
 -9971> 2020-04-17 02:33:01.310 7fbf5577c700  4 mgr ms_dispatch active mgrdigest v1
 -9970> 2020-04-17 02:33:01.310 7fbf5577c700  4 mgr ms_dispatch mgrdigest v1
 -9969> 2020-04-17 02:33:01.485 7fbf2fba6700  0 log_channel(cluster) log [DBG] : pgmap v57929: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail
 -9968> 2020-04-17 02:33:01.485 7fbf2fba6700 10 monclient: _send_mon_message to mon.magna057 at v2:10.8.128.57:3300/0
 -9967> 2020-04-17 02:33:01.796 7fbf5577c700  4 mgr ms_dispatch active log(5001 entries from seq 232342058 at 2020-04-17 02:32:43.259328) v1
 -9966> 2020-04-17 02:33:01.796 7fbf5577c700  4 mgr ms_dispatch log(5001 entries from seq 232342058 at 2020-04-17 02:32:43.259328) v1
 -9965> 2020-04-17 02:33:01.836 7fbf30ba8700  4 mgr.server handle_report from 0x5625ec448c00 rgw,magna067.rgw0
 -9964> 2020-04-17 02:33:01.841 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
 -9963> 2020-04-17 02:33:01.841 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
 -9962> 2020-04-17 02:33:01.841 7fbf5577c700 10 log_client handle_log_ack log(last 57779) v1
 -9961> 2020-04-17 02:33:01.841 7fbf5577c700 10 log_client  logged 2020-04-17 02:32:51.451241 mgr.magna057 (mgr.154572) 57775 : cluster [DBG] pgmap v57924: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail
 -9960> 2020-04-17 02:33:01.841 7fbf5577c700 10 log_client  logged 2020-04-17 02:32:53.452017 mgr.magna057 (mgr.154572) 57776 : cluster [DBG] pgmap v57925: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail
 -9959> 2020-04-17 02:33:01.841 7fbf5577c700 10 log_client  logged 2020-04-17 02:32:55.452660 mgr.magna057 (mgr.154572) 57777 : cluster [DBG] pgmap v57926: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail
 -9958> 2020-04-17 02:33:01.841 7fbf5577c700 10 log_client  logged 2020-04-17 02:32:57.485798 mgr.magna057 (mgr.154572) 57778 : cluster [DBG] pgmap v57927: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail
 -9957> 2020-04-17 02:33:01.841 7fbf5577c700 10 log_client  logged 2020-04-17 02:32:59.486323 mgr.magna057 (mgr.154572) 57779 : cluster [DBG] pgmap v57928: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail
 -9956> 2020-04-17 02:33:01.987 7fbf30ba8700  4 mgr.server handle_report from 0x5625ed9b2c00 mon,magna067
 -9955> 2020-04-17 02:33:02.293 7fbf30ba8700  4 mgr.server handle_report from 0x5625ec448800 osd,8
 -9954> 2020-04-17 02:33:02.297 7fbf5577c700  4 mgr ms_dispatch active log(5000 entries from seq 232347058 at 2020-04-17 02:32:43.287662) v1
 -9953> 2020-04-17 02:33:02.297 7fbf5577c700  4 mgr ms_dispatch log(5000 entries from seq 232347058 at 2020-04-17 02:32:43.287662) v1
 -9952> 2020-04-17 02:33:02.327 7fbf30ba8700  4 mgr.server handle_report from 0x5625fb403800 osd,5
 -9951> 2020-04-17 02:33:02.340 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
 -9950> 2020-04-17 02:33:02.340 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
 -9949> 2020-04-17 02:33:02.360 7fbf30ba8700  4 mgr.server handle_report from 0x5629d222dc00 osd,7
 -9948> 2020-04-17 02:33:02.397 7fbf30ba8700  4 mgr.server handle_report from 0x56288cd56000 osd,1
 -9947> 2020-04-17 02:33:02.695 7fbf51f75700  4 mgr send_beacon active
 -9946> 2020-04-17 02:33:02.695 7fbf51f75700 10 monclient: _send_mon_message to mon.magna057 at v2:10.8.128.57:3300/0
 -9945> 2020-04-17 02:33:02.854 7fbf30ba8700  4 mgr.server handle_report from 0x5625f753c400 mgr,magna057
 -9944> 2020-04-17 02:33:02.867 7fbf30ba8700  4 mgr.server handle_report from 0x5625ed58a800 mon,magna057
 -9943> 2020-04-17 02:33:02.873 7fbf2dba2700  4 mgr get_config get_config key: mgr/prometheus/magna057/rbd_stats_pools
 -9942> 2020-04-17 02:33:02.873 7fbf2dba2700  4 mgr get_config get_config key: mgr/prometheus/rbd_stats_pools
 -9941> 2020-04-17 02:33:02.873 7fbf2dba2700  4 mgr get_typed_config get_typed_config [magna057/]rbd_stats_pools not found
 -9940> 2020-04-17 02:33:02.997 7fbf5577c700  4 mgr ms_dispatch active log(5000 entries from seq 232352058 at 2020-04-17 02:32:43.316695) v1
 -9939> 2020-04-17 02:33:02.997 7fbf5577c700  4 mgr ms_dispatch log(5000 entries from seq 232352058 at 2020-04-17 02:32:43.316695) v1
 -9938> 2020-04-17 02:33:03.040 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
 -9937> 2020-04-17 02:33:03.040 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
 -9936> 2020-04-17 02:33:03.043 7fbf30ba8700  4 mgr.server handle_report from 0x5625ec745400 rgw,magna057.rgw0
 -9935> 2020-04-17 02:33:03.177 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
 -9934> 2020-04-17 02:33:03.177 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
 -9933> 2020-04-17 02:33:03.177 7fbf5577c700  4 mgr ms_dispatch active mgrdigest v1
 -9932> 2020-04-17 02:33:03.177 7fbf5577c700  4 mgr ms_dispatch mgrdigest v1
 -9931> 2020-04-17 02:33:03.312 7fbf5af87700  0 auth: could not find secret_id=391
 -9930> 2020-04-17 02:33:03.312 7fbf5af87700  0 cephx: verify_authorizer could not get service secret for service mgr secret_id=391
 -9929> 2020-04-17 02:33:03.312 7fbf5af87700 10 monclient: handle_auth_request bad authorizer on 0x5625ec8bf800
 -9928> 2020-04-17 02:33:03.486 7fbf2fba6700  0 log_channel(cluster) log [DBG] : pgmap v57930: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail
 -9927> 2020-04-17 02:33:03.486 7fbf2fba6700 10 monclient: _send_mon_message to mon.magna057 at v2:10.8.128.57:3300/0
 -9926> 2020-04-17 02:33:03.655 7fbf5577c700  4 mgr ms_dispatch active log(5000 entries from seq 232357058 at 2020-04-17 02:32:44.303167) v1
 -9925> 2020-04-17 02:33:03.655 7fbf5577c700  4 mgr ms_dispatch log(5000 entries from seq 232357058 at 2020-04-17 02:32:44.303167) v1
 -9924> 2020-04-17 02:33:03.699 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
 -9923> 2020-04-17 02:33:03.699 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
 -9922> 2020-04-17 02:33:03.884 7fbf5af87700  0 auth: could not find secret_id=390
 -9921> 2020-04-17 02:33:03.884 7fbf5af87700  0 cephx: verify_authorizer could not get service secret for service mgr secret_id=390
 -9920> 2020-04-17 02:33:03.884 7fbf5af87700 10 monclient: handle_auth_request bad authorizer on 0x5629d222c000
 -9919> 2020-04-17 02:33:04.155 7fbf5577c700  4 mgr ms_dispatch active log(5000 entries from seq 232362058 at 2020-04-17 02:32:44.332027) v1
 -9918> 2020-04-17 02:33:04.155 7fbf5577c700  4 mgr ms_dispatch log(5000 entries from seq 232362058 at 2020-04-17 02:32:44.332027) v1
 -9917> 2020-04-17 02:33:04.195 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
 -9916> 2020-04-17 02:33:04.195 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
 -9915> 2020-04-17 02:33:04.696 7fbf51f75700  4 mgr send_beacon active
 -9914> 2020-04-17 02:33:04.696 7fbf51f75700 10 monclient: _send_mon_message to mon.magna057 at v2:10.8.128.57:3300/0
 -9913> 2020-04-17 02:33:04.816 7fbf5577c700  4 mgr ms_dispatch active log(5000 entries from seq 232367058 at 2020-04-17 02:32:44.362081) v1
 -9912> 2020-04-17 02:33:04.816 7fbf5577c700  4 mgr ms_dispatch log(5000 entries from seq 232367058 at 2020-04-17 02:32:44.362081) v1
 -9911> 2020-04-17 02:33:04.858 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
 -9910> 2020-04-17 02:33:04.859 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
 -9909> 2020-04-17 02:33:04.887 7fbf5af87700  0 auth: could not find secret_id=391
 -9908> 2020-04-17 02:33:04.887 7fbf5af87700  0 cephx: verify_authorizer could not get service secret for service mgr secret_id=391
...
   -38> 2020-04-17 02:45:35.878 7fbf2fba6700 10 monclient: _send_mon_message to mon.magna057 at v2:10.8.128.57:3300/0
   -37> 2020-04-17 02:45:36.359 7fbf5577c700  4 mgr ms_dispatch active log(5000 entries from seq 235915427 at 2020-04-17 02:45:21.649510) v1
   -36> 2020-04-17 02:45:36.359 7fbf5577c700  4 mgr ms_dispatch log(5000 entries from seq 235915427 at 2020-04-17 02:45:21.649510) v1
   -35> 2020-04-17 02:45:36.400 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
   -34> 2020-04-17 02:45:36.400 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
   -33> 2020-04-17 02:45:36.826 7fbf30ba8700  4 mgr.server handle_report from 0x5625f19a4c00 rgw,magna067.rgw0
   -32> 2020-04-17 02:45:36.877 7fbf5577c700  4 mgr ms_dispatch active log(5000 entries from seq 235920427 at 2020-04-17 02:45:21.678877) v1
   -31> 2020-04-17 02:45:36.877 7fbf5577c700  4 mgr ms_dispatch log(5000 entries from seq 235920427 at 2020-04-17 02:45:21.678877) v1
   -30> 2020-04-17 02:45:36.917 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
   -29> 2020-04-17 02:45:36.917 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
   -28> 2020-04-17 02:45:36.974 7fbf30ba8700  4 mgr.server handle_report from 0x5625ec448c00 mon,magna067
   -27> 2020-04-17 02:45:37.032 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
   -26> 2020-04-17 02:45:37.032 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
   -25> 2020-04-17 02:45:37.032 7fbf5577c700  4 mgr ms_dispatch active mgrdigest v1
   -24> 2020-04-17 02:45:37.032 7fbf5577c700  4 mgr ms_dispatch mgrdigest v1
   -23> 2020-04-17 02:45:37.210 7fbf30ba8700  4 mgr.server handle_report from 0x5625eb821400 osd,8
   -22> 2020-04-17 02:45:37.248 7fbf30ba8700  4 mgr.server handle_report from 0x5629d222d800 osd,5
   -21> 2020-04-17 02:45:37.411 7fbf30ba8700  4 mgr.server handle_report from 0x5629d222dc00 osd,7
   -20> 2020-04-17 02:45:37.412 7fbf51f75700  4 mgr send_beacon active
   -19> 2020-04-17 02:45:37.412 7fbf51f75700 10 monclient: _send_mon_message to mon.magna057 at v2:10.8.128.57:3300/0
   -18> 2020-04-17 02:45:37.446 7fbf30ba8700  4 mgr.server handle_report from 0x56288cd56000 osd,1
   -17> 2020-04-17 02:45:37.517 7fbf5577c700  4 mgr ms_dispatch active log(5000 entries from seq 235925427 at 2020-04-17 02:45:22.664953) v1
   -16> 2020-04-17 02:45:37.517 7fbf5577c700  4 mgr ms_dispatch log(5000 entries from seq 235925427 at 2020-04-17 02:45:22.664953) v1
   -15> 2020-04-17 02:45:37.557 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
   -14> 2020-04-17 02:45:37.557 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
   -13> 2020-04-17 02:45:37.824 7fbf30ba8700  4 mgr.server handle_report from 0x5625eda8c000 mon,magna057
   -12> 2020-04-17 02:45:37.849 7fbf30ba8700  4 mgr.server handle_report from 0x5625ed668800 mgr,magna057
   -11> 2020-04-17 02:45:37.879 7fbf2fba6700  0 log_channel(cluster) log [DBG] : pgmap v58307: 132 pgs: 9 stale+peering, 123 active+clean; 458 GiB data, 1.2 TiB used, 4.2 TiB / 5.5 TiB avail; 4.0 KiB/s rd, 0 B/s wr, 6 op/s
   -10> 2020-04-17 02:45:37.879 7fbf2fba6700 10 monclient: _send_mon_message to mon.magna057 at v2:10.8.128.57:3300/0
    -9> 2020-04-17 02:45:38.008 7fbf31bea700 10 monclient: tick
    -8> 2020-04-17 02:45:38.008 7fbf31bea700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2020-04-17 02:45:08.009722)
    -7> 2020-04-17 02:45:38.050 7fbf5577c700  4 mgr ms_dispatch active log(5000 entries from seq 235930427 at 2020-04-17 02:45:22.694286) v1
    -6> 2020-04-17 02:45:38.050 7fbf5577c700  4 mgr ms_dispatch log(5000 entries from seq 235930427 at 2020-04-17 02:45:22.694286) v1
    -5> 2020-04-17 02:45:38.070 7fbf30ba8700  4 mgr.server handle_report from 0x5625ec745400 rgw,magna057.rgw0
    -4> 2020-04-17 02:45:38.090 7fbf5577c700  4 mgr ms_dispatch active service_map(e20 1 svc) v1
    -3> 2020-04-17 02:45:38.090 7fbf5577c700  4 mgr ms_dispatch service_map(e20 1 svc) v1
    -2> 2020-04-17 02:45:38.786 7fbf5577c700  4 mgr ms_dispatch active log(5000 entries from seq 235935427 at 2020-04-17 02:45:23.777440) v1
    -1> 2020-04-17 02:45:38.786 7fbf5577c700  4 mgr ms_dispatch log(5000 entries from seq 235935427 at 2020-04-17 02:45:23.777440) v1
     0> 2020-04-17 02:45:39.029 7fbf5577c700 -1 *** Caught signal (Aborted) **
 in thread 7fbf5577c700 thread_name:ms_dispatch

 ceph version 14.2.8-28.el7cp (b32eac9fd60c00c62a2d3c85d88b483be7b55ba1) nautilus (stable)
 1: (()+0xf630) [0x7fbf5ead4630]
 2: (gsignal()+0x37) [0x7fbf5d6b2387]
 3: (abort()+0x148) [0x7fbf5d6b3a78]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fbf5dfc27d5]
 5: (()+0x5e746) [0x7fbf5dfc0746]
 6: (()+0x5e773) [0x7fbf5dfc0773]
 7: (()+0x5e993) [0x7fbf5dfc0993]
 8: (()+0x1772b) [0x7fbf605dc72b]
 9: (tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0xf3) [0x7fbf605faa03]
 10: (void std::vector<std::pair<Context*, int>, std::allocator<std::pair<Context*, int> > >::_M_realloc_insert<std::pair<Context*, int> >(__gnu_cxx::__normal_iterator<std::pair<Context*, int>*, std::vector<std::pair<Context*, int>, std::allocator<std::pair<Context*, int> > > >, std::pair<Context*, int>&&)+0x56) [0x5625e6034976]
 11: (Finisher::queue(Context*, int)+0xe7) [0x5625e6034b77]
 12: (ActivePyModules::notify_all(LogEntry const&)+0x109) [0x5625e602a839]
 13: (Mgr::handle_log(MLog*)+0x58) [0x5625e60a1928]
 14: (Mgr::ms_dispatch(Message*)+0x22b) [0x5625e60a889b]
 15: (MgrStandby::ms_dispatch(Message*)+0x97) [0x5625e60b0157]
 16: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x5625e609a916]
 17: (DispatchQueue::entry()+0x1699) [0x7fbf61400c29]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fbf614ae0ad]
 19: (()+0x7ea5) [0x7fbf5eaccea5]
 20: (clone()+0x6d) [0x7fbf5d77a8dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mgr.magna057.log
--- end dump of recent events ---


Files

oom.png (183 KB) oom.png magna057 (RAM, 48 hours around OOM event) Ernesto Puerta, 04/17/2020 05:35 PM
Actions #1

Updated by Ernesto Puerta about 4 years ago

Actions #2

Updated by Sebastian Wagner about 4 years ago

might be a mem leak?

Actions #3

Updated by Yaarit Hatuka over 2 years ago

  • Crash signature (v1) updated (diff)

Moving the crash data into a note as 'Crash signature (v1)' should hold the value of 'stack_sig' key (which is missing in this case).

{
    "os_version_id": "7.8", 
    "utsname_release": "3.10.0-1127.el7.x86_64", 
    "os_name": "Red Hat Enterprise Linux Server", 
    "entity_name": "mgr.magna057", 
    "timestamp": "2020-04-17 02:45:39.021860Z", 
    "process_name": "ceph-mgr", 
    "utsname_machine": "x86_64", 
    "utsname_sysname": "Linux", 
    "os_version": "7.8 (Maipo)", 
    "os_id": "rhel", 
    "utsname_version": "#1 SMP Tue Feb 18 16:39:12 EST 2020", 
    "backtrace": [
        "(()+0xf630) [0x7fbf5ead4630]", 
        "(gsignal()+0x37) [0x7fbf5d6b2387]", 
        "(abort()+0x148) [0x7fbf5d6b3a78]", 
        "(__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fbf5dfc27d5]", 
        "(()+0x5e746) [0x7fbf5dfc0746]", 
        "(()+0x5e773) [0x7fbf5dfc0773]", 
        "(()+0x5e993) [0x7fbf5dfc0993]", 
        "(()+0x1772b) [0x7fbf605dc72b]", 
        "(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0xf3) [0x7fbf605faa03]", 
        "(void std::vector<std::pair<Context*, int>, std::allocator<std::pair<Context*, int> > >::_M_realloc_insert<std::pair<Context*, int> >(__gnu_cxx::__normal_iterator<std::pair<Context*, int>*, std::vector<std::pair<Context*, int>, std::allocator<std::pair<Context*, int> > > >, std::pair<Context*, int>&&)+0x56) [0x5625e6034976]", 
        "(Finisher::queue(Context*, int)+0xe7) [0x5625e6034b77]", 
        "(ActivePyModules::notify_all(LogEntry const&)+0x109) [0x5625e602a839]", 
        "(Mgr::handle_log(MLog*)+0x58) [0x5625e60a1928]", 
        "(Mgr::ms_dispatch(Message*)+0x22b) [0x5625e60a889b]", 
        "(MgrStandby::ms_dispatch(Message*)+0x97) [0x5625e60b0157]", 
        "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x5625e609a916]", 
        "(DispatchQueue::entry()+0x1699) [0x7fbf61400c29]", 
        "(DispatchQueue::DispatchThread::entry()+0xd) [0x7fbf614ae0ad]", 
        "(()+0x7ea5) [0x7fbf5eaccea5]", 
        "(clone()+0x6d) [0x7fbf5d77a8dd]" 
    ], 
    "utsname_hostname": "magna057.ceph.redhat.com", 
    "crash_id": "2020-04-17_02:45:39.021860Z_e73adf03-9acd-413f-b1ee-a961db95901c", 
    "ceph_version": "14.2.8-28.el7cp" 
}
Actions #4

Updated by Radoslaw Zarzynski almost 2 years ago

  • Status changed from New to Need More Info

Nautilus is OOL. Does the problem still happen on a newer version?

Actions

Also available in: Atom PDF