Project

General

Profile

Bug #55656

mgr crash on "The path '/prometheus_receiver' was not found."

Added by David Galloway 9 months ago. Updated 7 months ago.

Status:
Closed
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

   -52> 2022-05-13T17:03:22.774+0000 7f30bdd3e700 10 monclient: tick
   -51> 2022-05-13T17:03:22.774+0000 7f30bdd3e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:52.775732+0000)
   -50> 2022-05-13T17:03:23.254+0000 7f309302b700 10 monclient: tick
   -49> 2022-05-13T17:03:23.254+0000 7f309302b700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:53.256025+0000)
   -48> 2022-05-13T17:03:23.282+0000 7f309ee82700 10 monclient: tick
   -47> 2022-05-13T17:03:23.282+0000 7f309ee82700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:53.286845+0000)
   -46> 2022-05-13T17:03:23.326+0000 7f315e507700 10 monclient: tick
   -45> 2022-05-13T17:03:23.326+0000 7f315e507700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:53.329543+0000)
   -44> 2022-05-13T17:03:23.346+0000 7f30ab099700 10 monclient: tick
   -43> 2022-05-13T17:03:23.346+0000 7f30ab099700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:53.348720+0000)
   -42> 2022-05-13T17:03:23.354+0000 7f315bd02700 10 monclient: _send_mon_message to mon.ivan02 at v2:172.21.2.222:3300/0
   -41> 2022-05-13T17:03:23.586+0000 7f3069c9b700  0 [dashboard ERROR exception] Internal Server Error
Traceback (most recent call last):
  File "/lib/python3.6/site-packages/cherrypy/lib/static.py", line 58, in serve_file
    st = os.stat(path)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/share/ceph/mgr/dashboard/frontend/dist/en-US/prometheus_receiver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 47, in dashboard_exception_handler
    return handler(*args, **kwargs)
  File "/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/home.py", line 134, in __call__
    return serve_file(full_path)
  File "/lib/python3.6/site-packages/cherrypy/lib/static.py", line 65, in serve_file
    raise cherrypy.NotFound()
cherrypy._cperror.NotFound: (404, "The path '/prometheus_receiver' was not found.")
   -40> 2022-05-13T17:03:23.590+0000 7f3069c9b700  0 [dashboard INFO request] [::ffff:172.21.2.202:34646] [POST] [404] [0.006s] [513.0B] [43c44f60-2cf5-4611-90b8-09681c4bbce3] /prometheus_receiver
   -39> 2022-05-13T17:03:23.774+0000 7f30bdd3e700 10 monclient: tick
   -38> 2022-05-13T17:03:23.774+0000 7f30bdd3e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:53.775837+0000)
   -37> 2022-05-13T17:03:23.914+0000 7f3164513700 10 monclient: handle_auth_request added challenge on 0x55f772df9000
   -36> 2022-05-13T17:03:23.922+0000 7f313a313700  0 log_channel(cluster) log [DBG] : pgmap v70: 2833 pgs: 409 active+remapped+backfilling, 2424 active+clean; 75 TiB data, 142 TiB used, 614 TiB / 756 TiB avail; 2.0 MiB/s wr, 4 op/s; 19485149/332380401 objects misplaced (5.862%); 578 MiB/s, 521 objects/s recovering
   -35> 2022-05-13T17:03:23.922+0000 7f313a313700 10 monclient: _send_mon_message to mon.ivan02 at v2:172.21.2.222:3300/0
   -34> 2022-05-13T17:03:23.922+0000 7f3163d12700 10 monclient: handle_auth_request added challenge on 0x55f772df8c00
   -33> 2022-05-13T17:03:24.254+0000 7f309302b700 10 monclient: tick
   -32> 2022-05-13T17:03:24.254+0000 7f309302b700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:54.256129+0000)
   -31> 2022-05-13T17:03:24.282+0000 7f309ee82700 10 monclient: tick
   -30> 2022-05-13T17:03:24.282+0000 7f309ee82700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:54.286952+0000)
   -29> 2022-05-13T17:03:24.326+0000 7f315e507700 10 monclient: tick
   -28> 2022-05-13T17:03:24.326+0000 7f315e507700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:54.329656+0000)
   -27> 2022-05-13T17:03:24.326+0000 7f315e507700 10 log_client  log_queue is 1 last_log 212 sent 211 num 1 unsent 1 sending 1
   -26> 2022-05-13T17:03:24.326+0000 7f315e507700 10 log_client  will send 2022-05-13T17:03:23.925102+0000 mgr.reesi004.tplfrt (mgr.847761379) 212 : cluster [DBG] pgmap v70: 2833 pgs: 409 active+remapped+backfilling, 2424 active+clean; 75 TiB data, 142 TiB used, 614 TiB / 756 TiB avail; 2.0 MiB/s wr, 4 op/s; 19485149/332380401 objects misplaced (5.862%); 578 MiB/s, 521 objects/s recovering
   -25> 2022-05-13T17:03:24.326+0000 7f315e507700 10 monclient: _send_mon_message to mon.ivan02 at v2:172.21.2.222:3300/0
   -24> 2022-05-13T17:03:24.346+0000 7f30ab099700 10 monclient: tick
   -23> 2022-05-13T17:03:24.346+0000 7f30ab099700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:54.348815+0000)
   -22> 2022-05-13T17:03:24.418+0000 7f3139311700  0 log_channel(audit) log [DBG] : from='client.847813360 -' entity='client.admin' cmd=[{"prefix": "osd df", "target": ["mon-mgr", ""]}]: dispatch
   -21> 2022-05-13T17:03:24.538+0000 7f316050b700 10 log_client handle_log_ack log(last 212) v1
   -20> 2022-05-13T17:03:24.538+0000 7f316050b700 10 log_client  logged 2022-05-13T17:03:23.925102+0000 mgr.reesi004.tplfrt (mgr.847761379) 212 : cluster [DBG] pgmap v70: 2833 pgs: 409 active+remapped+backfilling, 2424 active+clean; 75 TiB data, 142 TiB used, 614 TiB / 756 TiB avail; 2.0 MiB/s wr, 4 op/s; 19485149/332380401 objects misplaced (5.862%); 578 MiB/s, 521 objects/s recovering
   -19> 2022-05-13T17:03:24.774+0000 7f30bdd3e700 10 monclient: tick
   -18> 2022-05-13T17:03:24.774+0000 7f30bdd3e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:54.775951+0000)
   -17> 2022-05-13T17:03:25.254+0000 7f309302b700 10 monclient: tick
   -16> 2022-05-13T17:03:25.254+0000 7f309302b700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:55.256243+0000)
   -15> 2022-05-13T17:03:25.282+0000 7f309ee82700 10 monclient: tick
   -14> 2022-05-13T17:03:25.282+0000 7f309ee82700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:55.287057+0000)
   -13> 2022-05-13T17:03:25.326+0000 7f315e507700 10 monclient: tick
   -12> 2022-05-13T17:03:25.326+0000 7f315e507700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:55.329817+0000)
   -11> 2022-05-13T17:03:25.326+0000 7f315e507700 10 log_client  log_queue is 1 last_log 213 sent 212 num 1 unsent 1 sending 1
   -10> 2022-05-13T17:03:25.326+0000 7f315e507700 10 log_client  will send 2022-05-13T17:03:24.422797+0000 mgr.reesi004.tplfrt (mgr.847761379) 213 : audit [DBG] from='client.847813360 -' entity='client.admin' cmd=[{"prefix": "osd df", "target": ["mon-mgr", ""]}]: dispatch
    -9> 2022-05-13T17:03:25.326+0000 7f315e507700 10 monclient: _send_mon_message to mon.ivan02 at v2:172.21.2.222:3300/0
    -8> 2022-05-13T17:03:25.346+0000 7f30ab099700 10 monclient: tick
    -7> 2022-05-13T17:03:25.346+0000 7f30ab099700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-13T17:02:55.348913+0000)
    -6> 2022-05-13T17:03:25.350+0000 7f30bcd3c700  0 [balancer INFO root] Optimize plan auto_2022-05-13_17:03:25
    -5> 2022-05-13T17:03:25.350+0000 7f30bcd3c700  0 [balancer INFO root] Mode upmap, max misplaced 0.060000
    -4> 2022-05-13T17:03:25.350+0000 7f30bcd3c700  0 [balancer INFO root] do_upmap
    -3> 2022-05-13T17:03:25.358+0000 7f30bcd3c700  0 [balancer INFO root] pools ['cephfs.teuthology.meta', 'rbd', 'data', 'cephfs.teuthology.data-ec', 'cephfs.teuthology.data', 'default.rgw.buckets.data', 'cephfs.scratch.meta', 'default.rgw.log', 'cephsqlite', '.rgw.root', 'metadata', 'libvirt-pool', 'default.rgw.control', '.mgr', 'default.rgw.buckets.non-ec', 'telemetry', 'default.rgw.buckets.index', 'default.rgw.meta', 'cephfs.scratch.data']
    -2> 2022-05-13T17:03:25.366+0000 7f30bcd3c700 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0-178-gcced371a/rpm/el8/BUILD/ceph-17.2.0-178-gcced371a/src/osd/OSDMap.cc: In function 'float OSDMap::calc_deviations(ceph::common::CephContext*, const std::map<int, std::set<pg_t> >&, const std::map<int, float>&, float, std::map<int, float>&, std::multimap<float, int>&, float&)' thread 7f30bcd3c700 time 2022-05-13T17:03:25.361749+0000
/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0-178-gcced371a/rpm/el8/BUILD/ceph-17.2.0-178-gcced371a/src/osd/OSDMap.cc: 5015: FAILED ceph_assert(osd_weight.count(oid))

 ceph version 17.2.0-178-gcced371a (cced371a6398564c97d1c1ccafdd43033f6c92df) quincy (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f3169efe024]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x283245) [0x7f3169efe245]
 3: (OSDMap::calc_deviations(ceph::common::CephContext*, std::map<int, std::set<pg_t, std::less<pg_t>, std::allocator<pg_t> >, std::less<int>, std::allocator<std::pair<int const, std::set<pg_t, std::less<pg_t>, std::allocator<pg_t> > > > > const&, std::map<int, float, std::less<int>, std::allocator<std::pair<int const, float> > > const&, float, std::map<int, float, std::less<int>, std::allocator<std::pair<int const, float> > >&, std::multimap<float, int, std::less<float>, std::allocator<std::pair<float const, int> > >&, float&)+0xe0) [0x7f316a3c8160]
 4: (OSDMap::calc_pg_upmaps(ceph::common::CephContext*, unsigned int, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*, unsigned int*)+0x389) [0x7f316a3cc4f9]
 5: /usr/bin/ceph-mgr(+0x2982f6) [0x55f763f212f6]
 6: /lib64/libpython3.6m.so.1.0(+0x19d3b7) [0x7f316ae2f3b7]
 7: _PyEval_EvalFrameDefault()
 8: /lib64/libpython3.6m.so.1.0(+0xf9b84) [0x7f316ad8bb84]
 9: /lib64/libpython3.6m.so.1.0(+0x17a590) [0x7f316ae0c590]
 10: /lib64/libpython3.6m.so.1.0(+0x19d657) [0x7f316ae2f657]
 11: _PyEval_EvalFrameDefault()
 12: /lib64/libpython3.6m.so.1.0(+0x17a3a8) [0x7f316ae0c3a8]
 13: /lib64/libpython3.6m.so.1.0(+0x19d657) [0x7f316ae2f657]
 14: _PyEval_EvalFrameDefault()
 15: /lib64/libpython3.6m.so.1.0(+0x17a3a8) [0x7f316ae0c3a8]
 16: /lib64/libpython3.6m.so.1.0(+0x19d657) [0x7f316ae2f657]
 17: _PyEval_EvalFrameDefault()
 18: /lib64/libpython3.6m.so.1.0(+0xfa4f6) [0x7f316ad8c4f6]
 19: _PyFunction_FastCallDict()
 20: _PyObject_FastCallDict()
 21: /lib64/libpython3.6m.so.1.0(+0x10e210) [0x7f316ada0210]
 22: _PyObject_FastCallDict()
 23: PyObject_CallMethod()
 24: (PyModuleRunner::serve()+0x66) [0x55f763f1ccf6]
 25: (PyModuleRunner::PyModuleRunnerThread::entry()+0x3e3) [0x55f763f1e333]
 26: /lib64/libpthread.so.0(+0x81cf) [0x7f3168d141cf]
 27: clone()

    -1> 2022-05-13T17:03:25.374+0000 7f315bd02700 10 monclient: _send_mon_message to mon.ivan02 at v2:172.21.2.222:3300/0
     0> 2022-05-13T17:03:25.374+0000 7f30bcd3c700 -1 *** Caught signal (Aborted) **
 in thread 7f30bcd3c700 thread_name:balancer

 ceph version 17.2.0-178-gcced371a (cced371a6398564c97d1c1ccafdd43033f6c92df) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f3168d1ece0]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7f3169efe082]
 5: /usr/lib64/ceph/libceph-common.so.2(+0x283245) [0x7f3169efe245]
 6: (OSDMap::calc_deviations(ceph::common::CephContext*, std::map<int, std::set<pg_t, std::less<pg_t>, std::allocator<pg_t> >, std::less<int>, std::allocator<std::pair<int const, std::set<pg_t, std::less<pg_t>, std::allocator<pg_t> > > > > const&, std::map<int, float, std::less<int>, std::allocator<std::pair<int const, float> > > const&, float, std::map<int, float, std::less<int>, std::allocator<std::pair<int const, float> > >&, std::multimap<float, int, std::less<float>, std::allocator<std::pair<float const, int> > >&, float&)+0xe0) [0x7f316a3c8160]
 7: (OSDMap::calc_pg_upmaps(ceph::common::CephContext*, unsigned int, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*, unsigned int*)+0x389) [0x7f316a3cc4f9]
 8: /usr/bin/ceph-mgr(+0x2982f6) [0x55f763f212f6]
 9: /lib64/libpython3.6m.so.1.0(+0x19d3b7) [0x7f316ae2f3b7]
 10: _PyEval_EvalFrameDefault()
 11: /lib64/libpython3.6m.so.1.0(+0xf9b84) [0x7f316ad8bb84]
 12: /lib64/libpython3.6m.so.1.0(+0x17a590) [0x7f316ae0c590]
 13: /lib64/libpython3.6m.so.1.0(+0x19d657) [0x7f316ae2f657]
 14: _PyEval_EvalFrameDefault()
 15: /lib64/libpython3.6m.so.1.0(+0x17a3a8) [0x7f316ae0c3a8]
 16: /lib64/libpython3.6m.so.1.0(+0x19d657) [0x7f316ae2f657]
 17: _PyEval_EvalFrameDefault()
 18: /lib64/libpython3.6m.so.1.0(+0x17a3a8) [0x7f316ae0c3a8]
 19: /lib64/libpython3.6m.so.1.0(+0x19d657) [0x7f316ae2f657]
 20: _PyEval_EvalFrameDefault()
 21: /lib64/libpython3.6m.so.1.0(+0xfa4f6) [0x7f316ad8c4f6]
 22: _PyFunction_FastCallDict()
 23: _PyObject_FastCallDict()
 24: /lib64/libpython3.6m.so.1.0(+0x10e210) [0x7f316ada0210]
 25: _PyObject_FastCallDict()
 26: PyObject_CallMethod()
 27: (PyModuleRunner::serve()+0x66) [0x55f763f1ccf6]
 28: (PyModuleRunner::PyModuleRunnerThread::entry()+0x3e3) [0x55f763f1e333]
 29: /lib64/libpthread.so.0(+0x81cf) [0x7f3168d141cf]
 30: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/ 5 rgw_datacache
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_cleaner
   0/ 5 seastore_lba
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 alienstore
   1/ 5 mclock
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7f306949a700 / dashboard
  7f3069c9b700 / dashboard
  7f306a49c700 / dashboard
  7f306ac9d700 / dashboard
  7f306b49e700 / dashboard
  7f306bcdf700 / dashboard
  7f306c4e0700 / dashboard
  7f306cce1700 / dashboard
  7f306d522700 / dashboard
  7f306dd23700 / dashboard

Related issues

Related to RADOS - Bug #48896: osd/OSDMap.cc: FAILED ceph_assert(osd_weight.count(i.first)) New
Related to Orchestrator - Bug #55638: alertmanager webhook urls may lead to 404 Resolved

History

#1 Updated by David Galloway 9 months ago

Crashed with the dashboard module enabled but it managed to bring itself back up this time.

    -5> 2022-05-13T19:38:38.234+0000 7f0831279700  0 [balancer INFO root] Optimize plan auto_2022-05-13_19:38:38
    -4> 2022-05-13T19:38:38.234+0000 7f0831279700  0 [balancer INFO root] Mode upmap, max misplaced 0.060000
    -3> 2022-05-13T19:38:38.234+0000 7f0831279700  0 [balancer INFO root] do_upmap
    -2> 2022-05-13T19:38:38.234+0000 7f0831279700  0 [balancer INFO root] pools ['telemetry', 'libvirt-pool', 'default.rgw.meta', 'default.rgw.buckets.data', 'default.rgw.buckets.non-ec', 'cephfs.scratch.data', 'cephfs.scratch.meta', 'default.rgw.control', '.rgw.root', 'rbd', 'data', 'cephsqlite', 'cephfs.teuthology.data-ec', 'cephfs.teuthology.data', 'cephfs.teuthology.meta', 'default.rgw.buckets.index', '.mgr', 'metadata', 'default.rgw.log']
    -1> 2022-05-13T19:38:38.238+0000 7f0831279700 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0-178-gcced371a/rpm/el8/BUILD/ceph-17.2.0-178-gcced371a/src/osd/OSDMap.cc: In function 'float OSDMap::calc_deviations(ceph::common::CephContext*, const std::map<int, std::set<pg_t> >&, const std::map<int, float>&, float, std::map<int, float>&, std::multimap<float, int>&, float&)' thread 7f0831279700 time 2022-05-13T19:38:38.239677+0000
/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0-178-gcced371a/rpm/el8/BUILD/ceph-17.2.0-178-gcced371a/src/osd/OSDMap.cc: 5015: FAILED ceph_assert(osd_weight.count(oid))

 ceph version 17.2.0-178-gcced371a (cced371a6398564c97d1c1ccafdd43033f6c92df) quincy (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f08de41a024]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x283245) [0x7f08de41a245]
 3: (OSDMap::calc_deviations(ceph::common::CephContext*, std::map<int, std::set<pg_t, std::less<pg_t>, std::allocator<pg_t> >, std::less<int>, std::allocator<std::pair<int const, std::set<pg_t, std::less<pg_t>, std::allocator<pg_t> > > > > const&, std::map<int, float, std::less<int>, std::allocator<std::pair<int const, float> > > const&, float, std::map<int, float, std::less<int>, std::allocator<std::pair<int const, float> > >&, std::multimap<float, int, std::less<float>, std::allocator<std::pair<float const, int> > >&, float&)+0xe0) [0x7f08de8e4160]
 4: (OSDMap::calc_pg_upmaps(ceph::common::CephContext*, unsigned int, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*, unsigned int*)+0x389) [0x7f08de8e84f9]
 5: /usr/bin/ceph-mgr(+0x2982f6) [0x55a490e402f6]
 6: /lib64/libpython3.6m.so.1.0(+0x19d3b7) [0x7f08df34b3b7]
 7: _PyEval_EvalFrameDefault()
 8: /lib64/libpython3.6m.so.1.0(+0xf9b84) [0x7f08df2a7b84]
 9: /lib64/libpython3.6m.so.1.0(+0x17a590) [0x7f08df328590]
 10: /lib64/libpython3.6m.so.1.0(+0x19d657) [0x7f08df34b657]
 11: _PyEval_EvalFrameDefault()
 12: /lib64/libpython3.6m.so.1.0(+0x17a3a8) [0x7f08df3283a8]
 13: /lib64/libpython3.6m.so.1.0(+0x19d657) [0x7f08df34b657]
 14: _PyEval_EvalFrameDefault()
 15: /lib64/libpython3.6m.so.1.0(+0x17a3a8) [0x7f08df3283a8]
 16: /lib64/libpython3.6m.so.1.0(+0x19d657) [0x7f08df34b657]
 17: _PyEval_EvalFrameDefault()
 18: /lib64/libpython3.6m.so.1.0(+0xfa4f6) [0x7f08df2a84f6]
 19: _PyFunction_FastCallDict()
 20: _PyObject_FastCallDict()
 21: /lib64/libpython3.6m.so.1.0(+0x10e210) [0x7f08df2bc210]
 22: _PyObject_FastCallDict()
 23: PyObject_CallMethod()
 24: (PyModuleRunner::serve()+0x66) [0x55a490e3bcf6]
 25: (PyModuleRunner::PyModuleRunnerThread::entry()+0x3e3) [0x55a490e3d333]
 26: /lib64/libpthread.so.0(+0x81cf) [0x7f08dd2301cf]
 27: clone()

     0> 2022-05-13T19:38:38.246+0000 7f0831279700 -1 *** Caught signal (Aborted) **
 in thread 7f0831279700 thread_name:balancer

 ceph version 17.2.0-178-gcced371a (cced371a6398564c97d1c1ccafdd43033f6c92df) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f08dd23ace0]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7f08de41a082]
 5: /usr/lib64/ceph/libceph-common.so.2(+0x283245) [0x7f08de41a245]
 6: (OSDMap::calc_deviations(ceph::common::CephContext*, std::map<int, std::set<pg_t, std::less<pg_t>, std::allocator<pg_t> >, std::less<int>, std::allocator<std::pair<int const, std::set<pg_t, std::less<pg_t>, std::allocator<pg_t> > > > > const&, std::map<int, float, std::less<int>, std::allocator<std::pair<int const, float> > > const&, float, std::map<int, float, std::less<int>, std::allocator<std::pair<int const, float> > >&, std::multimap<float, int, std::less<float>, std::allocator<std::pair<float const, int> > >&, float&)+0xe0) [0x7f08de8e4160]
 7: (OSDMap::calc_pg_upmaps(ceph::common::CephContext*, unsigned int, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*, unsigned int*)+0x389) [0x7f08de8e84f9]
 8: /usr/bin/ceph-mgr(+0x2982f6) [0x55a490e402f6]
 9: /lib64/libpython3.6m.so.1.0(+0x19d3b7) [0x7f08df34b3b7]
 10: _PyEval_EvalFrameDefault()
 11: /lib64/libpython3.6m.so.1.0(+0xf9b84) [0x7f08df2a7b84]
 12: /lib64/libpython3.6m.so.1.0(+0x17a590) [0x7f08df328590]
 13: /lib64/libpython3.6m.so.1.0(+0x19d657) [0x7f08df34b657]
 14: _PyEval_EvalFrameDefault()
 15: /lib64/libpython3.6m.so.1.0(+0x17a3a8) [0x7f08df3283a8]
 16: /lib64/libpython3.6m.so.1.0(+0x19d657) [0x7f08df34b657]
 17: _PyEval_EvalFrameDefault()
 18: /lib64/libpython3.6m.so.1.0(+0x17a3a8) [0x7f08df3283a8]
 19: /lib64/libpython3.6m.so.1.0(+0x19d657) [0x7f08df34b657]
 20: _PyEval_EvalFrameDefault()
 21: /lib64/libpython3.6m.so.1.0(+0xfa4f6) [0x7f08df2a84f6]
 22: _PyFunction_FastCallDict()
 23: _PyObject_FastCallDict()
 24: /lib64/libpython3.6m.so.1.0(+0x10e210) [0x7f08df2bc210]
 25: _PyObject_FastCallDict()
 26: PyObject_CallMethod()
 27: (PyModuleRunner::serve()+0x66) [0x55a490e3bcf6]
 28: (PyModuleRunner::PyModuleRunnerThread::entry()+0x3e3) [0x55a490e3d333]
 29: /lib64/libpthread.so.0(+0x81cf) [0x7f08dd2301cf]
 30: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/ 5 rgw_datacache
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_cleaner
   0/ 5 seastore_lba
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 alienstore
   1/ 5 mclock
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
--- pthread ID / name mapping for recent threads ---

#2 Updated by Laura Flores 9 months ago

David Galloway wrote:

Crashed with the dashboard module enabled but it managed to bring itself back up this time.

[...]

@David which node and log was this output taken from? I'm trying to see what was going on with the balancer at that time.

From the code, it looks like the balancer was making sure that an OSD was present in the crush tree:

  for (auto& [oid, opgs] : pgs_by_osd) {
    // make sure osd is still there (belongs to this crush-tree)
    ceph_assert(osd_weight.count(oid));
    float target = osd_weight.at(oid) * pgs_per_weight;
    float deviation = (float)opgs.size() - target;
    ldout(cct, 20) << " osd." << oid
                   << "\tpgs " << opgs.size()
                   << "\ttarget " << target
                   << "\tdeviation " << deviation
                   << dendl;

But it wasn't able to find it.

#3 Updated by Laura Flores 9 months ago

  • Related to Bug #48896: osd/OSDMap.cc: FAILED ceph_assert(osd_weight.count(i.first)) added

#4 Updated by David Galloway 9 months ago

Pretty sure it was reesi004. But because the reesi root drives are so small and we have the debug level set so high, we logrotate something like 6 times a day so the log is gone unfortunately.

#5 Updated by Laura Flores 9 months ago

Got it, thanks David.

#6 Updated by Redouane Kachach Elhichou 9 months ago

  • Related to Bug #55638: alertmanager webhook urls may lead to 404 added

#7 Updated by Adam King 9 months ago

At least for the dashboard related failure, we added https://github.com/ceph/ceph/pull/46306 to the LRC and re-enabled the dashboard module and there have been no further crashes so I think that part should be fixed. Not sure about the balancer portion of this tracker.

#8 Updated by Radoslaw Zarzynski 7 months ago

  • Status changed from New to Closed

The balancer issue is already tacked in the RADOS project: https://tracker.ceph.com/issues/48896.
Closing one.

Also available in: Atom PDF