Project

General

Profile

Actions

Bug #48689

closed

Irradict MGR behaviour after new cluster install

Added by Jeremi A over 3 years ago. Updated over 2 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
mgr
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-ansible
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

What happened:

Finished building a new cluster but experiencing the following issue:
  • MGR constantly and randomly crashes right from the start
  • When I reboot a single host (with 24 OSDS) as part of a test about 60 OSDs goes offline (24+random others) and gets marked as down. They don't come back online.
  • Only way to get the OSDs in against is to `sudo docker restart <osd>` on the host
What you expected to happen:
  • MGR not crashing constantly since right after a vanilla build
  • OSDS to come back in when a host reboots, or when they randomly restart

How to reproduce it (minimal and precise):
Brand new cluster, experienced issues with the first test (1 host down)
cephfs (14) = EC 8+2 HDD Only

Logs:
MGR- http://pastefile.fr/cf09c6d2c02641b69b9feb1c4a7ba66a

ceph health detail (extract of one pg)

 pg 14.7fc is stuck undersized for 13m, current state active+undersized+degraded+remapped+backfill_wait, last acting [201,216,84,198,94,214,16,289,262,2147483647]
ansible -i /opt/ilifu_ceph-config/inventory mons -m shell -a "sudo docker logs ceph-mgr-\`hostname\` 2>&1 | grep -i 'fail to parse'
B-01-34-cephctl.maas | CHANGED | rc=0 >>
2020-12-21T11:21:23.320+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.1 ()
2020-12-21T11:21:24.352+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.3 ()
2020-12-21T11:21:30.680+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.9 ()
2020-12-21T11:21:43.377+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.20 ()
2020-12-21T11:21:51.485+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.27 ()
2020-12-21T11:21:54.117+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.30 ()

ceph crash ls
2020-12-21T11:23:48.405+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.140 ()
2020-12-21T11:31:21.206+0200 7f160e111700  0 [progress WARNING root] osd.140 marked out
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)

Dec 21 13:15:41 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:41.739+0200 7f64f0c82700 -1 client.0 error registering admin socket command: (17) File exists
Dec 21 13:15:41 B-01-34-cephctl docker[32628]: message repeated 4 times: [ 2020-12-21T13:15:41.739+0200 7f64f0c82700 -1 client.0 error registering admin socket command: (17) File exists]
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:42.231+0200 7f64e35b3700  0 log_channel(cluster) log [DBG] : pgmap v3: 2241 pgs: 3 active+clean+scrubbing+deep, 2238 active+clean; 45 TiB data, 75 TiB used, 4.2 PiB / 4.2 PiB avail
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:42.699+0200 7f64e45b5700  0 log_channel(cluster) log [DBG] : pgmap v4: 2241 pgs: 3 active+clean+scrubbing+deep, 2238 active+clean; 45 TiB data, 75 TiB used, 4.2 PiB / 4.2 PiB avail
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: /usr/include/c++/8/bits/stl_algo.h:3721: constexpr const _Tp& std::clamp(const _Tp&, const _Tp&, const _Tp&) [with _Tp = unsigned int]: Assertion '!(__hi < __lo)' failed.
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: *** Caught signal (Aborted) **

Vars
Inventory - http://pastefile.fr/050f196318ed37375a12cfa879717d8f
Group vars - http://pastefile.fr/8e51fbd46706d650e4ab9b75b776cb8c
Crush map - http://pastefile.fr/afbb6fb06066333fac44d94c144c9446

Environment:
  • OS (e.g. from /etc/os-release): `18.04.5 LTS (Bionic Beaver)`
  • Kernel (e.g. `uname -a`): `5.4.0-58-generic #64~18.04.1-Ubuntu SMP Wed Dec 9`
  • Docker version if applicable (e.g. `docker version`): `client/server 19.03.6`
  • Ansible version (e.g. `ansible-playbook --version`): `ansible-playbook 2.9.0`
  • ceph-ansible version (e.g. `git head or tag or stable branch`): `stable-5.0`
  • Ceph version (e.g. `ceph -v`): `ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)`

Related issues 1 (1 open0 closed)

Is duplicate of mgr - Bug #47132: mgr: Caught signal (Segmentation fault) thread_name:safe_timerNeed More Info

Actions
Actions #1

Updated by Jeremi A over 3 years ago

ubuntu@B-01-16-cephctl:~$ sudo docker exec -it ceph-mgr-B-01-16-cephctl ceph daemon mgr.B-01-16-cephctl perf dump 2>&1 | tee mgr.perf-dump.dave
{
    "AsyncMessenger::Worker-0": {
        "msgr_recv_messages": 6,
        "msgr_send_messages": 5,
        "msgr_recv_bytes": 668000,
        "msgr_send_bytes": 640,
        "msgr_created_connections": 4,
        "msgr_active_connections": 18446744073709551615,
...
Actions #2

Updated by Jeremi A over 3 years ago

root@B-01-16-cephctl:/etc/systemd/system/multi-user.target.wants# sudo docker logs ceph-crash-`hostname`
INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s
WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T15:12:07.945396Z_9355f0a4-a727-4962-bf42-0a30f656b3bf as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n'
WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T14:11:52.011757Z_9a4362a1-3c2a-4dae-bc01-031ca0495937 as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n'
WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T14:04:55.808225Z_e2ddde68-6326-4138-8416-ebedac7465a1 as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n'
WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T14:37:56.788143Z_adb7df3a-9d36-4466-9a29-754b54a9fbdf as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n'
Actions #3

Updated by Brad Hubbard over 3 years ago

  • Project changed from Ceph to mgr
  • Category deleted (Monitor)
Actions #4

Updated by Brad Hubbard over 3 years ago

  • Is duplicate of Bug #47132: mgr: Caught signal (Segmentation fault) thread_name:safe_timer added
Actions #5

Updated by Neha Ojha over 3 years ago

  • Status changed from New to Need More Info

Is it possible for you to capture a coredump and share it with us?

Actions #6

Updated by Jeremi A over 3 years ago

Neha Ojha wrote:

Is it possible for you to capture a coredump and share it with us?

Hi, we've rebuild our cluster from scratch in the meantime, we had 312 OSDS, but rebuilding it with only 240 OSDS seems to have it stable. However this mean we cannot grow the the cluster, or make use of all our nodes/osds.

Actions #7

Updated by Yaarit Hatuka over 2 years ago

  • Status changed from Need More Info to Duplicate
Actions

Also available in: Atom PDF