Project

General

Profile

Bug #48689

Irradict MGR behaviour after new cluster install

Added by Jeremi A about 1 month ago. Updated 15 days ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
mgr
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-ansible
Pull request ID:
Crash signature:

Description

What happened:

Finished building a new cluster but experiencing the following issue:
  • MGR constantly and randomly crashes right from the start
  • When I reboot a single host (with 24 OSDS) as part of a test about 60 OSDs goes offline (24+random others) and gets marked as down. They don't come back online.
  • Only way to get the OSDs in against is to `sudo docker restart <osd>` on the host
What you expected to happen:
  • MGR not crashing constantly since right after a vanilla build
  • OSDS to come back in when a host reboots, or when they randomly restart

How to reproduce it (minimal and precise):
Brand new cluster, experienced issues with the first test (1 host down)
cephfs (14) = EC 8+2 HDD Only

Logs:
MGR- http://pastefile.fr/cf09c6d2c02641b69b9feb1c4a7ba66a

ceph health detail (extract of one pg)

 pg 14.7fc is stuck undersized for 13m, current state active+undersized+degraded+remapped+backfill_wait, last acting [201,216,84,198,94,214,16,289,262,2147483647]
ansible -i /opt/ilifu_ceph-config/inventory mons -m shell -a "sudo docker logs ceph-mgr-\`hostname\` 2>&1 | grep -i 'fail to parse'
B-01-34-cephctl.maas | CHANGED | rc=0 >>
2020-12-21T11:21:23.320+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.1 ()
2020-12-21T11:21:24.352+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.3 ()
2020-12-21T11:21:30.680+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.9 ()
2020-12-21T11:21:43.377+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.20 ()
2020-12-21T11:21:51.485+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.27 ()
2020-12-21T11:21:54.117+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.30 ()

ceph crash ls
2020-12-21T11:23:48.405+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.140 ()
2020-12-21T11:31:21.206+0200 7f160e111700  0 [progress WARNING root] osd.140 marked out
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)

Dec 21 13:15:41 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:41.739+0200 7f64f0c82700 -1 client.0 error registering admin socket command: (17) File exists
Dec 21 13:15:41 B-01-34-cephctl docker[32628]: message repeated 4 times: [ 2020-12-21T13:15:41.739+0200 7f64f0c82700 -1 client.0 error registering admin socket command: (17) File exists]
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:42.231+0200 7f64e35b3700  0 log_channel(cluster) log [DBG] : pgmap v3: 2241 pgs: 3 active+clean+scrubbing+deep, 2238 active+clean; 45 TiB data, 75 TiB used, 4.2 PiB / 4.2 PiB avail
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:42.699+0200 7f64e45b5700  0 log_channel(cluster) log [DBG] : pgmap v4: 2241 pgs: 3 active+clean+scrubbing+deep, 2238 active+clean; 45 TiB data, 75 TiB used, 4.2 PiB / 4.2 PiB avail
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: /usr/include/c++/8/bits/stl_algo.h:3721: constexpr const _Tp& std::clamp(const _Tp&, const _Tp&, const _Tp&) [with _Tp = unsigned int]: Assertion '!(__hi < __lo)' failed.
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: *** Caught signal (Aborted) **

Vars
Inventory - http://pastefile.fr/050f196318ed37375a12cfa879717d8f
Group vars - http://pastefile.fr/8e51fbd46706d650e4ab9b75b776cb8c
Crush map - http://pastefile.fr/afbb6fb06066333fac44d94c144c9446

Environment:
  • OS (e.g. from /etc/os-release): `18.04.5 LTS (Bionic Beaver)`
  • Kernel (e.g. `uname -a`): `5.4.0-58-generic #64~18.04.1-Ubuntu SMP Wed Dec 9`
  • Docker version if applicable (e.g. `docker version`): `client/server 19.03.6`
  • Ansible version (e.g. `ansible-playbook --version`): `ansible-playbook 2.9.0`
  • ceph-ansible version (e.g. `git head or tag or stable branch`): `stable-5.0`
  • Ceph version (e.g. `ceph -v`): `ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)`

Related issues

Duplicates mgr - Bug #47132: mgr: Caught signal (Segmentation fault) thread_name:safe_timer New

History

#1 Updated by Jeremi A about 1 month ago

ubuntu@B-01-16-cephctl:~$ sudo docker exec -it ceph-mgr-B-01-16-cephctl ceph daemon mgr.B-01-16-cephctl perf dump 2>&1 | tee mgr.perf-dump.dave
{
    "AsyncMessenger::Worker-0": {
        "msgr_recv_messages": 6,
        "msgr_send_messages": 5,
        "msgr_recv_bytes": 668000,
        "msgr_send_bytes": 640,
        "msgr_created_connections": 4,
        "msgr_active_connections": 18446744073709551615,
...

#2 Updated by Jeremi A about 1 month ago

root@B-01-16-cephctl:/etc/systemd/system/multi-user.target.wants# sudo docker logs ceph-crash-`hostname`
INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s
WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T15:12:07.945396Z_9355f0a4-a727-4962-bf42-0a30f656b3bf as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n'
WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T14:11:52.011757Z_9a4362a1-3c2a-4dae-bc01-031ca0495937 as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n'
WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T14:04:55.808225Z_e2ddde68-6326-4138-8416-ebedac7465a1 as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n'
WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T14:37:56.788143Z_adb7df3a-9d36-4466-9a29-754b54a9fbdf as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n'

#3 Updated by Brad Hubbard about 1 month ago

  • Project changed from Ceph to mgr
  • Category deleted (Monitor)

#4 Updated by Brad Hubbard about 1 month ago

  • Duplicates Bug #47132: mgr: Caught signal (Segmentation fault) thread_name:safe_timer added

#5 Updated by Neha Ojha 19 days ago

  • Status changed from New to Need More Info

Is it possible for you to capture a coredump and share it with us?

#6 Updated by Jeremi A 15 days ago

Neha Ojha wrote:

Is it possible for you to capture a coredump and share it with us?

Hi, we've rebuild our cluster from scratch in the meantime, we had 312 OSDS, but rebuilding it with only 240 OSDS seems to have it stable. However this mean we cannot grow the the cluster, or make use of all our nodes/osds.

Also available in: Atom PDF