Bug #48689: Irradict MGR behaviour after new cluster install - mgr - Ceph

Actions

Copy link

Bug #48689

closed

Irradict MGR behaviour after new cluster install

Added by Jeremi A over 3 years ago. Updated over 2 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

Ceph - v15.2.8

% Done:

Source:

Community (user)

Tags:

mgr

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v15.2.8

ceph-qa-suite:

ceph-ansible

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

What happened:

Finished building a new cluster but experiencing the following issue:

MGR constantly and randomly crashes right from the start
When I reboot a single host (with 24 OSDS) as part of a test about 60 OSDs goes offline (24+random others) and gets marked as down. They don't come back online.
Only way to get the OSDs in against is to `sudo docker restart <osd>` on the host

What you expected to happen:

MGR not crashing constantly since right after a vanilla build
OSDS to come back in when a host reboots, or when they randomly restart

How to reproduce it (minimal and precise):
Brand new cluster, experienced issues with the first test (1 host down)
cephfs (14) = EC 8+2 HDD Only

Logs:
MGR- http://pastefile.fr/cf09c6d2c02641b69b9feb1c4a7ba66a

ceph health detail (extract of one pg)

 pg 14.7fc is stuck undersized for 13m, current state active+undersized+degraded+remapped+backfill_wait, last acting [201,216,84,198,94,214,16,289,262,2147483647]

ansible -i /opt/ilifu_ceph-config/inventory mons -m shell -a "sudo docker logs ceph-mgr-\`hostname\` 2>&1 | grep -i 'fail to parse'

B-01-34-cephctl.maas | CHANGED | rc=0 >>
2020-12-21T11:21:23.320+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.1 ()
2020-12-21T11:21:24.352+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.3 ()
2020-12-21T11:21:30.680+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.9 ()
2020-12-21T11:21:43.377+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.20 ()
2020-12-21T11:21:51.485+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.27 ()
2020-12-21T11:21:54.117+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.30 ()

ceph crash ls

2020-12-21T11:23:48.405+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.140 ()
2020-12-21T11:31:21.206+0200 7f160e111700  0 [progress WARNING root] osd.140 marked out

Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)

Dec 21 13:15:41 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:41.739+0200 7f64f0c82700 -1 client.0 error registering admin socket command: (17) File exists
Dec 21 13:15:41 B-01-34-cephctl docker[32628]: message repeated 4 times: [ 2020-12-21T13:15:41.739+0200 7f64f0c82700 -1 client.0 error registering admin socket command: (17) File exists]
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:42.231+0200 7f64e35b3700  0 log_channel(cluster) log [DBG] : pgmap v3: 2241 pgs: 3 active+clean+scrubbing+deep, 2238 active+clean; 45 TiB data, 75 TiB used, 4.2 PiB / 4.2 PiB avail
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:42.699+0200 7f64e45b5700  0 log_channel(cluster) log [DBG] : pgmap v4: 2241 pgs: 3 active+clean+scrubbing+deep, 2238 active+clean; 45 TiB data, 75 TiB used, 4.2 PiB / 4.2 PiB avail
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: /usr/include/c++/8/bits/stl_algo.h:3721: constexpr const _Tp& std::clamp(const _Tp&, const _Tp&, const _Tp&) [with _Tp = unsigned int]: Assertion '!(__hi < __lo)' failed.
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: *** Caught signal (Aborted) **

Vars
Inventory - http://pastefile.fr/050f196318ed37375a12cfa879717d8f
Group vars - http://pastefile.fr/8e51fbd46706d650e4ab9b75b776cb8c
Crush map - http://pastefile.fr/afbb6fb06066333fac44d94c144c9446

Environment:

OS (e.g. from /etc/os-release): `18.04.5 LTS (Bionic Beaver)`
Kernel (e.g. `uname -a`): `5.4.0-58-generic #64~18.04.1-Ubuntu SMP Wed Dec 9`
Docker version if applicable (e.g. `docker version`): `client/server 19.03.6`
Ansible version (e.g. `ansible-playbook --version`): `ansible-playbook 2.9.0`
ceph-ansible version (e.g. `git head or tag or stable branch`): `stable-5.0`
Ceph version (e.g. `ceph -v`): `ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)`

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Jeremi A over 3 years ago

ubuntu@B-01-16-cephctl:~$ sudo docker exec -it ceph-mgr-B-01-16-cephctl ceph daemon mgr.B-01-16-cephctl perf dump 2>&1 | tee mgr.perf-dump.dave
{
    "AsyncMessenger::Worker-0": {
        "msgr_recv_messages": 6,
        "msgr_send_messages": 5,
        "msgr_recv_bytes": 668000,
        "msgr_send_bytes": 640,
        "msgr_created_connections": 4,
        "msgr_active_connections": 18446744073709551615,
...

Actions

Copy link

Updated by Jeremi A over 3 years ago

root@B-01-16-cephctl:/etc/systemd/system/multi-user.target.wants# sudo docker logs ceph-crash-`hostname`
INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s
WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T15:12:07.945396Z_9355f0a4-a727-4962-bf42-0a30f656b3bf as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n'
WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T14:11:52.011757Z_9a4362a1-3c2a-4dae-bc01-031ca0495937 as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n'
WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T14:04:55.808225Z_e2ddde68-6326-4138-8416-ebedac7465a1 as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n'
WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T14:37:56.788143Z_adb7df3a-9d36-4466-9a29-754b54a9fbdf as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n'

Actions

Copy link

Updated by Brad Hubbard over 3 years ago

Project changed from Ceph to mgr
Category deleted (~~Monitor~~)

Actions

Copy link

Updated by Brad Hubbard over 3 years ago

Is duplicate of Bug #47132: mgr: Caught signal (Segmentation fault) thread_name:safe_timer added

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Status changed from New to Need More Info

Is it possible for you to capture a coredump and share it with us?

Actions

Copy link

Updated by Jeremi A over 3 years ago

Neha Ojha wrote:

Is it possible for you to capture a coredump and share it with us?

Hi, we've rebuild our cluster from scratch in the meantime, we had 312 OSDS, but rebuilding it with only 240 OSDS seems to have it stable. However this mean we cannot grow the the cluster, or make use of all our nodes/osds.

Actions

Copy link

Updated by Yaarit Hatuka over 2 years ago

Status changed from Need More Info to Duplicate

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #48689

Irradict MGR behaviour after new cluster install

Updated by Jeremi A over 3 years ago

Updated by Jeremi A over 3 years ago

Updated by Brad Hubbard over 3 years ago

Updated by Brad Hubbard over 3 years ago

Updated by Neha Ojha over 3 years ago

Updated by Jeremi A over 3 years ago

Updated by Yaarit Hatuka over 2 years ago