Bug #48689: Irradict MGR behaviour after new cluster install - mgr - Ceph

Actions

Copy link

Bug #48689

closed

Irradict MGR behaviour after new cluster install

Added by Jeremi A over 3 years ago. Updated almost 3 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

Ceph - v15.2.8

% Done:

Source:

Community (user)

Tags:

mgr

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v15.2.8

ceph-qa-suite:

ceph-ansible

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

What happened:

Finished building a new cluster but experiencing the following issue:

MGR constantly and randomly crashes right from the start
When I reboot a single host (with 24 OSDS) as part of a test about 60 OSDs goes offline (24+random others) and gets marked as down. They don't come back online.
Only way to get the OSDs in against is to `sudo docker restart <osd>` on the host

What you expected to happen:

MGR not crashing constantly since right after a vanilla build
OSDS to come back in when a host reboots, or when they randomly restart

How to reproduce it (minimal and precise):
Brand new cluster, experienced issues with the first test (1 host down)
cephfs (14) = EC 8+2 HDD Only

Logs:
MGR- http://pastefile.fr/cf09c6d2c02641b69b9feb1c4a7ba66a

ceph health detail (extract of one pg)

 pg 14.7fc is stuck undersized for 13m, current state active+undersized+degraded+remapped+backfill_wait, last acting [201,216,84,198,94,214,16,289,262,2147483647]

ansible -i /opt/ilifu_ceph-config/inventory mons -m shell -a "sudo docker logs ceph-mgr-\`hostname\` 2>&1 | grep -i 'fail to parse'

B-01-34-cephctl.maas | CHANGED | rc=0 >>
2020-12-21T11:21:23.320+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.1 ()
2020-12-21T11:21:24.352+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.3 ()
2020-12-21T11:21:30.680+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.9 ()
2020-12-21T11:21:43.377+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.20 ()
2020-12-21T11:21:51.485+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.27 ()
2020-12-21T11:21:54.117+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.30 ()

ceph crash ls

2020-12-21T11:23:48.405+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.140 ()
2020-12-21T11:31:21.206+0200 7f160e111700  0 [progress WARNING root] osd.140 marked out

Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)

Dec 21 13:15:41 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:41.739+0200 7f64f0c82700 -1 client.0 error registering admin socket command: (17) File exists
Dec 21 13:15:41 B-01-34-cephctl docker[32628]: message repeated 4 times: [ 2020-12-21T13:15:41.739+0200 7f64f0c82700 -1 client.0 error registering admin socket command: (17) File exists]
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:42.231+0200 7f64e35b3700  0 log_channel(cluster) log [DBG] : pgmap v3: 2241 pgs: 3 active+clean+scrubbing+deep, 2238 active+clean; 45 TiB data, 75 TiB used, 4.2 PiB / 4.2 PiB avail
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:42.699+0200 7f64e45b5700  0 log_channel(cluster) log [DBG] : pgmap v4: 2241 pgs: 3 active+clean+scrubbing+deep, 2238 active+clean; 45 TiB data, 75 TiB used, 4.2 PiB / 4.2 PiB avail
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: /usr/include/c++/8/bits/stl_algo.h:3721: constexpr const _Tp& std::clamp(const _Tp&, const _Tp&, const _Tp&) [with _Tp = unsigned int]: Assertion '!(__hi < __lo)' failed.
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: *** Caught signal (Aborted) **

Vars
Inventory - http://pastefile.fr/050f196318ed37375a12cfa879717d8f
Group vars - http://pastefile.fr/8e51fbd46706d650e4ab9b75b776cb8c
Crush map - http://pastefile.fr/afbb6fb06066333fac44d94c144c9446

Environment:

OS (e.g. from /etc/os-release): `18.04.5 LTS (Bionic Beaver)`
Kernel (e.g. `uname -a`): `5.4.0-58-generic #64~18.04.1-Ubuntu SMP Wed Dec 9`
Docker version if applicable (e.g. `docker version`): `client/server 19.03.6`
Ansible version (e.g. `ansible-playbook --version`): `ansible-playbook 2.9.0`
ceph-ansible version (e.g. `git head or tag or stable branch`): `stable-5.0`
Ceph version (e.g. `ceph -v`): `ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)`

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #48689

Irradict MGR behaviour after new cluster install

Updated by Jeremi A over 3 years ago

Updated by Jeremi A over 3 years ago

Updated by Brad Hubbard over 3 years ago

Updated by Brad Hubbard over 3 years ago

Updated by Neha Ojha over 3 years ago

Updated by Jeremi A over 3 years ago

Updated by Yaarit Hatuka almost 3 years ago