Project

General

Profile

Actions

Bug #48689

closed

Irradict MGR behaviour after new cluster install

Added by Jeremi A over 3 years ago. Updated almost 3 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
mgr
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-ansible
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

What happened:

Finished building a new cluster but experiencing the following issue:
  • MGR constantly and randomly crashes right from the start
  • When I reboot a single host (with 24 OSDS) as part of a test about 60 OSDs goes offline (24+random others) and gets marked as down. They don't come back online.
  • Only way to get the OSDs in against is to `sudo docker restart <osd>` on the host
What you expected to happen:
  • MGR not crashing constantly since right after a vanilla build
  • OSDS to come back in when a host reboots, or when they randomly restart

How to reproduce it (minimal and precise):
Brand new cluster, experienced issues with the first test (1 host down)
cephfs (14) = EC 8+2 HDD Only

Logs:
MGR- http://pastefile.fr/cf09c6d2c02641b69b9feb1c4a7ba66a

ceph health detail (extract of one pg)

 pg 14.7fc is stuck undersized for 13m, current state active+undersized+degraded+remapped+backfill_wait, last acting [201,216,84,198,94,214,16,289,262,2147483647]
ansible -i /opt/ilifu_ceph-config/inventory mons -m shell -a "sudo docker logs ceph-mgr-\`hostname\` 2>&1 | grep -i 'fail to parse'
B-01-34-cephctl.maas | CHANGED | rc=0 >>
2020-12-21T11:21:23.320+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.1 ()
2020-12-21T11:21:24.352+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.3 ()
2020-12-21T11:21:30.680+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.9 ()
2020-12-21T11:21:43.377+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.20 ()
2020-12-21T11:21:51.485+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.27 ()
2020-12-21T11:21:54.117+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.30 ()

ceph crash ls
2020-12-21T11:23:48.405+0200 7f1631ecb700  0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.140 ()
2020-12-21T11:31:21.206+0200 7f160e111700  0 [progress WARNING root] osd.140 marked out
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)

Dec 21 13:15:41 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:41.739+0200 7f64f0c82700 -1 client.0 error registering admin socket command: (17) File exists
Dec 21 13:15:41 B-01-34-cephctl docker[32628]: message repeated 4 times: [ 2020-12-21T13:15:41.739+0200 7f64f0c82700 -1 client.0 error registering admin socket command: (17) File exists]
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:42.231+0200 7f64e35b3700  0 log_channel(cluster) log [DBG] : pgmap v3: 2241 pgs: 3 active+clean+scrubbing+deep, 2238 active+clean; 45 TiB data, 75 TiB used, 4.2 PiB / 4.2 PiB avail
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:42.699+0200 7f64e45b5700  0 log_channel(cluster) log [DBG] : pgmap v4: 2241 pgs: 3 active+clean+scrubbing+deep, 2238 active+clean; 45 TiB data, 75 TiB used, 4.2 PiB / 4.2 PiB avail
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: /usr/include/c++/8/bits/stl_algo.h:3721: constexpr const _Tp& std::clamp(const _Tp&, const _Tp&, const _Tp&) [with _Tp = unsigned int]: Assertion '!(__hi < __lo)' failed.
Dec 21 13:15:42 B-01-34-cephctl docker[32628]: *** Caught signal (Aborted) **

Vars
Inventory - http://pastefile.fr/050f196318ed37375a12cfa879717d8f
Group vars - http://pastefile.fr/8e51fbd46706d650e4ab9b75b776cb8c
Crush map - http://pastefile.fr/afbb6fb06066333fac44d94c144c9446

Environment:
  • OS (e.g. from /etc/os-release): `18.04.5 LTS (Bionic Beaver)`
  • Kernel (e.g. `uname -a`): `5.4.0-58-generic #64~18.04.1-Ubuntu SMP Wed Dec 9`
  • Docker version if applicable (e.g. `docker version`): `client/server 19.03.6`
  • Ansible version (e.g. `ansible-playbook --version`): `ansible-playbook 2.9.0`
  • ceph-ansible version (e.g. `git head or tag or stable branch`): `stable-5.0`
  • Ceph version (e.g. `ceph -v`): `ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)`

Related issues 1 (1 open0 closed)

Is duplicate of mgr - Bug #47132: mgr: Caught signal (Segmentation fault) thread_name:safe_timerNeed More Info

Actions
Actions

Also available in: Atom PDF