Bug #48689
closedIrradict MGR behaviour after new cluster install
0%
Description
What happened:
Finished building a new cluster but experiencing the following issue:- MGR constantly and randomly crashes right from the start
- When I reboot a single host (with 24 OSDS) as part of a test about 60 OSDs goes offline (24+random others) and gets marked as down. They don't come back online.
- Only way to get the OSDs in against is to `sudo docker restart <osd>` on the host
- MGR not crashing constantly since right after a vanilla build
- OSDS to come back in when a host reboots, or when they randomly restart
How to reproduce it (minimal and precise):
Brand new cluster, experienced issues with the first test (1 host down)
cephfs (14) = EC 8+2 HDD Only
Logs:
MGR- http://pastefile.fr/cf09c6d2c02641b69b9feb1c4a7ba66a
ceph health detail (extract of one pg)
pg 14.7fc is stuck undersized for 13m, current state active+undersized+degraded+remapped+backfill_wait, last acting [201,216,84,198,94,214,16,289,262,2147483647]
ansible -i /opt/ilifu_ceph-config/inventory mons -m shell -a "sudo docker logs ceph-mgr-\`hostname\` 2>&1 | grep -i 'fail to parse'
B-01-34-cephctl.maas | CHANGED | rc=0 >> 2020-12-21T11:21:23.320+0200 7f1631ecb700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.1 () 2020-12-21T11:21:24.352+0200 7f1631ecb700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.3 () 2020-12-21T11:21:30.680+0200 7f1631ecb700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.9 () 2020-12-21T11:21:43.377+0200 7f1631ecb700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.20 () 2020-12-21T11:21:51.485+0200 7f1631ecb700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.27 () 2020-12-21T11:21:54.117+0200 7f1631ecb700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.30 ()
ceph crash ls
2020-12-21T11:23:48.405+0200 7f1631ecb700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.140 () 2020-12-21T11:31:21.206+0200 7f160e111700 0 [progress WARNING root] osd.140 marked out
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
Dec 21 13:15:41 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:41.739+0200 7f64f0c82700 -1 client.0 error registering admin socket command: (17) File exists Dec 21 13:15:41 B-01-34-cephctl docker[32628]: message repeated 4 times: [ 2020-12-21T13:15:41.739+0200 7f64f0c82700 -1 client.0 error registering admin socket command: (17) File exists] Dec 21 13:15:42 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:42.231+0200 7f64e35b3700 0 log_channel(cluster) log [DBG] : pgmap v3: 2241 pgs: 3 active+clean+scrubbing+deep, 2238 active+clean; 45 TiB data, 75 TiB used, 4.2 PiB / 4.2 PiB avail Dec 21 13:15:42 B-01-34-cephctl docker[32628]: 2020-12-21T13:15:42.699+0200 7f64e45b5700 0 log_channel(cluster) log [DBG] : pgmap v4: 2241 pgs: 3 active+clean+scrubbing+deep, 2238 active+clean; 45 TiB data, 75 TiB used, 4.2 PiB / 4.2 PiB avail Dec 21 13:15:42 B-01-34-cephctl docker[32628]: /usr/include/c++/8/bits/stl_algo.h:3721: constexpr const _Tp& std::clamp(const _Tp&, const _Tp&, const _Tp&) [with _Tp = unsigned int]: Assertion '!(__hi < __lo)' failed. Dec 21 13:15:42 B-01-34-cephctl docker[32628]: *** Caught signal (Aborted) **
Vars
Inventory - http://pastefile.fr/050f196318ed37375a12cfa879717d8f
Group vars - http://pastefile.fr/8e51fbd46706d650e4ab9b75b776cb8c
Crush map - http://pastefile.fr/afbb6fb06066333fac44d94c144c9446
- OS (e.g. from /etc/os-release): `18.04.5 LTS (Bionic Beaver)`
- Kernel (e.g. `uname -a`): `5.4.0-58-generic #64~18.04.1-Ubuntu SMP Wed Dec 9`
- Docker version if applicable (e.g. `docker version`): `client/server 19.03.6`
- Ansible version (e.g. `ansible-playbook --version`): `ansible-playbook 2.9.0`
- ceph-ansible version (e.g. `git head or tag or stable branch`): `stable-5.0`
- Ceph version (e.g. `ceph -v`): `ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)`
Updated by Jeremi A over 3 years ago
ubuntu@B-01-16-cephctl:~$ sudo docker exec -it ceph-mgr-B-01-16-cephctl ceph daemon mgr.B-01-16-cephctl perf dump 2>&1 | tee mgr.perf-dump.dave { "AsyncMessenger::Worker-0": { "msgr_recv_messages": 6, "msgr_send_messages": 5, "msgr_recv_bytes": 668000, "msgr_send_bytes": 640, "msgr_created_connections": 4, "msgr_active_connections": 18446744073709551615, ...
Updated by Jeremi A over 3 years ago
root@B-01-16-cephctl:/etc/systemd/system/multi-user.target.wants# sudo docker logs ceph-crash-`hostname` INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T15:12:07.945396Z_9355f0a4-a727-4962-bf42-0a30f656b3bf as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n' WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T14:11:52.011757Z_9a4362a1-3c2a-4dae-bc01-031ca0495937 as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n' WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T14:04:55.808225Z_e2ddde68-6326-4138-8416-ebedac7465a1 as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n' WARNING:ceph-crash:post /var/lib/ceph/crash/2020-12-21T14:37:56.788143Z_adb7df3a-9d36-4466-9a29-754b54a9fbdf as client.crash.B-01-16-cephctl failed: b'[errno 13] RADOS permission denied (error connecting to the cluster)\n'
Updated by Brad Hubbard over 3 years ago
- Project changed from Ceph to mgr
- Category deleted (
Monitor)
Updated by Brad Hubbard over 3 years ago
- Is duplicate of Bug #47132: mgr: Caught signal (Segmentation fault) thread_name:safe_timer added
Updated by Neha Ojha over 3 years ago
- Status changed from New to Need More Info
Is it possible for you to capture a coredump and share it with us?
Updated by Jeremi A over 3 years ago
Neha Ojha wrote:
Is it possible for you to capture a coredump and share it with us?
Hi, we've rebuild our cluster from scratch in the meantime, we had 312 OSDS, but rebuilding it with only 240 OSDS seems to have it stable. However this mean we cannot grow the the cluster, or make use of all our nodes/osds.
Updated by Yaarit Hatuka over 2 years ago
- Status changed from Need More Info to Duplicate