Actions
Bug #57964
openCephadm: MONs are not getting back after /var/log filesystem is full
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Description
I had at least two examples where Cephadm controlled cluster had /var/log full in CentOS Linux and it obviously caused a crash of the MONs. After cleaning up some logs and/or expanding the filesystem, MONs are unable to get back and form a cluster. There's a huge flood of the error messages in system's journal, MONs still do claim there's no space available:
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105444+0000 mon.w04s01 (mon.0) 6425 : cluster [WRN] --- Logging error --- Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105463+0000 mon.w04s01 (mon.0) 6426 : cluster [WRN] Traceback (most recent call last): Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105479+0000 mon.w04s01 (mon.0) 6427 : cluster [WRN] File "/usr/lib64/python3.6/logging/__init__.py", line 998, in emit Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105493+0000 mon.w04s01 (mon.0) 6428 : cluster [WRN] self.flush() Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105515+0000 mon.w04s01 (mon.0) 6429 : cluster [WRN] File "/usr/lib64/python3.6/logging/__init__.py", line 978, in flush Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105533+0000 mon.w04s01 (mon.0) 6430 : cluster [WRN] self.stream.flush() Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105550+0000 mon.w04s01 (mon.0) 6431 : cluster [WRN] OSError: [Errno 28] No space left on device Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105566+0000 mon.w04s01 (mon.0) 6432 : cluster [WRN] Call stack: Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105585+0000 mon.w04s01 (mon.0) 6433 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 9281, in <module> Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105601+0000 mon.w04s01 (mon.0) 6434 : cluster [WRN] main() Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105615+0000 mon.w04s01 (mon.0) 6435 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 9269, in main Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105632+0000 mon.w04s01 (mon.0) 6436 : cluster [WRN] r = ctx.func(ctx) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105647+0000 mon.w04s01 (mon.0) 6437 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 2034, in _infer_config Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105665+0000 mon.w04s01 (mon.0) 6438 : cluster [WRN] return func(ctx) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105680+0000 mon.w04s01 (mon.0) 6439 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1950, in _infer_fsid Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105698+0000 mon.w04s01 (mon.0) 6440 : cluster [WRN] return func(ctx) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105716+0000 mon.w04s01 (mon.0) 6441 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 2062, in _infer_image Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105734+0000 mon.w04s01 (mon.0) 6442 : cluster [WRN] return func(ctx) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105750+0000 mon.w04s01 (mon.0) 6443 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1937, in _validate_fsid Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105767+0000 mon.w04s01 (mon.0) 6444 : cluster [WRN] return func(ctx) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105781+0000 mon.w04s01 (mon.0) 6445 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 5995, in command_ceph_volume Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105799+0000 mon.w04s01 (mon.0) 6446 : cluster [WRN] out, err, code = call_throws(ctx, c.run_cmd()) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105815+0000 mon.w04s01 (mon.0) 6447 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1734, in call_throws Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105832+0000 mon.w04s01 (mon.0) 6448 : cluster [WRN] out, err, ret = call(ctx, command, desc, verbosity, timeout, **kwargs) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105847+0000 mon.w04s01 (mon.0) 6449 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1716, in call Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105865+0000 mon.w04s01 (mon.0) 6450 : cluster [WRN] stdout, stderr, returncode = async_run(run_with_timeout()) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105881+0000 mon.w04s01 (mon.0) 6451 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1657, in async_run Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105898+0000 mon.w04s01 (mon.0) 6452 : cluster [WRN] return loop.run_until_complete(coro) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105913+0000 mon.w04s01 (mon.0) 6453 : cluster [WRN] File "/usr/lib64/python3.6/asyncio/base_events.py", line 471, in run_until_complete Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105932+0000 mon.w04s01 (mon.0) 6454 : cluster [WRN] self.run_forever() Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105955+0000 mon.w04s01 (mon.0) 6455 : cluster [WRN] File "/usr/lib64/python3.6/asyncio/base_events.py", line 438, in run_forever Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105975+0000 mon.w04s01 (mon.0) 6456 : cluster [WRN] self._run_once() Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105991+0000 mon.w04s01 (mon.0) 6457 : cluster [WRN] File "/usr/lib64/python3.6/asyncio/base_events.py", line 1451, in _run_once Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106007+0000 mon.w04s01 (mon.0) 6458 : cluster [WRN] handle._run() Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106023+0000 mon.w04s01 (mon.0) 6459 : cluster [WRN] File "/usr/lib64/python3.6/asyncio/events.py", line 145, in _run Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106038+0000 mon.w04s01 (mon.0) 6460 : cluster [WRN] self._callback(*self._args) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106055+0000 mon.w04s01 (mon.0) 6461 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1695, in tee Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106072+0000 mon.w04s01 (mon.0) 6462 : cluster [WRN] logger.debug(prefix + message.rstrip()) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106089+0000 mon.w04s01 (mon.0) 6463 : cluster [WRN] Message: '/bin/docker: File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch' Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106106+0000 mon.w04s01 (mon.0) 6464 : cluster [WRN] Arguments: () Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106123+0000 mon.w04s01 (mon.0) 6465 : cluster [WRN] --- Logging error --- Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106140+0000 mon.w04s01 (mon.0) 6466 : cluster [WRN] Traceback (most recent call last): Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106166+0000 mon.w04s01 (mon.0) 6467 : cluster [WRN] File "/usr/lib64/python3.6/logging/__init__.py", line 998, in emit Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106183+0000 mon.w04s01 (mon.0) 6468 : cluster [WRN] self.flush() Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106198+0000 mon.w04s01 (mon.0) 6469 : cluster [WRN] File "/usr/lib64/python3.6/logging/__init__.py", line 978, in flush Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106218+0000 mon.w04s01 (mon.0) 6470 : cluster [WRN] self.stream.flush() Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106235+0000 mon.w04s01 (mon.0) 6471 : cluster [WRN] OSError: [Errno 28] No space left on device Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106250+0000 mon.w04s01 (mon.0) 6472 : cluster [WRN] Call stack: Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106267+0000 mon.w04s01 (mon.0) 6473 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 9281, in <module> Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106284+0000 mon.w04s01 (mon.0) 6474 : cluster [WRN] main() Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106301+0000 mon.w04s01 (mon.0) 6475 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 9269, in main Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106318+0000 mon.w04s01 (mon.0) 6476 : cluster [WRN] r = ctx.func(ctx) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106336+0000 mon.w04s01 (mon.0) 6477 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 2034, in _infer_config Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106351+0000 mon.w04s01 (mon.0) 6478 : cluster [WRN] return func(ctx) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106369+0000 mon.w04s01 (mon.0) 6479 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1950, in _infer_fsid Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106387+0000 mon.w04s01 (mon.0) 6480 : cluster [WRN] return func(ctx) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106411+0000 mon.w04s01 (mon.0) 6481 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 2062, in _infer_image Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106428+0000 mon.w04s01 (mon.0) 6482 : cluster [WRN] return func(ctx) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106444+0000 mon.w04s01 (mon.0) 6483 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1937, in _validate_fsid Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106461+0000 mon.w04s01 (mon.0) 6484 : cluster [WRN] return func(ctx) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106480+0000 mon.w04s01 (mon.0) 6485 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 5995, in command_ceph_volume Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106497+0000 mon.w04s01 (mon.0) 6486 : cluster [WRN] out, err, code = call_throws(ctx, c.run_cmd()) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106513+0000 mon.w04s01 (mon.0) 6487 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1734, in call_throws Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106529+0000 mon.w04s01 (mon.0) 6488 : cluster [WRN] out, err, ret = call(ctx, command, desc, verbosity, timeout, **kwargs) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106547+0000 mon.w04s01 (mon.0) 6489 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1716, in call Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106561+0000 mon.w04s01 (mon.0) 6490 : cluster [WRN] stdout, stderr, returncode = async_run(run_with_timeout()) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106581+0000 mon.w04s01 (mon.0) 6491 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1657, in async_run Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106599+0000 mon.w04s01 (mon.0) 6492 : cluster [WRN] return loop.run_until_complete(coro) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106617+0000 mon.w04s01 (mon.0) 6493 : cluster [WRN] File "/usr/lib64/python3.6/asyncio/base_events.py", line 471, in run_until_complete Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106637+0000 mon.w04s01 (mon.0) 6494 : cluster [WRN] self.run_forever() Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106654+0000 mon.w04s01 (mon.0) 6495 : cluster [WRN] File "/usr/lib64/python3.6/asyncio/base_events.py", line 438, in run_forever Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106672+0000 mon.w04s01 (mon.0) 6496 : cluster [WRN] self._run_once() Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106689+0000 mon.w04s01 (mon.0) 6497 : cluster [WRN] File "/usr/lib64/python3.6/asyncio/base_events.py", line 1451, in _run_once Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106707+0000 mon.w04s01 (mon.0) 6498 : cluster [WRN] handle._run() Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106724+0000 mon.w04s01 (mon.0) 6499 : cluster [WRN] File "/usr/lib64/python3.6/asyncio/events.py", line 145, in _run Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106740+0000 mon.w04s01 (mon.0) 6500 : cluster [WRN] self._callback(*self._args) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106760+0000 mon.w04s01 (mon.0) 6501 : cluster [WRN] File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1695, in tee Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106777+0000 mon.w04s01 (mon.0) 6502 : cluster [WRN] logger.debug(prefix + message.rstrip()) Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106793+0000 mon.w04s01 (mon.0) 6503 : cluster [WRN] Message: '/bin/docker: instance.main()' Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106809+0000 mon.w04s01 (mon.0) 6504 : cluster [WRN] Arguments: ()
The only way of getting back into a quorum was modifying the monmap to have a single MON running, then starting MGRs, then slowly restore cluster's condition using Cephadm.
No data to display
Actions