Project

General

Profile

Bug #57964

Cephadm: MONs are not getting back after /var/log filesystem is full

Added by Piotr Parczewski 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I had at least two examples where Cephadm controlled cluster had /var/log full in CentOS Linux and it obviously caused a crash of the MONs. After cleaning up some logs and/or expanding the filesystem, MONs are unable to get back and form a cluster. There's a huge flood of the error messages in system's journal, MONs still do claim there's no space available:

Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105444+0000 mon.w04s01 (mon.0) 6425 : cluster [WRN] --- Logging error ---
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105463+0000 mon.w04s01 (mon.0) 6426 : cluster [WRN] Traceback (most recent call last):
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105479+0000 mon.w04s01 (mon.0) 6427 : cluster [WRN]   File "/usr/lib64/python3.6/logging/__init__.py", line 998, in emit
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105493+0000 mon.w04s01 (mon.0) 6428 : cluster [WRN]     self.flush()
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105515+0000 mon.w04s01 (mon.0) 6429 : cluster [WRN]   File "/usr/lib64/python3.6/logging/__init__.py", line 978, in flush
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105533+0000 mon.w04s01 (mon.0) 6430 : cluster [WRN]     self.stream.flush()
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105550+0000 mon.w04s01 (mon.0) 6431 : cluster [WRN] OSError: [Errno 28] No space left on device
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105566+0000 mon.w04s01 (mon.0) 6432 : cluster [WRN] Call stack:
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105585+0000 mon.w04s01 (mon.0) 6433 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 9281, in <module>
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105601+0000 mon.w04s01 (mon.0) 6434 : cluster [WRN]     main()
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105615+0000 mon.w04s01 (mon.0) 6435 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 9269, in main
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105632+0000 mon.w04s01 (mon.0) 6436 : cluster [WRN]     r = ctx.func(ctx)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105647+0000 mon.w04s01 (mon.0) 6437 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 2034, in _infer_config
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105665+0000 mon.w04s01 (mon.0) 6438 : cluster [WRN]     return func(ctx)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105680+0000 mon.w04s01 (mon.0) 6439 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1950, in _infer_fsid
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105698+0000 mon.w04s01 (mon.0) 6440 : cluster [WRN]     return func(ctx)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105716+0000 mon.w04s01 (mon.0) 6441 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 2062, in _infer_image
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105734+0000 mon.w04s01 (mon.0) 6442 : cluster [WRN]     return func(ctx)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105750+0000 mon.w04s01 (mon.0) 6443 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1937, in _validate_fsid
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105767+0000 mon.w04s01 (mon.0) 6444 : cluster [WRN]     return func(ctx)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105781+0000 mon.w04s01 (mon.0) 6445 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 5995, in command_ceph_volume
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105799+0000 mon.w04s01 (mon.0) 6446 : cluster [WRN]     out, err, code = call_throws(ctx, c.run_cmd())
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105815+0000 mon.w04s01 (mon.0) 6447 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1734, in call_throws
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105832+0000 mon.w04s01 (mon.0) 6448 : cluster [WRN]     out, err, ret = call(ctx, command, desc, verbosity, timeout, **kwargs)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105847+0000 mon.w04s01 (mon.0) 6449 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1716, in call
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105865+0000 mon.w04s01 (mon.0) 6450 : cluster [WRN]     stdout, stderr, returncode = async_run(run_with_timeout())
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105881+0000 mon.w04s01 (mon.0) 6451 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1657, in async_run
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105898+0000 mon.w04s01 (mon.0) 6452 : cluster [WRN]     return loop.run_until_complete(coro)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105913+0000 mon.w04s01 (mon.0) 6453 : cluster [WRN]   File "/usr/lib64/python3.6/asyncio/base_events.py", line 471, in run_until_complete
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105932+0000 mon.w04s01 (mon.0) 6454 : cluster [WRN]     self.run_forever()
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105955+0000 mon.w04s01 (mon.0) 6455 : cluster [WRN]   File "/usr/lib64/python3.6/asyncio/base_events.py", line 438, in run_forever
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105975+0000 mon.w04s01 (mon.0) 6456 : cluster [WRN]     self._run_once()
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.105991+0000 mon.w04s01 (mon.0) 6457 : cluster [WRN]   File "/usr/lib64/python3.6/asyncio/base_events.py", line 1451, in _run_once
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106007+0000 mon.w04s01 (mon.0) 6458 : cluster [WRN]     handle._run()
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106023+0000 mon.w04s01 (mon.0) 6459 : cluster [WRN]   File "/usr/lib64/python3.6/asyncio/events.py", line 145, in _run
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106038+0000 mon.w04s01 (mon.0) 6460 : cluster [WRN]     self._callback(*self._args)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106055+0000 mon.w04s01 (mon.0) 6461 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1695, in tee
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106072+0000 mon.w04s01 (mon.0) 6462 : cluster [WRN]     logger.debug(prefix + message.rstrip())
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106089+0000 mon.w04s01 (mon.0) 6463 : cluster [WRN] Message: '/bin/docker:   File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch'
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106106+0000 mon.w04s01 (mon.0) 6464 : cluster [WRN] Arguments: ()
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106123+0000 mon.w04s01 (mon.0) 6465 : cluster [WRN] --- Logging error ---
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106140+0000 mon.w04s01 (mon.0) 6466 : cluster [WRN] Traceback (most recent call last):
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106166+0000 mon.w04s01 (mon.0) 6467 : cluster [WRN]   File "/usr/lib64/python3.6/logging/__init__.py", line 998, in emit
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106183+0000 mon.w04s01 (mon.0) 6468 : cluster [WRN]     self.flush()
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106198+0000 mon.w04s01 (mon.0) 6469 : cluster [WRN]   File "/usr/lib64/python3.6/logging/__init__.py", line 978, in flush
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106218+0000 mon.w04s01 (mon.0) 6470 : cluster [WRN]     self.stream.flush()
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106235+0000 mon.w04s01 (mon.0) 6471 : cluster [WRN] OSError: [Errno 28] No space left on device
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106250+0000 mon.w04s01 (mon.0) 6472 : cluster [WRN] Call stack:
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106267+0000 mon.w04s01 (mon.0) 6473 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 9281, in <module>
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106284+0000 mon.w04s01 (mon.0) 6474 : cluster [WRN]     main()
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106301+0000 mon.w04s01 (mon.0) 6475 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 9269, in main
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106318+0000 mon.w04s01 (mon.0) 6476 : cluster [WRN]     r = ctx.func(ctx)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106336+0000 mon.w04s01 (mon.0) 6477 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 2034, in _infer_config
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106351+0000 mon.w04s01 (mon.0) 6478 : cluster [WRN]     return func(ctx)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106369+0000 mon.w04s01 (mon.0) 6479 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1950, in _infer_fsid
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106387+0000 mon.w04s01 (mon.0) 6480 : cluster [WRN]     return func(ctx)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106411+0000 mon.w04s01 (mon.0) 6481 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 2062, in _infer_image
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106428+0000 mon.w04s01 (mon.0) 6482 : cluster [WRN]     return func(ctx)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106444+0000 mon.w04s01 (mon.0) 6483 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1937, in _validate_fsid
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106461+0000 mon.w04s01 (mon.0) 6484 : cluster [WRN]     return func(ctx)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106480+0000 mon.w04s01 (mon.0) 6485 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 5995, in command_ceph_volume
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106497+0000 mon.w04s01 (mon.0) 6486 : cluster [WRN]     out, err, code = call_throws(ctx, c.run_cmd())
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106513+0000 mon.w04s01 (mon.0) 6487 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1734, in call_throws
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106529+0000 mon.w04s01 (mon.0) 6488 : cluster [WRN]     out, err, ret = call(ctx, command, desc, verbosity, timeout, **kwargs)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106547+0000 mon.w04s01 (mon.0) 6489 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1716, in call
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106561+0000 mon.w04s01 (mon.0) 6490 : cluster [WRN]     stdout, stderr, returncode = async_run(run_with_timeout())
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106581+0000 mon.w04s01 (mon.0) 6491 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1657, in async_run
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106599+0000 mon.w04s01 (mon.0) 6492 : cluster [WRN]     return loop.run_until_complete(coro)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106617+0000 mon.w04s01 (mon.0) 6493 : cluster [WRN]   File "/usr/lib64/python3.6/asyncio/base_events.py", line 471, in run_until_complete
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106637+0000 mon.w04s01 (mon.0) 6494 : cluster [WRN]     self.run_forever()
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106654+0000 mon.w04s01 (mon.0) 6495 : cluster [WRN]   File "/usr/lib64/python3.6/asyncio/base_events.py", line 438, in run_forever
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106672+0000 mon.w04s01 (mon.0) 6496 : cluster [WRN]     self._run_once()
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106689+0000 mon.w04s01 (mon.0) 6497 : cluster [WRN]   File "/usr/lib64/python3.6/asyncio/base_events.py", line 1451, in _run_once
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106707+0000 mon.w04s01 (mon.0) 6498 : cluster [WRN]     handle._run()
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106724+0000 mon.w04s01 (mon.0) 6499 : cluster [WRN]   File "/usr/lib64/python3.6/asyncio/events.py", line 145, in _run
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106740+0000 mon.w04s01 (mon.0) 6500 : cluster [WRN]     self._callback(*self._args)
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106760+0000 mon.w04s01 (mon.0) 6501 : cluster [WRN]   File "/var/lib/ceph/6710a0d0-3d6d-11ed-a0be-b47af1c197ec/cephadm.9f5947044415209b488d90a5f9b9807235c9c2dc29836a12b73019a60ec20bde", line 1695, in tee
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106777+0000 mon.w04s01 (mon.0) 6502 : cluster [WRN]     logger.debug(prefix + message.rstrip())
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106793+0000 mon.w04s01 (mon.0) 6503 : cluster [WRN] Message: '/bin/docker:     instance.main()'
Nov 01 22:22:45 w04s02 bash[810172]: cluster 2022-11-01T07:01:41.106809+0000 mon.w04s01 (mon.0) 6504 : cluster [WRN] Arguments: ()

The only way of getting back into a quorum was modifying the monmap to have a single MON running, then starting MGRs, then slowly restore cluster's condition using Cephadm.

Also available in: Atom PDF