Project

General

Profile

Actions

Bug #57431

open

All mon servers unresponsive when ceph health detail too long

Added by Glen Baars over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We had a production outage from this issue. Our 5 mon servers all became unresponsive using 100% of a single core ( the server has 32 cores ).

One of our 27 ceph nodes root hard drive became readonly and cephadm repeatedly reporting this into the cluster log. I couldn't get the ceph health detail to display as it was too large but looking in the logs - it was huge.

There should be a limit on the size of the health detail or a rate limit.

Just incase someone else finds this issue - our workaround was to inject into the mon server config mon_health_detail_to_clog=false

Here is an example of the errors that were repeating into the cluster log.

We run 16.2.7 but i expect this affects most versions.

2022-09-03T00:00:10.097+0000 7f3ae8689700 1 log_channel(cluster) log [ERR] : -- Logging error ---
2022-09-03T00:00:10.129+0000 7f3ae8689700 1 log_channel(cluster) log [ERR] : Traceback (most recent call last):
2022-09-03T00:00:10.161+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/usr/lib/python3.8/logging/__init__.py", line 1089, in emit
2022-09-03T00:00:10.193+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : self.flush()
2022-09-03T00:00:10.229+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/usr/lib/python3.8/logging/__init__.py", line 1069, in flush
2022-09-03T00:00:10.261+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : self.stream.flush()
2022-09-03T00:00:10.293+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : OSError: [Errno 30] Read-only file system
2022-09-03T00:00:10.329+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : Call stack:
2022-09-03T00:00:10.361+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/var/lib/ceph/87e0c6ee-5b9d-4f32-b7f0-188d0cfd52b8/cephadm.55e70975756e8c180366666f9fa21d3301c67edc3a5000698fd6e7ccb6fcafee", line 8571, in <module>
2022-09-03T00:00:10.397+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : main()
2022-09-03T00:00:10.429+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/var/lib/ceph/87e0c6ee-5b9d-4f32-b7f0-188d0cfd52b8/cephadm.55e70975756e8c180366666f9fa21d3301c67edc3a5000698fd6e7ccb6fcafee", line 8559, in main
2022-09-03T00:00:10.461+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : r = ctx.func(ctx)
2022-09-03T00:00:10.493+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/var/lib/ceph/87e0c6ee-5b9d-4f32-b7f0-188d0cfd52b8/cephadm.55e70975756e8c180366666f9fa21d3301c67edc3a5000698fd6e7ccb6fcafee", line 1737, in _infer_config
2022-09-03T00:00:10.529+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : return func(ctx)
2022-09-03T00:00:10.561+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/var/lib/ceph/87e0c6ee-5b9d-4f32-b7f0-188d0cfd52b8/cephadm.55e70975756e8c180366666f9fa21d3301c67edc3a5000698fd6e7ccb6fcafee", line 1678, in _infer_fsid
2022-09-03T00:00:10.593+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : return func(ctx)
2022-09-03T00:00:10.625+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/var/lib/ceph/87e0c6ee-5b9d-4f32-b7f0-188d0cfd52b8/cephadm.55e70975756e8c180366666f9fa21d3301c67edc3a5000698fd6e7ccb6fcafee", line 1765, in _infer_image
2022-09-03T00:00:10.661+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : return func(ctx)
2022-09-03T00:00:10.693+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/var/lib/ceph/87e0c6ee-5b9d-4f32-b7f0-188d0cfd52b8/cephadm.55e70975756e8c180366666f9fa21d3301c67edc3a5000698fd6e7ccb6fcafee", line 1665, in _validate_fsid
2022-09-03T00:00:10.729+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : return func(ctx)
2022-09-03T00:00:10.765+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/var/lib/ceph/87e0c6ee-5b9d-4f32-b7f0-188d0cfd52b8/cephadm.55e70975756e8c180366666f9fa21d3301c67edc3a5000698fd6e7ccb6fcafee", line 4822, in command_ceph_volume
2022-09-03T00:00:10.797+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : out, err, code = call_throws(ctx, c.run_cmd())
2022-09-03T00:00:10.829+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/var/lib/ceph/87e0c6ee-5b9d-4f32-b7f0-188d0cfd52b8/cephadm.55e70975756e8c180366666f9fa21d3301c67edc3a5000698fd6e7ccb6fcafee", line 1465, in call_throws
2022-09-03T00:00:10.861+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : out, err, ret = call(ctx, command, desc, verbosity, timeout, **kwargs)
2022-09-03T00:00:10.901+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/var/lib/ceph/87e0c6ee-5b9d-4f32-b7f0-188d0cfd52b8/cephadm.55e70975756e8c180366666f9fa21d3301c67edc3a5000698fd6e7ccb6fcafee", line 1447, in call
2022-09-03T00:00:10.933+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : stdout, stderr, returncode = async_run(run_with_timeout())
2022-09-03T00:00:10.965+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
2022-09-03T00:00:11.001+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : return loop.run_until_complete(main)
2022-09-03T00:00:11.033+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/usr/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
2022-09-03T00:00:11.065+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : self.run_forever()
2022-09-03T00:00:11.101+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/usr/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
2022-09-03T00:00:11.133+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : self._run_once()
2022-09-03T00:00:11.165+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/usr/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
2022-09-03T00:00:11.201+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : handle._run()
2022-09-03T00:00:11.233+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
2022-09-03T00:00:11.269+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : self._context.run(self._callback, *self._args)
2022-09-03T00:00:11.301+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : File "/var/lib/ceph/87e0c6ee-5b9d-4f32-b7f0-188d0cfd52b8/cephadm.55e70975756e8c180366666f9fa21d3301c67edc3a5000698fd6e7ccb6fcafee", line 1426, in tee
2022-09-03T00:00:11.337+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : logger.debug(prefix + message.rstrip())
2022-09-03T00:00:11.369+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : Message: "/usr/bin/docker: log_output('stdout', line, terminal_verbose, logfile_verbose)"
2022-09-03T00:00:11.401+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : Arguments: ()
2022-09-03T00:00:11.433+0000 7f3ae8689700 -1 log_channel(cluster) log [ERR] : --
Logging error ---

No data to display

Actions

Also available in: Atom PDF