Project

General

Profile

Actions

Bug #49736

closed

cephfs-top: missing keys in the client_metadata

Added by Jos Collin about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

There are missing keys in the mgr/stats client_metadata for some clients, which causes the exception mentioned in the BZ [1] in cephfs-top [2]. Either cephfs-top should handle the missing metadata entries or the mgr/stats should fill in defaults until it can update the metadata. This exception occurs unexpectedly with no definite action/steps while cephfs-top is running.

Below is the `ceph fs perf stats` dumped during the exception. Notice client.14585.

{"version": 1, "global_counters": ["cap_hit", "read_latency", "write_latency", "metadata_latency", "dentry_lease"], "counters": [], 

"client_metadata": 
{"client.14504": {"IP": "127.0.0.1", "hostname": "smithi069", "root": "/", "mount_point": "/mnt/cephfs", "valid_metrics": ["cap_hit", "read_latency", "write_latency", "metadata_latency", "dentry_lease"]}, 
"client.14507": {"IP": "127.0.0.1", "hostname": "smithi069", "root": "/", "mount_point": "/mnt/cephfs2", "valid_metrics": ["cap_hit", "read_latency", "write_latency", "metadata_latency", "dentry_lease"]}, 
"client.14585": {"IP": "127.0.0.1"}}, 

"global_metrics": 
{"client.14504": [[2, 0], [0, 0], [0, 0], [0, 3038554], [0, 0]], 
"client.14507": [[2, 0], [0, 0], [0, 0], [0, 3091147], [0, 0]], 
"client.14585": [[0, 0], [0, 0], [0, 0], [0, 0], [0, 0]]}, 

"metrics": {"delayed_ranks": [], "mds.0": {"client.14504": [], "client.14507": [], "client.14585": []}}}

The mgr logs during the exception reflect the same. The mgr logs cannot be attached to this ticket because of Maximum file size: 1000 KB limit.

More Details:
Here [3] we set IP metadata initially and then send a request to the mds for the remaining metadata. In the meantime, the current stats are dumped when cephfs-top queries mgr/stats, which would cause the exception. So the cephfs-top should be prepared to handle that OR mgr/stats should fill in the defaults (N/A, not available) and later update when it receives the metadata query reply. On the MDS side, it is observed that the metadata query reply did not contain metadata for client.14585 - this also need to be debugged.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1934426
[2] https://github.com/ceph/ceph/blob/master/src/tools/cephfs/top/cephfs-top#L256
[3] https://github.com/ceph/ceph/blob/master/src/pybind/mgr/stats/fs/perf_stats.py#L275


Related issues 1 (0 open1 closed)

Copied to CephFS - Backport #49973: pacific: cephfs-top: missing keys in the client_metadataResolvedJos CollinActions
Actions #1

Updated by Jos Collin about 3 years ago

  • Description updated (diff)
Actions #2

Updated by Patrick Donnelly about 3 years ago

  • Status changed from New to Triaged
  • Assignee set to Jos Collin
  • Priority changed from Normal to Urgent
  • Target version set to v17.0.0
  • Source set to Q/A
  • Backport set to pacific
  • Severity changed from 3 - minor to 2 - major

Either cephfs-top should handle the missing metadata entries or the mgr/stats should fill in defaults until it can update the metadata.

My intuition is that cephfs-top should be tolerant of the missing metadata entries. Venky, what do you think?

Actions #3

Updated by Venky Shankar about 3 years ago

Patrick Donnelly wrote:

Either cephfs-top should handle the missing metadata entries or the mgr/stats should fill in defaults until it can update the metadata.

My intuition is that cephfs-top should be tolerant of the missing metadata entries. Venky, what do you think?

Right. That's one part of the problem that needs to be handled in cephfs-top (or mgr/stats). mgr/stats sets minimum metadata (IP addr) when it sees a client for the first time. Then it sends a (async) request to the MDS for other metadata. The perf stats data handed out by mgr/stats would contain incomplete metadata for this client (since mgr/stats hasn't received the metadata request reply from the MDS).

The other issue which I see in Jos's setup is that the metadata request reply ("client ls" call to the MDS) didn't contain metadata for the client (client.14585):


2021-03-11T09:51:18.622+0000 7fa3252ad700  0 [stats DEBUG root] notify: client metadata=[{'id': 14507, 'entity': {'name': {'type': 'client', 'num': 14507}, 'addr': {'type': 'any', 'addr': '127.0.0.1:0', 'nonce': 1132974732}}, 'state': 'open', 'num_leases': 0, 'num_caps': 1, 'request_load_avg': 0, 'uptime': 191.8276
00803, 'requests_in_flight': 0, 'completed_requests': [], 'reconnecting': False, 'recall_caps': {'value': 0, 'halflife': 60}, 'release_caps': {'value': 0, 'halflife': 60}, 'recall_caps_throttle': {'value': 0, 'halflife': 1.5}, 'recall_caps_throttle2o': {'value': 0, 'halflife': 0.5}, 'session_cache_liveness': {'value': 0.6421762966800849, 'halflife': 300}, 'cap_acquisition': {'value': 0, 'halflife': 10}, 'delegated_inos': [], 'inst': 'client.14507 127.0.0.1:0/1132974732', 'prealloc_inos': [], 'client_metadata': {'client_features': {'feature_bits': '0x000000000000ffff'}, 'metric_spec': {'metric_flags': {'feature_bits': '0x000000000000001f'}}, 'ceph_sha1': '68142daf25e396d4bd8c9caee31c4f0bfe88164f', 'ceph_version': 'ceph version 17.0.0-1725-g68142daf25 (68142daf25e396d4bd8c9caee31c4f0bfe88164f) quincy (dev)', 'entity_id': 'admin', 'hostname': 'smithi069', 'mount_point': '/mnt/cephfs2', 'pid': '22511', 'root': '/'}}, {'id': 14504, 'entity': {'name': {'type': 'client', 'num': 14504}, 'addr': {'type': 'any', 'addr': '127.0.0.1:0', 'nonce': 3460583018}}, 'state': 'open', 'num_leases': 0, 'num_caps': 1, 'request_load_avg': 0, 'uptime': 192.483585759, 'requests_in_flight': 0, 'completed_requests': [], 'reconnecting': False, 'recall_caps': {'value': 0, 'halflife': 60}, 'release_caps': {'value': 0, 'halflife': 60}, 'recall_caps_throttle': {'value': 0, 'halflife': 1.5}, 'recall_caps_throttle2o': {'value': 0, 'halflife': 0.5}, 'session_cache_liveness': {'value': 0.6418025110207033, 'halflife': 300}, 'cap_acquisition': {'value': 0, 'halflife': 10}, 'delegated_inos': [], 'inst': 'client.14504 127.0.0.1:0/3460583018', 'prealloc_inos': [], 'client_metadata': {'client_features': {'feature_bits': '0x000000000000ffff'}, 'metric_spec': {'metric_flags': {'feature_bits': '0x000000000000001f'}}, 'ceph_sha1': '68142daf25e396d4bd8c9caee31c4f0bfe88164f', 'ceph_version': 'ceph version 17.0.0-1725-g68142daf25 (68142daf25e396d4bd8c9caee31c4f0bfe88164f) quincy (dev)', 'entity_id': 'admin', 'hostname': 'smithi069', 'mount_point': '/mnt/cephfs', 'pid': '22470', 'root': '/'}}]

Why did the metadata request did not carry entry for client.14585 needs to be examined further. MDSRank::dump_sessions() has this filter:


if (!filter.match(*s, std::bind(&Server::waiting_for_reconnect, server, std::placeholders::_1))) {
  continue;
}

... which might be the reason that the client got filtered out of the session dump. Jos, could you please check if that's the case?

Actions #4

Updated by Jos Collin about 3 years ago

Venky Shankar wrote:
MDSRank::dump_sessions() has this filter:

[...]

... which might be the reason that the client got filtered out of the session dump. Jos, could you please check if that's the case?

I've checked the mds logs. It doesn't hit there. As of now, there's nothing interesting found for this particular client (client.14585) on the mds side - looks the same as the other clients. Needs more checking.

Actions #6

Updated by Jos Collin about 3 years ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 40210
Actions #7

Updated by Venky Shankar about 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #8

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49973: pacific: cephfs-top: missing keys in the client_metadata added
Actions #9

Updated by Nathan Cutler about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF