Bug #59551: mgr/stats: exception ValueError ：invalid literal for int() with base 16: '0x' - CephFS - Ceph

Actions

Copy link

Bug #59551

closed

mgr/stats: exception ValueError ：invalid literal for int() with base 16: '0x'

Added by xinyu wang 12 months ago. Updated 8 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jos Collin

Category:

Correctness/Safety

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

backport_processed

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

cephfs-top

Labels (FS):

Pull request ID:

51655

Crash signature (v1):

Crash signature (v2):

Description

'ceph fs perf stats' command miss some metadata for cephfs client, such as kernel_version.

2023-04-25T19:28:29.089+0800 7f87064a8700 -1 mgr notify stats.notify:
2023-04-25T19:28:29.089+0800 7f87064a8700 -1 mgr notify Traceback (most recent call last):
  File "/opt/ceph/src/pybind/mgr/stats/module.py", line 32, in notify
    self.fs_perf_stats.notify_cmd(notify_id)
  File "/opt/ceph/src/pybind/mgr/stats/fs/perf_stats.py", line 177, in notify_cmd
    metric_features = int(metadata[CLIENT_METADATA_KEY]["metric_spec"]["metric_flags"]["feature_bits"], 16)
ValueError: invalid literal for int() with base 16: '0x'

Files

Download all files

mgr.storage01.log (191 KB) mgr.storage01.log		Eugen Block, 05/05/2023 07:37 AM
mds-session-ls.out (14.4 KB) mds-session-ls.out		Eugen Block, 05/17/2023 12:51 PM

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Venky Shankar 12 months ago

Category set to Correctness/Safety
Assignee set to Jos Collin
Target version set to v19.0.0
Backport set to reef,quincy,pacific

Actions

Copy link

Updated by Eugen Block 12 months ago

Not sure if required but I wanted to add some more information, while running cephfs-top the mgr module crashes all the time:

storage01:~ # ceph health detail
HEALTH_WARN 23 mgr modules have recently crashed
[WRN] RECENT_MGR_MODULE_CRASH: 23 mgr modules have recently crashed
    mgr module stats crashed in daemon mgr.storage01.ygsvte on host storage01 at 2023-05-04T11:36:48.948046Z
    mgr module stats crashed in daemon mgr.storage01.ygsvte on host storage01 at 2023-05-04T11:35:33.897963Z
    mgr module stats crashed in daemon mgr.storage01.ygsvte on host storage01 at 2023-05-04T11:36:13.927182Z
    mgr module stats crashed in daemon mgr.storage01.ygsvte on host storage01 at 2023-05-04T11:35:13.908107Z

But I get output from cephfs-top so it kind of works but leaves all these crashes.

Actions

Copy link

Updated by Jos Collin 12 months ago

Eugen Block wrote:

Not sure if required but I wanted to add some more information, while running cephfs-top the mgr module crashes all the time:

[...]

But I get output from cephfs-top so it kind of works but leaves all these crashes.

I don't see these crashes in `ceph health detail` with the latest cephfs-top code. Can you confirm if it's related crash? Could you please attach the mgr logs?

Actions

Copy link

Updated by Jos Collin 12 months ago

Status changed from New to Need More Info

xinyu wang wrote:

'ceph fs perf stats' command miss some metadata for cephfs client, such as kernel_version.

[...]

@xinyu
Could you please provide the debugfs log file to check more on this issue? Let me know your kernel version too.

Actions

Copy link

Updated by Eugen Block 12 months ago

File mgr.storage01.log mgr.storage01.log added

Jos Collin wrote:

Eugen Block wrote:

Not sure if required but I wanted to add some more information, while running cephfs-top the mgr module crashes all the time:

[...]

But I get output from cephfs-top so it kind of works but leaves all these crashes.

I don't see these crashes in `ceph health detail` with the latest cephfs-top code. Can you confirm if it's related crash? Could you please attach the mgr logs?

Yes, these crashes are related to that module, here's the info from the last crash:

storage01:~ # ceph crash info 2023-05-05T07:24:28.258666Z_33a408c8-ecfa-4edf-8c37-c093ccf69bd6
{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/stats/module.py\", line 32, in notify\n    self.fs_perf_stats.notify_cmd(notify_id)",
        "  File \"/usr/share/ceph/mgr/stats/fs/perf_stats.py\", line 177, in notify_cmd\n    metric_features = int(metadata[CLIENT_METADATA_KEY][\"metric_spec\"][\"metric_flags\"][\"feature_bits\"], 16)",
        "ValueError: invalid literal for int() with base 16: '0x'" 
    ],
    "ceph_version": "17.2.6",
    "crash_id": "2023-05-05T07:24:28.258666Z_33a408c8-ecfa-4edf-8c37-c093ccf69bd6",
    "entity_name": "mgr.storage01.ygsvte",
    "mgr_module": "stats",
    "mgr_module_caller": "ActivePyModule::notify",
    "mgr_python_exception": "ValueError",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "971ae170f1fff7f7bc0b7ae86d164b2b0136a8bd5ca7956166ea5161e51ad42c",
    "timestamp": "2023-05-05T07:24:28.258666Z",
    "utsname_hostname": "storage01",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.21-150400.24.60-default",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 12 12:13:32 UTC 2023 (93dbe2e)" 
}

The mgr log file is attached. I truncated the file to contain only the logs from the last crashes. Let me know if you need more. By the way, the operating system of the VM is openSUSE Leap 15.4 and cephfs-top is installed from rpm:
cephfs-top-17.2.6.248+gad656d572cb-lp154.2.1.noarch

Actions

Copy link

Updated by Eugen Block 12 months ago

This is kind of strange, when I initially wanted to test cephfs-top I chose a different virtual ceph cluster which already was running ceph version 17.2.6 (images from quay.io) on openSUSE Leap 15.4, but I got some curses error messages. Then I chose to use above one-node cluster where I got cephfs-top to work but with mentioned stack trace. Now in the other cluster I managed to mitigate the curses error and now cephfs-top works there as well, but without breaking the mgr module. Both systems are Leap 15.4 with ceph version 17.2.6:

# machine with stack trace
storage01:~ # ceph versions
{
[...]
    "overall": {
        "ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)": 9
    }
}

storage01:~ # grep NAME /etc/os-release 
NAME="openSUSE Leap" 
PRETTY_NAME="openSUSE Leap 15.4" 
CPE_NAME="cpe:/o:opensuse:leap:15.4" 

storage01:~ # rpm -qa | grep cephfs-top
cephfs-top-17.2.6.248+gad656d572cb-lp154.2.1.noarch

# machine without stack trace
nautilus:~ # ceph versions
{
[...]
    "overall": {
        "ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)": 21
    }
}

nautilus:~ # grep NAME /etc/os-release 
NAME="openSUSE Leap" 
PRETTY_NAME="openSUSE Leap 15.4" 
CPE_NAME="cpe:/o:opensuse:leap:15.4" 

nautilus:~ # rpm -qa | grep cephfs-top
cephfs-top-17.2.6.248+gad656d572cb-lp154.2.1.noarch

So they are practically identical, only the nautilus machine has been installed with Leap 15.1 and was upgraded to Quincy while the other was installed with Leap 15.3 and Pacific. Any idea what the difference could be here?

Actions

Copy link

Updated by Jos Collin 11 months ago

Eugen Block wrote:

This is kind of strange, when I initially wanted to test cephfs-top I chose a different virtual ceph cluster which already was running ceph version 17.2.6 (images from quay.io) on openSUSE Leap 15.4, but I got some curses error messages. Then I chose to use above one-node cluster where I got cephfs-top to work but with mentioned stack trace. Now in the other cluster I managed to mitigate the curses error and now cephfs-top works there as well, but without breaking the mgr module. Both systems are Leap 15.4 with ceph version 17.2.6:

[...]

So they are practically identical, only the nautilus machine has been installed with Leap 15.1 and was upgraded to Quincy while the other was installed with Leap 15.3 and Pacific. Any idea what the difference could be here?

Leap 15.1 uses kernel v4.12. You need Kernel v5.14 to have the complete set of 'perf stats' patches.
Could you please send the output of 'session ls' to check?

Actions

Copy link

Updated by Eugen Block 11 months ago

Both VMs use the same kernel version (they are not running 15.1 anymore, both have been upgraded to 15.4 on the way to quincy):

storage01:~ # uname -r
5.14.21-150400.24.60-default

nautilus:~ # uname -r
5.14.21-150400.24.60-default

The session ls output would only show mounted clients, but both these storage nodes do not have the cephfs mounted. Or do you suspect that older clients mounting the cephfs could cause the stack trace? Could you please clarify? In the cluster with a stack trace there are older kernel clients, for example:


        "client_metadata": {
            "client_features": {
                "feature_bits": "0x0000000000003bff" 
            },
            "metric_spec": {
                "metric_flags": {
                    "feature_bits": "0x" 
                }
            },
            "entity_id": "nova-mount",
            "hostname": "compute01",
            "kernel_version": "5.3.18-lp152.106-default",
            "root": "/openstack-cluster/nova-instances"

Actions

Copy link

Updated by Jos Collin 11 months ago

Eugen Block wrote:

Both VMs use the same kernel version (they are not running 15.1 anymore, both have been upgraded to 15.4 on the way to quincy):

storage01:~ # uname -r
5.14.21-150400.24.60-default

nautilus:~ # uname -r
5.14.21-150400.24.60-default

So is the above mentioned issue (in Description) exist in your cluster?

The session ls output would only show mounted clients, but both these storage nodes do not have the cephfs mounted. Or do you suspect that older clients mounting the cephfs could cause the stack trace? Could you please clarify? In the cluster with a stack trace there are older kernel clients, for example:
[...]

I need the output from the older cluster, where you hit the issue.

Actions

Copy link

#10

Updated by Eugen Block 11 months ago

File mds-session-ls.out mds-session-ls.out added

Nothing has changed on storage01 since my first crash report. The crash happens on Leap 15.4 with Quincy. I just wanted to point out that on a different cluster with the same kernel, same OS, same ceph version the crashes are not happening. I can paste the entire session ls output if you want, I've uploaded it as a text file.

Actions

Copy link

#11