Project

General

Profile

Actions

Bug #59551

closed

mgr/stats: exception ValueError :invalid literal for int() with base 16: '0x'

Added by xinyu wang 12 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
cephfs-top
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

'ceph fs perf stats' command miss some metadata for cephfs client, such as kernel_version.

2023-04-25T19:28:29.089+0800 7f87064a8700 -1 mgr notify stats.notify:
2023-04-25T19:28:29.089+0800 7f87064a8700 -1 mgr notify Traceback (most recent call last):
  File "/opt/ceph/src/pybind/mgr/stats/module.py", line 32, in notify
    self.fs_perf_stats.notify_cmd(notify_id)
  File "/opt/ceph/src/pybind/mgr/stats/fs/perf_stats.py", line 177, in notify_cmd
    metric_features = int(metadata[CLIENT_METADATA_KEY]["metric_spec"]["metric_flags"]["feature_bits"], 16)
ValueError: invalid literal for int() with base 16: '0x'

Files

mgr.storage01.log (191 KB) mgr.storage01.log Eugen Block, 05/05/2023 07:37 AM
mds-session-ls.out (14.4 KB) mds-session-ls.out Eugen Block, 05/17/2023 12:51 PM

Related issues 3 (0 open3 closed)

Copied to CephFS - Backport #61734: pacific: mgr/stats: exception ValueError :invalid literal for int() with base 16: '0x'ResolvedJos CollinActions
Copied to CephFS - Backport #61735: reef: mgr/stats: exception ValueError :invalid literal for int() with base 16: '0x'ResolvedJos CollinActions
Copied to CephFS - Backport #61736: quincy: mgr/stats: exception ValueError :invalid literal for int() with base 16: '0x'ResolvedJos CollinActions
Actions #1

Updated by Venky Shankar 12 months ago

  • Category set to Correctness/Safety
  • Assignee set to Jos Collin
  • Target version set to v19.0.0
  • Backport set to reef,quincy,pacific
Actions #2

Updated by Eugen Block 12 months ago

Not sure if required but I wanted to add some more information, while running cephfs-top the mgr module crashes all the time:

storage01:~ # ceph health detail
HEALTH_WARN 23 mgr modules have recently crashed
[WRN] RECENT_MGR_MODULE_CRASH: 23 mgr modules have recently crashed
    mgr module stats crashed in daemon mgr.storage01.ygsvte on host storage01 at 2023-05-04T11:36:48.948046Z
    mgr module stats crashed in daemon mgr.storage01.ygsvte on host storage01 at 2023-05-04T11:35:33.897963Z
    mgr module stats crashed in daemon mgr.storage01.ygsvte on host storage01 at 2023-05-04T11:36:13.927182Z
    mgr module stats crashed in daemon mgr.storage01.ygsvte on host storage01 at 2023-05-04T11:35:13.908107Z

But I get output from cephfs-top so it kind of works but leaves all these crashes.

Actions #3

Updated by Jos Collin 12 months ago

Eugen Block wrote:

Not sure if required but I wanted to add some more information, while running cephfs-top the mgr module crashes all the time:

[...]

But I get output from cephfs-top so it kind of works but leaves all these crashes.

I don't see these crashes in `ceph health detail` with the latest cephfs-top code. Can you confirm if it's related crash? Could you please attach the mgr logs?

Actions #4

Updated by Jos Collin 12 months ago

  • Status changed from New to Need More Info

xinyu wang wrote:

'ceph fs perf stats' command miss some metadata for cephfs client, such as kernel_version.

[...]

@xinyu
Could you please provide the debugfs log file to check more on this issue? Let me know your kernel version too.

Actions #5

Updated by Eugen Block 12 months ago

Jos Collin wrote:

Eugen Block wrote:

Not sure if required but I wanted to add some more information, while running cephfs-top the mgr module crashes all the time:

[...]

But I get output from cephfs-top so it kind of works but leaves all these crashes.

I don't see these crashes in `ceph health detail` with the latest cephfs-top code. Can you confirm if it's related crash? Could you please attach the mgr logs?

Yes, these crashes are related to that module, here's the info from the last crash:

storage01:~ # ceph crash info 2023-05-05T07:24:28.258666Z_33a408c8-ecfa-4edf-8c37-c093ccf69bd6
{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/stats/module.py\", line 32, in notify\n    self.fs_perf_stats.notify_cmd(notify_id)",
        "  File \"/usr/share/ceph/mgr/stats/fs/perf_stats.py\", line 177, in notify_cmd\n    metric_features = int(metadata[CLIENT_METADATA_KEY][\"metric_spec\"][\"metric_flags\"][\"feature_bits\"], 16)",
        "ValueError: invalid literal for int() with base 16: '0x'" 
    ],
    "ceph_version": "17.2.6",
    "crash_id": "2023-05-05T07:24:28.258666Z_33a408c8-ecfa-4edf-8c37-c093ccf69bd6",
    "entity_name": "mgr.storage01.ygsvte",
    "mgr_module": "stats",
    "mgr_module_caller": "ActivePyModule::notify",
    "mgr_python_exception": "ValueError",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "971ae170f1fff7f7bc0b7ae86d164b2b0136a8bd5ca7956166ea5161e51ad42c",
    "timestamp": "2023-05-05T07:24:28.258666Z",
    "utsname_hostname": "storage01",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.21-150400.24.60-default",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 12 12:13:32 UTC 2023 (93dbe2e)" 
}

The mgr log file is attached. I truncated the file to contain only the logs from the last crashes. Let me know if you need more. By the way, the operating system of the VM is openSUSE Leap 15.4 and cephfs-top is installed from rpm:
cephfs-top-17.2.6.248+gad656d572cb-lp154.2.1.noarch

Actions #6

Updated by Eugen Block 12 months ago

This is kind of strange, when I initially wanted to test cephfs-top I chose a different virtual ceph cluster which already was running ceph version 17.2.6 (images from quay.io) on openSUSE Leap 15.4, but I got some curses error messages. Then I chose to use above one-node cluster where I got cephfs-top to work but with mentioned stack trace. Now in the other cluster I managed to mitigate the curses error and now cephfs-top works there as well, but without breaking the mgr module. Both systems are Leap 15.4 with ceph version 17.2.6:

# machine with stack trace
storage01:~ # ceph versions
{
[...]
    "overall": {
        "ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)": 9
    }
}

storage01:~ # grep NAME /etc/os-release 
NAME="openSUSE Leap" 
PRETTY_NAME="openSUSE Leap 15.4" 
CPE_NAME="cpe:/o:opensuse:leap:15.4" 

storage01:~ # rpm -qa | grep cephfs-top
cephfs-top-17.2.6.248+gad656d572cb-lp154.2.1.noarch

# machine without stack trace
nautilus:~ # ceph versions
{
[...]
    "overall": {
        "ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)": 21
    }
}

nautilus:~ # grep NAME /etc/os-release 
NAME="openSUSE Leap" 
PRETTY_NAME="openSUSE Leap 15.4" 
CPE_NAME="cpe:/o:opensuse:leap:15.4" 

nautilus:~ # rpm -qa | grep cephfs-top
cephfs-top-17.2.6.248+gad656d572cb-lp154.2.1.noarch

So they are practically identical, only the nautilus machine has been installed with Leap 15.1 and was upgraded to Quincy while the other was installed with Leap 15.3 and Pacific. Any idea what the difference could be here?

Actions #7

Updated by Jos Collin 11 months ago

Eugen Block wrote:

This is kind of strange, when I initially wanted to test cephfs-top I chose a different virtual ceph cluster which already was running ceph version 17.2.6 (images from quay.io) on openSUSE Leap 15.4, but I got some curses error messages. Then I chose to use above one-node cluster where I got cephfs-top to work but with mentioned stack trace. Now in the other cluster I managed to mitigate the curses error and now cephfs-top works there as well, but without breaking the mgr module. Both systems are Leap 15.4 with ceph version 17.2.6:

[...]

So they are practically identical, only the nautilus machine has been installed with Leap 15.1 and was upgraded to Quincy while the other was installed with Leap 15.3 and Pacific. Any idea what the difference could be here?

Leap 15.1 uses kernel v4.12. You need Kernel v5.14 to have the complete set of 'perf stats' patches.
Could you please send the output of 'session ls' to check?

Actions #8

Updated by Eugen Block 11 months ago

Both VMs use the same kernel version (they are not running 15.1 anymore, both have been upgraded to 15.4 on the way to quincy):

storage01:~ # uname -r
5.14.21-150400.24.60-default

nautilus:~ # uname -r
5.14.21-150400.24.60-default

The session ls output would only show mounted clients, but both these storage nodes do not have the cephfs mounted. Or do you suspect that older clients mounting the cephfs could cause the stack trace? Could you please clarify? In the cluster with a stack trace there are older kernel clients, for example:


        "client_metadata": {
            "client_features": {
                "feature_bits": "0x0000000000003bff" 
            },
            "metric_spec": {
                "metric_flags": {
                    "feature_bits": "0x" 
                }
            },
            "entity_id": "nova-mount",
            "hostname": "compute01",
            "kernel_version": "5.3.18-lp152.106-default",
            "root": "/openstack-cluster/nova-instances" 

Actions #9

Updated by Jos Collin 11 months ago

Eugen Block wrote:

Both VMs use the same kernel version (they are not running 15.1 anymore, both have been upgraded to 15.4 on the way to quincy):

storage01:~ # uname -r
5.14.21-150400.24.60-default

nautilus:~ # uname -r
5.14.21-150400.24.60-default

So is the above mentioned issue (in Description) exist in your cluster?

The session ls output would only show mounted clients, but both these storage nodes do not have the cephfs mounted. Or do you suspect that older clients mounting the cephfs could cause the stack trace? Could you please clarify? In the cluster with a stack trace there are older kernel clients, for example:
[...]

I need the output from the older cluster, where you hit the issue.

Actions #10

Updated by Eugen Block 11 months ago

Nothing has changed on storage01 since my first crash report. The crash happens on Leap 15.4 with Quincy. I just wanted to point out that on a different cluster with the same kernel, same OS, same ceph version the crashes are not happening. I can paste the entire session ls output if you want, I've uploaded it as a text file.

Actions #11

Updated by Jos Collin 11 months ago

  • Status changed from Need More Info to Fix Under Review
  • Pull request ID set to 51655
Actions #12

Updated by Venky Shankar 10 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #13

Updated by Backport Bot 10 months ago

  • Copied to Backport #61734: pacific: mgr/stats: exception ValueError :invalid literal for int() with base 16: '0x' added
Actions #14

Updated by Backport Bot 10 months ago

  • Copied to Backport #61735: reef: mgr/stats: exception ValueError :invalid literal for int() with base 16: '0x' added
Actions #15

Updated by Backport Bot 10 months ago

  • Copied to Backport #61736: quincy: mgr/stats: exception ValueError :invalid literal for int() with base 16: '0x' added
Actions #16

Updated by Backport Bot 10 months ago

  • Tags set to backport_processed
Actions #17

Updated by Jos Collin 8 months ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF