Bug #59551
closedmgr/stats: exception ValueError :invalid literal for int() with base 16: '0x'
0%
Description
'ceph fs perf stats' command miss some metadata for cephfs client, such as kernel_version.
2023-04-25T19:28:29.089+0800 7f87064a8700 -1 mgr notify stats.notify:
2023-04-25T19:28:29.089+0800 7f87064a8700 -1 mgr notify Traceback (most recent call last):
File "/opt/ceph/src/pybind/mgr/stats/module.py", line 32, in notify
self.fs_perf_stats.notify_cmd(notify_id)
File "/opt/ceph/src/pybind/mgr/stats/fs/perf_stats.py", line 177, in notify_cmd
metric_features = int(metadata[CLIENT_METADATA_KEY]["metric_spec"]["metric_flags"]["feature_bits"], 16)
ValueError: invalid literal for int() with base 16: '0x'
Files
Updated by Venky Shankar 12 months ago
- Category set to Correctness/Safety
- Assignee set to Jos Collin
- Target version set to v19.0.0
- Backport set to reef,quincy,pacific
Updated by Eugen Block 12 months ago
Not sure if required but I wanted to add some more information, while running cephfs-top the mgr module crashes all the time:
storage01:~ # ceph health detail HEALTH_WARN 23 mgr modules have recently crashed [WRN] RECENT_MGR_MODULE_CRASH: 23 mgr modules have recently crashed mgr module stats crashed in daemon mgr.storage01.ygsvte on host storage01 at 2023-05-04T11:36:48.948046Z mgr module stats crashed in daemon mgr.storage01.ygsvte on host storage01 at 2023-05-04T11:35:33.897963Z mgr module stats crashed in daemon mgr.storage01.ygsvte on host storage01 at 2023-05-04T11:36:13.927182Z mgr module stats crashed in daemon mgr.storage01.ygsvte on host storage01 at 2023-05-04T11:35:13.908107Z
But I get output from cephfs-top so it kind of works but leaves all these crashes.
Updated by Jos Collin 12 months ago
Eugen Block wrote:
Not sure if required but I wanted to add some more information, while running cephfs-top the mgr module crashes all the time:
[...]
But I get output from cephfs-top so it kind of works but leaves all these crashes.
I don't see these crashes in `ceph health detail` with the latest cephfs-top code. Can you confirm if it's related crash? Could you please attach the mgr logs?
Updated by Jos Collin 12 months ago
- Status changed from New to Need More Info
xinyu wang wrote:
'ceph fs perf stats' command miss some metadata for cephfs client, such as kernel_version.
[...]
@xinyu
Could you please provide the debugfs log file to check more on this issue? Let me know your kernel version too.
Updated by Eugen Block 12 months ago
- File mgr.storage01.log mgr.storage01.log added
Jos Collin wrote:
Eugen Block wrote:
Not sure if required but I wanted to add some more information, while running cephfs-top the mgr module crashes all the time:
[...]
But I get output from cephfs-top so it kind of works but leaves all these crashes.
I don't see these crashes in `ceph health detail` with the latest cephfs-top code. Can you confirm if it's related crash? Could you please attach the mgr logs?
Yes, these crashes are related to that module, here's the info from the last crash:
storage01:~ # ceph crash info 2023-05-05T07:24:28.258666Z_33a408c8-ecfa-4edf-8c37-c093ccf69bd6 { "backtrace": [ " File \"/usr/share/ceph/mgr/stats/module.py\", line 32, in notify\n self.fs_perf_stats.notify_cmd(notify_id)", " File \"/usr/share/ceph/mgr/stats/fs/perf_stats.py\", line 177, in notify_cmd\n metric_features = int(metadata[CLIENT_METADATA_KEY][\"metric_spec\"][\"metric_flags\"][\"feature_bits\"], 16)", "ValueError: invalid literal for int() with base 16: '0x'" ], "ceph_version": "17.2.6", "crash_id": "2023-05-05T07:24:28.258666Z_33a408c8-ecfa-4edf-8c37-c093ccf69bd6", "entity_name": "mgr.storage01.ygsvte", "mgr_module": "stats", "mgr_module_caller": "ActivePyModule::notify", "mgr_python_exception": "ValueError", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "ceph-mgr", "stack_sig": "971ae170f1fff7f7bc0b7ae86d164b2b0136a8bd5ca7956166ea5161e51ad42c", "timestamp": "2023-05-05T07:24:28.258666Z", "utsname_hostname": "storage01", "utsname_machine": "x86_64", "utsname_release": "5.14.21-150400.24.60-default", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 12 12:13:32 UTC 2023 (93dbe2e)" }
The mgr log file is attached. I truncated the file to contain only the logs from the last crashes. Let me know if you need more. By the way, the operating system of the VM is openSUSE Leap 15.4 and cephfs-top is installed from rpm:
cephfs-top-17.2.6.248+gad656d572cb-lp154.2.1.noarch
Updated by Eugen Block 12 months ago
This is kind of strange, when I initially wanted to test cephfs-top I chose a different virtual ceph cluster which already was running ceph version 17.2.6 (images from quay.io) on openSUSE Leap 15.4, but I got some curses error messages. Then I chose to use above one-node cluster where I got cephfs-top to work but with mentioned stack trace. Now in the other cluster I managed to mitigate the curses error and now cephfs-top works there as well, but without breaking the mgr module. Both systems are Leap 15.4 with ceph version 17.2.6:
# machine with stack trace storage01:~ # ceph versions { [...] "overall": { "ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)": 9 } } storage01:~ # grep NAME /etc/os-release NAME="openSUSE Leap" PRETTY_NAME="openSUSE Leap 15.4" CPE_NAME="cpe:/o:opensuse:leap:15.4" storage01:~ # rpm -qa | grep cephfs-top cephfs-top-17.2.6.248+gad656d572cb-lp154.2.1.noarch # machine without stack trace nautilus:~ # ceph versions { [...] "overall": { "ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)": 21 } } nautilus:~ # grep NAME /etc/os-release NAME="openSUSE Leap" PRETTY_NAME="openSUSE Leap 15.4" CPE_NAME="cpe:/o:opensuse:leap:15.4" nautilus:~ # rpm -qa | grep cephfs-top cephfs-top-17.2.6.248+gad656d572cb-lp154.2.1.noarch
So they are practically identical, only the nautilus machine has been installed with Leap 15.1 and was upgraded to Quincy while the other was installed with Leap 15.3 and Pacific. Any idea what the difference could be here?
Updated by Jos Collin 11 months ago
Eugen Block wrote:
This is kind of strange, when I initially wanted to test cephfs-top I chose a different virtual ceph cluster which already was running ceph version 17.2.6 (images from quay.io) on openSUSE Leap 15.4, but I got some curses error messages. Then I chose to use above one-node cluster where I got cephfs-top to work but with mentioned stack trace. Now in the other cluster I managed to mitigate the curses error and now cephfs-top works there as well, but without breaking the mgr module. Both systems are Leap 15.4 with ceph version 17.2.6:
[...]
So they are practically identical, only the nautilus machine has been installed with Leap 15.1 and was upgraded to Quincy while the other was installed with Leap 15.3 and Pacific. Any idea what the difference could be here?
Leap 15.1 uses kernel v4.12. You need Kernel v5.14 to have the complete set of 'perf stats' patches.
Could you please send the output of 'session ls' to check?
Updated by Eugen Block 11 months ago
Both VMs use the same kernel version (they are not running 15.1 anymore, both have been upgraded to 15.4 on the way to quincy):
storage01:~ # uname -r
5.14.21-150400.24.60-default
nautilus:~ # uname -r
5.14.21-150400.24.60-default
The session ls output would only show mounted clients, but both these storage nodes do not have the cephfs mounted. Or do you suspect that older clients mounting the cephfs could cause the stack trace? Could you please clarify? In the cluster with a stack trace there are older kernel clients, for example:
"client_metadata": { "client_features": { "feature_bits": "0x0000000000003bff" }, "metric_spec": { "metric_flags": { "feature_bits": "0x" } }, "entity_id": "nova-mount", "hostname": "compute01", "kernel_version": "5.3.18-lp152.106-default", "root": "/openstack-cluster/nova-instances"
Updated by Jos Collin 11 months ago
Eugen Block wrote:
Both VMs use the same kernel version (they are not running 15.1 anymore, both have been upgraded to 15.4 on the way to quincy):
storage01:~ # uname -r
5.14.21-150400.24.60-defaultnautilus:~ # uname -r
5.14.21-150400.24.60-default
So is the above mentioned issue (in Description) exist in your cluster?
The session ls output would only show mounted clients, but both these storage nodes do not have the cephfs mounted. Or do you suspect that older clients mounting the cephfs could cause the stack trace? Could you please clarify? In the cluster with a stack trace there are older kernel clients, for example:
[...]
I need the output from the older cluster, where you hit the issue.
Updated by Eugen Block 11 months ago
- File mds-session-ls.out mds-session-ls.out added
Nothing has changed on storage01 since my first crash report. The crash happens on Leap 15.4 with Quincy. I just wanted to point out that on a different cluster with the same kernel, same OS, same ceph version the crashes are not happening. I can paste the entire session ls output if you want, I've uploaded it as a text file.
Updated by Jos Collin 11 months ago
- Status changed from Need More Info to Fix Under Review
- Pull request ID set to 51655
Updated by Venky Shankar 10 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 10 months ago
- Copied to Backport #61734: pacific: mgr/stats: exception ValueError :invalid literal for int() with base 16: '0x' added
Updated by Backport Bot 10 months ago
- Copied to Backport #61735: reef: mgr/stats: exception ValueError :invalid literal for int() with base 16: '0x' added
Updated by Backport Bot 10 months ago
- Copied to Backport #61736: quincy: mgr/stats: exception ValueError :invalid literal for int() with base 16: '0x' added
Updated by Jos Collin 8 months ago
- Status changed from Pending Backport to Resolved