Project

General

Profile

Actions

Bug #55859

open

Radosgw-admin: illegal instruction, running on commodity hardware

Added by Samuel Martin Moro almost 2 years ago. Updated 6 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
radosgw-admin illegal instruction opteron quincy
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

I recently re-deployed a cluster from scratch. Used to have Octopus, running on top of CentOS 8. Couple weeks ago, switched to Debian 11, and Quincy.

Previously, I could run "radosgw-admin" commands from my ceph nodes. As of re-deploying, this is no longer possible.
Everything else works just fine (rbd, cephfs, mgrs, dashboard, ...).

root@mon1.~# radosgw-admin user create --uid=prometheus-xp --display-name="Prometheus s3 exporter" --email=monitoring@prometheus
Illegal instruction
root@mon1:~# radosgw-admin
Illegal instruction

This definitely looks CPU related ...

Note that I'm running on relatively old hardware: some prolian microserver, don't recall which gen, doesn't show in dmidecode (4/5/6?)... let's go with "old".

root@mon1:~# dmesg
...
[676261.074960] traps: radosgw-admin33623 trap invalid opcode ip:55ad8f7b1963 sp:7ffdb7f19d10 error:0 in radosgw-admin[55ad8f69c000+cda000]
[676273.935910] traps: radosgw-admin33625 trap invalid opcode ip:55d6106b1963 sp:7ffdbef995f0 error:0 in radosgw-admin[55d61059c000+cda000]
[676724.158210] traps: radosgw-admin33873 trap invalid opcode ip:557c6a8a8963 sp:7ffe7e1e6f90 error:0 in radosgw-admin[557c6a793000+cda000]
[676727.997861] traps: radosgw-admin33874 trap invalid opcode ip:563924285963 sp:7ffd3fe44900 error:0 in radosgw-admin[563924170000+cda000]
[676731.330261] traps: radosgw-admin33875 trap invalid opcode ip:55cf22fc2963 sp:7fff3dc7ec30 error:0 in radosgw-admin[55cf22ead000+cda000]

root@mon1:~# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : AMD Opteron 23xx (Gen 3 Class Opteron)
stepping : 3
microcode : 0x1000065
cpu MHz : 2196.340
cache size : 512 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow rep_good nopl cpuid extd_apicid tsc_known_freq pni cx16 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw vmmcall arat
bugs : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips : 4392.68
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : AMD Opteron 23xx (Gen 3 Class Opteron)
stepping : 3
microcode : 0x1000065
cpu MHz : 2196.340
cache size : 512 KB
physical id : 1
siblings : 1
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow rep_good nopl cpuid extd_apicid tsc_known_freq pni cx16 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw vmmcall arat
bugs : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips : 4392.68
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

root@mon1:~# dpkg -l|grep ceph
ii ceph 17.2.0-1~bpo11+1 amd64 distributed storage and file system
ii ceph-base 17.2.0-1~bpo11+1 amd64 common ceph daemon libraries and management tools
ii ceph-common 17.2.0-1~bpo11+1 amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-fuse 17.2.0-1~bpo11+1 amd64 FUSE-based client for the Ceph distributed file system
ii ceph-mds 17.2.0-1~bpo11+1 amd64 metadata server for the ceph distributed file system
ii ceph-mgr 17.2.0-1~bpo11+1 amd64 manager for the ceph distributed storage system
ii ceph-mgr-cephadm 17.2.0-1~bpo11+1 all cephadm orchestrator module for ceph-mgr
ii ceph-mgr-dashboard 17.2.0-1~bpo11+1 all dashboard module for ceph-mgr
ii ceph-mgr-diskprediction-local 17.2.0-1~bpo11+1 all diskprediction-local module for ceph-mgr
ii ceph-mgr-k8sevents 17.2.0-1~bpo11+1 all kubernetes events module for ceph-mgr
ii ceph-mgr-modules-core 17.2.0-1~bpo11+1 all ceph manager modules which are always enabled
ii ceph-mon 17.2.0-1~bpo11+1 amd64 monitor server for the ceph storage system
ii ceph-osd 17.2.0-1~bpo11+1 amd64 OSD server for the ceph storage system
ii ceph-volume 17.2.0-1~bpo11+1 all tool to facilidate OSD deployment
ii cephadm 17.2.0-1~bpo11+1 amd64 cephadm utility to bootstrap ceph daemons with systemd and containers
ii libcephfs2 17.2.0-1~bpo11+1 amd64 Ceph distributed file system client library
ii libsqlite3-mod-ceph 17.2.0-1~bpo11+1 amd64 SQLite3 VFS for Ceph
ii python3-ceph-argparse 17.2.0-1~bpo11+1 all Python 3 utility libraries for Ceph CLI
ii python3-ceph-common 17.2.0-1~bpo11+1 all Python 3 utility libraries for Ceph
ii python3-cephfs 17.2.0-1~bpo11+1 amd64 Python 3 libraries for the Ceph libcephfs library

And I was able to confirm: from a Kubernetes (still running on old hardware, ... Intel(R) Xeon(R) CPU X5650)
I'm able to build an image with same debian/ceph versions
Start a Pod.
radosgw-admin commands work just fine. I was able to create my users, set permissions, ...

While when I want to manage s3 from my ceph nodes: it crashes. I can't even get an help message out of radosgw-admin.

Something "broke" / compatibility was lost, in between octopus and quincy, for such an old CPU.
While it only affects radosgw-admin, as far as I could see.
Could be some compiler option / optimization (?)
Could be specific to debian packaging / not sure I would have had this issue sticking with some centos8 derivative ...
Assuming debian specific, regression may have been introduced even before octopus: last time I had those nodes running debian/ceph/radosgw-admin successfully was with Firefly or Giant. Switched to centos, then, ...

Any chance this could be fixed, in future versions of radosgw-admin?

Thanks!

Actions #1

Updated by Samuel Martin Moro almost 2 years ago

side-note: this isn't just radosgw-admin. radosgw itself is also affected.

My radosgw daemon is running in a Kubernetes container already.
Just check: I have the same errors when running radosgw commands from my ceph nodes

root@mon1:~# radosgw --help
Illegal instruction

Actions #2

Updated by Nico Schottelius 6 months ago

I have just upgrade a cluster to 17.2.6 and radosgw and radosgw-admin are also crashing for me:

[rook@rook-ceph-tools-76f9674d97-pf4z4 /]$ radosgw-admin zonegroup list
Illegal instruction
[rook@rook-ceph-tools-76f9674d97-pf4z4 /]$ 

And radosgw with:

debug 2023-10-15T22:02:48.626+0000 7fdefa862740  0 rgw main: ERROR: could not find zonegroup (place5)
debug 2023-10-15T22:02:48.626+0000 7fdefa862740  0 rgw main: ERROR: failed to start notify service ((2) No such file or directory
debug 2023-10-15T22:02:48.626+0000 7fdefa862740  0 rgw main: ERROR: failed to init services (ret=(2) No such file or directory)
debug 2023-10-15T22:02:48.633+0000 7fdefa862740 -1 Couldn't init storage provider (RADOS)

The cpus are mixed in this cluster, some of them are:

processor    : 23
vendor_id    : GenuineIntel
cpu family    : 6
model        : 44
model name    : Intel(R) Xeon(R) CPU           L5640  @ 2.27GHz
stepping    : 2
microcode    : 0x13
cpu MHz        : 1600.000
cache size    : 12288 KB
physical id    : 1
siblings    : 12
core id        : 10
cpu cores    : 6
apicid        : 53
initial apicid    : 53
fpu        : yes
fpu_exception    : yes
cpuid level    : 11
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid dtherm ida arat
vmx flags    : vnmi preemption_timer invvpid ept_x_only ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips    : 4535.70
clflush size    : 64
cache_alignment    : 64
address sizes    : 40 bits physical, 48 bits virtual
power management:

While others are:

processor    : 47
vendor_id    : AuthenticAMD
cpu family    : 16
model        : 9
model name    : AMD Opteron(tm) Processor 6172
stepping    : 1
microcode    : 0x10000d9
cpu MHz        : 2100.195
cache size    : 512 KB
physical id    : 1
siblings    : 12
core id        : 5
cpu cores    : 12
apicid        : 27
initial apicid    : 27
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr hw_pstate vmmcall npt lbrv svm_lock nrip_save pausefilter
bugs        : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips    : 4202.51
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

What is needed for ceph to run?

Actions #3

Updated by Nico Schottelius 6 months ago

Checking the dmesg on the affected host, in that case running on the Opteron CPU:

[34219.930946] traps: radosgw[6812] trap invalid opcode ip:7f6035a720ec sp:7ffedc2424e0 error:0 in libradosgw.so.2.0.0[7f6035968000+e7c000]
[34520.894475] traps: radosgw[12613] trap invalid opcode ip:7fc6bdcde0ec sp:7fff90e82bf0 error:0 in libradosgw.so.2.0.0[7fc6bdbd4000+e7c000]
[34828.902790] traps: radosgw[18501] trap invalid opcode ip:7f4bb60070ec sp:7ffdc487bb10 error:0 in libradosgw.so.2.0.0[7f4bb5efd000+e7c000]
[35132.894687] traps: radosgw[24296] trap invalid opcode ip:7f054cb9b0ec sp:7ffd5247a790 error:0 in libradosgw.so.2.0.0[7f054ca91000+e7c000]
[35435.897653] traps: radosgw[29801] trap invalid opcode ip:7f5e610490ec sp:7ffebcbb19e0 error:0 in libradosgw.so.2.0.0[7f5e60f3f000+e7c000]

This is based on the quay.io images running version 17.2.6.

Actions

Also available in: Atom PDF