Project

General

Profile

Actions

Bug #55859

open

Radosgw-admin: illegal instruction, running on commodity hardware

Added by Samuel Martin Moro almost 2 years ago. Updated 13 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
radosgw-admin illegal instruction opteron quincy
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

I recently re-deployed a cluster from scratch. Used to have Octopus, running on top of CentOS 8. Couple weeks ago, switched to Debian 11, and Quincy.

Previously, I could run "radosgw-admin" commands from my ceph nodes. As of re-deploying, this is no longer possible.
Everything else works just fine (rbd, cephfs, mgrs, dashboard, ...).

root@mon1.~# radosgw-admin user create --uid=prometheus-xp --display-name="Prometheus s3 exporter" --email=monitoring@prometheus
Illegal instruction
root@mon1:~# radosgw-admin
Illegal instruction

This definitely looks CPU related ...

Note that I'm running on relatively old hardware: some prolian microserver, don't recall which gen, doesn't show in dmidecode (4/5/6?)... let's go with "old".

root@mon1:~# dmesg
...
[676261.074960] traps: radosgw-admin33623 trap invalid opcode ip:55ad8f7b1963 sp:7ffdb7f19d10 error:0 in radosgw-admin[55ad8f69c000+cda000]
[676273.935910] traps: radosgw-admin33625 trap invalid opcode ip:55d6106b1963 sp:7ffdbef995f0 error:0 in radosgw-admin[55d61059c000+cda000]
[676724.158210] traps: radosgw-admin33873 trap invalid opcode ip:557c6a8a8963 sp:7ffe7e1e6f90 error:0 in radosgw-admin[557c6a793000+cda000]
[676727.997861] traps: radosgw-admin33874 trap invalid opcode ip:563924285963 sp:7ffd3fe44900 error:0 in radosgw-admin[563924170000+cda000]
[676731.330261] traps: radosgw-admin33875 trap invalid opcode ip:55cf22fc2963 sp:7fff3dc7ec30 error:0 in radosgw-admin[55cf22ead000+cda000]

root@mon1:~# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : AMD Opteron 23xx (Gen 3 Class Opteron)
stepping : 3
microcode : 0x1000065
cpu MHz : 2196.340
cache size : 512 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow rep_good nopl cpuid extd_apicid tsc_known_freq pni cx16 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw vmmcall arat
bugs : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips : 4392.68
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : AMD Opteron 23xx (Gen 3 Class Opteron)
stepping : 3
microcode : 0x1000065
cpu MHz : 2196.340
cache size : 512 KB
physical id : 1
siblings : 1
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow rep_good nopl cpuid extd_apicid tsc_known_freq pni cx16 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw vmmcall arat
bugs : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips : 4392.68
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

root@mon1:~# dpkg -l|grep ceph
ii ceph 17.2.0-1~bpo11+1 amd64 distributed storage and file system
ii ceph-base 17.2.0-1~bpo11+1 amd64 common ceph daemon libraries and management tools
ii ceph-common 17.2.0-1~bpo11+1 amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-fuse 17.2.0-1~bpo11+1 amd64 FUSE-based client for the Ceph distributed file system
ii ceph-mds 17.2.0-1~bpo11+1 amd64 metadata server for the ceph distributed file system
ii ceph-mgr 17.2.0-1~bpo11+1 amd64 manager for the ceph distributed storage system
ii ceph-mgr-cephadm 17.2.0-1~bpo11+1 all cephadm orchestrator module for ceph-mgr
ii ceph-mgr-dashboard 17.2.0-1~bpo11+1 all dashboard module for ceph-mgr
ii ceph-mgr-diskprediction-local 17.2.0-1~bpo11+1 all diskprediction-local module for ceph-mgr
ii ceph-mgr-k8sevents 17.2.0-1~bpo11+1 all kubernetes events module for ceph-mgr
ii ceph-mgr-modules-core 17.2.0-1~bpo11+1 all ceph manager modules which are always enabled
ii ceph-mon 17.2.0-1~bpo11+1 amd64 monitor server for the ceph storage system
ii ceph-osd 17.2.0-1~bpo11+1 amd64 OSD server for the ceph storage system
ii ceph-volume 17.2.0-1~bpo11+1 all tool to facilidate OSD deployment
ii cephadm 17.2.0-1~bpo11+1 amd64 cephadm utility to bootstrap ceph daemons with systemd and containers
ii libcephfs2 17.2.0-1~bpo11+1 amd64 Ceph distributed file system client library
ii libsqlite3-mod-ceph 17.2.0-1~bpo11+1 amd64 SQLite3 VFS for Ceph
ii python3-ceph-argparse 17.2.0-1~bpo11+1 all Python 3 utility libraries for Ceph CLI
ii python3-ceph-common 17.2.0-1~bpo11+1 all Python 3 utility libraries for Ceph
ii python3-cephfs 17.2.0-1~bpo11+1 amd64 Python 3 libraries for the Ceph libcephfs library

And I was able to confirm: from a Kubernetes (still running on old hardware, ... Intel(R) Xeon(R) CPU X5650)
I'm able to build an image with same debian/ceph versions
Start a Pod.
radosgw-admin commands work just fine. I was able to create my users, set permissions, ...

While when I want to manage s3 from my ceph nodes: it crashes. I can't even get an help message out of radosgw-admin.

Something "broke" / compatibility was lost, in between octopus and quincy, for such an old CPU.
While it only affects radosgw-admin, as far as I could see.
Could be some compiler option / optimization (?)
Could be specific to debian packaging / not sure I would have had this issue sticking with some centos8 derivative ...
Assuming debian specific, regression may have been introduced even before octopus: last time I had those nodes running debian/ceph/radosgw-admin successfully was with Firefly or Giant. Switched to centos, then, ...

Any chance this could be fixed, in future versions of radosgw-admin?

Thanks!

Actions

Also available in: Atom PDF