Actions
Bug #61748
open[crimson] Restart of OSD service removed all the data from the cluster
% Done:
0%
Source:
Tags:
crimson
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Observed on a Reef based Crimson cluster build with this image - https://shaman.ceph.com/builds/ceph/main/ff8144fac0bdb12d803d6c3905e68584dd10bb19/crimson/347418/
Created multiple replicated pools and wrote data using rados bench.
Upon restarting OSD service using 'ceph orch restart <osd.service>', it was observed that all the existing data from every pool was cleared
# ceph df detail
2023-06-21T02:39:32.748+0000 7f806c7b4700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
2023-06-21T02:39:32.748+0000 7f806c7b4700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
TOTAL 1.3 TiB 1.3 TiB 3.2 GiB 3.2 GiB 0.24
--- POOLS ---
POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
.mgr 1 1 0 B 0 B 0 B 0 0 B 0 B 0 B 0 426 GiB N/A N/A N/A 0 B 0 B
test_bench 2 32 38 MiB 38 MiB 0 B 9.66k 38 MiB 38 MiB 0 B 0 426 GiB N/A N/A N/A 0 B 0 B
test_bench_objs 3 32 47 MiB 47 MiB 0 B 11.97k 47 MiB 47 MiB 0 B 0 426 GiB N/A N/A N/A 0 B 0 B
test_bench_objs_2 4 32 7.8 MiB 7.8 MiB 0 B 1.99k 7.8 MiB 7.8 MiB 0 B 0 426 GiB N/A N/A N/A 0 B 0 B
test_omap 5 32 0 B 0 B 0 B 98 0 B 0 B 0 B 0 426 GiB N/A N/A N/A 0 B 0 B
[ceph: root@dell-r640-039 /]# ceph orch ls
2023-06-21T04:25:29.722+0000 7f18b05d3700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
2023-06-21T04:25:29.723+0000 7f18b05d3700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
alertmanager ?:9093,9094 1/1 5m ago 3h count:1
ceph-exporter 3/3 5m ago 3h *
grafana ?:3000 1/1 5m ago 3h count:1
mgr 2/2 5m ago 3h label:mgr
mon 3/3 5m ago 3h label:mon
node-exporter ?:9100 3/3 5m ago 3h *
osd.all-available-devices 9 5m ago 3h *
prometheus ?:9095 1/1 5m ago 3h count:1
[ceph: root@dell-r640-039 /]# ceph orch restart osd.all-available-devices
2023-06-21T04:25:48.259+0000 7fde377df700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
2023-06-21T04:25:48.260+0000 7fde377df700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
Scheduled to restart osd.2 on host 'dell-r640-039.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.5 on host 'dell-r640-039.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.8 on host 'dell-r640-039.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.1 on host 'dell-r640-073.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.4 on host 'dell-r640-073.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.7 on host 'dell-r640-073.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.0 on host 'dell-r640-069.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.3 on host 'dell-r640-069.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.6 on host 'dell-r640-069.dsal.lab.eng.rdu2.redhat.com'
[ceph: root@dell-r640-039 /]# ceph df detail
2023-06-21T04:42:27.478+0000 7fc13876f700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
2023-06-21T04:42:27.479+0000 7fc13876f700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
TOTAL 1.3 TiB 1.3 TiB 1.5 GiB 1.5 GiB 0.11
--- POOLS ---
POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
.mgr 1 1 449 KiB 449 KiB 0 B 2 449 KiB 449 KiB 0 B 0 427 GiB N/A N/A N/A 0 B 0 B
test_bench 2 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 427 GiB N/A N/A N/A 0 B 0 B
test_bench_objs 3 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 427 GiB N/A N/A N/A 0 B 0 B
test_bench_objs_2 4 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 427 GiB N/A N/A N/A 0 B 0 B
test_omap 5 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 427 GiB N/A N/A N/A 0 B 0 B
Cluster config -
# ceph config dump
2023-06-21T04:48:15.559+0000 7fa5e0a3c700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
2023-06-21T04:48:15.559+0000 7fa5e0a3c700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
WHO MASK LEVEL OPTION VALUE RO
global basic container_image quay.ceph.io/ceph-ci/ceph@sha256:87239ee25da7bd08962ffe94a73ff5429fa4391108bad2efed02617933edded1 *
global advanced enable_experimental_unrecoverable_data_corrupting_features crimson
global basic log_to_file true
global advanced mon_cluster_log_to_file true
global advanced osd_pool_default_pg_autoscale_mode off
mon advanced auth_allow_insecure_global_id_reclaim false
mon advanced cluster_network 10.1.240.0/24 *
mon advanced osd_pool_default_crimson true
mon advanced public_network 10.1.240.0/23 *
mgr advanced mgr/cephadm/container_init True *
mgr advanced mgr/cephadm/migration_current 6 *
mgr advanced mgr/dashboard/ALERTMANAGER_API_HOST http://dell-r640-039.dsal.lab.eng.rdu2.redhat.com:9093 *
mgr advanced mgr/dashboard/GRAFANA_API_SSL_VERIFY false *
mgr advanced mgr/dashboard/GRAFANA_API_URL https://dell-r640-039.dsal.lab.eng.rdu2.redhat.com:3000 *
mgr advanced mgr/dashboard/PROMETHEUS_API_HOST http://dell-r640-039.dsal.lab.eng.rdu2.redhat.com:9095 *
mgr advanced mgr/dashboard/ssl_server_port 8443 *
mgr advanced mgr/orchestrator/orchestrator cephadm
osd host:dell-r640-039 basic osd_memory_target 43343119974
osd host:dell-r640-069 basic osd_memory_target 45848560571
osd host:dell-r640-073 basic osd_memory_target 44416904806
osd advanced osd_memory_target_autotune true
Ceph Cluster status -
# ceph status 2023-06-21T04:49:18.793+0000 7fb8886cd700 -1 WARNING: the following dangerous and experimental features are enabled: crimson 2023-06-21T04:49:18.793+0000 7fb8886cd700 -1 WARNING: the following dangerous and experimental features are enabled: crimson cluster: id: 203a849c-0fcb-11ee-918b-78ac443b3604 health: HEALTH_OK services: mon: 3 daemons, quorum dell-r640-039,dell-r640-073,dell-r640-069 (age 4h) mgr: dell-r640-039.fzgmxo(active, since 4h), standbys: dell-r640-073.zrjkat osd: 9 osds: 9 up (since 19m), 9 in (since 4h) data: pools: 5 pools, 129 pgs objects: 2 objects, 449 KiB usage: 1.5 GiB used, 1.3 TiB / 1.3 TiB avail pgs: 129 active+clean
Cluster version
# cephadm shell -- ceph version Inferring fsid 203a849c-0fcb-11ee-918b-78ac443b3604 Inferring config /var/lib/ceph/203a849c-0fcb-11ee-918b-78ac443b3604/mon.dell-r640-039/config Using ceph image with id 'e50e16176f87' and tag 'ff8144fac0bdb12d803d6c3905e68584dd10bb19-crimson' created on 2023-06-20 23:00:38 +0000 UTC quay.ceph.io/ceph-ci/ceph@sha256:87239ee25da7bd08962ffe94a73ff5429fa4391108bad2efed02617933edded1 2023-06-21T02:19:02.294+0000 7f15d7a4f700 -1 WARNING: the following dangerous and experimental features are enabled: crimson 2023-06-21T02:19:02.295+0000 7f15d7a4f700 -1 WARNING: the following dangerous and experimental features are enabled: crimson ceph version 18.0.0-4505-gff8144fa (ff8144fac0bdb12d803d6c3905e68584dd10bb19) reef (dev)
Cluster logs - http://magna002.ceph.redhat.com/ceph-qe-logs/harsh/crimson_osd_restart/
Actions