Project

General

Profile

Actions

Bug #61748

open

[crimson] Restart of OSD service removed all the data from the cluster

Added by Harsh Kumar 11 months ago. Updated 25 days ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
crimson
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Observed on a Reef based Crimson cluster build with this image - https://shaman.ceph.com/builds/ceph/main/ff8144fac0bdb12d803d6c3905e68584dd10bb19/crimson/347418/

Created multiple replicated pools and wrote data using rados bench.
Upon restarting OSD service using 'ceph orch restart <osd.service>', it was observed that all the existing data from every pool was cleared

# ceph df detail
2023-06-21T02:39:32.748+0000 7f806c7b4700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
2023-06-21T02:39:32.748+0000 7f806c7b4700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
TOTAL  1.3 TiB  1.3 TiB  3.2 GiB   3.2 GiB       0.24

--- POOLS ---
POOL               ID  PGS   STORED   (DATA)  (OMAP)  OBJECTS     USED   (DATA)  (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
.mgr                1    1      0 B      0 B     0 B        0      0 B      0 B     0 B      0    426 GiB            N/A          N/A    N/A         0 B          0 B
test_bench          2   32   38 MiB   38 MiB     0 B    9.66k   38 MiB   38 MiB     0 B      0    426 GiB            N/A          N/A    N/A         0 B          0 B
test_bench_objs     3   32   47 MiB   47 MiB     0 B   11.97k   47 MiB   47 MiB     0 B      0    426 GiB            N/A          N/A    N/A         0 B          0 B
test_bench_objs_2   4   32  7.8 MiB  7.8 MiB     0 B    1.99k  7.8 MiB  7.8 MiB     0 B      0    426 GiB            N/A          N/A    N/A         0 B          0 B
test_omap           5   32      0 B      0 B     0 B       98      0 B      0 B     0 B      0    426 GiB            N/A          N/A    N/A         0 B          0 B
[ceph: root@dell-r640-039 /]# ceph orch ls
2023-06-21T04:25:29.722+0000 7f18b05d3700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
2023-06-21T04:25:29.723+0000 7f18b05d3700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
NAME                       PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager               ?:9093,9094      1/1  5m ago     3h   count:1    
ceph-exporter                               3/3  5m ago     3h   *          
grafana                    ?:3000           1/1  5m ago     3h   count:1    
mgr                                         2/2  5m ago     3h   label:mgr  
mon                                         3/3  5m ago     3h   label:mon  
node-exporter              ?:9100           3/3  5m ago     3h   *          
osd.all-available-devices                     9  5m ago     3h   *          
prometheus                 ?:9095           1/1  5m ago     3h   count:1    
[ceph: root@dell-r640-039 /]# ceph orch restart osd.all-available-devices
2023-06-21T04:25:48.259+0000 7fde377df700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
2023-06-21T04:25:48.260+0000 7fde377df700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
Scheduled to restart osd.2 on host 'dell-r640-039.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.5 on host 'dell-r640-039.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.8 on host 'dell-r640-039.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.1 on host 'dell-r640-073.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.4 on host 'dell-r640-073.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.7 on host 'dell-r640-073.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.0 on host 'dell-r640-069.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.3 on host 'dell-r640-069.dsal.lab.eng.rdu2.redhat.com'
Scheduled to restart osd.6 on host 'dell-r640-069.dsal.lab.eng.rdu2.redhat.com'
[ceph: root@dell-r640-039 /]# ceph df detail
2023-06-21T04:42:27.478+0000 7fc13876f700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
2023-06-21T04:42:27.479+0000 7fc13876f700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
TOTAL  1.3 TiB  1.3 TiB  1.5 GiB   1.5 GiB       0.11

--- POOLS ---
POOL               ID  PGS   STORED   (DATA)  (OMAP)  OBJECTS     USED   (DATA)  (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
.mgr                1    1  449 KiB  449 KiB     0 B        2  449 KiB  449 KiB     0 B      0    427 GiB            N/A          N/A    N/A         0 B          0 B
test_bench          2   32      0 B      0 B     0 B        0      0 B      0 B     0 B      0    427 GiB            N/A          N/A    N/A         0 B          0 B
test_bench_objs     3   32      0 B      0 B     0 B        0      0 B      0 B     0 B      0    427 GiB            N/A          N/A    N/A         0 B          0 B
test_bench_objs_2   4   32      0 B      0 B     0 B        0      0 B      0 B     0 B      0    427 GiB            N/A          N/A    N/A         0 B          0 B
test_omap           5   32      0 B      0 B     0 B        0      0 B      0 B     0 B      0    427 GiB            N/A          N/A    N/A         0 B          0 B

Cluster config -

# ceph config dump
2023-06-21T04:48:15.559+0000 7fa5e0a3c700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
2023-06-21T04:48:15.559+0000 7fa5e0a3c700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
WHO     MASK                LEVEL     OPTION                                                      VALUE                                                                                              RO
global                      basic     container_image                                             quay.ceph.io/ceph-ci/ceph@sha256:87239ee25da7bd08962ffe94a73ff5429fa4391108bad2efed02617933edded1  * 
global                      advanced  enable_experimental_unrecoverable_data_corrupting_features  crimson                                                                                              
global                      basic     log_to_file                                                 true                                                                                                 
global                      advanced  mon_cluster_log_to_file                                     true                                                                                                 
global                      advanced  osd_pool_default_pg_autoscale_mode                          off                                                                                                  
mon                         advanced  auth_allow_insecure_global_id_reclaim                       false                                                                                                
mon                         advanced  cluster_network                                             10.1.240.0/24                                                                                      * 
mon                         advanced  osd_pool_default_crimson                                    true                                                                                                 
mon                         advanced  public_network                                              10.1.240.0/23                                                                                      * 
mgr                         advanced  mgr/cephadm/container_init                                  True                                                                                               * 
mgr                         advanced  mgr/cephadm/migration_current                               6                                                                                                  * 
mgr                         advanced  mgr/dashboard/ALERTMANAGER_API_HOST                         http://dell-r640-039.dsal.lab.eng.rdu2.redhat.com:9093                                             * 
mgr                         advanced  mgr/dashboard/GRAFANA_API_SSL_VERIFY                        false                                                                                              * 
mgr                         advanced  mgr/dashboard/GRAFANA_API_URL                               https://dell-r640-039.dsal.lab.eng.rdu2.redhat.com:3000                                            * 
mgr                         advanced  mgr/dashboard/PROMETHEUS_API_HOST                           http://dell-r640-039.dsal.lab.eng.rdu2.redhat.com:9095                                             * 
mgr                         advanced  mgr/dashboard/ssl_server_port                               8443                                                                                               * 
mgr                         advanced  mgr/orchestrator/orchestrator                               cephadm                                                                                              
osd     host:dell-r640-039  basic     osd_memory_target                                           43343119974                                                                                          
osd     host:dell-r640-069  basic     osd_memory_target                                           45848560571                                                                                          
osd     host:dell-r640-073  basic     osd_memory_target                                           44416904806                                                                                          
osd                         advanced  osd_memory_target_autotune                                  true 

Ceph Cluster status -

# ceph status
2023-06-21T04:49:18.793+0000 7fb8886cd700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
2023-06-21T04:49:18.793+0000 7fb8886cd700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
  cluster:
    id:     203a849c-0fcb-11ee-918b-78ac443b3604
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum dell-r640-039,dell-r640-073,dell-r640-069 (age 4h)
    mgr: dell-r640-039.fzgmxo(active, since 4h), standbys: dell-r640-073.zrjkat
    osd: 9 osds: 9 up (since 19m), 9 in (since 4h)

  data:
    pools:   5 pools, 129 pgs
    objects: 2 objects, 449 KiB
    usage:   1.5 GiB used, 1.3 TiB / 1.3 TiB avail
    pgs:     129 active+clean

Cluster version

# cephadm shell -- ceph version
Inferring fsid 203a849c-0fcb-11ee-918b-78ac443b3604
Inferring config /var/lib/ceph/203a849c-0fcb-11ee-918b-78ac443b3604/mon.dell-r640-039/config
Using ceph image with id 'e50e16176f87' and tag 'ff8144fac0bdb12d803d6c3905e68584dd10bb19-crimson' created on 2023-06-20 23:00:38 +0000 UTC
quay.ceph.io/ceph-ci/ceph@sha256:87239ee25da7bd08962ffe94a73ff5429fa4391108bad2efed02617933edded1
2023-06-21T02:19:02.294+0000 7f15d7a4f700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
2023-06-21T02:19:02.295+0000 7f15d7a4f700 -1 WARNING: the following dangerous and experimental features are enabled: crimson
ceph version 18.0.0-4505-gff8144fa (ff8144fac0bdb12d803d6c3905e68584dd10bb19) reef (dev)

Cluster logs - http://magna002.ceph.redhat.com/ceph-qe-logs/harsh/crimson_osd_restart/

Actions #1

Updated by Aishwarya Mathuria 25 days ago

  • Assignee set to Aishwarya Mathuria
Actions

Also available in: Atom PDF