Actions
Bug #54456
openRadosgw not available after restarts with error "Initialization timeout, failed to initialize" when osd become full
Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
When OSD become full , radosgw pod continue to run in read only mode until restart. When a radosgw gets re-created/restarted during OSD full situation , radosgw never boots up correctly.
How to reproduce it (minimal and precise)
- Create a CephCluster cluster with 1 OSD and create a CephObjectStore on Kubernetes.
- Pump the data into the cluster via s3cmd , until it become full i.e reach full_ratio
- s3cmd command becomes read-only until the radosgw pod is restarted
- Delete existing radosgw pod , new radosgw pod never boots up. It fail with error Initialization timeout, failed to initialize
CephCluster.yaml
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
cephVersion:
image: ceph/ceph:v16.2.6
cleanupPolicy:
sanitizeDisks: {}
crashCollector:
disable: true
dashboard:
enabled: true
ssl: true
dataDirHostPath: /var/lib/rook
disruptionManagement:
osdMaintenanceTimeout: 30
external: {}
healthCheck:
daemonHealth:
mon:
interval: 45s
timeout: 600s
osd:
interval: 60s
status: {}
livenessProbe:
mgr: {}
mon: {}
osd: {}
logCollector: {}
mgr:
count: 1
modules:
- enabled: true
name: pg_autoscaler
mon:
count: 1
volumeClaimTemplate:
metadata: {}
spec:
resources:
requests:
storage: 10Gi
storageClassName: longhorn-backup-single-replica
status: {}
monitoring:
enabled: false
rulesNamespace: rook-ceph
network: {}
placement:
mgr:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: rook-ceph-mgr
topologyKey: kubernetes.io/hostname
weight: 100
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: rook-ceph-mgr
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
mon:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: rook-ceph-mon
topologyKey: kubernetes.io/hostname
weight: 100
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: rook-ceph-mon
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
priorityClassNames:
all: system-cluster-critical
resources:
crashcollector:
requests:
cpu: 50m
memory: 100Mi
mgr:
requests:
cpu: 100m
memory: 512Mi
mon:
requests:
cpu: 100m
memory: 512Mi
osd:
requests:
cpu: 100m
memory: 1Gi
prepareosd:
requests:
cpu: 50m
memory: 100Mi
security:
kms: {}
storage:
storageClassDeviceSets:
- count: 1
name: set1
placement:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: rook-ceph-osd
topologyKey: kubernetes.io/hostname
weight: 100
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: rook-ceph-osd
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
portable: true
preparePlacement:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: rook-ceph-osd-prepare
topologyKey: kubernetes.io/hostname
weight: 100
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: rook-ceph-osd-prepare
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
resources:
requests:
cpu: 100m
memory: 1Gi
tuneDeviceClass: true
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: longhorn-backup-single-replica
volumeMode: Block
status: {}
CephObjectStore.yaml
apiVersion: ceph.rook.io/v1
kind: CephObjectStore
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
dataPool:
erasureCoded:
codingChunks: 0
dataChunks: 0
failureDomain: host
mirroring: {}
parameters:
compression_mode: none
quotas: {}
replicated:
requireSafeReplicaSize: false
size: 1
statusCheck:
mirror: {}
gateway:
instances: 1
placement:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: rook-ceph-rgw
topologyKey: kubernetes.io/hostname
weight: 100
port: 80
priorityClassName: system-cluster-critical
resources:
requests:
cpu: 100m
memory: 512Mi
healthCheck:
bucket:
disabled: false
interval: 60s
livenessProbe:
disabled: false
readinessProbe:
disabled: false
metadataPool:
erasureCoded:
codingChunks: 0
dataChunks: 0
failureDomain: host
mirroring: {}
parameters:
compression_mode: none
quotas: {}
replicated:
requireSafeReplicaSize: false
size: 1
statusCheck:
mirror: {}
preservePoolsOnDelete: false
zone:
name: ""
Crashing pod(s) logs
debug 2022-03-01T15:24:25.698+0000 7f671ec29440 10 rgw main: Cannot find current period zone using local zone
debug 2022-03-01T15:24:25.698+0000 7f671ec29440 20 rgw main: rados->read ofs=0 len=0
debug 2022-03-01T15:24:25.698+0000 7f671ec29440 20 rgw main: rados_obj.operate() r=0 bl.length=915
debug 2022-03-01T15:24:25.698+0000 7f671ec29440 20 rgw main: zone rook-ceph found
debug 2022-03-01T15:24:25.698+0000 7f671ec29440 20 rgw main: rados->read ofs=0 len=0
debug 2022-03-01T15:24:25.698+0000 7f671ec29440 20 rgw main: rados_obj.operate() r=-2 bl.length=0
debug 2022-03-01T15:24:25.699+0000 7f671ec29440 20 rgw main: rados->read ofs=0 len=0
debug 2022-03-01T15:24:25.699+0000 7f671ec29440 20 rgw main: rados_obj.operate() r=-2 bl.length=0
debug 2022-03-01T15:24:25.699+0000 7f671ec29440 20 rgw main: started sync module instance, tier type =
debug 2022-03-01T15:24:25.699+0000 7f671ec29440 20 rgw main: started zone id=a90fa03f-fbc3-4074-9bab-378d0ab42131 (name=rook-ceph) with tier type =
debug 2022-03-01T15:29:25.669+0000 7f670b0a1700 -1 Initialization timeout, failed to initialize
Environment:
- OS (e.g. from /etc/os-release): "Red Hat Enterprise Linux 8.2 (Ootpa)"
- Kernel (e.g. `uname -a`): Linux server0 4.18.0-193.65.2.el8_2.x86_64
- Cloud provider or hardware configuration: Azure
- Rook version (use `rook version` inside of a Rook Pod): v1.7.9
- Storage backend version (e.g. for ceph do `ceph -v`): 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
- Kubernetes version (use `kubectl version`): v1.21.4+rke2r2
- Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Rancher rke2 v1.21.4+rke2r2
- Storage backend status (e.g. for Ceph use `ceph health` in the [Rook Ceph toolbox](https://rook.io/docs/rook/latest/ceph-toolbox.html)): HEALTH_ERR 1 full osd(s); 7 pool(s) full; 7 pool(s) have no replicas configured; OSD count 1 < osd_pool_default_size 3
Actions