Bug #49166
closedAll OSD down after docker upgrade: KernelDevice.cc: 999: FAILED ceph_assert(is_valid_io(off, len))
0%
Description
Hello,
I'm running a 4-node CEPH cluster on Debian 10.7. There was a Docker update from Docker version 20.10.1 to Docker version 20.10.3, so a minor update. However, after installing the update with 'apt upgrade' all the Ceph dockers went down, obv because of Docker restart. I rebooted the machine and the OSDs are not coming back.
When I type docker ps I can see a few services running:
6e28dc0c0b6a ceph/ceph:v15 "bash" 42 minutes ago Up 42 minutes nostalgic_banach
582265e28589 ceph/ceph:v15 "/usr/bin/ceph-crash…" 18 hours ago Up 18 hours ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-crash.osd07
0b5b8f6b09d9 ceph/ceph:v15 "/usr/bin/ceph-mds -…" 18 hours ago Up 18 hours ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-mds.cephfs-sata.osd07.kofdrp
fb7cc0b58ef5 prom/node-exporter:v0.18.1 "/bin/node_exporter …" 18 hours ago Up 18 hours ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-node-exporter.osd07
Here's my ceph -s output:
root@osd07:/var/lib/ceph/osd# ceph -s
cluster:
id: 8fde54d0-45e9-11eb-86ab-a23d47ea900e
health: HEALTH_WARN
1 osds down
1 host (8 osds) down
Degraded data redundancy: 396924/7903527 objects degraded (5.022%), 64 pgs degraded, 64 pgs undersized
services:
mon: 3 daemons, quorum osd04,osd06,osd05 (age 5w)
mgr: osd04.wljcez(active, since 5w), standbys: osd05.uvfdor
mds: cephfs-sata:1 {0=cephfs-sata.osd05.evynxa=up:active} 1 up:standby
osd: 26 osds: 18 up (since 24h), 19 in (since 24h); 20 remapped pgs
data:
pools: 4 pools, 577 pgs
objects: 3.95M objects, 41 TiB
usage: 79 TiB used, 128 TiB / 207 TiB avail
pgs: 396924/7903527 objects degraded (5.022%)
513 active+clean
44 active+undersized+degraded
16 active+undersized+degraded+remapped+backfill_wait
4 active+undersized+degraded+remapped+backfilling
io:
client: 37 MiB/s rd, 14 MiB/s wr, 158 op/s rd, 31 op/s wr
recovery: 98 MiB/s, 5 objects/s
I then tried to manually start the OSD service but it fails every time and all 8 of them do. I've attached the log to this issue.
I have no clue why they won't start. I already tried to downgrade docker to the previous version without any luck. I also checked all permissions on the devices and directories, compared them with other nodes and they all look good as well.
Any idea how to fix this without recreating the OSDs?
Thanks!
Files