Bug #49166
closedAll OSD down after docker upgrade: KernelDevice.cc: 999: FAILED ceph_assert(is_valid_io(off, len))
0%
Description
Hello,
I'm running a 4-node CEPH cluster on Debian 10.7. There was a Docker update from Docker version 20.10.1 to Docker version 20.10.3, so a minor update. However, after installing the update with 'apt upgrade' all the Ceph dockers went down, obv because of Docker restart. I rebooted the machine and the OSDs are not coming back.
When I type docker ps I can see a few services running:
6e28dc0c0b6a ceph/ceph:v15 "bash" 42 minutes ago Up 42 minutes nostalgic_banach
582265e28589 ceph/ceph:v15 "/usr/bin/ceph-crash…" 18 hours ago Up 18 hours ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-crash.osd07
0b5b8f6b09d9 ceph/ceph:v15 "/usr/bin/ceph-mds -…" 18 hours ago Up 18 hours ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-mds.cephfs-sata.osd07.kofdrp
fb7cc0b58ef5 prom/node-exporter:v0.18.1 "/bin/node_exporter …" 18 hours ago Up 18 hours ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-node-exporter.osd07
Here's my ceph -s output:
root@osd07:/var/lib/ceph/osd# ceph -s
cluster:
id: 8fde54d0-45e9-11eb-86ab-a23d47ea900e
health: HEALTH_WARN
1 osds down
1 host (8 osds) down
Degraded data redundancy: 396924/7903527 objects degraded (5.022%), 64 pgs degraded, 64 pgs undersized
services:
mon: 3 daemons, quorum osd04,osd06,osd05 (age 5w)
mgr: osd04.wljcez(active, since 5w), standbys: osd05.uvfdor
mds: cephfs-sata:1 {0=cephfs-sata.osd05.evynxa=up:active} 1 up:standby
osd: 26 osds: 18 up (since 24h), 19 in (since 24h); 20 remapped pgs
data:
pools: 4 pools, 577 pgs
objects: 3.95M objects, 41 TiB
usage: 79 TiB used, 128 TiB / 207 TiB avail
pgs: 396924/7903527 objects degraded (5.022%)
513 active+clean
44 active+undersized+degraded
16 active+undersized+degraded+remapped+backfill_wait
4 active+undersized+degraded+remapped+backfilling
io:
client: 37 MiB/s rd, 14 MiB/s wr, 158 op/s rd, 31 op/s wr
recovery: 98 MiB/s, 5 objects/s
I then tried to manually start the OSD service but it fails every time and all 8 of them do. I've attached the log to this issue.
I have no clue why they won't start. I already tried to downgrade docker to the previous version without any luck. I also checked all permissions on the devices and directories, compared them with other nodes and they all look good as well.
Any idea how to fix this without recreating the OSDs?
Thanks!
Files
Updated by Sebastian Wagner about 3 years ago
- Project changed from Ceph to Orchestrator
- Subject changed from All OSD down after docker upgrade to cepham: All OSD down after docker upgrade
- Category deleted (
OSD)
Updated by Sebastian Wagner about 3 years ago
- Project changed from Orchestrator to Ceph
- Subject changed from cepham: All OSD down after docker upgrade to All OSD down after docker upgrade: KernelDevice.cc: 999: FAILED ceph_assert(is_valid_io(off, len))
nope. does not seem to be related to cephadm.
Updated by Igor Fedotov about 3 years ago
don't you have custom value for bdev_block_size param?
Updated by Igor Fedotov about 3 years ago
This line looks suspicios:
Feb 04 09:33:09 osd07 bash32010: debug -12> 2021-02-04T08:33:09.857+0000 7f52e0bfff40 1 bdev(0x5605c585c700 /var/lib/ceph/osd/ceph-18/block) open backing device/file reports st_blksize 4096, using bdev_block_size 32768 anyway
Updated by Igor Fedotov about 3 years ago
And here is the code snippet which reports the above:
// Operate as though the block size is 4 KB. The backing file
// blksize doesn't strictly matter except that some file systems may
// require a read/modify/write if we write something smaller than
// it.
block_size = cct->_conf->bdev_block_size;
if (block_size != (unsigned)st.st_blksize) {
dout(1) << func << " backing device/file reports st_blksize "
<< st.st_blksize << ", using bdev_block_size "
<< block_size << " anyway" << dendl;
}
Updated by Martijn Kools about 3 years ago
Igor Fedotov wrote:
don't you have custom value for bdev_block_size param?
Correct. I have it set to 32768. Could that be the issue? Should I try and revert to 4096?
Updated by Igor Fedotov about 3 years ago
- Tracker changed from Support to Bug
- Status changed from New to Triaged
- Regression set to No
- Severity set to 3 - minor
Updated by Igor Fedotov about 3 years ago
Martijn - please let us know if the reverting helps to close the ticket...
Updated by Martijn Kools about 3 years ago
Igor Fedotov wrote:
Martijn - please let us know if the reverting helps to close the ticket...
BINGO! Finally after two days I can't believe this was it. Everything immediately started running again,
Thanks so much guys!
Updated by Igor Fedotov about 3 years ago
- Status changed from Triaged to Rejected