Bug #49166: All OSD down after docker upgrade: KernelDevice.cc: 999: FAILED ceph_assert(is_valid_io(off, len)) - Ceph - Ceph

Actions

Copy link

Bug #49166

closed

All OSD down after docker upgrade: KernelDevice.cc: 999: FAILED ceph_assert(is_valid_io(off, len))

Added by Martijn Kools about 3 years ago. Updated about 3 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Category:

Target version:

v15.2.8

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v15.2.8

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hello,

I'm running a 4-node CEPH cluster on Debian 10.7. There was a Docker update from Docker version 20.10.1 to Docker version 20.10.3, so a minor update. However, after installing the update with 'apt upgrade' all the Ceph dockers went down, obv because of Docker restart. I rebooted the machine and the OSDs are not coming back.

When I type docker ps I can see a few services running:

6e28dc0c0b6a   ceph/ceph:v15                "bash"                   42 minutes ago   Up 42 minutes             nostalgic_banach
582265e28589   ceph/ceph:v15                "/usr/bin/ceph-crash…"   18 hours ago     Up 18 hours               ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-crash.osd07
0b5b8f6b09d9   ceph/ceph:v15                "/usr/bin/ceph-mds -…"   18 hours ago     Up 18 hours               ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-mds.cephfs-sata.osd07.kofdrp
fb7cc0b58ef5   prom/node-exporter:v0.18.1   "/bin/node_exporter …"   18 hours ago     Up 18 hours               ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-node-exporter.osd07

Here's my ceph -s output:

root@osd07:/var/lib/ceph/osd# ceph -s
  cluster:
    id:     8fde54d0-45e9-11eb-86ab-a23d47ea900e
    health: HEALTH_WARN
            1 osds down
            1 host (8 osds) down
            Degraded data redundancy: 396924/7903527 objects degraded (5.022%), 64 pgs degraded, 64 pgs undersized

  services:
    mon: 3 daemons, quorum osd04,osd06,osd05 (age 5w)
    mgr: osd04.wljcez(active, since 5w), standbys: osd05.uvfdor
    mds: cephfs-sata:1 {0=cephfs-sata.osd05.evynxa=up:active} 1 up:standby
    osd: 26 osds: 18 up (since 24h), 19 in (since 24h); 20 remapped pgs

  data:
    pools:   4 pools, 577 pgs
    objects: 3.95M objects, 41 TiB
    usage:   79 TiB used, 128 TiB / 207 TiB avail
    pgs:     396924/7903527 objects degraded (5.022%)
             513 active+clean
             44  active+undersized+degraded
             16  active+undersized+degraded+remapped+backfill_wait
             4   active+undersized+degraded+remapped+backfilling

  io:
    client:   37 MiB/s rd, 14 MiB/s wr, 158 op/s rd, 31 op/s wr
    recovery: 98 MiB/s, 5 objects/s

I then tried to manually start the OSD service but it fails every time and all 8 of them do. I've attached the log to this issue.
I have no clue why they won't start. I already tried to downgrade docker to the previous version without any luck. I also checked all permissions on the devices and directories, compared them with other nodes and they all look good as well.

Any idea how to fix this without recreating the OSDs?

Thanks!

Files

osdlog.txt (862 KB) osdlog.txt

Martijn Kools, 02/04/2021 01:49 PM

Actions

Copy link

Updated by Sebastian Wagner about 3 years ago

Project changed from Ceph to Orchestrator
Subject changed from All OSD down after docker upgrade to cepham: All OSD down after docker upgrade
Category deleted (~~OSD~~)

Actions

Copy link

Updated by Sebastian Wagner about 3 years ago

Project changed from Orchestrator to Ceph
Subject changed from cepham: All OSD down after docker upgrade to All OSD down after docker upgrade: KernelDevice.cc: 999: FAILED ceph_assert(is_valid_io(off, len))

nope. does not seem to be related to cephadm.

Actions

Copy link

Updated by Igor Fedotov about 3 years ago

don't you have custom value for bdev_block_size param?

Actions

Copy link

Updated by Igor Fedotov about 3 years ago

This line looks suspicios:

Feb 04 09:33:09 osd07 bash³²⁰¹⁰: debug -12> 2021-02-04T08:33:09.857+0000 7f52e0bfff40 1 bdev(0x5605c585c700 /var/lib/ceph/osd/ceph-18/block) open backing device/file reports st_blksize 4096, using bdev_block_size 32768 anyway

Actions

Copy link

Updated by Igor Fedotov about 3 years ago

And here is the code snippet which reports the above:
// Operate as though the block size is 4 KB. The backing file
// blksize doesn't strictly matter except that some file systems may
// require a read/modify/write if we write something smaller than
// it.
block_size = cct->_conf->bdev_block_size;
if (block_size != (unsigned)st.st_blksize) {
dout(1) << func << " backing device/file reports st_blksize "
<< st.st_blksize << ", using bdev_block_size "
<< block_size << " anyway" << dendl;
}