Project

General

Profile

Actions

Bug #49166

closed

All OSD down after docker upgrade: KernelDevice.cc: 999: FAILED ceph_assert(is_valid_io(off, len))

Added by Martijn Kools over 3 years ago. Updated over 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

I'm running a 4-node CEPH cluster on Debian 10.7. There was a Docker update from Docker version 20.10.1 to Docker version 20.10.3, so a minor update. However, after installing the update with 'apt upgrade' all the Ceph dockers went down, obv because of Docker restart. I rebooted the machine and the OSDs are not coming back.

When I type docker ps I can see a few services running:

6e28dc0c0b6a   ceph/ceph:v15                "bash"                   42 minutes ago   Up 42 minutes             nostalgic_banach
582265e28589   ceph/ceph:v15                "/usr/bin/ceph-crash…"   18 hours ago     Up 18 hours               ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-crash.osd07
0b5b8f6b09d9   ceph/ceph:v15                "/usr/bin/ceph-mds -…"   18 hours ago     Up 18 hours               ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-mds.cephfs-sata.osd07.kofdrp
fb7cc0b58ef5   prom/node-exporter:v0.18.1   "/bin/node_exporter …"   18 hours ago     Up 18 hours               ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-node-exporter.osd07

Here's my ceph -s output:

root@osd07:/var/lib/ceph/osd# ceph -s
  cluster:
    id:     8fde54d0-45e9-11eb-86ab-a23d47ea900e
    health: HEALTH_WARN
            1 osds down
            1 host (8 osds) down
            Degraded data redundancy: 396924/7903527 objects degraded (5.022%), 64 pgs degraded, 64 pgs undersized

  services:
    mon: 3 daemons, quorum osd04,osd06,osd05 (age 5w)
    mgr: osd04.wljcez(active, since 5w), standbys: osd05.uvfdor
    mds: cephfs-sata:1 {0=cephfs-sata.osd05.evynxa=up:active} 1 up:standby
    osd: 26 osds: 18 up (since 24h), 19 in (since 24h); 20 remapped pgs

  data:
    pools:   4 pools, 577 pgs
    objects: 3.95M objects, 41 TiB
    usage:   79 TiB used, 128 TiB / 207 TiB avail
    pgs:     396924/7903527 objects degraded (5.022%)
             513 active+clean
             44  active+undersized+degraded
             16  active+undersized+degraded+remapped+backfill_wait
             4   active+undersized+degraded+remapped+backfilling

  io:
    client:   37 MiB/s rd, 14 MiB/s wr, 158 op/s rd, 31 op/s wr
    recovery: 98 MiB/s, 5 objects/s

I then tried to manually start the OSD service but it fails every time and all 8 of them do. I've attached the log to this issue.
I have no clue why they won't start. I already tried to downgrade docker to the previous version without any luck. I also checked all permissions on the devices and directories, compared them with other nodes and they all look good as well.

Any idea how to fix this without recreating the OSDs?

Thanks!


Files

osdlog.txt (862 KB) osdlog.txt Martijn Kools, 02/04/2021 01:49 PM
Actions

Also available in: Atom PDF