Project

General

Profile

Bug #49166

All OSD down after docker upgrade: KernelDevice.cc: 999: FAILED ceph_assert(is_valid_io(off, len))

Added by Martijn Kools 21 days ago. Updated 21 days ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Hello,

I'm running a 4-node CEPH cluster on Debian 10.7. There was a Docker update from Docker version 20.10.1 to Docker version 20.10.3, so a minor update. However, after installing the update with 'apt upgrade' all the Ceph dockers went down, obv because of Docker restart. I rebooted the machine and the OSDs are not coming back.

When I type docker ps I can see a few services running:

6e28dc0c0b6a   ceph/ceph:v15                "bash"                   42 minutes ago   Up 42 minutes             nostalgic_banach
582265e28589   ceph/ceph:v15                "/usr/bin/ceph-crash…"   18 hours ago     Up 18 hours               ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-crash.osd07
0b5b8f6b09d9   ceph/ceph:v15                "/usr/bin/ceph-mds -…"   18 hours ago     Up 18 hours               ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-mds.cephfs-sata.osd07.kofdrp
fb7cc0b58ef5   prom/node-exporter:v0.18.1   "/bin/node_exporter …"   18 hours ago     Up 18 hours               ceph-8fde54d0-45e9-11eb-86ab-a23d47ea900e-node-exporter.osd07

Here's my ceph -s output:

root@osd07:/var/lib/ceph/osd# ceph -s
  cluster:
    id:     8fde54d0-45e9-11eb-86ab-a23d47ea900e
    health: HEALTH_WARN
            1 osds down
            1 host (8 osds) down
            Degraded data redundancy: 396924/7903527 objects degraded (5.022%), 64 pgs degraded, 64 pgs undersized

  services:
    mon: 3 daemons, quorum osd04,osd06,osd05 (age 5w)
    mgr: osd04.wljcez(active, since 5w), standbys: osd05.uvfdor
    mds: cephfs-sata:1 {0=cephfs-sata.osd05.evynxa=up:active} 1 up:standby
    osd: 26 osds: 18 up (since 24h), 19 in (since 24h); 20 remapped pgs

  data:
    pools:   4 pools, 577 pgs
    objects: 3.95M objects, 41 TiB
    usage:   79 TiB used, 128 TiB / 207 TiB avail
    pgs:     396924/7903527 objects degraded (5.022%)
             513 active+clean
             44  active+undersized+degraded
             16  active+undersized+degraded+remapped+backfill_wait
             4   active+undersized+degraded+remapped+backfilling

  io:
    client:   37 MiB/s rd, 14 MiB/s wr, 158 op/s rd, 31 op/s wr
    recovery: 98 MiB/s, 5 objects/s

I then tried to manually start the OSD service but it fails every time and all 8 of them do. I've attached the log to this issue.
I have no clue why they won't start. I already tried to downgrade docker to the previous version without any luck. I also checked all permissions on the devices and directories, compared them with other nodes and they all look good as well.

Any idea how to fix this without recreating the OSDs?

Thanks!

osdlog.txt View (862 KB) Martijn Kools, 02/04/2021 01:49 PM

History

#1 Updated by Sebastian Wagner 21 days ago

  • Project changed from Ceph to Orchestrator
  • Subject changed from All OSD down after docker upgrade to cepham: All OSD down after docker upgrade
  • Category deleted (OSD)

#2 Updated by Sebastian Wagner 21 days ago

  • Project changed from Orchestrator to Ceph
  • Subject changed from cepham: All OSD down after docker upgrade to All OSD down after docker upgrade: KernelDevice.cc: 999: FAILED ceph_assert(is_valid_io(off, len))

nope. does not seem to be related to cephadm.

#3 Updated by Igor Fedotov 21 days ago

don't you have custom value for bdev_block_size param?

#4 Updated by Igor Fedotov 21 days ago

This line looks suspicios:

Feb 04 09:33:09 osd07 bash32010: debug -12> 2021-02-04T08:33:09.857+0000 7f52e0bfff40 1 bdev(0x5605c585c700 /var/lib/ceph/osd/ceph-18/block) open backing device/file reports st_blksize 4096, using bdev_block_size 32768 anyway

#5 Updated by Igor Fedotov 21 days ago

And here is the code snippet which reports the above:
// Operate as though the block size is 4 KB. The backing file
// blksize doesn't strictly matter except that some file systems may
// require a read/modify/write if we write something smaller than
// it.
block_size = cct->_conf->bdev_block_size;
if (block_size != (unsigned)st.st_blksize) {
dout(1) << func << " backing device/file reports st_blksize "
<< st.st_blksize << ", using bdev_block_size "
<< block_size << " anyway" << dendl;
}

#6 Updated by Martijn Kools 21 days ago

Igor Fedotov wrote:

don't you have custom value for bdev_block_size param?

Correct. I have it set to 32768. Could that be the issue? Should I try and revert to 4096?

#7 Updated by Igor Fedotov 21 days ago

Absolutely! Please revert

#8 Updated by Igor Fedotov 21 days ago

  • Tracker changed from Support to Bug
  • Status changed from New to Triaged
  • Regression set to No
  • Severity set to 3 - minor

#9 Updated by Igor Fedotov 21 days ago

Martijn - please let us know if the reverting helps to close the ticket...

#10 Updated by Martijn Kools 21 days ago

Igor Fedotov wrote:

Martijn - please let us know if the reverting helps to close the ticket...

BINGO! Finally after two days I can't believe this was it. Everything immediately started running again,

Thanks so much guys!

#11 Updated by Igor Fedotov 21 days ago

  • Status changed from Triaged to Rejected

Also available in: Atom PDF