Project

General

Profile

Actions

Backport #58952

closed

reef: OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output error

Added by Radoslaw Zarzynski about 1 year ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Target version:
-
Release:
reef
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When upgrading to rook 1.8.3 (ceph 16.2.7) we experience issue's with the OSD initialization; basically only +/- 50% of the raw devices are actually picked up by ceph, i.e. get an OSD instance. All others are eventually ignored.

It seems that the first osd-(prepare) run fails but leaves the osd setup in an invalid state from which ceph can't recover by simply respawning rook operator or osd pods.

In the osd-prepare logging it looks like this eventually:

2022-01-24 16:50:57.929115 D | exec: Running command: udevadm info --query=property /dev/sdc1
2022-01-24 16:50:57.938006 D | exec: Running command: lsblk /dev/sdc1 --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME
2022-01-24 16:50:57.941195 D | exec: Running command: ceph-volume inventory --format json /dev/sdc1
2022-01-24 16:50:58.423904 I | cephosd: skipping device "sdc1": ["Has BlueStore device label"].
2022-01-24 16:50:58.423934 D | exec: Running command: udevadm info --query=property /dev/sdd1
2022-01-24 16:50:58.432716 D | exec: Running command: lsblk /dev/sdd1 --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME
2022-01-24 16:50:58.436077 D | exec: Running command: ceph-volume inventory --format json /dev/sdd1
2022-01-24 16:50:58.888559 I | cephosd: skipping device "sdd1": ["Has BlueStore device label"].
2022-01-24 16:50:58.888590 D | exec: Running command: udevadm info --query=property /dev/sde1
2022-01-24 16:50:58.897610 D | exec: Running command: lsblk /dev/sde1 --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME
2022-01-24 16:50:58.901015 D | exec: Running command: ceph-volume inventory --format json /dev/sde1
2022-01-24 16:50:59.363622 I | cephosd: skipping device "sde1": ["Has BlueStore device label"].
2022-01-24 16:50:59.363646 I | cephosd: skipping 'dm' device "dm-0"
2022-01-24 16:50:59.370400 I | cephosd: configuring osd devices: {"Entries":{}}
2022-01-24 16:50:59.370426 I | cephosd: no new devices to configure. returning devices already configured with ceph-volume.
2022-01-24 16:50:59.370718 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm list --format json
2022-01-24 16:50:59.741959 D | cephosd: {}
2022-01-24 16:50:59.742004 I | cephosd: 0 ceph-volume lvm osd devices configured on this node
2022-01-24 16:50:59.742038 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list --format json
2022-01-24 16:51:00.581917 D | cephosd: {
"0f6d36fb-23ed-4888-9469-a5ab91370bd3": {
"ceph_fsid": "576c5282-5b6c-4204-80d3-80049ef7326d",
"device": "/dev/sdc1",
"osd_id": 1,
"osd_uuid": "0f6d36fb-23ed-4888-9469-a5ab91370bd3",
"type": "bluestore"
},
"51b3b924-f70f-4257-88cf-91d881faf5e3": {
"ceph_fsid": "576c5282-5b6c-4204-80d3-80049ef7326d",
"device": "/dev/sde1",
"osd_id": 4,
"osd_uuid": "51b3b924-f70f-4257-88cf-91d881faf5e3",
"type": "bluestore"
}
}

As can be seen sdd1 is just skipped/missing.

Basically the same issue is reported in https://github.com/rook/rook/issues/8023 but closed with assumption that ceph 16.2.6 provides a fix for this issue.

After testing i can confirm 16.2.7 still has this issue.

It seems our setup somehow triggers this issue:
- centos7 (kernel 3.10)
- 3 spinning disks per host
- rook 1.8.3

I can also report that by setting "bluefs_buffered_io = false" the issue doesn't occur; although we don't fully understand today the further implications of setting this option.

Operator logging is attached:
- operator.log std 16.2.7 experiencing the issue
- operator-no-buffering.log: same but with option set; doesn't show the issue.

I can easily reproduce this issue and can run additional debug statements; but since i'm fairly new to ceph i do need some guidenance what/how to debug.


Files

operator-no-buffering.log (67 KB) operator-no-buffering.log Paul Bormans, 01/26/2022 01:25 PM
operator.log (90.1 KB) operator.log Paul Bormans, 01/26/2022 01:25 PM

Related issues 1 (0 open1 closed)

Copied from bluestore - Bug #54019: OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output errorResolvedAdam Kupczyk

Actions
Actions #1

Updated by Radoslaw Zarzynski about 1 year ago

  • Copied from Bug #54019: OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output error added
Actions #2

Updated by Radoslaw Zarzynski about 1 year ago

  • Status changed from Pending Backport to In Progress
Actions #3

Updated by Yuri Weinstein about 1 year ago

Radoslaw Zarzynski wrote:

https://github.com/ceph/ceph/pull/50475

merged

Actions #4

Updated by Igor Fedotov about 1 year ago

  • Status changed from In Progress to Resolved
Actions #5

Updated by Konstantin Shalygin 4 months ago

  • Release set to reef
Actions

Also available in: Atom PDF