Project

General

Profile

Actions

Bug #51034

closed

osd: failed to initialize OSD in Rook

Added by Satoru Takeuchi almost 3 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
container
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I tried to create Rook/Ceph cluster that consists of some OSDs. However, some
OSDs failed to be initialized. Although this problem happened in Rook, I suspect
it's a Ceph's problem.

OSD initialization failed in
`ceph-volume --log-path /var/log/ceph/<osd specific dir> raw prepare --bluestore --data /dev/mapper/<lv name>`.

I read `ceph-volume.log` and found that the following command failed.

```
echo <keyring> | /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 5 --monmap /var/lib/ceph/osd/ceph-5/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-5/ --osd-uuid <OSD uuid> --setuser ceph --setgroup ceph
```

As a result of my investigation, I suspect that Ceph has a buffer-cache related bug.

Here is the binary dump of the target block device just after mkfs failure.

```
...
01d000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d120 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d160 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d180 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
...
```

This region is filled with zero though this region corresponds to a part of transaction log.

Here is the binary dump of the same place which is captured after flushing
buffer caches with `echo 3 > /proc/sys/vm/drop_caches`.

```
...
01d000 01 01 a5 00 00 00 6a 15 1b d6 bb 18 4a 96 b4 21 >......j.....J..!<
01d010 cf 88 f1 92 ef 17 0e 00 00 00 00 00 00 00 85 00 >................<
01d020 00 00 04 02 00 00 00 64 62 0e 00 00 00 4f 50 54 >.......db....OPT<
01d030 49 4f 4e 53 2d 30 30 30 30 31 30 0b 00 00 00 00 >IONS-000010.....<
01d040 00 00 00 05 02 00 00 00 64 62 14 00 00 00 4f 50 >........db....OP<
01d050 54 49 4f 4e 53 2d 30 30 30 30 30 39 2e 64 62 74 >TIONS-000009.dbt<
01d060 6d 70 05 02 00 00 00 64 62 0e 00 00 00 4f 50 54 >mp.....db....OPT<
01d070 49 4f 4e 53 2d 30 30 30 30 30 35 09 07 00 00 00 >IONS-000005.....<
01d080 00 00 00 00 08 01 01 1c 00 00 00 08 c5 01 f6 b5 >................<
01d090 b5 60 62 84 44 2d 00 01 00 00 00 01 01 06 00 00 >.`b.D-..........<
01d0a0 00 15 01 00 00 43 01 9c 9b ec d3 00 00 00 00 00 >.....C..........<
01d0b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d120 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d160 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d180 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
...
```

I suspect that there might be a cache sync problem in bluefs. I briefly looked at bluefs's
code and found that bluefs uses two fds, one is with O_DIRECT and the other is
without this flag. However, I'm not sure about the detail.

  1. how to reproduce

1. Run the commands equivalent to above-mentioned `ceph-volume raw prepare` command
[just before `osd_mkfs_bluestore`](https://github.com/ceph/ceph/blob/v16.2.4/src/ceph-volume/ceph_volume/devices/raw/prepare.py#L64)

2. Run the following script.

```sh
while :;
do dd if=/dev/zero of=/var/lib/ceph/osd/ceph-5/block bs=1024 count=1024;
rm /var/lib/ceph/osd/ceph-5/{bfm_blocks_per_key,bfm_size,bluefs,fsid,kv_backend,mkfs_done,ready,whoami,bfm_blocks,bfm_bytes_per_block,ceph_fsid,magic,osd_key,type};
sync;
echo 3 > /proc/sys/vm/drop_caches;
echo <keyring> | /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 5 --monmap /var/lib/ceph/osd/ceph-5/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-5/ --osd-uuid <OSD uuid> --setuser ceph --setgroup ceph
ret=$?
if [ $ret -ne 0 ];then
break
fi
echo $ret
done
```

  1. reproduction probability

About 10%

  1. workaround

Set `bluefs_buffered_io=false`

I ran the reproducer with this configuration overnight. Then this problem didn't happen.

  1. attached files(zipped)

- Ceph related files.
- ceph.conf
- ceph-volume.log
- osd-mkfs-1622521334.log
a failure log of ceph-osd --mkfs
- Rook specific files.
- rook-ceph-osd-prepare-set1-data-5.yaml
Problemetic OSD's deployment
- the log of OSD preparing pod
osd-prepare-set1-data-5.pod.log
- osd-prepare-event.log
prepare pod's kubernetes event log
- cephcluster yaml
CephCluster cluster resource

  1. environment

- Rook 1.6.3
- Ceph 16.2.4
- kernel 5.10.38-flatcar

  1. additional information

This problem didn't happen in Ceph v15.2.8


Files

ceph.zip (41.6 KB) ceph.zip Satoru Takeuchi, 06/01/2021 11:03 AM
Actions

Also available in: Atom PDF