Project

General

Profile

Actions

Bug #51034

closed

osd: failed to initialize OSD in Rook

Added by Satoru Takeuchi almost 3 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
container
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I tried to create Rook/Ceph cluster that consists of some OSDs. However, some
OSDs failed to be initialized. Although this problem happened in Rook, I suspect
it's a Ceph's problem.

OSD initialization failed in
`ceph-volume --log-path /var/log/ceph/<osd specific dir> raw prepare --bluestore --data /dev/mapper/<lv name>`.

I read `ceph-volume.log` and found that the following command failed.

```
echo <keyring> | /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 5 --monmap /var/lib/ceph/osd/ceph-5/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-5/ --osd-uuid <OSD uuid> --setuser ceph --setgroup ceph
```

As a result of my investigation, I suspect that Ceph has a buffer-cache related bug.

Here is the binary dump of the target block device just after mkfs failure.

```
...
01d000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d120 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d160 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d180 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
...
```

This region is filled with zero though this region corresponds to a part of transaction log.

Here is the binary dump of the same place which is captured after flushing
buffer caches with `echo 3 > /proc/sys/vm/drop_caches`.

```
...
01d000 01 01 a5 00 00 00 6a 15 1b d6 bb 18 4a 96 b4 21 >......j.....J..!<
01d010 cf 88 f1 92 ef 17 0e 00 00 00 00 00 00 00 85 00 >................<
01d020 00 00 04 02 00 00 00 64 62 0e 00 00 00 4f 50 54 >.......db....OPT<
01d030 49 4f 4e 53 2d 30 30 30 30 31 30 0b 00 00 00 00 >IONS-000010.....<
01d040 00 00 00 05 02 00 00 00 64 62 14 00 00 00 4f 50 >........db....OP<
01d050 54 49 4f 4e 53 2d 30 30 30 30 30 39 2e 64 62 74 >TIONS-000009.dbt<
01d060 6d 70 05 02 00 00 00 64 62 0e 00 00 00 4f 50 54 >mp.....db....OPT<
01d070 49 4f 4e 53 2d 30 30 30 30 30 35 09 07 00 00 00 >IONS-000005.....<
01d080 00 00 00 00 08 01 01 1c 00 00 00 08 c5 01 f6 b5 >................<
01d090 b5 60 62 84 44 2d 00 01 00 00 00 01 01 06 00 00 >.`b.D-..........<
01d0a0 00 15 01 00 00 43 01 9c 9b ec d3 00 00 00 00 00 >.....C..........<
01d0b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d0f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d120 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d160 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d180 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d1f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
01d200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
...
```

I suspect that there might be a cache sync problem in bluefs. I briefly looked at bluefs's
code and found that bluefs uses two fds, one is with O_DIRECT and the other is
without this flag. However, I'm not sure about the detail.

  1. how to reproduce

1. Run the commands equivalent to above-mentioned `ceph-volume raw prepare` command
[just before `osd_mkfs_bluestore`](https://github.com/ceph/ceph/blob/v16.2.4/src/ceph-volume/ceph_volume/devices/raw/prepare.py#L64)

2. Run the following script.

```sh
while :;
do dd if=/dev/zero of=/var/lib/ceph/osd/ceph-5/block bs=1024 count=1024;
rm /var/lib/ceph/osd/ceph-5/{bfm_blocks_per_key,bfm_size,bluefs,fsid,kv_backend,mkfs_done,ready,whoami,bfm_blocks,bfm_bytes_per_block,ceph_fsid,magic,osd_key,type};
sync;
echo 3 > /proc/sys/vm/drop_caches;
echo <keyring> | /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 5 --monmap /var/lib/ceph/osd/ceph-5/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-5/ --osd-uuid <OSD uuid> --setuser ceph --setgroup ceph
ret=$?
if [ $ret -ne 0 ];then
break
fi
echo $ret
done
```

  1. reproduction probability

About 10%

  1. workaround

Set `bluefs_buffered_io=false`

I ran the reproducer with this configuration overnight. Then this problem didn't happen.

  1. attached files(zipped)

- Ceph related files.
- ceph.conf
- ceph-volume.log
- osd-mkfs-1622521334.log
a failure log of ceph-osd --mkfs
- Rook specific files.
- rook-ceph-osd-prepare-set1-data-5.yaml
Problemetic OSD's deployment
- the log of OSD preparing pod
osd-prepare-set1-data-5.pod.log
- osd-prepare-event.log
prepare pod's kubernetes event log
- cephcluster yaml
CephCluster cluster resource

  1. environment

- Rook 1.6.3
- Ceph 16.2.4
- kernel 5.10.38-flatcar

  1. additional information

This problem didn't happen in Ceph v15.2.8


Files

ceph.zip (41.6 KB) ceph.zip Satoru Takeuchi, 06/01/2021 11:03 AM
Actions #1

Updated by Igor Fedotov almost 3 years ago

  • Project changed from Ceph to bluestore
  • Category deleted (OSD)
Actions #2

Updated by Satoru Takeuchi almost 3 years ago

Any comments? I can get more information if needed.

Actions #3

Updated by Satoru Takeuchi almost 3 years ago

Another person hit this problem.

https://github.com/rook/rook/issues/8023#issuecomment-863176780

Ceph version is v16.2 as mine.

In addition, the following PR may correspond to this issue but I'm not sure yet.

https://github.com/ceph/ceph/pull/41889

Actions #4

Updated by Satoru Takeuchi almost 3 years ago

Two more inputs.

1. Ceph v15.2.8, in which this problem didn't happen, disabled bluefs_buffered_io. It's the default setting. On the other hands, this configuration is enabled by default in v16.2.4.
2. This problem also happened in ceph-bluestore-tool bluefs-bdev-expand

Actions #5

Updated by Igor Fedotov over 2 years ago

  • Tags set to container
Actions #6

Updated by Igor Fedotov over 2 years ago

  • Subject changed from osd: failed to initialize OSD to osd: failed to initialize OSD in Rook
Actions #7

Updated by Igor Fedotov over 2 years ago

@satoru, wondering if you would be able to reproduce this for the upcoming 16.2.6.
Perhaps this has been fixed by https://github.com/ceph/ceph/pull/42424

Actions #8

Updated by Satoru Takeuchi over 2 years ago

@Igor Gajowiak, Thank you! I'll do it later.

Actions #9

Updated by Satoru Takeuchi over 2 years ago

I confirmed this problem disappeared in ceph v16.2.6 and rook v1.7.5 environment. However, another problem, which was reported in the following URL, happened in this environment.

https://tracker.ceph.com/issues/53184

Actions #10

Updated by Satoru Takeuchi over 2 years ago

Please close this ticket, thank you very much for handling this problem!

Actions #11

Updated by Igor Fedotov over 2 years ago

  • Status changed from New to Closed
Actions #12

Updated by Satoru Takeuchi almost 2 years ago

Some rook users still encountered similar issues.

https://github.com/rook/rook/issues/9646#issuecomment-1021505103
```
2022-01-25 18:38:32.269143 I | op-osd: OSD orchestration status for node dev2-cmp1l is "failed"
2022-01-25 18:38:32.269193 E | op-osd: failed to provision OSD on node dev2-cmp1l. &{OSDs:[] Status:failed PvcBackedOSD:false Message:failed to configure devices: failed to initialize osd: failed to run ceph-volume raw command. Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring i - osd new f4a0ae55-0b96-4304-a4cc-2f2f3d13a341
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
Running command: /usr/bin/chown -R ceph:ceph /dev/sde1
Running command: /usr/bin/ln -s /dev/sde1 /var/lib/ceph/osd/ceph-0/block
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-0/activate.monmap
stderr: got monmap epoch 3
Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-0/keyring --create-keyring --name osd.0 --add-key AQClQ/Bhr58XLRAACuf8ROY8okq1izdbu4HZaQ==
stdout: creating /var/lib/ceph/osd/ceph-0/keyring
added entity osd.0 auth(key=AQClQ/Bhr58XLRAACuf8ROY8okq1izdbu4HZaQ==)
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/keyring
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid f4a0ae55-0b96-4304-a4cc-2f2f3d13a341 --setuser ceph --setgroup ceph
stderr: 2022-01-25T18:38:30.769+0000 7f89304df080 -1 bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid
stderr: 2022-01-25T18:38:30.786+0000 7f89304df080 -1 bluefs _replay 0x0: stop: uuid 00000000-0000-0000-0000-000000000000 != super.uuid 6f36f879-3c86-4ae1-8b46-abf13cc438c7, block dump:
stderr: 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
stderr:
stderr: 00000ff0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
stderr: 00001000
stderr: 2022-01-25T18:38:31.281+0000 7f89304df080 -1 rocksdb: verify_sharding unable to list column families: NotFound:
stderr: 2022-01-25T18:38:31.281+0000 7f89304df080 -1 bluestore(/var/lib/ceph/osd/ceph-0/) _open_db erroring opening db:
stderr: 2022-01-25T18:38:31.783+0000 7f89304df080 -1 OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output error
stderr: 2022-01-25T18:38:31.783+0000 7f89304df080 -1 �[0;31m *
ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0/: (5) Input/output error�[0m
-
> Was unable to complete a new OSD, will rollback changes
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.0 --yes-i-really-mean-it
stderr: purged osd.0
Traceback (most recent call last):
File "/usr/sbin/ceph-volume", line 11, in <module>
load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 40, in init
self.main(self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
return f(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 152, in main
terminal.dispatch(self.mapper, subcommand_args)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
terminal.dispatch(self.mapper, self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 169, in main
self.safe_prepare(self.args)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 91, in safe_prepare
self.prepare()
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 134, in prepare
tmpfs,
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 68, in prepare_bluestore
db=db
File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 481, in osd_mkfs_bluestore
raise RuntimeError('Command failed with exit code %s: %s' % (returncode, ' '.join(command)))
RuntimeError: Command failed with exit code 250: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid f4a0ae55-0b96-4304-a4cc-2f2f3d13a341 --setuser ceph --setgroup ceph: exit status 1}

```

Ths problem disappeared if `bluefs_buffered_io=false` was set.

https://github.com/rook/rook/issues/9646#issuecomment-1022021781
```
And i can confirm that:

- rook 1.8.3 / 16.2.7 still have this issue operator.log
- rook 1.8.3 / 16.2.7 + bluefs_buffered_io = false does not experience this issue. operator-no-buffering.log
```

And here is another report.

https://github.com/rook/rook/issues/10160
```
stderr: 2022-04-25T18:12:29.836+0000 7f023a2a43c0 1 bluestore(/var/lib/ceph/osd/ceph-1/) _read_fsid unparsable uuid
stderr: 2022-04-25T18:12:30.024+0000 7f023a2a43c0 -1 bluefs _replay 0x0: stop: uuid 00000000-0000-0000-0000-000000000000 != super.uuid 68ba3e0d-e42c-46b0-a51b-e8e6ab918ad6, block dump:
stderr: 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
stderr:
stderr: 00000ff0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
stderr: 00001000
stderr: 2022-04-25T18:12:30.664+0000 7f023a2a43c0 -1 rocksdb: verify_sharding unable to list column families: NotFound:
stderr: 2022-04-25T18:12:30.664+0000 7f023a2a43c0 -1 bluestore(/var/lib/ceph/osd/ceph-1/) _open_db erroring opening db:
stderr: 2022-04-25T18:12:31.224+0000 7f023a2a43c0 -1 OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output error
stderr: 2022-04-25T18:12:31.224+0000 7f023a2a43c0 -1 *
ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-1/: (5) Input/output error
-
> Was unable to complete a new OSD, will rollback changes
```

The log is very similar to the first report. He used v17.20. But it's not clear that he tried `bluefs_buffered_io=false`. I suspect that `bluefs_buffered_id=true` (it's the default) still has some problems.

Please let me know if it's better to open a new issue.

Actions #13

Updated by Satoru Takeuchi almost 2 years ago

Oops, sorry. This problem has already been reported in the following ticket.
https://tracker.ceph.com/issues/54019

Actions

Also available in: Atom PDF