Project

General

Profile

Bug #47985

When WAL is closed, osd cannot be restarted

Added by jiaxu li about 1 month ago. Updated 30 days ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Compile the master branch source code, use vstart to deploy the cluster, close bluestore wal during deployment, and place bluestore wal/db and block on different disks. After the deployment is complete, restart the osd, the osd restart fails, and the error message: unable to read osd superblock. The problem is not necessarily present, but the probability is high. The cluster does not use vstart deployment also has this problem.

The problem reproduced as follows:
1. compile and install

2. deploy the cluster

3. bluestore wal info
  1. ./bin/ceph daemon osd.0 config show | grep rocksdb
    • DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
      "bluestore_kvbackend": "rocksdb",
      "bluestore_rocksdb_cf": "true",
      "bluestore_rocksdb_cfs": "m(3) O(3,0-13) L",
      "bluestore_rocksdb_options": "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,disableWAL=true",
      ......
4. bluestore path info
  1. tree dev/
    dev/
    ├── mgr.x
    │   └── keyring
    ├── mon.a
    │   ├── kv_backend
    │   ├── min_mon_release
    │   └── store.db
    │   ├── 000039.log
    │   ├── 000041.sst
    │   ├── CURRENT
    │   ├── IDENTITY
    │   ├── LOCK
    │   ├── MANIFEST-000020
    │   ├── OPTIONS-000008
    │   └── OPTIONS-000023
    ├── osd0
    │   ├── bfm_blocks
    │   ├── bfm_blocks_per_key
    │   ├── bfm_bytes_per_block
    │   ├── bfm_size
    │   ├── block -> /dev/sda
    │   ├── block.db -> /dev/sdg6
    │   ├── block.wal -> /dev/sdg5
    │   ├── bluefs
    │   ├── ceph_fsid
    │   ├── fsid
    │   ├── keyring
    │   ├── kv_backend
    │   ├── magic
    │   ├── mkfs_done
    │   ├── ready
    │   ├── require_osd_release
    │   ├── type
    │   └── whoami
    ├── osd1
    │   ├── bfm_blocks
    │   ├── bfm_blocks_per_key
    │   ├── bfm_bytes_per_block
    │   ├── bfm_size
    │   ├── block -> /dev/sdb
    │   ├── block.db -> /dev/sdg8
    │   ├── block.wal -> /dev/sdg7
    │   ├── bluefs
    │   ├── ceph_fsid
    │   ├── fsid
    │   ├── keyring
    │   ├── kv_backend
    │   ├── magic
    │   ├── mkfs_done
    │   ├── ready
    │   ├── require_osd_release
    │   ├── type
    │   └── whoami
    ├── osd2
    │   ├── bfm_blocks
    │   ├── bfm_blocks_per_key
    │   ├── bfm_bytes_per_block
    │   ├── bfm_size
    │   ├── block -> /dev/sdc
    │   ├── block.db -> /dev/sdg10
    │   ├── block.wal -> /dev/sdg9
    │   ├── bluefs
    │   ├── ceph_fsid
    │   ├── fsid
    │   ├── keyring
    │   ├── kv_backend
    │   ├── magic
    │   ├── mkfs_done
    │   ├── ready
    │   ├── require_osd_release
    │   ├── type
    │   └── whoami
    ├── osd3
    │   ├── bfm_blocks
    │   ├── bfm_blocks_per_key
    │   ├── bfm_bytes_per_block
    │   ├── bfm_size
    │   ├── block -> /dev/sdd
    │   ├── block.db -> /dev/sdg12
    │   ├── block.wal -> /dev/sdg11
    │   ├── bluefs
    │   ├── ceph_fsid
    │   ├── fsid
    │   ├── keyring
    │   ├── kv_backend
    │   ├── magic
    │   ├── mkfs_done
    │   ├── ready
    │   ├── require_osd_release
    │   ├── type
    │   └── whoami
    └── osd4
    ├── bfm_blocks
    ├── bfm_blocks_per_key
    ├── bfm_bytes_per_block
    ├── bfm_size
    ├── block -> /dev/sde
    ├── block.db -> /dev/sdg14
    ├── block.wal -> /dev/sdg13
    ├── bluefs
    ├── ceph_fsid
    ├── fsid
    ├── keyring
    ├── kv_backend
    ├── magic
    ├── mkfs_done
    ├── ready
    ├── require_osd_release
    ├── type
    └── whoami
    8 directories, 101 files
5.get the osd process id
  1. ps -ef | grep ceph
    root 2719 1 1 16:36 ? 00:00:09 /home/ljx/ceph-master/build/bin/ceph-mgr -i x -c /home/ljx/ceph-master/build/ceph.conf
    root 2801 1 0 16:36 ? 00:00:06 /home/ljx/ceph-master/build/bin/ceph-mon -i a -c /home/ljx/ceph-master/build/ceph.conf
    root 3792 1 0 16:39 ? 00:00:02 ./bin/ceph-osd -i 0
    root 5262 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 1
    root 6733 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 2
    root 8203 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 3
    root 9673 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 4
    root 10384 1549 0 16:48 pts/0 00:00:00 grep --color=auto ceph
6.restart osd.1
  1. kill -9 5262
  2. ps -ef | grep ceph
    root 2719 1 1 16:36 ? 00:00:09 /home/ljx/ceph-master/build/bin/ceph-mgr -i x -c /home/ljx/ceph-master/build/ceph.conf
    root 2801 1 0 16:36 ? 00:00:06 /home/ljx/ceph-master/build/bin/ceph-mon -i a -c /home/ljx/ceph-master/build/ceph.conf
    root 3792 1 0 16:39 ? 00:00:02 ./bin/ceph-osd -i 0
    root 6733 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 2
    root 8203 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 3
    root 9673 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 4
    root 10389 1549 0 16:49 pts/0 00:00:00 grep --color=auto ceph
  1. ./bin/ceph-osd -i 1
    2020-10-26T16:49:09.456+0800 7f1eadb01f40 -1 WARNING: all dangerous and experimental features are enabled.
    2020-10-26T16:49:09.462+0800 7f1eadb01f40 -1 WARNING: all dangerous and experimental features are enabled.
    2020-10-26T16:49:09.466+0800 7f1eadb01f40 -1 WARNING: all dangerous and experimental features are enabled.
    2020-10-26T16:49:10.547+0800 7f1eadb01f40 -1 Falling back to public interface
    2020-10-26T16:49:13.552+0800 7f1eadb01f40 -1 bluestore(/home/ljx/ceph-master/build/dev/osd1/) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x4319887b, expected 0xa1f41464, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
    2020-10-26T16:49:13.552+0800 7f1eadb01f40 -1 bluestore(/home/ljx/ceph-master/build/dev/osd1/) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x4319887b, expected 0xa1f41464, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
    2020-10-26T16:49:13.553+0800 7f1eadb01f40 -1 bluestore(/home/ljx/ceph-master/build/dev/osd1/) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x4319887b, expected 0xa1f41464, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
    2020-10-26T16:49:13.562+0800 7f1eadb01f40 -1 bluestore(/home/ljx/ceph-master/build/dev/osd1/) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x4319887b, expected 0xa1f41464, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
    2020-10-26T16:49:13.562+0800 7f1eadb01f40 -1 osd.1 0 OSD::init() : unable to read osd superblock
    2020-10-26T16:49:14.333+0800 7f1eadb01f40 -1 ** ERROR: osd init failed: (22) Invalid argument
Operating system version and source code version
  1. cat /etc/redhat-release
    CentOS Linux release 8.2.2004 (Core)
  2. ./bin/ceph -v
    • DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
      ceph version 16.0.0-6584-gcdf596c8ca (cdf596c8ca3646908c04f922261a22058fa8730e) pacific (dev)

rgw-P99.9lantecy-before-after-disable-wal.png View - rgw put P99.9 lantecy before after osd disable wal (116 KB) Jiaying Ren, 10/28/2020 06:00 AM

History

#1 Updated by Igor Fedotov about 1 month ago

  • Status changed from New to Need More Info

It's not clear what did you mean under "close bluestore wal during deployment, and place bluestore wal/db and block on different disks". Please provide exact steps.

#2 Updated by jiaxu li about 1 month ago

The detailed steps to deploy the cluster are as follows:
1. deploy a cluster without osd
```
MON=1 OSD=0 MDS=0 MGR=1 ../src/vstart.sh -b -d -n -X -l --without-dashboard
```
2. modify the configuration file, add bluestore_rocksdb_options and bluestore paths
```
bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,disableWAL=true

[osd.0]
host = ceph
osd data = /home/ljx/ceph-master/build/dev/osd0/
bluestore block path = /dev/sda
bluestore block db path = /dev/sdg6
bluestore block wal path = /dev/sdg5
[osd.1]
host = ceph
osd data = /home/ljx/ceph-master/build/dev/osd1/
bluestore block path = /dev/sdb
bluestore block db path = /dev/sdg8
bluestore block wal path = /dev/sdg7
...
```
3. deploy osds
```
i=1
while [ $i -le 5 ]
do
osdID=`./bin/ceph osd create`
echo "current osd is: $osdID"
./bin/ceph-osd -i $osdID --mkfs --mkkey -c /home/ljx/ceph-master/build/ceph.conf
./bin/ceph auth add osd.$osdID osd 'allow *' mon 'allow profile osd' -i /home/ljx/ceph-master/build/dev/osd$osdID/keyring
./bin/ceph-osd -i $osdID
let i++
done
```

In this way, for each osd, the block path and wal path are on different disks. For example, the block path of osd.0 uses disk 'sda', and the wal path uses a partition of 'sdg'. Because of the disableWAL=true option in bluestore_rocksdb_options, when osd is deployed, wal has been closed, as shown in the following query result:
```
  1. ./bin/ceph daemon osd.0 config show | grep rocksdb
    ...
    "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,disableWAL=true",
    ...
    ```

#3 Updated by Igor Fedotov about 1 month ago

I haven't investigated this deeper but what's the rationale to disableWAL? Generally this introduces a breach to data writing consistency model. Highly likely csum errors you're observing is a consequence of this..

#4 Updated by jiaxu li about 1 month ago

In some application scenarios, I want to close wal in order to get lower latency and higher IOPS. After closing wal, osd can be used to save business data, but it cannot be restarted.

#5 Updated by jiaxu li about 1 month ago

In addition, I have also tried to deploy osd first, and then modify the bluestore_rocksdb_options in the configuration file. Restarting osd will also fail, the error message is the same as this one.

#6 Updated by Jiaying Ren 30 days ago

Hi Igor:

1. we've found disable WAL would reduce latency(measured by P99.9 latency),as we've tested rgw put workload (io size from 4k-1m)with osd enable/disable WAL settings.

It shows us a big room to improve bluestore write latency by customizing WAL.

2. As our expectation after restart the osd(with WAL disabled), osd may lost some object is reasonable.Our confusion is osd even failed to restart.

#7 Updated by Igor Fedotov 30 days ago

  • Severity changed from 1 - critical to 3 - minor

I doubt it will work this way as there would be no onode's metadata consistency guarantee any more... In your case superblock is updated via deferred write which means there are at least two update operations to KV (which should go within a single transaction): user data payload for deferred write and onode's metadata update. Looks like the transaction is interrupted in the middle and onode's metadata isn't updated properly - hence it still contains previous checksums which causes subsequent csum verification failures.

You can probably workaround this specific(!) case by disabling deferred writes and/or csum verification or somehow improving BlueStore shutdown procedure to flush all the KV data to disk. But I'm absolutely not sure all the above would guarantee the solution to be 100% operational - current design relies hardly on consistent DB operations.

Finally IMO this isn't a proper ticket for Ceph - disabling KV's WAL isn't a proper mode of operation in general...

#8 Updated by Igor Fedotov 30 days ago

  • Project changed from Ceph to bluestore
  • Category deleted (OSD)

Also available in: Atom PDF