Bug #47985: When WAL is closed, osd cannot be restarted - bluestore - Ceph

Actions

Copy link

Bug #47985

open

When WAL is closed, osd cannot be restarted

Added by jiaxu li over 3 years ago. Updated over 3 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Target version:

Ceph - v16.0.0

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.4

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Compile the master branch source code, use vstart to deploy the cluster, close bluestore wal during deployment, and place bluestore wal/db and block on different disks. After the deployment is complete, restart the osd, the osd restart fails, and the error message: unable to read osd superblock. The problem is not necessarily present, but the probability is high. The cluster does not use vstart deployment also has this problem.

The problem reproduced as follows:
1. compile and install

2. deploy the cluster

3. bluestore wal info

./bin/ceph daemon osd.0 config show | grep rocksdb
- DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
  "bluestore_kvbackend": "rocksdb",
  "bluestore_rocksdb_cf": "true",
  "bluestore_rocksdb_cfs": "m(3) O(3,0-13) L",
  "bluestore_rocksdb_options": "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,disableWAL=true",
  ......

4. bluestore path info

tree dev/
dev/
├── mgr.x
│   └── keyring
├── mon.a
│   ├── kv_backend
│   ├── min_mon_release
│   └── store.db
│   ├── 000039.log
│   ├── 000041.sst
│   ├── CURRENT
│   ├── IDENTITY
│   ├── LOCK
│   ├── MANIFEST-000020
│   ├── OPTIONS-000008
│   └── OPTIONS-000023
├── osd0
│   ├── bfm_blocks
│   ├── bfm_blocks_per_key
│   ├── bfm_bytes_per_block
│   ├── bfm_size
│   ├── block -> /dev/sda
│   ├── block.db -> /dev/sdg6
│   ├── block.wal -> /dev/sdg5
│   ├── bluefs
│   ├── ceph_fsid
│   ├── fsid
│   ├── keyring
│   ├── kv_backend
│   ├── magic
│   ├── mkfs_done
│   ├── ready
│   ├── require_osd_release
│   ├── type
│   └── whoami
├── osd1
│   ├── bfm_blocks
│   ├── bfm_blocks_per_key
│   ├── bfm_bytes_per_block
│   ├── bfm_size
│   ├── block -> /dev/sdb
│   ├── block.db -> /dev/sdg8
│   ├── block.wal -> /dev/sdg7
│   ├── bluefs
│   ├── ceph_fsid
│   ├── fsid
│   ├── keyring
│   ├── kv_backend
│   ├── magic
│   ├── mkfs_done
│   ├── ready
│   ├── require_osd_release
│   ├── type
│   └── whoami
├── osd2
│   ├── bfm_blocks
│   ├── bfm_blocks_per_key
│   ├── bfm_bytes_per_block
│   ├── bfm_size
│   ├── block -> /dev/sdc
│   ├── block.db -> /dev/sdg10
│   ├── block.wal -> /dev/sdg9
│   ├── bluefs
│   ├── ceph_fsid
│   ├── fsid
│   ├── keyring
│   ├── kv_backend
│   ├── magic
│   ├── mkfs_done
│   ├── ready
│   ├── require_osd_release
│   ├── type
│   └── whoami
├── osd3
│   ├── bfm_blocks
│   ├── bfm_blocks_per_key
│   ├── bfm_bytes_per_block
│   ├── bfm_size
│   ├── block -> /dev/sdd
│   ├── block.db -> /dev/sdg12
│   ├── block.wal -> /dev/sdg11
│   ├── bluefs
│   ├── ceph_fsid
│   ├── fsid
│   ├── keyring
│   ├── kv_backend
│   ├── magic
│   ├── mkfs_done
│   ├── ready
│   ├── require_osd_release
│   ├── type
│   └── whoami
└── osd4
├── bfm_blocks
├── bfm_blocks_per_key
├── bfm_bytes_per_block
├── bfm_size
├── block -> /dev/sde
├── block.db -> /dev/sdg14
├── block.wal -> /dev/sdg13
├── bluefs
├── ceph_fsid
├── fsid
├── keyring
├── kv_backend
├── magic
├── mkfs_done
├── ready
├── require_osd_release
├── type
└── whoami
8 directories, 101 files

5.get the osd process id

ps -ef | grep ceph
root 2719 1 1 16:36 ? 00:00:09 /home/ljx/ceph-master/build/bin/ceph-mgr -i x -c /home/ljx/ceph-master/build/ceph.conf
root 2801 1 0 16:36 ? 00:00:06 /home/ljx/ceph-master/build/bin/ceph-mon -i a -c /home/ljx/ceph-master/build/ceph.conf
root 3792 1 0 16:39 ? 00:00:02 ./bin/ceph-osd -i 0
root 5262 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 1
root 6733 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 2
root 8203 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 3
root 9673 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 4
root 10384 1549 0 16:48 pts/0 00:00:00 grep --color=auto ceph

6.restart osd.1

kill -9 5262
ps -ef | grep ceph
root 2719 1 1 16:36 ? 00:00:09 /home/ljx/ceph-master/build/bin/ceph-mgr -i x -c /home/ljx/ceph-master/build/ceph.conf
root 2801 1 0 16:36 ? 00:00:06 /home/ljx/ceph-master/build/bin/ceph-mon -i a -c /home/ljx/ceph-master/build/ceph.conf
root 3792 1 0 16:39 ? 00:00:02 ./bin/ceph-osd -i 0
root 6733 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 2
root 8203 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 3
root 9673 1 0 16:40 ? 00:00:02 ./bin/ceph-osd -i 4
root 10389 1549 0 16:49 pts/0 00:00:00 grep --color=auto ceph

./bin/ceph-osd -i 1
2020-10-26T16:49:09.456+0800 7f1eadb01f40 -1 WARNING: all dangerous and experimental features are enabled.
2020-10-26T16:49:09.462+0800 7f1eadb01f40 -1 WARNING: all dangerous and experimental features are enabled.
2020-10-26T16:49:09.466+0800 7f1eadb01f40 -1 WARNING: all dangerous and experimental features are enabled.
2020-10-26T16:49:10.547+0800 7f1eadb01f40 -1 Falling back to public interface
2020-10-26T16:49:13.552+0800 7f1eadb01f40 -1 bluestore(/home/ljx/ceph-master/build/dev/osd1/) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x4319887b, expected 0xa1f41464, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2020-10-26T16:49:13.552+0800 7f1eadb01f40 -1 bluestore(/home/ljx/ceph-master/build/dev/osd1/) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x4319887b, expected 0xa1f41464, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2020-10-26T16:49:13.553+0800 7f1eadb01f40 -1 bluestore(/home/ljx/ceph-master/build/dev/osd1/) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x4319887b, expected 0xa1f41464, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2020-10-26T16:49:13.562+0800 7f1eadb01f40 -1 bluestore(/home/ljx/ceph-master/build/dev/osd1/) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x4319887b, expected 0xa1f41464, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2020-10-26T16:49:13.562+0800 7f1eadb01f40 -1 osd.1 0 OSD::init() : unable to read osd superblock
2020-10-26T16:49:14.333+0800 7f1eadb01f40 -1 ** ERROR: osd init failed: (22) Invalid argument

Operating system version and source code version

cat /etc/redhat-release
CentOS Linux release 8.2.2004 (Core)
./bin/ceph -v
- DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
  ceph version 16.0.0-6584-gcdf596c8ca (cdf596c8ca3646908c04f922261a22058fa8730e) pacific (dev)

Files

rgw-P99.9lantecy-before-after-disable-wal.png (116 KB) rgw-P99.9lantecy-before-after-disable-wal.png

rgw put P99.9 lantecy before after osd disable wal

Jiaying Ren, 10/28/2020 06:00 AM

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Status changed from New to Need More Info

It's not clear what did you mean under "close bluestore wal during deployment, and place bluestore wal/db and block on different disks". Please provide exact steps.

Actions

Copy link

Updated by jiaxu li over 3 years ago

The detailed steps to deploy the cluster are as follows:
1. deploy a cluster without osd
```
MON=1 OSD=0 MDS=0 MGR=1 ../src/vstart.sh -b -d -n -X -l --without-dashboard
```
2. modify the configuration file, add bluestore_rocksdb_options and bluestore paths
```
bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,disableWAL=true

[osd.0]
host = ceph
osd data = /home/ljx/ceph-master/build/dev/osd0/
bluestore block path = /dev/sda
bluestore block db path = /dev/sdg6
bluestore block wal path = /dev/sdg5
[osd.1]
host = ceph
osd data = /home/ljx/ceph-master/build/dev/osd1/
bluestore block path = /dev/sdb
bluestore block db path = /dev/sdg8
bluestore block wal path = /dev/sdg7
...
```
3. deploy osds
```
i=1
while [ $i -le 5 ]
do
osdID=`./bin/ceph osd create`
echo "current osd is: $osdID"
./bin/ceph-osd -i $osdID --mkfs --mkkey -c /home/ljx/ceph-master/build/ceph.conf
./bin/ceph auth add osd.$osdID osd 'allow *' mon 'allow profile osd' -i /home/ljx/ceph-master/build/dev/osd$osdID/keyring
./bin/ceph-osd -i $osdID
let i++
done
```

In this way, for each osd, the block path and wal path are on different disks. For example, the block path of osd.0 uses disk 'sda', and the wal path uses a partition of 'sdg'. Because of the disableWAL=true option in bluestore_rocksdb_options, when osd is deployed, wal has been closed, as shown in the following query result:
```

./bin/ceph daemon osd.0 config show | grep rocksdb
...
"compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,disableWAL=true",
...
```

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

I haven't investigated this deeper but what's the rationale to disableWAL? Generally this introduces a breach to data writing consistency model. Highly likely csum errors you're observing is a consequence of this..

Actions

Copy link

Updated by jiaxu li over 3 years ago

In some application scenarios, I want to close wal in order to get lower latency and higher IOPS. After closing wal, osd can be used to save business data, but it cannot be restarted.

Actions

Copy link

Updated by jiaxu li over 3 years ago

In addition, I have also tried to deploy osd first, and then modify the bluestore_rocksdb_options in the configuration file. Restarting osd will also fail, the error message is the same as this one.

Actions

Copy link

Updated by Jiaying Ren over 3 years ago

File rgw-P99.9lantecy-before-after-disable-wal.png rgw-P99.9lantecy-before-after-disable-wal.png added

Hi Igor:

1. we've found disable WAL would reduce latency(measured by P99.9 latency),as we've tested rgw put workload (io size from 4k-1m)with osd enable/disable WAL settings.

It shows us a big room to improve bluestore write latency by customizing WAL.

2. As our expectation after restart the osd(with WAL disabled), osd may lost some object is reasonable.Our confusion is osd even failed to restart.

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Severity changed from 1 - critical to 3 - minor

I doubt it will work this way as there would be no onode's metadata consistency guarantee any more... In your case superblock is updated via deferred write which means there are at least two update operations to KV (which should go within a single transaction): user data payload for deferred write and onode's metadata update. Looks like the transaction is interrupted in the middle and onode's metadata isn't updated properly - hence it still contains previous checksums which causes subsequent csum verification failures.

You can probably workaround this specific(!) case by disabling deferred writes and/or csum verification or somehow improving BlueStore shutdown procedure to flush all the KV data to disk. But I'm absolutely not sure all the above would guarantee the solution to be 100% operational - current design relies hardly on consistent DB operations.

Finally IMO this isn't a proper ticket for Ceph - disabling KV's WAL isn't a proper mode of operation in general...

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Project changed from Ceph to bluestore
Category deleted (~~OSD~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #47985

When WAL is closed, osd cannot be restarted

Updated by Igor Fedotov over 3 years ago

Updated by jiaxu li over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by jiaxu li over 3 years ago

Updated by jiaxu li over 3 years ago

Updated by Jiaying Ren over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by Igor Fedotov over 3 years ago