Project

General

Profile

Actions

Bug #48025

open

osd start up failed when osd superblock crc fail

Added by Bo Zhang over 3 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
10/28/2020
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

【verson】
14.2.8
【trigger operation 】
Under normal operation of the cluster, power down the equipment manually, and then power up the equipment again
【appearance】
One OSD failed to start up in the cluster.
The logging shows that the OSD Superblock CRC checksum is inconsistent with the actual data on disk.
Try to turn validation off. Dump OSD superblock decodes normally and the fields are correct.
Analyzing the refresh OSD Superblock process confirms that the refresh operation is a defere write transaction.
【dump osd super block no crc】
[root@node145 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-23 --type bluestore --op dump-super {
"cluster_fsid": "a28d823e-eb42-4eba-9603-e1722e8bc884",
"osd_fsid": "53b6d708-2ac2-4323-bedc-7082e08c5791",
"whoami": 23,
"current_epoch": 6112,
"oldest_map": 5249,
"newest_map": 6112,
"weight": 0,
"compat": {
"compat": {},
"ro_compat": {},
"incompat": {
"feature_1": "initial feature set(~v.18)",
"feature_2": "pginfo object",
"feature_3": "object locator",
"feature_4": "last_epoch_clean",
"feature_5": "categories",
"feature_6": "hobjectpool",
"feature_7": "biginfo",
"feature_8": "leveldbinfo",
"feature_9": "leveldblog",
"feature_10": "snapmapper",
"feature_11": "sharded objects",
"feature_12": "transaction hints",
"feature_13": "pg meta object",
"feature_14": "explicit missing set",
"feature_15": "fastinfo pg attr",
"feature_16": "deletes in missing set"
}
},
"clean_thru": 6112,
"last_epoch_mounted": 5884
}
【log】
2020-10-09 15:00:22.296 7f637345fdc0 -1 bluestore(/var/lib/ceph/osd/ceph-15) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xce54de81, expected 0xebc2f895, device location [0x10000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0# [BlueStore::_verify_csum:9368]
2020-10-09 15:00:22.297 7f637345fdc0 -1 bluestore(/var/lib/ceph/osd/ceph-15) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xce54de81, expected 0xebc2f895, device location [0x10000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0# [BlueStore::_verify_csum:9368]
2020-10-09 15:00:22.297 7f637345fdc0 -1 bluestore(/var/lib/ceph/osd/ceph-15) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xce54de81, expected 0xebc2f895, device location [0x10000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0# [BlueStore::_verify_csum:9368]
2020-10-09 15:00:22.297 7f637345fdc0 -1 osd.15 0 OSD::init() : unable to read osd superblock [OSD::init:3115]

Actions #1

Updated by Igor Fedotov over 3 years ago

Just in case - don't you have any custom settings for RocksDB, e.g. disabled WAL?

Actions #2

Updated by Bo Zhang over 3 years ago

Igor Fedotov wrote:

Just in case - don't you have any custom settings for RocksDB, e.g. disabled WAL?

Has been changed in the configuration file ceph.conf.as follows:
bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=4,write_buffer_size=536870912,writable_file_max_buffer_size=0,compaction_readahead_size=2097152"

Actions #3

Updated by Bo Zhang over 3 years ago

Igor Fedotov wrote:

Just in case - don't you have any custom settings for RocksDB, e.g. disabled WAL?

NOT disable WAL?

Actions #4

Updated by Bo Zhang over 3 years ago

Bo Zhang wrote:

Igor Fedotov wrote:

Just in case - don't you have any custom settings for RocksDB, e.g. disabled WAL?

NOT disable WAL

Actions #5

Updated by Igor Fedotov over 3 years ago

Bo Jang, I haven't got your last commends on disabled WAL, please elaborate.

From RocksDB config line I don't see any WAL disablement hence it's enabled by default.
Also wondering if this was a single occurrence or you're able to reproduce it on more or less regular basis?

Actions #6

Updated by Bo Zhang over 3 years ago

Igor Fedotov wrote:

Bo Jang, I haven't got your last commends on disabled WAL, please elaborate.

From RocksDB config line I don't see any WAL disablement hence it's enabled by default.
Also wondering if this was a single occurrence or you're able to reproduce it on more or less regular basis?

Sorry for my late response.
1.I didn't disable wal
2.So far, I haven't found a way to reproduce it, which is the difficulty of the problem.
3.The device was powered down for eight days due to the holiday, and the problem occurred after it was powered up, but I don't know if the two are related

Actions #7

Updated by Bo Zhang over 3 years ago

Another bug also appears on the same node.(https://tracker.ceph.com/issues/48061)

Actions #8

Updated by Igor Fedotov over 3 years ago

Bo Zhang wrote:

Another bug also appears on the same node.(https://tracker.ceph.com/issues/48061)

This another bug happened to a different OSD I presume, isn't it?
Aren't both failing OSDs sharing the same disk for DB volume? Or may be disk controller or something? What are the models of the drive(s) behind DB?

Could you please check for H/W errors using dmesg and smartctl?

Actions #9

Updated by Bo Zhang over 3 years ago

Igor Fedotov wrote:

Bo Zhang wrote:

Another bug also appears on the same node.(https://tracker.ceph.com/issues/48061)

This another bug happened to a different OSD I presume, isn't it?
Aren't both failing OSDs sharing the same disk for DB volume? Or may be disk controller or something? What are the models of the drive(s) behind DB?

Could you please check for H/W errors using dmesg and smartctl?

1.yes,Different OSD.The result is that OSD fails to power up.
2.nothing,Db and wal are not set up in this environment,Both are HDD disks.
3.I had checked with Demsg and Smartctl and there are no errors

Actions

Also available in: Atom PDF