Bug #48025
openosd start up failed when osd superblock crc fail
0%
Description
【verson】
14.2.8
【trigger operation 】
Under normal operation of the cluster, power down the equipment manually, and then power up the equipment again
【appearance】
One OSD failed to start up in the cluster.
The logging shows that the OSD Superblock CRC checksum is inconsistent with the actual data on disk.
Try to turn validation off. Dump OSD superblock decodes normally and the fields are correct.
Analyzing the refresh OSD Superblock process confirms that the refresh operation is a defere write transaction.
【dump osd super block no crc】
[root@node145 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-23 --type bluestore --op dump-super
{
"cluster_fsid": "a28d823e-eb42-4eba-9603-e1722e8bc884",
"osd_fsid": "53b6d708-2ac2-4323-bedc-7082e08c5791",
"whoami": 23,
"current_epoch": 6112,
"oldest_map": 5249,
"newest_map": 6112,
"weight": 0,
"compat": {
"compat": {},
"ro_compat": {},
"incompat": {
"feature_1": "initial feature set(~v.18)",
"feature_2": "pginfo object",
"feature_3": "object locator",
"feature_4": "last_epoch_clean",
"feature_5": "categories",
"feature_6": "hobjectpool",
"feature_7": "biginfo",
"feature_8": "leveldbinfo",
"feature_9": "leveldblog",
"feature_10": "snapmapper",
"feature_11": "sharded objects",
"feature_12": "transaction hints",
"feature_13": "pg meta object",
"feature_14": "explicit missing set",
"feature_15": "fastinfo pg attr",
"feature_16": "deletes in missing set"
}
},
"clean_thru": 6112,
"last_epoch_mounted": 5884
}
【log】
2020-10-09 15:00:22.296 7f637345fdc0 -1 bluestore(/var/lib/ceph/osd/ceph-15) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xce54de81, expected 0xebc2f895, device location [0x10000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0# [BlueStore::_verify_csum:9368]
2020-10-09 15:00:22.297 7f637345fdc0 -1 bluestore(/var/lib/ceph/osd/ceph-15) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xce54de81, expected 0xebc2f895, device location [0x10000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0# [BlueStore::_verify_csum:9368]
2020-10-09 15:00:22.297 7f637345fdc0 -1 bluestore(/var/lib/ceph/osd/ceph-15) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xce54de81, expected 0xebc2f895, device location [0x10000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0# [BlueStore::_verify_csum:9368]
2020-10-09 15:00:22.297 7f637345fdc0 -1 osd.15 0 OSD::init() : unable to read osd superblock [OSD::init:3115]
Updated by Igor Fedotov over 3 years ago
Just in case - don't you have any custom settings for RocksDB, e.g. disabled WAL?
Updated by Bo Zhang over 3 years ago
Igor Fedotov wrote:
Just in case - don't you have any custom settings for RocksDB, e.g. disabled WAL?
Has been changed in the configuration file ceph.conf.as follows:
bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=4,write_buffer_size=536870912,writable_file_max_buffer_size=0,compaction_readahead_size=2097152"
Updated by Bo Zhang over 3 years ago
Igor Fedotov wrote:
Just in case - don't you have any custom settings for RocksDB, e.g. disabled WAL?
NOT disable WAL?
Updated by Bo Zhang over 3 years ago
Bo Zhang wrote:
Igor Fedotov wrote:
Just in case - don't you have any custom settings for RocksDB, e.g. disabled WAL?
NOT disable WAL
Updated by Igor Fedotov over 3 years ago
Bo Jang, I haven't got your last commends on disabled WAL, please elaborate.
From RocksDB config line I don't see any WAL disablement hence it's enabled by default.
Also wondering if this was a single occurrence or you're able to reproduce it on more or less regular basis?
Updated by Bo Zhang over 3 years ago
Igor Fedotov wrote:
Bo Jang, I haven't got your last commends on disabled WAL, please elaborate.
From RocksDB config line I don't see any WAL disablement hence it's enabled by default.
Also wondering if this was a single occurrence or you're able to reproduce it on more or less regular basis?
Sorry for my late response.
1.I didn't disable wal
2.So far, I haven't found a way to reproduce it, which is the difficulty of the problem.
3.The device was powered down for eight days due to the holiday, and the problem occurred after it was powered up, but I don't know if the two are related
Updated by Bo Zhang over 3 years ago
Another bug also appears on the same node.(https://tracker.ceph.com/issues/48061)
Updated by Igor Fedotov over 3 years ago
Bo Zhang wrote:
Another bug also appears on the same node.(https://tracker.ceph.com/issues/48061)
This another bug happened to a different OSD I presume, isn't it?
Aren't both failing OSDs sharing the same disk for DB volume? Or may be disk controller or something? What are the models of the drive(s) behind DB?
Could you please check for H/W errors using dmesg and smartctl?
Updated by Bo Zhang over 3 years ago
Igor Fedotov wrote:
Bo Zhang wrote:
Another bug also appears on the same node.(https://tracker.ceph.com/issues/48061)
This another bug happened to a different OSD I presume, isn't it?
Aren't both failing OSDs sharing the same disk for DB volume? Or may be disk controller or something? What are the models of the drive(s) behind DB?Could you please check for H/W errors using dmesg and smartctl?
1.yes,Different OSD.The result is that OSD fails to power up.
2.nothing,Db and wal are not set up in this environment,Both are HDD disks.
3.I had checked with Demsg and Smartctl and there are no errors