Project

General

Profile

Actions

Bug #22464

closed

Bluestore: many checksum errors, always 0x6706be76 (which matches a zero block)

Added by Martin Preuss over 6 years ago. Updated over 1 year ago.

Status:
Won't Fix
Priority:
Urgent
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-deploy
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm new to Ceph. I started a ceph cluster from scratch on Debian 9,
consisting of 3 hosts, each host has 3-4 OSDs (using 4TB hdds, currently
totalling 10 hdds).

Right from the start I always received random scrub errors telling me
that some checksums didn't match the expected value, fixable with "ceph
pg repair".

I looked at the ceph-osd logfiles on each of the hosts and compared with
the corresponding syslogs. I never found any hardware error, so there
was no problem reading or writing a sector hardware-wise. Also there was
never any other suspicious syslog entry around the time of checksum
error reporting.

When I looked at the checksum error entries I found that the reported
bad checksum always was "0x6706be76".

Cluster created with version 12.2.1 (errors already existed with that version) and updated to 12.2.2.
All 3 nodes run Debian 9 with packages from "http://eu.ceph.com/debian-luminous/".

Cluster status:
services:
mon: 3 daemons, quorum ceph1,ceph2,ceph3
mgr: ceph1(active), standbys: ceph2
mds: cephfs-1/1/1 up {0=ceph1=up:active}, 2 up:standby
osd: 10 osds: 10 up, 10 in

data:
pools: 5 pools, 256 pgs
objects: 8097k objects, 10671 GB
usage: 25403 GB used, 11856 GB / 37259 GB avail
pgs: 256 active+clean

Pools:
pool 1 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 1184 flags hashpspool stripe_width 0 application cephfs
pool 2 'cephfs_data' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1184 lfor 0/772 flags hashpspool stripe_width 0 compression_algorithm zlib compression_mode force application cephfs
pool 3 'cephfs_home' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 1184 lfor 0/463 flags hashpspool stripe_width 0 compression_algorithm zlib compression_mode force application cephfs
pool 4 'cephfs_multimedia' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1184 lfor 0/705 flags hashpspool stripe_width 0 application cephfs
pool 5 'cephfs_vdr' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1184 lfor 0/632 flags hashpspool stripe_width 0 application cephfs

OSD tree:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 36.38596 root default
-3 10.91579 host ceph1
0 hdd 3.63860 osd.0 up 0.79999 1.00000
1 hdd 3.63860 osd.1 up 0.70000 1.00000
2 hdd 3.63860 osd.2 up 1.00000 1.00000
-5 14.55438 host ceph2
3 hdd 3.63860 osd.3 up 1.00000 1.00000
4 hdd 3.63860 osd.4 up 1.00000 1.00000
5 hdd 3.63860 osd.5 up 1.00000 1.00000
9 hdd 3.63860 osd.9 up 1.00000 1.00000
-7 10.91579 host ceph3
6 hdd 3.63860 osd.6 up 1.00000 1.00000
7 hdd 3.63860 osd.7 up 1.00000 1.00000
8 hdd 3.63860 osd.8 up 1.00000 1.00000


Files

ceph-errors (5.95 KB) ceph-errors List of bad pgs per day Martin Preuss, 01/19/2018 07:09 PM

Related issues 2 (0 open2 closed)

Related to bluestore - Bug #22102: BlueStore crashed on rocksdb checksum mismatchWon't Fix11/10/2017

Actions
Related to bluestore - Bug #25006: bad csum during upgrade testCan't reproduce07/19/2018

Actions
Actions

Also available in: Atom PDF