Project

General

Profile

Bug #37326

Daily inconsistent objects

Added by OMC OMC about 1 year ago. Updated 11 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
Scrub/Repair
Target version:
Start date:
11/19/2018
Due date:
% Done:

0%

Source:
Support
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(RADOS):
BlueStore, Monitor, OSD
Pull request ID:
Crash signature:

Description

We have many Ceph mimic 13.2.1 installed with a similar configuration on ubuntu, but on one of them we get inconsistent objects on a daily basis.
They are repaired after a few hours (after we run: ceph pg repair).
This is a part of the log on the OSD node:
/var/log/ceph/ceph-osd.0.log.1.gz:171-2018-11-18 07:28:54.966 7f9392ab6700 0 log_channel(cluster) log [DBG] : 2.5c scrub starts
/var/log/ceph/ceph-osd.0.log.1.gz:172-2018-11-18 07:28:54.978 7f9392ab6700 0 log_channel(cluster) log [DBG] : 2.5c scrub ok
/var/log/ceph/ceph-osd.0.log.1.gz:173-2018-11-18 07:29:19.968 7f9392ab6700 0 log_channel(cluster) log [DBG] : 1.16a deep-scrub starts
/var/log/ceph/ceph-osd.0.log.1.gz:174-2018-11-18 07:29:48.062 7f93b0d2c700 0 -- 192.168.32.103:6800/3967491 >> 192.168.32.31:6800/1342180418 conn(0x564831f7d100 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pg
s=0 cs=0 l=1).handle_connect_msg: challenging authorizer
/var/log/ceph/ceph-osd.0.log.1.gz:175-2018-11-18 07:38:41.728 7f93b0d2c700 0 -- 192.168.32.103:6802/3967491 >> 192.168.32.102:6802/2026 conn(0x56483922a300 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 c
s=0 l=0).handle_connect_msg: challenging authorizer
/var/log/ceph/ceph-osd.0.log.1.gz:176-2018-11-18 07:38:41.728 7f93b0d2c700 0 -- 192.168.32.103:6802/3967491 >> 192.168.32.102:6802/2026 conn(0x56484e7fea00 :-1 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=1890
cs=7 l=0).handle_connect_msg: challenging authorizer
/var/log/ceph/ceph-osd.0.log.1.gz:177-2018-11-18 07:38:42.500 7f93afd2a700 0 -- 192.168.32.103:6802/3967491 >> 192.168.32.104:6803/2049 conn(0x564831f7a000 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 c
s=0 l=0).handle_connect_msg: challenging authorizer
/var/log/ceph/ceph-osd.0.log.1.gz:178-2018-11-18 07:38:42.504 7f93afd2a700 0 -- 192.168.32.103:6802/3967491 >> 192.168.32.104:6803/2049 conn(0x56484e83f500 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=201
4 cs=5 l=0).handle_connect_msg: challenging authorizer
/var/log/ceph/ceph-osd.0.log.1.gz:179-2018-11-18 07:40:43.138 7f9392ab6700 1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x11000, got 0x6706be76, expected 0xaa94f
4ee, device location [0x9c29ff31000~1000], logical extent 0x191000~1000, object 0#1:56dd4516:::100003aec89.0000088e:head#
/var/log/ceph/ceph-osd.0.log.1.gz:180:2018-11-18 07:40:43.198 7f9392ab6700 -1 log_channel(cluster) log [ERR] : 1.16as0 shard 0(0): soid 1:56dd4516:::100003aec89.0000088e:head candidate had a read error
/var/log/ceph/ceph-osd.0.log.1.gz:181-2018-11-18 07:41:38.574 7f93b052b700 0 -
192.168.32.103:6802/3967491 >> 192.168.32.101:6801/2283 conn(0x564831f7c300 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 c
s=0 l=0).handle_connect_msg: challenging authorizer
/var/log/ceph/ceph-osd.0.log.1.gz:182-2018-11-18 07:41:38.574 7f93b052b700 0 -- 192.168.32.103:6802/3967491 >> 192.168.32.101:6801/2283 conn(0x56484e7fc700 :-1 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=1797
cs=5 l=0).handle_connect_msg: challenging authorizer
/var/log/ceph/ceph-osd.0.log.1.gz:183-2018-11-18 07:44:48.065 7f93afd2a700 0 -- 192.168.32.103:6800/3967491 >> 192.168.32.31:6800/1342180418 conn(0x564831f7ca00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pg
s=0 cs=0 l=1).handle_connect_msg: challenging authorizer
/var/log/ceph/ceph-osd.0.log.1.gz:184:2018-11-18 07:44:57.338 7f9392ab6700 -1 log_channel(cluster) log [ERR] : 1.16as0 deep-scrub 0 missing, 1 inconsistent objects
/var/log/ceph/ceph-osd.0.log.1.gz:185:2018-11-18 07:44:57.338 7f9392ab6700 -1 log_channel(cluster) log [ERR] : 1.16a deep-scrub 1 errors
/var/log/ceph/ceph-osd.0.log.1.gz:186-2018-11-18 07:45:53.118 7f9392ab6700 0 log_channel(cluster) log [DBG] : 2.c5 deep-scrub starts
/var/log/ceph/ceph-osd.0.log.1.gz:187-2018-11-18 07:45:53.166 7f9392ab6700 0 log_channel(cluster) log [DBG] : 2.c5 deep-scrub ok
/var/log/ceph/ceph-osd.0.log.1.gz:188-2018-11-18 07:56:31.252 7f9392ab6700 0 log_channel(cluster) log [DBG] : 1.165 scrub starts
/var/log/ceph/ceph-osd.0.log.1.gz:189-2018-11-18 07:57:16.584 7f9392ab6700 0 log_channel(cluster) log [DBG] : 1.165 scrub ok

This is the part in the SYSLOG:
/var/log/syslog.1:Nov 18 07:40:43 osd1103 ceph-osd3967491: 2018-11-18 07:40:43.138 7f9392ab6700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x11000, got 0x6706be76, expected 0xaa94f4ee, device location [0x9c29ff31000~1000], logical extent 0x191000~1000, object 0#1:56dd4516:::100003aec89.0000088e:head#
/var/log/syslog.1:Nov 18 07:40:43 osd1103 ceph-osd3967491: 2018-11-18 07:40:43.198 7f9392ab6700 -1 log_channel(cluster) log [ERR] : 1.16as0 shard 0(0): soid 1:56dd4516:::100003aec89.0000088e:head candidate had a read error

ceph-osd.0.log.1.gz (53.5 KB) OMC OMC, 11/19/2018 10:31 AM

History

#1 Updated by Josh Durgin about 1 year ago

  • Status changed from New to Need More Info

Is this happening on the same disk all the time, or the same node? If so, that suggests a piece of hardware (e.g. controller, disk, memory) going bad.

#2 Updated by OMC OMC about 1 year ago

It happens on different disks, even on different host nodes.

#3 Updated by OMC OMC 12 months ago

Anyone has any idea?

#4 Updated by Josh Durgin 11 months ago

The ceph-users list may be able to help debug this faster - it could be many things in the hw/sw stack.

Also available in: Atom PDF