Bug #38012: osd bad crc cause the whole cluster stop accepting new request. - RADOS - Ceph

Actions

Copy link

Bug #38012

open

osd bad crc cause the whole cluster stop accepting new request.

Added by frank lin about 5 years ago. Updated about 5 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have encounter this problem both on jewel cluster and luminous cluster.
The symptom is some request will be blocked forever and the whole
cluster won't able to receive any data anymore. Further investigating
shows the blocked request happened on 2 osds(the pool size is 2 so I
guess it will effect all osds in the same acting set).The osd log have
many repeated error message like below

error message in jewel cluster
first effected osd

2019-01-21 10:08:18.836864 941f9b10  0 bad crc in data 1155321283 !=
exp 1909584237
2019-01-21 10:08:18.837013 941f9b10  0 -- 10.0.2.39:6800/23795 >>
10.0.2.40:6804/28471 pipe(0x7c15c000 sd=149 :53242 s=2 pgs=5 cs=1 l=0
c=0x7a697e60).fault, initiating reconnect
2019-01-21 10:08:18.839328 7cbb4b10  0 -- 10.0.2.39:6800/23795 >>
10.0.2.40:6804/28471 pipe(0x6b782b00 sd=235 :6800 s=0 pgs=0 cs=0 l=0
c=0x8b2f5440).accept connect_seq 2 vs existing 2 state
 connecting
2019-01-21 10:08:18.850772 7cbb4b10  0 bad crc in data 1155321283 !=
exp 1909584237
2019-01-21 10:08:18.850910 7cbb4b10  0 -- 10.0.2.39:6800/23795 >>
10.0.2.40:6804/28471 pipe(0x6b782b00 sd=235 :6800 s=2 pgs=58 cs=3 l=0
c=0x7a697e60).fault with nothing to send, going to st
andby

second effected osd

2019-01-21 10:06:12.282115 9513cb10  0 bad crc in data 1035875608 !=
exp 3787091679
2019-01-21 10:06:12.290395 abdcdb10  0 -- 10.0.1.40:6804/28471
submit_message osd_op_reply(1031289084
rbd_data.28ae2238e1f29.0000000000a7df16 [set-alloc-hint object_size
4194304 write_size
4194304,write 65536~524288] v5503'1224666 uv1224666 ondisk = 0) v7
remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message
0x70df6800
2019-01-21 10:06:12.297356 9eb3cb10  0 -- 10.0.1.40:6804/28471
submit_message osd_op_reply(1031289067
rbd_data.28ae2238e1f29.0000000000a7defd [set-alloc-hint object_size
4194304 write_size
4194304,write 3211264~524288] v5503'1236405 uv1236405 ondisk = 0) v7
remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message
0x716a0e00
2019-01-21 10:06:12.303597 abdcdb10  0 -- 10.0.1.40:6804/28471
submit_message osd_op_reply(1031289085
rbd_data.28ae2238e1f29.0000000000a7df16 [set-alloc-hint object_size
4194304 write_size
4194304,write 589824~524288] v5503'1224667 uv1224667 ondisk = 0) v7
remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message
0x6e537000
2019-01-21 10:06:12.310642 9c33cb10  0 -- 10.0.1.40:6804/28471
submit_message osd_op_reply(1031289069
rbd_data.28ae2238e1f29.0000000000a7defd [set-alloc-hint object_size
4194304 write_size
4194304,write 3735552~458752] v5503'1236406 uv1236406 ondisk = 0) v7
remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message
0x71655000
2019-01-21 10:08:18.837438 94b3cb10  0 -- 10.0.2.40:6804/28471 >>
10.0.2.39:6800/23795 pipe(0x888acd00 sd=129 :6804 s=2 pgs=3916 cs=1
l=0 c=0x9202a7e0).fault, initiating reconnect
2019-01-21 10:08:18.839301 9323cb10  0 -- 10.0.2.40:6804/28471 >>
10.0.2.39:6800/23795 pipe(0x702a6000 sd=80 :6804 s=0 pgs=0 cs=0 l=0
c=0x8d086480).accept connect_seq 2 vs existing 2 state
connecting
2019-01-21 10:08:18.851839 94b3cb10  0 -- 10.0.2.40:6804/28471 >>
10.0.2.39:6800/23795 pipe(0x888acd00 sd=129 :42636 s=2 pgs=3930 cs=3
l=0 c=0x9202a7e0).fault, initiating reconnect
2019-01-21 10:08:18.860245 7eaf5b10  0 -- 10.0.2.40:6804/28471 >>
10.0.2.39:6800/23795 pipe(0x888acd00 sd=129 :42636 s=1 pgs=3930 cs=4
l=0 c=0x9202a7e0).fault
2019-01-21 10:08:18.877537 94b3cb10  0 -- 10.0.2.40:6804/28471 >>
10.0.2.39:6800/23795 pipe(0x888acd00 sd=80 :42638 s=2 pgs=3931 cs=5
l=0 c=0x9202a7e0).fault, initiating reconnect

error message in luminous cluster
first effected osd

2018-12-11 23:14:43.034926 b560c8e0 0 -- 10.0.2.2:6802/15865 >>
10.0.2.37:6800/13016 conn(0x7648e00 :-1
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=8870304 cs=8856031
l=0).handle_connect_msg: challenging authorizer
2018-12-11 23:14:43.042915 b560c8e0 0 bad crc in front 1566330326 !=
exp 3283985696
2018-12-11 23:14:43.044587 b4e0c8e0 0 -- 10.0.2.2:6802/15865 >>
10.0.2.37:6800/13016 conn(0x919e700 :6802
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg: challenging authorizer
2018-12-11 23:14:43.045153 b4e0c8e0 0 -- 10.0.2.2:6802/15865 >>
10.0.2.37:6800/13016 conn(0x919e700 :6802
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 8856034 vs existing
csq=8856033 existing_state=STATE_STANDBY

second effected osd

2018-12-11 23:15:23.693508 b56158e0 0 -- 10.0.2.37:6800/13016 >>
10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450977
cs=8863341 l=0).fault initiating reconnect
2018-12-11 23:15:23.704284 b56158e0 0 -- 10.0.2.37:6800/13016 >>
10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450978
cs=8863343 l=0).fault initiating reconnect
2018-12-11 23:15:23.714925 b56158e0 0 -- 10.0.2.37:6800/13016 >>
10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450979
cs=8863345 l=0).fault initiating reconnect
2018-12-11 23:15:23.725507 b56158e0 0 -- 10.0.2.37:6800/13016 >>
10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450980
cs=8863347 l=0).fault initiating reconnect

What could cause the bad crc error? Faulty memory or other hardware?

But I think it is not right to let a few faulty osd to make the whole cluster not working.

Actions

Copy link

Updated by Josh Durgin about 5 years ago

This is likely to be bad networking hardware - the CRC at the ceph level that is failing is designed to detect exactly that kind of issue, and we have seen it do so in the past, iirc usually a bad NIC, or less probably a driver bug.

Are the heartbeat messages still getting through somehow, so that the OSDs aren't marked as failed? In most networking hardware failures they should be affected as well, so the whole cluster would not be affected.

Actions

Copy link

Updated by frank lin about 5 years ago

Josh Durgin wrote:

This is likely to be bad networking hardware - the CRC at the ceph level that is failing is designed to detect exactly that kind of issue, and we have seen it do so in the past, iirc usually a bad NIC, or less probably a driver bug.

Are the heartbeat messages still getting through somehow, so that the OSDs aren't marked as failed? In most networking hardware failures they should be affected as well, so the whole cluster would not be affected.

I am not sure what the root cause is. I thought it was related to a bad memory because I have met this problem on a node which has a faulty memory(no ecc memory).
What my concern is how to let the cluster deal this kind of problem automatic.Because in this situation the osd was not marked down and all request were blocked before I manually restart the osd.

Actions

Copy link

Updated by Josh Durgin about 5 years ago

It seems the first step would be reporting the crc mismatches via a perfcounter. Then the mgr could look at those to either raise a warning or automatically mark affected OSDs out.

Actions

Copy link

Updated by frank lin about 5 years ago

Josh Durgin wrote:

It seems the first step would be reporting the crc mismatches via a perfcounter. Then the mgr could look at those to either raise a warning or automatically mark affected OSDs out.

I think this should be a good approach.

Will there be a patch for this kind of problem?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #38012

osd bad crc cause the whole cluster stop accepting new request.

Updated by Josh Durgin about 5 years ago

Updated by frank lin about 5 years ago

Updated by Josh Durgin about 5 years ago

Updated by frank lin about 5 years ago