Bug #38012
openosd bad crc cause the whole cluster stop accepting new request.
0%
Description
I have encounter this problem both on jewel cluster and luminous cluster.
The symptom is some request will be blocked forever and the whole
cluster won't able to receive any data anymore. Further investigating
shows the blocked request happened on 2 osds(the pool size is 2 so I
guess it will effect all osds in the same acting set).The osd log have
many repeated error message like below
error message in jewel cluster
first effected osd
2019-01-21 10:08:18.836864 941f9b10 0 bad crc in data 1155321283 != exp 1909584237 2019-01-21 10:08:18.837013 941f9b10 0 -- 10.0.2.39:6800/23795 >> 10.0.2.40:6804/28471 pipe(0x7c15c000 sd=149 :53242 s=2 pgs=5 cs=1 l=0 c=0x7a697e60).fault, initiating reconnect 2019-01-21 10:08:18.839328 7cbb4b10 0 -- 10.0.2.39:6800/23795 >> 10.0.2.40:6804/28471 pipe(0x6b782b00 sd=235 :6800 s=0 pgs=0 cs=0 l=0 c=0x8b2f5440).accept connect_seq 2 vs existing 2 state connecting 2019-01-21 10:08:18.850772 7cbb4b10 0 bad crc in data 1155321283 != exp 1909584237 2019-01-21 10:08:18.850910 7cbb4b10 0 -- 10.0.2.39:6800/23795 >> 10.0.2.40:6804/28471 pipe(0x6b782b00 sd=235 :6800 s=2 pgs=58 cs=3 l=0 c=0x7a697e60).fault with nothing to send, going to st andby
second effected osd
2019-01-21 10:06:12.282115 9513cb10 0 bad crc in data 1035875608 != exp 3787091679 2019-01-21 10:06:12.290395 abdcdb10 0 -- 10.0.1.40:6804/28471 submit_message osd_op_reply(1031289084 rbd_data.28ae2238e1f29.0000000000a7df16 [set-alloc-hint object_size 4194304 write_size 4194304,write 65536~524288] v5503'1224666 uv1224666 ondisk = 0) v7 remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message 0x70df6800 2019-01-21 10:06:12.297356 9eb3cb10 0 -- 10.0.1.40:6804/28471 submit_message osd_op_reply(1031289067 rbd_data.28ae2238e1f29.0000000000a7defd [set-alloc-hint object_size 4194304 write_size 4194304,write 3211264~524288] v5503'1236405 uv1236405 ondisk = 0) v7 remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message 0x716a0e00 2019-01-21 10:06:12.303597 abdcdb10 0 -- 10.0.1.40:6804/28471 submit_message osd_op_reply(1031289085 rbd_data.28ae2238e1f29.0000000000a7df16 [set-alloc-hint object_size 4194304 write_size 4194304,write 589824~524288] v5503'1224667 uv1224667 ondisk = 0) v7 remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message 0x6e537000 2019-01-21 10:06:12.310642 9c33cb10 0 -- 10.0.1.40:6804/28471 submit_message osd_op_reply(1031289069 rbd_data.28ae2238e1f29.0000000000a7defd [set-alloc-hint object_size 4194304 write_size 4194304,write 3735552~458752] v5503'1236406 uv1236406 ondisk = 0) v7 remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message 0x71655000 2019-01-21 10:08:18.837438 94b3cb10 0 -- 10.0.2.40:6804/28471 >> 10.0.2.39:6800/23795 pipe(0x888acd00 sd=129 :6804 s=2 pgs=3916 cs=1 l=0 c=0x9202a7e0).fault, initiating reconnect 2019-01-21 10:08:18.839301 9323cb10 0 -- 10.0.2.40:6804/28471 >> 10.0.2.39:6800/23795 pipe(0x702a6000 sd=80 :6804 s=0 pgs=0 cs=0 l=0 c=0x8d086480).accept connect_seq 2 vs existing 2 state connecting 2019-01-21 10:08:18.851839 94b3cb10 0 -- 10.0.2.40:6804/28471 >> 10.0.2.39:6800/23795 pipe(0x888acd00 sd=129 :42636 s=2 pgs=3930 cs=3 l=0 c=0x9202a7e0).fault, initiating reconnect 2019-01-21 10:08:18.860245 7eaf5b10 0 -- 10.0.2.40:6804/28471 >> 10.0.2.39:6800/23795 pipe(0x888acd00 sd=129 :42636 s=1 pgs=3930 cs=4 l=0 c=0x9202a7e0).fault 2019-01-21 10:08:18.877537 94b3cb10 0 -- 10.0.2.40:6804/28471 >> 10.0.2.39:6800/23795 pipe(0x888acd00 sd=80 :42638 s=2 pgs=3931 cs=5 l=0 c=0x9202a7e0).fault, initiating reconnect
error message in luminous cluster
first effected osd
2018-12-11 23:14:43.034926 b560c8e0 0 -- 10.0.2.2:6802/15865 >> 10.0.2.37:6800/13016 conn(0x7648e00 :-1 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=8870304 cs=8856031 l=0).handle_connect_msg: challenging authorizer 2018-12-11 23:14:43.042915 b560c8e0 0 bad crc in front 1566330326 != exp 3283985696 2018-12-11 23:14:43.044587 b4e0c8e0 0 -- 10.0.2.2:6802/15865 >> 10.0.2.37:6800/13016 conn(0x919e700 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: challenging authorizer 2018-12-11 23:14:43.045153 b4e0c8e0 0 -- 10.0.2.2:6802/15865 >> 10.0.2.37:6800/13016 conn(0x919e700 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 8856034 vs existing csq=8856033 existing_state=STATE_STANDBY
second effected osd
2018-12-11 23:15:23.693508 b56158e0 0 -- 10.0.2.37:6800/13016 >> 10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450977 cs=8863341 l=0).fault initiating reconnect 2018-12-11 23:15:23.704284 b56158e0 0 -- 10.0.2.37:6800/13016 >> 10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450978 cs=8863343 l=0).fault initiating reconnect 2018-12-11 23:15:23.714925 b56158e0 0 -- 10.0.2.37:6800/13016 >> 10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450979 cs=8863345 l=0).fault initiating reconnect 2018-12-11 23:15:23.725507 b56158e0 0 -- 10.0.2.37:6800/13016 >> 10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450980 cs=8863347 l=0).fault initiating reconnect
What could cause the bad crc error? Faulty memory or other hardware?
But I think it is not right to let a few faulty osd to make the whole cluster not working.
Updated by Josh Durgin about 5 years ago
This is likely to be bad networking hardware - the CRC at the ceph level that is failing is designed to detect exactly that kind of issue, and we have seen it do so in the past, iirc usually a bad NIC, or less probably a driver bug.
Are the heartbeat messages still getting through somehow, so that the OSDs aren't marked as failed? In most networking hardware failures they should be affected as well, so the whole cluster would not be affected.
Updated by frank lin about 5 years ago
Josh Durgin wrote:
This is likely to be bad networking hardware - the CRC at the ceph level that is failing is designed to detect exactly that kind of issue, and we have seen it do so in the past, iirc usually a bad NIC, or less probably a driver bug.
Are the heartbeat messages still getting through somehow, so that the OSDs aren't marked as failed? In most networking hardware failures they should be affected as well, so the whole cluster would not be affected.
I am not sure what the root cause is. I thought it was related to a bad memory because I have met this problem on a node which has a faulty memory(no ecc memory).
What my concern is how to let the cluster deal this kind of problem automatic.Because in this situation the osd was not marked down and all request were blocked before I manually restart the osd.
Updated by Josh Durgin about 5 years ago
It seems the first step would be reporting the crc mismatches via a perfcounter. Then the mgr could look at those to either raise a warning or automatically mark affected OSDs out.
Updated by frank lin about 5 years ago
Josh Durgin wrote:
It seems the first step would be reporting the crc mismatches via a perfcounter. Then the mgr could look at those to either raise a warning or automatically mark affected OSDs out.
I think this should be a good approach.
Will there be a patch for this kind of problem?