Project

General

Profile

Actions

Bug #47003

closed

ceph_test_rados test error. Reponses out of order due to the connection drops data.

Added by Zhenyi Shu over 3 years ago. Updated about 3 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://pulpito.front.sepia.ceph.com/shuzhenyi-2020-08-17_01:22:26-rados:thrash-wip-shuzhenyi-testing-2020-08-15-1304-distro-basic-smithi/5349936/

I found logs below:

2020-08-17T01:36:37.872 INFO:tasks.rados.rados.0.smithi158.stdout:780:  writing smithi15817201-780 from 692653 to 1179630 tid 1
2020-08-17T01:36:37.878 INFO:tasks.rados.rados.0.smithi158.stdout:780:  writing smithi15817201-780 from 1602061 to 2249664 tid 2
2020-08-17T01:36:37.883 INFO:tasks.rados.rados.0.smithi158.stdout:780:  writing smithi15817201-780 from 2992629 to 3092653 tid 3
2020-08-17T01:36:38.478 INFO:tasks.rados.rados.0.smithi158.stdout:780:  finishing write tid 1 to smithi15817201-780
2020-08-17T01:36:38.478 INFO:tasks.rados.rados.0.smithi158.stdout:780:  finishing write tid 3 to smithi15817201-780
2020-08-17T01:36:38.478 INFO:tasks.rados.rados.0.smithi158.stdout:780:  finishing write tid 5 to smithi15817201-780
2020-08-17T01:36:38.478 INFO:tasks.rados.rados.0.smithi158.stdout:780:  finishing write tid 2 to smithi15817201-780
2020-08-17T01:36:38.479 INFO:tasks.rados.rados.0.smithi158.stderr:Error: finished tid 2 when last_acked_tid was 5
2020-08-17T01:36:38.479 INFO:tasks.rados.rados.0.smithi158.stderr:/build/ceph-16.0.0-4262-g9943626/src/test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7f3ab6ffd700 time 2020-08-17T01:36:38.435642+0000
2020-08-17T01:36:38.479 INFO:tasks.rados.rados.0.smithi158.stderr:/build/ceph-16.0.0-4262-g9943626/src/test/osd/RadosModel.h: 907: ceph_abort_msg("abort() called")
2020-08-17T01:36:38.479 INFO:tasks.rados.rados.0.smithi158.stderr: ceph version 16.0.0-4262-g9943626 (9943626c67ac05201c815e9e878fd203ee096d94) pacific (dev)
2020-08-17T01:36:38.479 INFO:tasks.rados.rados.0.smithi158.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe1) [0x7f3aca390997]
2020-08-17T01:36:38.480 INFO:tasks.rados.rados.0.smithi158.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x59c) [0x55d5add9c94c]
2020-08-17T01:36:38.480 INFO:tasks.rados.rados.0.smithi158.stderr: 3: (write_callback(void*, void*)+0x19) [0x55d5addb9009]
2020-08-17T01:36:38.480 INFO:tasks.rados.rados.0.smithi158.stderr: 4: (()+0x98897) [0x7f3ad3068897]
2020-08-17T01:36:38.480 INFO:tasks.rados.rados.0.smithi158.stderr: 5: (()+0xb2665) [0x7f3ad3082665]
2020-08-17T01:36:38.480 INFO:tasks.rados.rados.0.smithi158.stderr: 6: (()+0xb54ff) [0x7f3ad30854ff]
2020-08-17T01:36:38.481 INFO:tasks.rados.rados.0.smithi158.stderr: 7: (()+0xb5846) [0x7f3ad3085846]
2020-08-17T01:36:38.481 INFO:tasks.rados.rados.0.smithi158.stderr: 8: (()+0xbd6df) [0x7f3ac9c426df]
2020-08-17T01:36:38.481 INFO:tasks.rados.rados.0.smithi158.stderr: 9: (()+0x76db) [0x7f3ac9f156db]
2020-08-17T01:36:38.481 INFO:tasks.rados.rados.0.smithi158.stderr: 10: (clone()+0x3f) [0x7f3ac969da3f]

Then I grep ceph-osd.5.log, which is primary osd's log, and found logs below:

375153:2020-08-17T01:36:37.882+0000 7f551125e700  1 -- v2:172.21.15.66:6800/13577 --> 172.21.15.158:0/3294166163 -- osd_op_reply(3505 smithi15817201-780 [write 692653~486977,stat] v29'170 uv170 ondisk = 0) v8 -- 0x561fb0ce2880 con 0x561fb3824000
375156:2020-08-17T01:36:37.882+0000 7f551125e700  5 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 crc :-1 s=READY pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0).send_message enqueueing message m=0x561fb0ce2880 type=43 osd_op_reply(3505 smithi15817201-780 [write 692653~486977,stat] v29'170 uv170 ondisk = 0) v8
375170:2020-08-17T01:36:37.882+0000 7f552d12d700 20 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 crc :-1 s=READY pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0).prepare_send_message m=osd_op_reply(3505 smithi15817201-780 [write 692653~486977,stat] v29'170 uv170 ondisk = 0) v8
375172:2020-08-17T01:36:37.882+0000 7f552d12d700 20 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 crc :-1 s=READY pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0).prepare_send_message encoding features 4540138292837744639 0x561fb0ce2880 osd_op_reply(3505 smithi15817201-780 [write 692653~486977,stat] v29'170 uv170 ondisk = 0) v8
375174:2020-08-17T01:36:37.882+0000 7f552d12d700  5 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 crc :-1 s=READY pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0).write_message sending message m=0x561fb0ce2880 seq=91 osd_op_reply(3505 smithi15817201-780 [write 692653~486977,stat] v29'170 uv170 ondisk = 0) v8
376811:2020-08-17T01:36:37.886+0000 7f550d256700  1 -- v2:172.21.15.66:6800/13577 --> 172.21.15.158:0/3294166163 -- osd_op_reply(3506 smithi15817201-780 [write 1602061~647603] v29'171 uv171 ondisk = 0) v8 -- 0x561fb0ce1440 con 0x561fb3824000
376812:2020-08-17T01:36:37.886+0000 7f550d256700 10 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 unknown :-1 s=CLOSED pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0).send_message connection closed. Drop message 0x561fb0ce1440
376860:2020-08-17T01:36:37.886+0000 7f551125e700  1 -- v2:172.21.15.66:6800/13577 --> 172.21.15.158:0/3294166163 -- osd_op_reply(3507 smithi15817201-780 [write 2992629~100024] v29'172 uv172 ondisk = 0) v8 -- 0x561fb38bc400 con 0x561faac7bc00
376861:2020-08-17T01:36:37.886+0000 7f551125e700  5 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561faac7bc00 0x561facccca00 crc :-1 s=READY pgs=26 cs=0 l=1 rev1=1 rx=0 tx=0).send_message enqueueing message m=0x561fb38bc400 type=43 osd_op_reply(3507 smithi15817201-780 [write 2992629~100024] v29'172 uv172 ondisk = 0) v8
376868:2020-08-17T01:36:37.886+0000 7f552d12d700 20 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561faac7bc00 0x561facccca00 crc :-1 s=READY pgs=26 cs=0 l=1 rev1=1 rx=0 tx=0).prepare_send_message m=osd_op_reply(3507 smithi15817201-780 [write 2992629~100024] v29'172 uv172 ondisk = 0) v8
376869:2020-08-17T01:36:37.886+0000 7f552d12d700 20 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561faac7bc00 0x561facccca00 crc :-1 s=READY pgs=26 cs=0 l=1 rev1=1 rx=0 tx=0).prepare_send_message encoding features 4540138292837744639 0x561fb38bc400 osd_op_reply(3507 smithi15817201-780 [write 2992629~100024] v29'172 uv172 ondisk = 0) v8
376870:2020-08-17T01:36:37.886+0000 7f552d12d700  5 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561faac7bc00 0x561facccca00 crc :-1 s=READY pgs=26 cs=0 l=1 rev1=1 rx=0 tx=0).write_message sending message m=0x561fb38bc400 seq=1 osd_op_reply(3507 smithi15817201-780 [write 2992629~100024] v29'172 uv172 ondisk = 0) v8
376972:2020-08-17T01:36:37.890+0000 7f550d256700  1 -- v2:172.21.15.66:6800/13577 --> 172.21.15.158:0/3294166163 -- osd_op_reply(3506 smithi15817201-780 [write 1602061~647603] v29'171 uv171 ondisk = 0) v8 -- 0x561fb0ce1440 con 0x561faac7bc00
376973:2020-08-17T01:36:37.890+0000 7f550d256700  5 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561faac7bc00 0x561facccca00 crc :-1 s=READY pgs=26 cs=0 l=1 rev1=1 rx=0 tx=0).send_message enqueueing message m=0x561fb0ce1440 type=43 osd_op_reply(3506 smithi15817201-780 [write 1602061~647603] v29'171 uv171 ondisk = 0) v8
376978:2020-08-17T01:36:37.890+0000 7f552d12d700 20 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561faac7bc00 0x561facccca00 crc :-1 s=READY pgs=26 cs=0 l=1 rev1=1 rx=0 tx=0).prepare_send_message m=osd_op_reply(3506 smithi15817201-780 [write 1602061~647603] v29'171 uv171 ondisk = 0) v8
376979:2020-08-17T01:36:37.890+0000 7f552d12d700 20 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561faac7bc00 0x561facccca00 crc :-1 s=READY pgs=26 cs=0 l=1 rev1=1 rx=0 tx=0).prepare_send_message encoding features 4540138292837744639 0x561fb0ce1440 osd_op_reply(3506 smithi15817201-780 [write 1602061~647603] v29'171 uv171 ondisk = 0) v8
376980:2020-08-17T01:36:37.890+0000 7f552d12d700  5 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561faac7bc00 0x561facccca00 crc :-1 s=READY pgs=26 cs=0 l=1 rev1=1 rx=0 tx=0).write_message sending message m=0x561fb0ce1440 seq=3 osd_op_reply(3506 smithi15817201-780 [write 1602061~647603] v29'171 uv171 ondisk = 0) v8

It seems like at the point of sending tid2's data, connection dropped the data and cloed.


Related issues 1 (0 open1 closed)

Is duplicate of RADOS - Bug #48793: out of order opResolvedRonen Friedman

Actions
Actions #1

Updated by Zhenyi Shu over 3 years ago

The logs I grep "0x561fb3824000" below shows that the connection was reset.

375180:2020-08-17T01:36:37.882+0000 7f552d12d700 10 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 crc :-1 s=READY pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0).write_message sending 0x561fb0ce2880 done.
375187:2020-08-17T01:36:37.882+0000 7f552d12d700 20 -- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 msgr2=0x561fb0da3900 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).process
375189:2020-08-17T01:36:37.882+0000 7f552d12d700 20 -- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 msgr2=0x561fb0da3900 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read continue len=32
375190:2020-08-17T01:36:37.882+0000 7f552d12d700  1 -- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 msgr2=0x561fb0da3900 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 93
375191:2020-08-17T01:36:37.882+0000 7f552d12d700  1 -- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 msgr2=0x561fb0da3900 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
375193:2020-08-17T01:36:37.882+0000 7f552d12d700 20 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 crc :-1 s=READY pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main r=-1
375195:2020-08-17T01:36:37.882+0000 7f552d12d700  1 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 crc :-1 s=READY pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
375196:2020-08-17T01:36:37.882+0000 7f552d12d700 10 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 crc :-1 s=READY pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0)._fault
375197:2020-08-17T01:36:37.882+0000 7f552d12d700  2 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 crc :-1 s=READY pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0)._fault on lossy channel, failing
375198:2020-08-17T01:36:37.882+0000 7f552d12d700  1 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 crc :-1 s=READY pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0).stop
375200:2020-08-17T01:36:37.882+0000 7f552d12d700  5 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 crc :-1 s=READY pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0).reset_recv_state
375201:2020-08-17T01:36:37.882+0000 7f552d12d700  5 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 crc :-1 s=READY pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0).reset_security
375202:2020-08-17T01:36:37.882+0000 7f552d12d700 10 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 unknown :-1 s=READY pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0).discard_out_queue started
375206:2020-08-17T01:36:37.882+0000 7f551c274700  2 osd.5 29 ms_handle_reset con 0x561fb3824000 session 0x561fb2db2000
376811:2020-08-17T01:36:37.886+0000 7f550d256700  1 -- v2:172.21.15.66:6800/13577 --> 172.21.15.158:0/3294166163 -- osd_op_reply(3506 smithi15817201-780 [write 1602061~647603] v29'171 uv171 ondisk = 0) v8 -- 0x561fb0ce1440 con 0x561fb3824000
376812:2020-08-17T01:36:37.886+0000 7f550d256700 10 --2- v2:172.21.15.66:6800/13577 >> 172.21.15.158:0/3294166163 conn(0x561fb3824000 0x561fb0da3900 unknown :-1 s=CLOSED pgs=19 cs=0 l=1 rev1=1 rx=0 tx=0).send_message connection closed. Drop message 0x561fb0ce1440
376988:2020-08-17T01:36:37.890+0000 7f550d256700 15 osd.5 pg_epoch: 29 pg[3.0( v 29'173 (29'170,29'173] local-lis/les=19/20 n=50 ec=19/19 lis/c=19/19 les/c/f=20/20/0 sis=19) [5,7,2] r=0 lpr=19 luod=29'171 crt=29'173 lcod 29'170 mlcod 29'170 active+clean] do_osd_op_effects client.4600 con 0x561fb3824000
Actions #2

Updated by Neha Ojha over 3 years ago

  • Project changed from Ceph to RADOS
  • Priority changed from Normal to Urgent

rados/thrash-erasure-code/{ceph clusters/{fixed-2 openstack} fast/fast mon_election/classic msgr-failures/few objectstore/bluestore-comp-zlib rados recovery-overrides/{more-active-recovery} supported-random-distro$/{centos_8} thrashers/minsize_recovery thrashosds-health workloads/ec-small-objects-many-deletes}

/a/yuriw-2021-01-15_19:06:33-rados-wip-yuri8-testing-master-2021-01-15-0935-distro-basic-smithi/5789308

Actions #3

Updated by Kefu Chai about 3 years ago

/a//kchai-2021-01-28_03:28:19-rados-wip-kefu-testing-2021-01-27-1353-distro-basic-smithi/5834177

Actions #4

Updated by Neha Ojha about 3 years ago

  • Status changed from New to Duplicate
Actions #5

Updated by Neha Ojha about 3 years ago

  • Is duplicate of Bug #48793: out of order op added
Actions

Also available in: Atom PDF