Project

General

Profile

Actions

Bug #21173

closed

OSD crash trying to decode erasure coded date from corrupted shards

Added by Mustafa Muhammad over 6 years ago. Updated over 6 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
David Zafman
Category:
OSD
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

OSDs are continuously crashing upon trying to backfill/recover certain objects, after inspecting these objects, I found size mismatch between files, this is the crashing thread

3941> 2017-08-29 22:09:07.666374 7fd478bc2700  1 - 192.168.216.114:6816/37284 --> 192.168.216.105:6804/27366 -- osd_map(1106979..1106980 src has 960094..1106980) v3 -- 0x7fd500f01680 con 0
3940> 2017-08-29 22:09:07.666439 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.105:6804/27366 -- MOSDECSubOpRead(143.371s3 1106980/1105690 ECSubRead(tid=23684, to_read={143:8efa89b4:::default.63296332.1__shadow_304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1_2:head=0,932160,0}, attrs_to_read=)) v3 -- 0x7fd50fdda080 con 0
3939> 2017-08-29 22:09:07.666493 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.114:6816/37284 -- MOSDECSubOpRead(143.371s0 1106980/1105690 ECSubRead(tid=23684, to_read={143:8efa89b4:::default.63296332.1__shadow_304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1_2:head=0,932160,0}, attrs_to_read=)) v3 -- 0x7fd4ad516a00 con 0
3938> 2017-08-29 22:09:07.666534 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.113:6802/12506 -- osd_map(1106979..1106980 src has 960094..1106980) v3 -- 0x7fd504198280 con 0
3937> 2017-08-29 22:09:07.666560 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.113:6802/12506 -- MOSDECSubOpRead(143.371s9 1106980/1105690 ECSubRead(tid=23684, to_read={143:8efa89b4:::default.63296332.1__shadow_304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1_2:head=0,932160,0}, attrs_to_read=)) v3 -- 0x7fd50fdda580 con 0
3935> 2017-08-29 22:09:07.666665 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.117:6802/12636 -- MOSDECSubOpRead(143.371s8 1106980/1105690 ECSubRead(tid=23684, to_read={143:8efa89b4:::default.63296332.1__shadow_304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1_2:head=0,932160,0}, attrs_to_read=)) v3 -- 0x7fd52e537980 con 0
3934> 2017-08-29 22:09:07.666735 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.109:6816/13028 -- osd_map(1106980..1106980 src has 960094..1106980) v3 -- 0x7fd50ca64c80 con 0
3933> 2017-08-29 22:09:07.666749 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.109:6816/13028 -- MOSDECSubOpRead(143.371s7 1106980/1105690 ECSubRead(tid=23684, to_read={143:8efa89b4:::default.63296332.1__shadow_304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1_2:head=0,932160,0}, attrs_to_read=)) v3 -- 0x7fd5170e8f00 con 0
3932> 2017-08-29 22:09:07.666819 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.110:6802/12139 -- osd_map(1106979..1106980 src has 960094..1106980) v3 -- 0x7fd4e85e4d00 con 0
3931> 2017-08-29 22:09:07.666837 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.110:6802/12139 -- MOSDECSubOpRead(143.371s5 1106980/1105690 ECSubRead(tid=23684, to_read={143:8efa89b4:::default.63296332.1__shadow_304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1_2:head=0,932160,0}, attrs_to_read=)) v3 -- 0x7fd503fe2300 con 0
3930> 2017-08-29 22:09:07.666863 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.123:6810/34596 -- MOSDECSubOpRead(143.371s1 1106980/1105690 ECSubRead(tid=23684, to_read={143:8efa89b4:::default.63296332.1__shadow_304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1_2:head=0,932160,0}, attrs_to_read=)) v3 -- 0x7fd50cff3980 con 0
3929> 2017-08-29 22:09:07.666900 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.135:6814/17687 -- MOSDECSubOpRead(143.371s6 1106980/1105690 ECSubRead(tid=23684, to_read={143:8efa89b4:::default.63296332.1__shadow_304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1_2:head=0,932160,0}, attrs_to_read=)) v3 -- 0x7fd4ae1bef00 con 0
3928> 2017-08-29 22:09:07.666926 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.136:6810/46475 -- MOSDECSubOpRead(143.371s4 1106980/1105690 ECSubRead(tid=23684, to_read={143:8efa89b4:::default.63296332.1__shadow_304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1_2:head=0,932160,0}, attrs_to_read=)) v3 -- 0x7fd50803e080 con 0
3927> 2017-08-29 22:09:07.666984 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.102:6807/31441 -- MOSDScrubReserve(143.1c8s0 GRANT e1106980) v1 -- 0x7fd500f5c800 con 0
3926> 2017-08-29 22:09:07.667437 7fd478bc2700 1 - 192.168.216.114:6816/37284 --> 192.168.216.114:6816/37284 -- MOSDECSubOpReadReply(143.371s0 1106980/1105690 ECSubReadReply(tid=23684, attrs_read=0)) v2 -- 0x7fd50ac558c0 con 0
0> 2017-08-29 22:09:08.098602 7fd478bc2700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7fd478bc2700 time 2017-08-29 22:09:08.094250
2017-08-29 22:09:08.169428 7fd478bc2700 -1 * Caught signal (Aborted) *
in thread 7fd478bc2700 thread_name:tp_osd_tp
0> 2017-08-29 22:09:08.169428 7fd478bc2700 -1 *
Caught signal (Aborted) *
in thread 7fd478bc2700 thread_name:tp_osd_tp

We are trying to find the corrupted files using some tricks (after the OSD dies), but there are so many of them, this is a sample, most shards are 456K, but some are 60K, when we stop the OSD, remove these files, and start it again, the crash doesn't happen, and recovery proceeds.

thread = 7fd478bc2700
PG=143.371
NAME_PART=3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1
143.371 63780 0 0 0 0 386436908769 3298 3298 activating+degraded+remapped 2017-08-29 22:22:00.092502 1105203'307439 1107026:3400572 [132,624,167,68,652,620,535,218,187,549,266,234] 132 [132,292,167,68,544,234,535,218,187,145,266,309] 132 960594'307163 2017-08-07 04:47:48.178394 958883'299934 2017-07-24 12:18:02.085306

456K /var/lib/ceph/osd/ceph-132/current/143.371s0_head/DIR_1/DIR_7/DIR_F/DIR_5/DIR_1/default.63296332.1\u\ushadow\u304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1\u2__head_2D915F71__8f_ffffffffffffffff_0
456K /var/lib/ceph/osd/ceph-68/current/143.371s3_head/DIR_1/DIR_7/DIR_F/DIR_5/DIR_1/default.63296332.1\u\ushadow\u304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1\u2__head_2D915F71__8f_ffffffffffffffff_3
456K /var/lib/ceph/osd/ceph-132/current/143.371s0_head/DIR_1/DIR_7/DIR_F/DIR_5/DIR_1/default.63296332.1\u\ushadow\u304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1\u2__head_2D915F71__8f_ffffffffffffffff_0
456K /var/lib/ceph/osd/ceph-145/current/143.371s9_head/DIR_1/DIR_7/DIR_F/DIR_5/DIR_1/default.63296332.1\u\ushadow\u304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1\u2__head_2D915F71__8f_ffffffffffffffff_9
60K /var/lib/ceph/osd/ceph-167/current/143.371s2_head/DIR_1/DIR_7/DIR_F/DIR_5/DIR_1/default.63296332.1\u\ushadow\u304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1\u2__head_2D915F71__8f_ffffffffffffffff_2
60K /var/lib/ceph/osd/ceph-187/current/143.371s8_head/DIR_1/DIR_7/DIR_F/DIR_5/DIR_1/default.63296332.1\u\ushadow\u304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1\u2__head_2D915F71__8f_ffffffffffffffff_8
456K /var/lib/ceph/osd/ceph-218/current/143.371s7_head/DIR_1/DIR_7/DIR_F/DIR_5/DIR_1/default.63296332.1\u\ushadow\u304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1\u2__head_2D915F71__8f_ffffffffffffffff_7
456K /var/lib/ceph/osd/ceph-266/current/143.371s10_head/DIR_1/DIR_7/DIR_F/DIR_5/DIR_1/default.63296332.1\u\ushadow\u304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1\u2__head_2D915F71__8f_ffffffffffffffff_a
456K /var/lib/ceph/osd/ceph-292/current/143.371s1_head/DIR_1/DIR_7/DIR_F/DIR_5/DIR_1/default.63296332.1\u\ushadow\u304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1\u2__head_2D915F71__8f_ffffffffffffffff_1
456K /var/lib/ceph/osd/ceph-309/current/143.371s11_head/DIR_1/DIR_7/DIR_F/DIR_5/DIR_1/default.63296332.1\u\ushadow\u304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1\u2__head_2D915F71__8f_ffffffffffffffff_b
456K /var/lib/ceph/osd/ceph-535/current/143.371s6_head/DIR_1/DIR_7/DIR_F/DIR_5/DIR_1/default.63296332.1\u\ushadow\u304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1\u2__head_2D915F71__8f_ffffffffffffffff_6
456K /var/lib/ceph/osd/ceph-544/current/143.371s4_head/DIR_1/DIR_7/DIR_F/DIR_5/DIR_1/default.63296332.1\u\ushadow\u304299676.2~3Mnd94mMNaUb0tK0fEB5bZn8YrB-A-F.1\u2__head_2D915F71__8f_ffffffffffffffff_4

Actions

Also available in: Atom PDF