osd/ECUtil.h: 117: FAILED assert(old_size == total_chunk_size) in 0.80.10
I have an erasure code pool setup with k=10,m=3 jerasure and ruleset-failure-domain=osd and I seem to be hitting a crash loop with the osds. Only a few of them are crashing them but it's enough that the entire ceph cluster is looking pretty banged up. I'm not exactly sure what caused the problem. We're just putting test data in the cluster so we're free to dump it if needed.
Here's a paste of the crash:
The log output is a mile long but I think I posted the most relevant parts.
#1 Updated by Chris Holcombe over 5 years ago
ceph df: https://gist.github.com/cholcombe973/c47e44d2615f3e6efbde the .rgw.buckets pool is erasure coded with k=10,m=3
rgw conf file: https://gist.github.com/cholcombe973/a7ec3e91290bb08b969a
There's no cache tier involved.
This cluster is using the new civetweb version of radosgw.
debug ms 20 of one of the crashing osd is attached.
I ran an xfs check on the partition and it came back clean.
#4 Updated by Chris Holcombe over 5 years ago
The workload that was happening before the crash was a boto multipart upload. There was a cron job of about 24 batches of these going. I killed the radosgw and restarted it to see if this continues happening.
Copying the irc convo here for clarity:
<sjusthm> cholcombe: ok, in master, an op like the one rgw is sending would have gotten an ENOTSUPP
<sjusthm> rather than crashing the osd
<cholcombe> i see
<sjusthm> rgw is writing a new object at offset 2096640
<sjusthm> which is illegal on an EC pool
<cholcombe> oh right..
<sjusthm> I guess we didn't backport the ENOTSUPP return fix
sjusthm> but it must be exactly an append
<cholcombe> i see
<sjusthm> which means it must be have offset 0
<sjusthm> since the object does not currently have any data
#6 Updated by Samuel Just over 5 years ago
2015-12-03 09:53:41.845187 7f68aecb2700 10 osd.80 8518 dequeue_op 0x7f690866f3b0 prio 63 cost 2096640 latency 24.577112 osd_op(client.11661.0:10010125 default.11661.7__shadow_NBC_SUITS_F6307_HD_10CH_FR_EN_DA000449566_16X9_178_2398_DIGITAL_FINAL_33620.mov.2~2XoKTxTsYORRDqGbdFS_K9QzzTicUMb.37_13 [write 2096640~2096640] 23.fca94ccd RETRY=1220 ack+ondisk+retry+write e8502) v4 pg pg[23.4cds0( v 7397'9268 (1614'6255,7397'9268]
local-les=8518 n=1033 ec=1595 les/c 8518/8462 8511/8511/8502) [80,99,87,49,156,89,115,101,84,67,108,36,160] r=0 lpr=8511 pi=8457-8510/10 crt=1743'9254 lcod 0'0 mlcod 0'0 active]
The object doesn't exist, 0.80.11 would have rejected that write with an ENOTSUPP since the offset doesn't match the object size. I seem to recall a bug in radosgw which could cause it to not send the initial write? Or maybe a bug in the osd which reordered the writes?
#7 Updated by Samuel Just over 5 years ago
The log above only contains two writes on the object at offset 2096640~2096640 and 4193280~1024. If it recurs after restarting radosgw, the next step would be to get matching logs on the osd and radosgw
debug ms = 1
debug objecter = 20
(whatever the right radosgw logging is?)
during the event.