Bug #13972
closed
osd/ECUtil.h: 117: FAILED assert(old_size == total_chunk_size) in 0.80.10
Added by Chris Holcombe over 8 years ago.
Updated over 7 years ago.
Description
I have an erasure code pool setup with k=10,m=3 jerasure and ruleset-failure-domain=osd and I seem to be hitting a crash loop with the osds. Only a few of them are crashing them but it's enough that the entire ceph cluster is looking pretty banged up. I'm not exactly sure what caused the problem. We're just putting test data in the cluster so we're free to dump it if needed.
Here's a paste of the crash:
https://gist.github.com/cholcombe973/eaacc5effc4a1af33ce6
The log output is a mile long but I think I posted the most relevant parts.
- Project changed from Stable releases to Ceph
- Priority changed from High to Urgent
The ceph osd log is located here: ceph-post-file: 5023e3eb-eeba-472d-aeea-86022efdc83d
The workload that was happening before the crash was a boto multipart upload. There was a cron job of about 24 batches of these going. I killed the radosgw and restarted it to see if this continues happening.
Copying the irc convo here for clarity:
<sjusthm> cholcombe: ok, in master, an op like the one rgw is sending would have gotten an ENOTSUPP
<sjusthm> rather than crashing the osd
<cholcombe> i see
<sjusthm> rgw is writing a new object at offset 2096640
<sjusthm> which is illegal on an EC pool
<cholcombe> oh right..
<sjusthm> I guess we didn't backport the ENOTSUPP return fix
..<snip>..
sjusthm> but it must be exactly an append
<cholcombe> i see
<sjusthm> which means it must be have offset 0
<sjusthm> since the object does not currently have any data
After rolling the rgw nodes the cluster went back to active+clean and the osds aren't crashing anymore.
2015-12-03 09:53:41.845187 7f68aecb2700 10 osd.80 8518 dequeue_op 0x7f690866f3b0 prio 63 cost 2096640 latency 24.577112 osd_op(client.11661.0:10010125 default.11661.7__shadow_NBC_SUITS_F6307_HD_10CH_FR_EN_DA000449566_16X9_178_2398_DIGITAL_FINAL_33620.mov.2~2XoKTxTsYORRDqGbdFS_K9QzzTicUMb.37_13 [write 2096640~2096640] 23.fca94ccd RETRY=1220 ack+ondisk+retry+write e8502) v4 pg pg[23.4cds0( v 7397'9268 (1614'6255,7397'9268]
local-les=8518 n=1033 ec=1595 les/c 8518/8462 8511/8511/8502) [80,99,87,49,156,89,115,101,84,67,108,36,160] r=0 lpr=8511 pi=8457-8510/10 crt=1743'9254 lcod 0'0 mlcod 0'0 active]
The object doesn't exist, 0.80.11 would have rejected that write with an ENOTSUPP since the offset doesn't match the object size. I seem to recall a bug in radosgw which could cause it to not send the initial write? Or maybe a bug in the osd which reordered the writes?
The log above only contains two writes on the object at offset 2096640~2096640 and 4193280~1024. If it recurs after restarting radosgw, the next step would be to get matching logs on the osd and radosgw
debug ms = 1
debug objecter = 20
(whatever the right radosgw logging is?)
during the event.
- Assignee deleted (
Samuel Just)
- Tracker changed from Tasks to Bug
- Status changed from New to Need More Info
- Release set to firefly
- Release set to hammer
- Affected Versions deleted (
v0.80.10)
- File cm-20151207.txt added
crushmap per Loic request.
- File ceph-osd-dump-20151207.txt added
- File ceph-pg-dump-20151207.txt.gz added
- Release deleted (
firefly)
- Release deleted (
hammer)
- Affected Versions 0.80 added
- File deleted (
cm-20151207.txt)
- File deleted (
ceph-osd-dump-20151207.txt)
- File deleted (
ceph-pg-dump-20151207.txt.gz)
- Status changed from Need More Info to Can't reproduce
Also available in: Atom
PDF