Bug #13972: osd/ECUtil.h: 117: FAILED assert(old_size == total_chunk_size) in 0.80.10 - Ceph - Ceph

Custom queries

Backports: mimic
Backports: missing release
Backports: nautilus
Bluestore
Bug queue
Bug queue - no subprojects
Bug triage
Ceph backlog
Crash queue
Crash triage
Feature Requests
Feedback
My issues
Need Review
Pending backports
Priority queue
Product Backlog Scrub
Project Triage
Test Failures

Actions

Copy link

Bug #13972

closed

osd/ECUtil.h: 117: FAILED assert(old_size == total_chunk_size) in 0.80.10

Added by Chris Holcombe over 8 years ago. Updated over 7 years ago.

Status:

Can't reproduce

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

0.80

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have an erasure code pool setup with k=10,m=3 jerasure and ruleset-failure-domain=osd and I seem to be hitting a crash loop with the osds. Only a few of them are crashing them but it's enough that the entire ceph cluster is looking pretty banged up. I'm not exactly sure what caused the problem. We're just putting test data in the cluster so we're free to dump it if needed.

Here's a paste of the crash:
https://gist.github.com/cholcombe973/eaacc5effc4a1af33ce6
The log output is a mile long but I think I posted the most relevant parts.

History
Notes
Property changes

Actions

Copy link

Updated by Chris Holcombe over 8 years ago

Ceph.conf: https://gist.github.com/cholcombe973/5293a14d63d4d573d4c6
ceph df: https://gist.github.com/cholcombe973/c47e44d2615f3e6efbde the .rgw.buckets pool is erasure coded with k=10,m=3
rgw conf file: https://gist.github.com/cholcombe973/a7ec3e91290bb08b969a

There's no cache tier involved.
This cluster is using the new civetweb version of radosgw.

debug ms 20 of one of the crashing osd is attached.
I ran an xfs check on the partition and it came back clean.

Actions

Copy link

Updated by Samuel Just over 8 years ago

Project changed from Stable releases to Ceph
Priority changed from High to Urgent

Actions

Copy link

Updated by Chris Holcombe over 8 years ago

The ceph osd log is located here: ceph-post-file: 5023e3eb-eeba-472d-aeea-86022efdc83d

Actions

Copy link

Updated by Chris Holcombe over 8 years ago

The workload that was happening before the crash was a boto multipart upload. There was a cron job of about 24 batches of these going. I killed the radosgw and restarted it to see if this continues happening.

Copying the irc convo here for clarity:
<sjusthm> cholcombe: ok, in master, an op like the one rgw is sending would have gotten an ENOTSUPP
<sjusthm> rather than crashing the osd
<cholcombe> i see
<sjusthm> rgw is writing a new object at offset 2096640
<sjusthm> which is illegal on an EC pool
<cholcombe> oh right..
<sjusthm> I guess we didn't backport the ENOTSUPP return fix
..<snip>..
sjusthm> but it must be exactly an append
<cholcombe> i see
<sjusthm> which means it must be have offset 0
<sjusthm> since the object does not currently have any data

Actions

Copy link

Updated by Chris Holcombe over 8 years ago

After rolling the rgw nodes the cluster went back to active+clean and the osds aren't crashing anymore.

Actions

Copy link

Updated by Samuel Just over 8 years ago

2015-12-03 09:53:41.845187 7f68aecb2700 10 osd.80 8518 dequeue_op 0x7f690866f3b0 prio 63 cost 2096640 latency 24.577112 osd_op(client.11661.0:10010125 default.11661.7__shadow_NBC_SUITS_F6307_HD_10CH_FR_EN_DA000449566_16X9_178_2398_DIGITAL_FINAL_33620.mov.2~2XoKTxTsYORRDqGbdFS_K9QzzTicUMb.37_13 [write 2096640~2096640] 23.fca94ccd RETRY=1220 ack+ondisk+retry+write e8502) v4 pg pg[23.4cds0( v 7397'9268 (1614'6255,7397'9268]
local-les=8518 n=1033 ec=1595 les/c 8518/8462 8511/8511/8502) [80,99,87,49,156,89,115,101,84,67,108,36,160] r=0 lpr=8511 pi=8457-8510/10 crt=1743'9254 lcod 0'0 mlcod 0'0 active]

The object doesn't exist, 0.80.11 would have rejected that write with an ENOTSUPP since the offset doesn't match the object size. I seem to recall a bug in radosgw which could cause it to not send the initial write? Or maybe a bug in the osd which reordered the writes?

Actions

Copy link

Updated by Samuel Just over 8 years ago

The log above only contains two writes on the object at offset 2096640~2096640 and 4193280~1024. If it recurs after restarting radosgw, the next step would be to get matching logs on the osd and radosgw

debug ms = 1
debug objecter = 20
(whatever the right radosgw logging is?)

during the event.

Actions

Copy link