Project

General

Profile

Bug #13972

osd/ECUtil.h: 117: FAILED assert(old_size == total_chunk_size) in 0.80.10

Added by Chris Holcombe over 5 years ago. Updated almost 5 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have an erasure code pool setup with k=10,m=3 jerasure and ruleset-failure-domain=osd and I seem to be hitting a crash loop with the osds. Only a few of them are crashing them but it's enough that the entire ceph cluster is looking pretty banged up. I'm not exactly sure what caused the problem. We're just putting test data in the cluster so we're free to dump it if needed.

Here's a paste of the crash:
https://gist.github.com/cholcombe973/eaacc5effc4a1af33ce6
The log output is a mile long but I think I posted the most relevant parts.

History

#1 Updated by Chris Holcombe over 5 years ago

Ceph.conf: https://gist.github.com/cholcombe973/5293a14d63d4d573d4c6
ceph df: https://gist.github.com/cholcombe973/c47e44d2615f3e6efbde the .rgw.buckets pool is erasure coded with k=10,m=3
rgw conf file: https://gist.github.com/cholcombe973/a7ec3e91290bb08b969a

There's no cache tier involved.
This cluster is using the new civetweb version of radosgw.

debug ms 20 of one of the crashing osd is attached.
I ran an xfs check on the partition and it came back clean.

#2 Updated by Samuel Just over 5 years ago

  • Project changed from Stable releases to Ceph
  • Priority changed from High to Urgent

#3 Updated by Chris Holcombe over 5 years ago

The ceph osd log is located here: ceph-post-file: 5023e3eb-eeba-472d-aeea-86022efdc83d

#4 Updated by Chris Holcombe over 5 years ago

The workload that was happening before the crash was a boto multipart upload. There was a cron job of about 24 batches of these going. I killed the radosgw and restarted it to see if this continues happening.

Copying the irc convo here for clarity:
<sjusthm> cholcombe: ok, in master, an op like the one rgw is sending would have gotten an ENOTSUPP
<sjusthm> rather than crashing the osd
<cholcombe> i see
<sjusthm> rgw is writing a new object at offset 2096640
<sjusthm> which is illegal on an EC pool
<cholcombe> oh right..
<sjusthm> I guess we didn't backport the ENOTSUPP return fix
..<snip>..
sjusthm> but it must be exactly an append
<cholcombe> i see
<sjusthm> which means it must be have offset 0
<sjusthm> since the object does not currently have any data

#5 Updated by Chris Holcombe over 5 years ago

After rolling the rgw nodes the cluster went back to active+clean and the osds aren't crashing anymore.

#6 Updated by Samuel Just over 5 years ago

2015-12-03 09:53:41.845187 7f68aecb2700 10 osd.80 8518 dequeue_op 0x7f690866f3b0 prio 63 cost 2096640 latency 24.577112 osd_op(client.11661.0:10010125 default.11661.7__shadow_NBC_SUITS_F6307_HD_10CH_FR_EN_DA000449566_16X9_178_2398_DIGITAL_FINAL_33620.mov.2~2XoKTxTsYORRDqGbdFS_K9QzzTicUMb.37_13 [write 2096640~2096640] 23.fca94ccd RETRY=1220 ack+ondisk+retry+write e8502) v4 pg pg[23.4cds0( v 7397'9268 (1614'6255,7397'9268]
local-les=8518 n=1033 ec=1595 les/c 8518/8462 8511/8511/8502) [80,99,87,49,156,89,115,101,84,67,108,36,160] r=0 lpr=8511 pi=8457-8510/10 crt=1743'9254 lcod 0'0 mlcod 0'0 active]

The object doesn't exist, 0.80.11 would have rejected that write with an ENOTSUPP since the offset doesn't match the object size. I seem to recall a bug in radosgw which could cause it to not send the initial write? Or maybe a bug in the osd which reordered the writes?

#7 Updated by Samuel Just over 5 years ago

The log above only contains two writes on the object at offset 2096640~2096640 and 4193280~1024. If it recurs after restarting radosgw, the next step would be to get matching logs on the osd and radosgw

debug ms = 1
debug objecter = 20
(whatever the right radosgw logging is?)

during the event.

#8 Updated by Samuel Just over 5 years ago

  • Assignee deleted (Samuel Just)

#9 Updated by Nathan Cutler over 5 years ago

  • Tracker changed from Tasks to Bug
  • Status changed from New to Need More Info

#10 Updated by Loïc Dachary over 5 years ago

  • Release set to firefly
  • Release set to hammer

#11 Updated by Loïc Dachary over 5 years ago

  • Affected Versions deleted (v0.80.10)

#12 Updated by Laurent GUERBY over 5 years ago

  • File cm-20151207.txt added

crushmap per Loic request.

#13 Updated by Laurent GUERBY over 5 years ago

  • File ceph-osd-dump-20151207.txt added

osd and pg dump

#14 Updated by Laurent GUERBY over 5 years ago

  • File ceph-pg-dump-20151207.txt.gz added

pg dump

#15 Updated by Loïc Dachary over 5 years ago

  • Release deleted (firefly)
  • Release deleted (hammer)
  • Affected Versions 0.80 added

#16 Updated by Loïc Dachary over 5 years ago

  • File deleted (cm-20151207.txt)

#17 Updated by Loïc Dachary over 5 years ago

  • File deleted (ceph-osd-dump-20151207.txt)

#18 Updated by Loïc Dachary over 5 years ago

  • File deleted (ceph-pg-dump-20151207.txt.gz)

#19 Updated by Samuel Just almost 5 years ago

  • Status changed from Need More Info to Can't reproduce

Also available in: Atom PDF