Project

General

Profile

Bug #10211

gf-complete exit(1) because of misaligned structure

Added by Loïc Dachary over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

80%

Source:
other
Tags:
Backport:
giant
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Steps to reproduce (this is fragile because it depends on the version of the allocator):

  • rm -fr dev out ; mkdir -p dev ; MON=1 OSD=5 ./vstart.sh -X -d -n -l mon osd
  • ./ceph osd erasure-code-profile set myprofile2 k=2 m=2 ruleset-failure-domain=osd
  • ./ceph osd pool create ecpool2 12 12 erasure myprofile2
  • ./rados put --pool ecpool2 SOMETHING /etc/group

The osd.1 which is the primary for the PG storing SOMETHING in ecpool2 exit(1)

#0  __GI_exit (status=1) at exit.c:104
#1  0x00007ffff18f3312 in gf_set_region_data (rd=0x7fffd9b8b0a0, gf=0x485dd40, src=0x4a7b31f, dest=0x4d25800, bytes=2048, val=143, xor=1, align=16) at erasure-code/jerasure/gf-complete/src/gf.c:817
#2  0x00007ffff1932364 in gf_w8_split_multiply_region_sse (gf=0x485dd40, src=0x4a7b31f, dest=0x4d25800, val=143, bytes=2048, xor=1) at erasure-code/jerasure/gf-complete/src/gf_w8.c:1071
#3  0x00007ffff18db9cc in galois_w08_region_multiply (region=0x4a7b31f "", multby=143, nbytes=2048, r2=0x4d25800 "root:x:0:\ndaemon:x:1:\nbin:x:2:\nsys:x:3:\nadm:x:4:loic,swift,syslog\ntty:x:5:\ndisk:x:6:\nlp:x:7:\nmail:x:8:\nnews:x:9:\nuucp:x:10:\nman:x:12:\nproxy:x:13:\nkmem:x:15:\ndialout:x:20:\nfax:x:21:\nvoice:x:22:\ncdrom:x"..., add=1) at erasure-code/jerasure/jerasure/src/galois.c:295
#4  0x00007ffff18ddb42 in jerasure_matrix_dotprod (k=2, w=8, matrix_row=0x47fa558, src_ids=0x0, dest_id=3, data_ptrs=0x7fffd9b8b260, coding_ptrs=0x7fffd9b8b270, size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:626

https://github.com/ceph/gf-complete/blob/v1/src/gf.c#L811 indeed checks that src=0x4a7b31f, dest=0x4d25800 are aligned and finds they are not.

Related issues

Duplicated by Ceph - Bug #10065: hung ec-lost-unfound.yaml, failed of osd.{0,2,3} Duplicate 11/11/2014

Associated revisions

Revision 4e955f41 (diff)
Added by Loic Dachary over 9 years ago

erasure-code: enforce chunk size alignment

Let say the ErasureCode::encode function is given a 4096 bytes
bufferlist made of a 1249 bytes bufferptr followed by a 2847 bytes
bufferptr, both properly starting on SIMD_ALIGN address. As a result the
second 2048 had to be reallocated when bufferlist::substr_of gets the
second 2048 buffer, the address starts at 799 bytes after the beginning
of the 2847 buffer ptr and is not SIMD_ALIGN'ed.

The ErasureCode::encode must enforce a size alignment based on the chunk
size in addition to the memory alignment required by SIMD operations,
using the bufferlist::rebuild_aligned_size_and_memory function instead of
bufferlist::rebuild_aligned.

http://tracker.ceph.com/issues/10211 Fixes: #10211

Signed-off-by: Loic Dachary <>

Revision cc469b23 (diff)
Added by Loic Dachary over 9 years ago

erasure-code: enforce chunk size alignment

Let say the ErasureCode::encode function is given a 4096 bytes
bufferlist made of a 1249 bytes bufferptr followed by a 2847 bytes
bufferptr, both properly starting on SIMD_ALIGN address. As a result the
second 2048 had to be reallocated when bufferlist::substr_of gets the
second 2048 buffer, the address starts at 799 bytes after the beginning
of the 2847 buffer ptr and is not SIMD_ALIGN'ed.

The ErasureCode::encode must enforce a size alignment based on the chunk
size in addition to the memory alignment required by SIMD operations,
using the bufferlist::rebuild_aligned_size_and_memory function instead of
bufferlist::rebuild_aligned.

http://tracker.ceph.com/issues/10211 Fixes: #10211

Signed-off-by: Loic Dachary <>
(cherry picked from commit 4e955f41297283798236c505c3d21bdcabb5caa0)

History

#2 Updated by Loïc Dachary over 9 years ago

  • Status changed from 12 to In Progress
  • % Done changed from 0 to 80

#3 Updated by Loïc Dachary over 9 years ago

  • Description updated (diff)

#4 Updated by Loïc Dachary over 9 years ago

  • Status changed from In Progress to Fix Under Review

#5 Updated by Loïc Dachary over 9 years ago

For the record, strace on the process shows:

[pid 22934] write(2, "Error in region multiply operation.\n", 36) = -1 EBADF (Bad file descriptor)
[pid 22934] write(2, "The source & destination pointers must be aligned with respect\n", 63) = -1 EBADF (Bad file descriptor)
[pid 22934] write(2, "to each other along a 16 byte boundary.\n", 40) = -1 EBADF (Bad file descriptor)
[pid 22934] write(2, "Src = 0x4402152.  Dest = 0x6d88d00\n", 35) = -1 EBADF (Bad file descriptor)

#6 Updated by Loïc Dachary over 9 years ago

  • Status changed from Fix Under Review to Pending Backport

#7 Updated by Loïc Dachary over 9 years ago

  • Status changed from Pending Backport to Fix Under Review

#8 Updated by Sage Weil about 9 years ago

  • Status changed from Fix Under Review to Pending Backport

#9 Updated by Loïc Dachary about 9 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF