Bug #8442: rgw: does not detect/adapt to erasure pool stripe size - rgw - Ceph

Actions

Copy link

Bug #8442

closed

rgw: does not detect/adapt to erasure pool stripe size

Added by Jingjing Zhao almost 10 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

firefly

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Step:

1. Put object which size is 4MB by rados gateway (S3).
2. An error returned. The error is HTTP/1.1 500 Internal Server Error.
3. Put object to cluster failed.

Rados gateway's log:

2014-05-27 07:38:05.353186 7f65e73f1700 2 req 61:0.000547:s3:PUT /my_bucket/4MB:put_obj:verifying op params
2014-05-27 07:38:05.353189 7f65e73f1700 2 req 61:0.000550:s3:PUT /my_bucket/4MB:put_obj:executing
2014-05-27 07:38:05.357837 7f65e73f1700 1 -- IP1 --> IP2 -- osd_op(client.4273.0:12239 default.4273.1__shadow_.CRFflfLsbIteptc69HYb7vEaRhVsVVe_1 [writefull 0~524288] 7.623b17f2 ack+ondisk+write e210) v4 -- ?+0 0x7f64ec00c060 con 0x22e6fc0
2014-05-27 07:38:05.360712 7f65e73f1700 1 -- IP1 --> IP2 -- osd_op(client.4273.0:12240 default.4273.1__shadow_.CRFflfLsbIteptc69HYb7vEaRhVsVVe_1 [write 524288~524288] 7.623b17f2 ack+ondisk+write e210) v4 -- ?+0 0x7f64ec00cd30 con 0x22e6fc0
2014-05-27 07:38:05.370620 7f666f841700 1 -- IP1 <== osd.22 IP2 ==== osd_op_reply(12240 default.4273.1__shadow_.CRFflfLsbIteptc69HYb7vEaRhVsVVe_1 [write 524288~524288] v0'0 uv0 ondisk = -95 ((95) Operation not supported)) v6 ==== 224+0+0 (1784400611 0 0) 0x7f6618002480 con 0x22e6fc0

Analysis?

1. According to the log, we can see when use OP_WRITE, the operation failed.
(e.g. [write 524288~524288] , the osd replied -95 ((95) Operation not
supported))

2. Each OP_WRTIE will use 512KB as the basic unit to write.
(e.g. [write 524288~524288] [write 1048576~524288] the first number means the position for write , the second number means the size for write)

3. The source code of OP_WRITE for erasure coding:

_  if (pool.info.requires_aligned_append() &&
        (op.extent.offset % pool.info.required_alignment() != 0)) {
      result = -EOPNOTSUPP;
      break;
    }_

Every offset (the basic unit, 512KB) will divide the alignment. If it can't be divided evenly?error will be returned. The alignment is based on the erasure profile, such as w,packetize and so on.

4. In my test, I use 4M object and the alignment is 640KB.
The object will be divided into two chunks
512KB --> for the first chunk in default
3.5M --> for the second chunk, and will be write in seven times. Each time is 512KB.
And the error occurred in the second part, cause the 512KB couldn't be divided evenly by 640KB.

Question:

1. Why should add this condition in source ? [_op.extent.offset % pool.info.required_alignment() != 0_]
2. Why WRITE_OP use 512KB to write each time ?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Guang Yang almost 10 years ago

Hello,
May I ask what is the design consideration behind to use multiple AIO writes (default size is 512KB) for a single strip (typical size is 4MB)? Is it aimed to improve the write latency as parallel writes to a single file (within filestore thread)? It comes along with the downside introducing x number of ops and those ops would eventually be serialized at OSD anyway.

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Subject changed from Can not put object to cluster with erasure coding pool to rgw: Can not put object to cluster with erasure coding pool
Status changed from New to 12
Priority changed from Normal to High
Source changed from other to Community (user)

Actions

Copy link

Updated by Sage Weil almost 10 years ago

you can manually adjust rgw_max_chunk_size to 640k to work around this for now. we need to make rgw automagically detect the pool alignment requirements.

Actions

Copy link

Updated by Guang Yang almost 10 years ago

Hi Sage,
Can you comment on the design consideration of comment #1? With erasure coding, I think it might be more efficient to do a single write for a 4MB chunk instead of split it into multiple AIO writes. Did I miss anything obvious here?

Thanks,
Guang

Actions

Copy link

Updated by Jingjing Zhao almost 10 years ago

Hi sage,

Thanks, it's work!
In addition? I also really want to know design consideration of comment #1.

Thanks,
Jing

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Priority changed from High to Urgent

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Status changed from 12 to Resolved

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Subject changed from rgw: Can not put object to cluster with erasure coding pool to rgw: does not detect/adapt to erasure pool stripe size
Status changed from Resolved to 12