Bug #8442
closedrgw: does not detect/adapt to erasure pool stripe size
0%
Description
Step:
1. Put object which size is 4MB by rados gateway (S3).
2. An error returned. The error is HTTP/1.1 500 Internal Server Error.
3. Put object to cluster failed.
Rados gateway's log:
2014-05-27 07:38:05.353186 7f65e73f1700 2 req 61:0.000547:s3:PUT /my_bucket/4MB:put_obj:verifying op params
2014-05-27 07:38:05.353189 7f65e73f1700 2 req 61:0.000550:s3:PUT /my_bucket/4MB:put_obj:executing
2014-05-27 07:38:05.357837 7f65e73f1700 1 -- IP1 --> IP2 -- osd_op(client.4273.0:12239 default.4273.1__shadow_.CRFflfLsbIteptc69HYb7vEaRhVsVVe_1 [writefull 0~524288] 7.623b17f2 ack+ondisk+write e210) v4 -- ?+0 0x7f64ec00c060 con 0x22e6fc0
2014-05-27 07:38:05.360712 7f65e73f1700 1 -- IP1 --> IP2 -- osd_op(client.4273.0:12240 default.4273.1__shadow_.CRFflfLsbIteptc69HYb7vEaRhVsVVe_1 [write 524288~524288] 7.623b17f2 ack+ondisk+write e210) v4 -- ?+0 0x7f64ec00cd30 con 0x22e6fc0
2014-05-27 07:38:05.370620 7f666f841700 1 -- IP1 <== osd.22 IP2 ==== osd_op_reply(12240 default.4273.1__shadow_.CRFflfLsbIteptc69HYb7vEaRhVsVVe_1 [write 524288~524288] v0'0 uv0 ondisk = -95 ((95) Operation not supported)) v6 ==== 224+0+0 (1784400611 0 0) 0x7f6618002480 con 0x22e6fc0
Analysis?
1. According to the log, we can see when use OP_WRITE, the operation failed.
(e.g. [write 524288~524288] , the osd replied -95 ((95) Operation not
supported))
2. Each OP_WRTIE will use 512KB as the basic unit to write.
(e.g. [write 524288~524288] [write 1048576~524288] the first number means the position for write , the second number means the size for write)
3. The source code of OP_WRITE for erasure coding:
_ if (pool.info.requires_aligned_append() &&
(op.extent.offset % pool.info.required_alignment() != 0)) {
result = -EOPNOTSUPP;
break;
}_
Every offset (the basic unit, 512KB) will divide the alignment. If it can't be divided evenly?error will be returned. The alignment is based on the erasure profile, such as w,packetize and so on.
4. In my test, I use 4M object and the alignment is 640KB.
The object will be divided into two chunks
512KB --> for the first chunk in default
3.5M --> for the second chunk, and will be write in seven times. Each time is 512KB.
And the error occurred in the second part, cause the 512KB couldn't be divided evenly by 640KB.
Question:
1. Why should add this condition in source ? [_op.extent.offset % pool.info.required_alignment() != 0_]
2. Why WRITE_OP use 512KB to write each time ?
Updated by Guang Yang almost 10 years ago
Hello,
May I ask what is the design consideration behind to use multiple AIO writes (default size is 512KB) for a single strip (typical size is 4MB)? Is it aimed to improve the write latency as parallel writes to a single file (within filestore thread)? It comes along with the downside introducing x number of ops and those ops would eventually be serialized at OSD anyway.
Updated by Sage Weil almost 10 years ago
- Subject changed from Can not put object to cluster with erasure coding pool to rgw: Can not put object to cluster with erasure coding pool
- Status changed from New to 12
- Priority changed from Normal to High
- Source changed from other to Community (user)
Updated by Sage Weil almost 10 years ago
you can manually adjust rgw_max_chunk_size to 640k to work around this for now. we need to make rgw automagically detect the pool alignment requirements.
Updated by Guang Yang almost 10 years ago
Hi Sage,
Can you comment on the design consideration of comment #1? With erasure coding, I think it might be more efficient to do a single write for a 4MB chunk instead of split it into multiple AIO writes. Did I miss anything obvious here?
Thanks,
Guang
Updated by Jingjing Zhao almost 10 years ago
Hi sage,
Thanks, it's work!
In addition? I also really want to know design consideration of comment #1.
Thanks,
Jing
Updated by Sage Weil almost 10 years ago
- Subject changed from rgw: Can not put object to cluster with erasure coding pool to rgw: does not detect/adapt to erasure pool stripe size
- Status changed from Resolved to 12
Updated by Josh Durgin almost 10 years ago
- Status changed from 12 to Pending Backport
- Backport set to firefly
Updated by Sage Weil over 9 years ago
- Status changed from Pending Backport to Resolved