Project

General

Profile

Bug #57920

mds:ESubtreeMap event size is too large

Added by zhikuo du over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In production environment, we have a problem: The ESubtreeMap event size is too large.

1,The ESubtreeMap event size is about 75M (79384286).

For example :(from a mds's log)
3701845:2022-10-25T14:45:27.983+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682556023939~79384286 : ESubtreeMap 2 subtrees , 0 ambiguous [metablob 0x1, 2904 dirs]
3701923:2022-10-25T14:45:29.171+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682635408245~1026 : EUpdate scatter_writebehind [metablob 0x10000000000, 1 dirs]
3701928:2022-10-25T14:45:29.171+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682635409291~1026 : EUpdate scatter_writebehind [metablob 0x10000000000, 1 dirs]
3701956:2022-10-25T14:45:29.171+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682635410337~411678 : EUpdate scatter_writebehind [metablob 0x1, 1128 dirs]
3701963:2022-10-25T14:45:29.171+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682635822035~6224 : EUpdate cap update [metablob 0x10000000000, 7 dirs]
3702858:2022-10-25T14:45:30.155+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682635828279~79384286 : ESubtreeMap 2 subtrees , 0 ambiguous [metablob 0x1, 2904 dirs]
3702954:2022-10-25T14:45:31.415+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682715212585~7606 : EUpdate cap update [metablob 0x1, 8 dirs]
3702959:2022-10-25T14:45:31.415+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682715220211~1856 : EUpdate scatter_writebehind [metablob 0x1, 2 dirs]
3702964:2022-10-25T14:45:31.415+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682715222087~1958 : EUpdate cap update [metablob 0x1000d2632f8, 2 dirs]
3702969:2022-10-25T14:45:31.415+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682715224065~1958 : EUpdate cap update [metablob 0x1000d2632f8, 2 dirs]
3703027:2022-10-25T14:45:31.415+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682715251247~1856 : EUpdate scatter_writebehind [metablob 0x1, 2 dirs]
......
3703086:2022-10-25T14:45:31.419+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682715379407~29284 : EUpdate openc [metablob 0x50001b4b21c, 2 dirs]
3703094:2022-10-25T14:45:31.419+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682715408711~1874 : EUpdate scatter_writebehind [metablob 0x40001f0d358, 2 dirs]
3703098:2022-10-25T14:45:31.419+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682715410605~405212 : EUpdate scatter_writebehind [metablob 0x50001b4b1a6, 1121 dirs]
3703106:2022-10-25T14:45:31.423+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682715815837~2408 : EUpdate scatter_writebehind [metablob 0x1, 2 dirs]
3703296:2022-10-25T14:45:32.439+0800 7f7ab4013700 5 mds.0.log _submit_thread 161682715818265~79384286 : ESubtreeMap 2 subtrees , 0 ambiguous [metablob 0x1, 2904 dirs]

2,In this situation:

(1)The cost time of ESubtreeMap::encode is mainly between 512ms and 1000ms:
my_hist:
[512M, 1G) 23 |
@@@@@@@@@@@@@@@@@@@@@@@@@@@|
real 0m33.978s
user 0m0.157s
sys 0m0.045s
(2)The cost time of MDCache::create_subtree_map is mainly between 256ms and 512ms:
my_hist:
[128M, 256M) 7 |
@@@@@ |
[256M, 512M) 32 |
@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512M, 1G) 6 |
@@@@ |
real 0m26.269s
user 0m0.169s
sys 0m0.035s

3,So the cpu utilization of md_submit(who uses ESubtreeMap::encode) and ms_dispatch(who use MDCache::create_subtree_map ) is high!

12085 ceph      20   0  133.2g  19.7g   9660 R 93.8  5.2  15040:53 ms_dispatch                                                                                                                                                                                              
18518 ceph 20 0 133.2g 19.7g 9660 R 93.8 5.2 29573:53 md_submit
62468 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 290:34.59 ceph-mds
12074 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 164:03.83 log
12075 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 2172:08 msgr-worker-0
12076 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 2003:44 msgr-worker-1
12077 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 2067:13 msgr-worker-2
12082 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 1:01.53 service
12083 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 0:02.35 admin_socket
12084 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 0:00.01 signal_handler
12086 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 0:00.00 ms_local
12087 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 6:01.68 safe_timer
12088 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 0:00.05 fn_anonymous
12089 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 162:41.85 safe_timer
12090 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 2:26.91 ceph-mds
12121 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 6:57.85 safe_timer
18510 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 11:42.19 ms_dispatch
18511 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 89:07.23 OpHistorySvc
18512 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 262:38.54 ms_dispatch
18513 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 13:08.91 mds_rank_progr
18514 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 97:22.64 PQ_Finisher
18515 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 0:28.42 safe_timer
18516 ceph 20 0 133.2g 19.7g 9660 S 0.0 5.2 2646:53 MR_Finisher

4,I think this problem will seriously affect the IOPS of write and read.

5, @Xiubo Li has provide a patch:
https://github.com/ceph/ceph/pull/44180 《mds/MDLog: use a larger sigment size if the event max size is too large》
But this patch has not been committed.
Will anybody review it again?

we use: ceph version 15.2.16 octopus (stable)

cpu.png View (113 KB) zhikuo du, 10/25/2022 09:23 AM

create_subtree_map.png View (21.2 KB) zhikuo du, 10/25/2022 09:23 AM

encode.png View (17.3 KB) zhikuo du, 10/25/2022 09:23 AM

ESubtree Size.png View (318 KB) zhikuo du, 10/25/2022 09:24 AM

History

#1 Updated by Xiubo Li over 1 year ago

zhikuo du wrote:
[...]

4,I think this problem will seriously affect the IOPS of write and read.

5, @Xiubo Li has provide a patch:
https://github.com/ceph/ceph/pull/44180 《mds/MDLog: use a larger sigment size if the event max size is too large》
But this patch has not been committed.
Will anybody review it again?

Yeah, this bug has been seen several times and my PR1 is still under reviewing. Even Zheng has one PR2 to remove the ESubtreeMap event, but the problem is not only the ESubtreeMap will have a large size, which could trigger this bug. From my test before as I remembered some other events also could have large sizes and eventually could trigger it sooner or later.

[1] https://github.com/ceph/ceph/pull/44180
[2] https://github.com/ceph/ceph/pull/38042

we use: ceph version 15.2.16 octopus (stable)

#2 Updated by zhikuo du over 1 year ago

Xiubo Li wrote:

zhikuo du wrote:
[...]

4,I think this problem will seriously affect the IOPS of write and read.

5, @Xiubo Li has provide a patch:
https://github.com/ceph/ceph/pull/44180 《mds/MDLog: use a larger sigment size if the event max size is too large》
But this patch has not been committed.
Will anybody review it again?

Yeah, this bug has been seen several times and my PR1 is still under reviewing. Even Zheng has one PR2 to remove the ESubtreeMap event, but the problem is not only the ESubtreeMap will have a large size, which could trigger this bug. From my test before as I remembered some other events also could have large sizes and eventually could trigger it sooner or later.

[1] https://github.com/ceph/ceph/pull/44180
[2] https://github.com/ceph/ceph/pull/38042

we use: ceph version 15.2.16 octopus (stable)

1,Zheng's change is very big.
2,If we want to reduce the impact of this problem in production environment, PR1 should be feasible.
3,By the way:
(1)In dynamical migration method, is it possiable to add a new migration method to make the size of ESubtreeMap event not too big? (Maybe it is not easy.)
(2)If we cannot control the size of ESubtreeMap event, if it is more larger, for example 300M, I think ESubtreeMap::encode and MDCache::create_subtree_map will use more time ( I think it is more then 1 second); and the IOPS will be more smaller!

#3 Updated by Xiubo Li over 1 year ago

zhikuo du wrote:

Xiubo Li wrote:

zhikuo du wrote:
[...]

4,I think this problem will seriously affect the IOPS of write and read.

5, @Xiubo Li has provide a patch:
https://github.com/ceph/ceph/pull/44180 《mds/MDLog: use a larger sigment size if the event max size is too large》
But this patch has not been committed.
Will anybody review it again?

Yeah, this bug has been seen several times and my PR1 is still under reviewing. Even Zheng has one PR2 to remove the ESubtreeMap event, but the problem is not only the ESubtreeMap will have a large size, which could trigger this bug. From my test before as I remembered some other events also could have large sizes and eventually could trigger it sooner or later.

[1] https://github.com/ceph/ceph/pull/44180
[2] https://github.com/ceph/ceph/pull/38042

we use: ceph version 15.2.16 octopus (stable)

1,Zheng's change is very big.

Yeah, right!

2,If we want to reduce the impact of this problem in production environment, PR1 should be feasible.
3,By the way:
(1)In dynamical migration method, is it possiable to add a new migration method to make the size of ESubtreeMap event not too big? (Maybe it is not easy.)

I am afraid we have to do what Zheng has done in his PR [2] then.

(2)If we cannot control the size of ESubtreeMap event, if it is more larger, for example 300M, I think ESubtreeMap::encode and MDCache::create_subtree_map will use more time ( I think it is more then 1 second); and the IOPS will be more smaller!

#4 Updated by zhikuo du over 1 year ago

Xiubo Li wrote:

zhikuo du wrote:
[...]

4,I think this problem will seriously affect the IOPS of write and read.

5, @Xiubo Li has provide a patch:
https://github.com/ceph/ceph/pull/44180 《mds/MDLog: use a larger sigment size if the event max size is too large》
But this patch has not been committed.
Will anybody review it again?

Yeah, this bug has been seen several times and my PR1 is still under reviewing. Even Zheng has one PR2 to remove the ESubtreeMap event, but the problem is not only the ESubtreeMap will have a large size, which could trigger this bug. From my test before as I remembered some other events also could have large sizes and eventually could trigger it sooner or later.

[1] https://github.com/ceph/ceph/pull/44180
[2] https://github.com/ceph/ceph/pull/38042

we use: ceph version 15.2.16 octopus (stable)

May I ask you a question:
What factors decide how many event must have a ESubtreeMap event? And what is the max event number?
In PR1, the number looks like 10000?
@xiubo li

#5 Updated by zhikuo du over 1 year ago

zhikuo du wrote:

Xiubo Li wrote:

zhikuo du wrote:
[...]

4,I think this problem will seriously affect the IOPS of write and read.

5, @Xiubo Li has provide a patch:
https://github.com/ceph/ceph/pull/44180 《mds/MDLog: use a larger sigment size if the event max size is too large》
But this patch has not been committed.
Will anybody review it again?

Yeah, this bug has been seen several times and my PR1 is still under reviewing. Even Zheng has one PR2 to remove the ESubtreeMap event, but the problem is not only the ESubtreeMap will have a large size, which could trigger this bug. From my test before as I remembered some other events also could have large sizes and eventually could trigger it sooner or later.

[1] https://github.com/ceph/ceph/pull/44180
[2] https://github.com/ceph/ceph/pull/38042

we use: ceph version 15.2.16 octopus (stable)

May I ask you a question:
What factors decide how many event must have a ESubtreeMap event? And what is the max event number?
In PR1, the number looks like 10000?
@xiubo li

Another information:
Using dynamical migration, rank0's ESubtreeMap event size maybe much larger than the other rank. For example: rank0:about 75M, rank9:about 5M.
In this situation,in PR1, rank0 will have more event per ESubtreeMap event than other rank.

#6 Updated by zhikuo du over 1 year ago

May I ask you a question:
What factors decide how many event must have a ESubtreeMap event? And what is the max event number?
In PR1, the number looks like 10000?
@xiubo li

@Xiubo Li @Venky Shankar

I readed the codes about: how the segment is trimmed and how ESubtreeMap/EImportStart/EImportFinish/EExport are replayed.
And I readed Zheng's design summary documentation.

Now I think if we keep the ESubtreeMap to be the first event: if we set mds_log_events_per_segment to be a larger value, for example 10240 or 20480. Logically, it should can work well too.

Do you think so?

#7 Updated by zhikuo du over 1 year ago

@Xiubo Li @Venky Shankar

I readed the codes about: how the segment is trimmed and how ESubtreeMap/EImportStart/EImportFinish/EExport are replayed.
And I readed Zheng's design summary documentation.

Now I think if we keep the ESubtreeMap to be the first event: if we set mds_log_events_per_segment to be a larger value, for example 10240 or 20480. Logically, it should can work well too.

Do you think so?

I mean that:
Before we can remove the ESubtreeMap event or other methods, and if the disk throughput/bandwidth capacity is not expensive; in function _submit_entry(), if we delete "ls->end/period != ls->offset/period", it should can work well too.
I will make some tests for it.

#8 Updated by Xiubo Li over 1 year ago

zhikuo du wrote:

May I ask you a question:
What factors decide how many event must have a ESubtreeMap event? And what is the max event number?
In PR1, the number looks like 10000?
@xiubo li

@Xiubo Li @Venky Shankar

I readed the codes about: how the segment is trimmed and how ESubtreeMap/EImportStart/EImportFinish/EExport are replayed.
And I readed Zheng's design summary documentation.

Now I think if we keep the ESubtreeMap to be the first event: if we set mds_log_events_per_segment to be a larger value, for example 10240 or 20480. Logically, it should can work well too.

Do you think so?

I am afraid this won't work. As I remembered from my test before, the size of ESubtreeMap could reach up to several MBs, and the default segment size is 4MB. That means no matter what the number of mds_log_events_per_segment is, each segment will only include 2 events in this case. The first one is ESubtreeMap event and the second one is any of others expect ESubtreeMap.

#9 Updated by zhikuo du over 1 year ago

I am afraid this won't work. As I remembered from my test before, the size of ESubtreeMap could reach up to several MBs, and the default segment size is 4MB. That means no matter what the number of mds_log_events_per_segment is, each segment will only include 2 events in this case. The first one is ESubtreeMap event and the second one is any of others expect ESubtreeMap.

I mean that:
Firstly, in function _submit_entry(), we delete "ls->end/period != ls->offset/period". Secondly,we set mds_log_events_per_segment to be a larger value, for example 20480. I think it should work well.

If it can work, Even if the size of ESubtreeMap is about 75M, if we set mds_log_events_per_segment to be a larger value, we can effectively reduce the impact of this problem in production environment.(The iops will be not very small)

#10 Updated by Xiubo Li over 1 year ago

zhikuo du wrote:

I am afraid this won't work. As I remembered from my test before, the size of ESubtreeMap could reach up to several MBs, and the default segment size is 4MB. That means no matter what the number of mds_log_events_per_segment is, each segment will only include 2 events in this case. The first one is ESubtreeMap event and the second one is any of others expect ESubtreeMap.

I mean that:
Firstly, in function _submit_entry(), we delete "ls->end/period != ls->offset/period". Secondly,we set mds_log_events_per_segment to be a larger value, for example 20480. I think it should work well.

Not sure if we delete this code.

#11 Updated by Venky Shankar over 1 year ago

Hi,

Could the list of PRs that try to address this issue be linked? (so, that we don't loose track of them).

As far as the issue is concerned, I left comments on the PRs itself and requires more dicussion.

#12 Updated by zhikuo du over 1 year ago

Venky Shankar wrote:

Hi,

Could the list of PRs that try to address this issue be linked? (so, that we don't loose track of them).

As far as the issue is concerned, I left comments on the PRs itself and requires more dicussion.

Sure. Thanks!
What I know so far is the following:
[1] https://github.com/ceph/ceph/pull/44180
[2] https://github.com/ceph/ceph/pull/38042
[3] https://github.com/ceph/ceph/pull/48615

Also available in: Atom PDF