Bug #48778
openSetting quota triggered ops storms on meta pool
0%
Description
A few days ago, I toyed with setting a quota on a handful (less than 60) directories on my CephFS filesystem. At the time, I was running version 15.2.1. After a while, I noticed the number of operations on my meta pool was extremely high. Normally, this would be near-zero or sometimes a few hundreds. Now this was at about 10K. I removed all the quotas I had set in the hopes that this would resolve the issue, but to no avail.
As I was due for an upgrade, I decided to upgrade to 15.2.8 yesterday. At first, things seemed to be back to normal. However, after a few hours, a strange pattern started to emerge, where the operations on the meta pool would spike to 20K for approximately 1h40m and then go back to normal for approximately 3h15m, making this a recurring event every 5 hours or so.
I have attached a screenshot to illustrate what this looks like. The first "wave" of 10K IOPS is until I did the upgrade, the next 20K spikes are after the upgrade.
During these spikes, dump_ops_in_flight shows a bunch of pending operations, many of them pertaining to ceph.quota (despite there being no more quotas set), all with at least one event of "failed to xlock, waiting". Excerpt below.
{ "description": "client_request(mds.0:61937455 setxattr #0x10006a6465b ceph.quota caller_uid=0, caller_gid=0{})", "initiated_at": "2021-01-06T09:45:41.610493+0000", "age": 0.0010133469999999999, "duration": 0.001879094, "type_data": { "flag_point": "failed to xlock, waiting", "reqid": "mds.0:61937455", "op_type": "client_request", "client_info": { "client": "mds.0", "tid": 61937455 }, "events": [ { "time": "2021-01-06T09:45:41.610493+0000", "event": "initiated" }, { "time": "2021-01-06T09:45:41.610493+0000", "event": "throttled" }, { "time": "2021-01-06T09:45:41.610493+0000", "event": "header_read" }, { "time": "2021-01-06T09:45:41.610493+0000", "event": "all_read" }, { "time": "2021-01-06T09:45:41.610598+0000", "event": "dispatched" }, { "time": "2021-01-06T09:45:41.610607+0000", "event": "failed to xlock, waiting" }, { "time": "2021-01-06T09:45:41.612276+0000", "event": "failed to xlock, waiting" } ] } }, { "description": "client_request(mds.0:61937456 setxattr #0x100000014c1 ceph.quota caller_uid=0, caller_gid=0{})", "initiated_at": "2021-01-06T09:45:41.610515+0000", "age": 0.00099128499999999995, "duration": 0.00189281, "type_data": { "flag_point": "failed to xlock, waiting", "reqid": "mds.0:61937456", "op_type": "client_request", "client_info": { "client": "mds.0", "tid": 61937456 }, "events": [ { "time": "2021-01-06T09:45:41.610515+0000", "event": "initiated" }, { "time": "2021-01-06T09:45:41.610515+0000", "event": "throttled" }, { "time": "2021-01-06T09:45:41.610515+0000", "event": "header_read" }, { "time": "2021-01-06T09:45:41.610515+0000", "event": "all_read" }, { "time": "2021-01-06T09:45:41.610609+0000", "event": "dispatched" }, { "time": "2021-01-06T09:45:41.610615+0000", "event": "failed to xlock, waiting" }, { "time": "2021-01-06T09:45:41.612294+0000", "event": "failed to xlock, waiting" } ] } }, { "description": "client_request(client.75073:35290007 rmdir #0x10006acee5b/fonts 2021-01-06T09:45:41.609114+0000 caller_uid=3166, caller_gid=1000{})", "initiated_at": "2021-01-06T09:45:41.610660+0000", "age": 0.00084625900000000003, "duration": 0.001782741, "type_data": { "flag_point": "acquired locks", "reqid": "client.75073:35290007", "op_type": "client_request", "client_info": { "client": "client.75073", "tid": 35290007 }, "events": [ { "time": "2021-01-06T09:45:41.610660+0000", "event": "initiated" }, { "time": "2021-01-06T09:45:41.610661+0000", "event": "throttled" }, { "time": "2021-01-06T09:45:41.610660+0000", "event": "header_read" }, { "time": "2021-01-06T09:45:41.610663+0000", "event": "all_read" }, { "time": "2021-01-06T09:45:41.610673+0000", "event": "dispatched" }, { "time": "2021-01-06T09:45:41.610690+0000", "event": "acquired locks" }, { "time": "2021-01-06T09:45:41.610697+0000", "event": "failed to xlock, waiting" }, { "time": "2021-01-06T09:45:41.612426+0000", "event": "acquired locks" } ] } }, { "description": "client_request(client.75073:35287363 setfilelock rule 2, type 2, owner 13411631674495090448, pid 9437, start 0, length 0, wait 1 #0x10006accd98 2021-01-06T09:44:08.248412+0000 caller_uid=2730, caller_gid=1000{})", "initiated_at": "2021-01-06T09:44:08.251306+0000", "age": 93.360200323000001, "duration": 93.36117634, "type_data": { "flag_point": "failed to add lock, waiting", "reqid": "client.75073:35287363", "op_type": "client_request", "client_info": { "client": "client.75073", "tid": 35287363 }, "events": [ { "time": "2021-01-06T09:44:08.251306+0000", "event": "initiated" }, { "time": "2021-01-06T09:44:08.251308+0000", "event": "throttled" }, { "time": "2021-01-06T09:44:08.251306+0000", "event": "header_read" }, { "time": "2021-01-06T09:44:08.251314+0000", "event": "all_read" }, { "time": "2021-01-06T09:44:08.251328+0000", "event": "dispatched" }, { "time": "2021-01-06T09:44:08.251353+0000", "event": "acquired locks" }, { "time": "2021-01-06T09:44:08.251360+0000", "event": "failed to add lock, waiting" }, { "time": "2021-01-06T09:44:10.683395+0000", "event": "acquired locks" }, { "time": "2021-01-06T09:44:10.683397+0000", "event": "failed to add lock, waiting" }, { "time": "2021-01-06T09:44:17.626619+0000", "event": "acquired locks" }, { "time": "2021-01-06T09:44:17.626621+0000", "event": "failed to add lock, waiting" }, { "time": "2021-01-06T09:44:46.166481+0000", "event": "acquired locks" }, { "time": "2021-01-06T09:44:46.166483+0000", "event": "failed to add lock, waiting" }, { "time": "2021-01-06T09:44:46.371158+0000", "event": "acquired locks" }, { "time": "2021-01-06T09:44:46.371159+0000", "event": "failed to add lock, waiting" }, { "time": "2021-01-06T09:44:47.581885+0000", "event": "acquired locks" }, { "time": "2021-01-06T09:44:47.581886+0000", "event": "failed to add lock, waiting" }, { "time": "2021-01-06T09:44:52.718687+0000", "event": "acquired locks" }, { "time": "2021-01-06T09:44:52.718689+0000", "event": "failed to add lock, waiting" }, { "time": "2021-01-06T09:45:31.331148+0000", "event": "acquired locks" }, { "time": "2021-01-06T09:45:31.331150+0000", "event": "failed to add lock, waiting" } ] } }, { "description": "client_request(mds.0:61937457 setxattr #0x10006a64656 ceph.quota caller_uid=0, caller_gid=0{})", "initiated_at": "2021-01-06T09:45:41.610708+0000", "age": 0.00079801799999999995, "duration": 0.0018484860000000001, "type_data": { "flag_point": "submit entry: journal_and_reply", "reqid": "mds.0:61937457", "op_type": "client_request", "client_info": { "client": "mds.0", "tid": 61937457 }, "events": [ { "time": "2021-01-06T09:45:41.610708+0000", "event": "initiated" }, { "time": "2021-01-06T09:45:41.610708+0000", "event": "throttled" }, { "time": "2021-01-06T09:45:41.610708+0000", "event": "header_read" }, { "time": "2021-01-06T09:45:41.610708+0000", "event": "all_read" }, { "time": "2021-01-06T09:45:41.610728+0000", "event": "dispatched" }, { "time": "2021-01-06T09:45:41.610750+0000", "event": "failed to xlock, waiting" }, { "time": "2021-01-06T09:45:41.612514+0000", "event": "acquired locks" }, { "time": "2021-01-06T09:45:41.612523+0000", "event": "submit entry: journal_and_reply" } ] } } ],
I'm at a loss as for what to do next, but I do think this looks like I hit a bug somehow, as I don't see why this happened in the first place, and escpecially why this is still happening now after removing all quotas and upgrading (and thus restarting everything).
Files