https://tracker.ceph.com/
https://tracker.ceph.com/favicon.ico
2022-07-25T15:42:22Z
Ceph
mgr - Bug #56675: mgr messenger policy not initialized properly
https://tracker.ceph.com/issues/56675?journal_id=221227
2022-07-25T15:42:22Z
Kamoltat (Junior) Sirivadhna
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>Pull request ID</strong> set to <i>47238</i></li></ul>
mgr - Bug #56675: mgr messenger policy not initialized properly
https://tracker.ceph.com/issues/56675?journal_id=221240
2022-07-25T17:26:08Z
Kamoltat (Junior) Sirivadhna
<ul><li><strong>Assignee</strong> set to <i>Xinying Song</i></li></ul>
mgr - Bug #56675: mgr messenger policy not initialized properly
https://tracker.ceph.com/issues/56675?journal_id=221452
2022-07-28T00:32:03Z
Xinying Song
<ul></ul><p>log snippet:<br />At first, a `pg dump` command comes, all goes well<br />```<br />2022-07-20 16:57:29.724832 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN pgs=3 cs=1 l=1).process prev state is STATE_OPEN<br />2022-07-20 16:57:29.724840 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_HEADER pgs=3 cs=1 l=1).process prev state is STATE_OPEN<br />2022-07-20 16:57:29.724842 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_HEADER pgs=3 cs=1 l=1).process begin MSG<br />2022-07-20 16:57:29.724845 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_HEADER pgs=3 cs=1 l=1).process got MSG header<br />2022-07-20 16:57:29.724848 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_HEADER pgs=3 cs=1 l=1).process got envelope type=97 src client.5217516 front=93 data=0 off 0<br />2022-07-20 16:57:29.724851 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=3 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_HEADER<br />2022-07-20 16:57:29.724853 7f79ad631700 10 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=3 cs=1 l=1).process wants 1 message from policy throttler 0/128<br />2022-07-20 16:57:29.724856 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_BYTES pgs=3 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_MESSAGE<br />2022-07-20 16:57:29.724859 7f79ad631700 10 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_BYTES pgs=3 cs=1 l=1).process wants 93 bytes from policy throttler 0/134217728<br />2022-07-20 16:57:29.724862 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_DISPATCH_QUEUE pgs=3 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_BYTES<br />2022-07-20 16:57:29.724865 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_READ_FRONT pgs=3 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_DISPATCH_QUEUE<br />2022-07-20 16:57:29.724868 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_READ_FRONT pgs=3 cs=1 l=1).process got front 93<br />2022-07-20 16:57:29.724873 7f79ad631700 10 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1).process aborted = 0<br />2022-07-20 16:57:29.724876 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1).process got 93 + 0 + 0 byte message<br />2022-07-20 16:57:29.724887 7f79ad631700 5 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1). rx client.5217516 seq 1 0x5635bf4ce200 command(tid 0: {"dumpcontents": ["pgs"], "prefix": "pg dump", "target": ["mgr", ""]}) v1<br />2022-07-20 16:57:29.724893 7f79ad631700 20 -- 10.94.39.164:6804/27529 queue 0x5635bf4ce200 prio 127<br />2022-07-20 16:57:29.724898 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.39.164:0/666659707 conn(0x5635c2394800 :6804 s=STATE_OPEN pgs=3 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_READ_FRONT<br />2022-07-20 16:57:29.724903 7f7991b3e700 1 -- 10.94.39.164:6804/27529 <== client.5217516 10.94.39.164:0/666659707 1 ==== command(tid 0: {"dumpcontents": ["pgs"], "prefix": "pg dump", "target": ["mgr", ""]}) v1 ==== 93+0+0 (2844267270 0 0) 0x5635bf4ce200 con 0x5635c2394800<br />```<br />A few microseconds later, messages from osd are throttled<br />```<br />2022-07-20 16:57:30.246374 7f79ac62f700 10 -- 10.94.39.164:6804/27529 >> 10.94.35.11:6826/1391972 conn(0x5635c4bbf800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=59789223 cs=1 l=1).process wants 1 message from policy throttle 128/128 failed, just wait.<br />2022-07-20 16:57:30.246503 7f79ad631700 5 -- 10.94.39.164:6804/27529 >> 10.94.35.26:6800/615347 conn(0x5635cd705000 :6804 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=60341386 cs=1 l=1). rx osd.99 seq 80619 0x5635ca2afb80 pg_stats(117 pgs tid 0 v 0) v1<br />2022-07-20 16:57:30.246509 7f79ad631700 20 -- 10.94.39.164:6804/27529 queue 0x5635ca2afb80 prio 127<br />2022-07-20 16:57:30.246514 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.35.26:6800/615347 conn(0x5635cd705000 :6804 s=STATE_OPEN pgs=60341386 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_READ_FRONT<br />2022-07-20 16:57:30.246770 7f79ac62f700 20 -- 10.94.39.164:6804/27529 >> 10.94.35.11:6826/1391972 conn(0x5635c4bbf800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=59789223 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_MESSAGE<br />2022-07-20 16:57:30.246776 7f79ac62f700 10 -- 10.94.39.164:6804/27529 >> 10.94.35.11:6826/1391972 conn(0x5635c4bbf800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=59789223 cs=1 l=1).process wants 1 message from policy throttler 128/128<br />2022-07-20 16:57:30.246780 7f79ac62f700 10 -- 10.94.39.164:6804/27529 >> 10.94.35.11:6826/1391972 conn(0x5635c4bbf800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=59789223 cs=1 l=1).process wants 1 message from policy throttle 128/128 failed, just wait.<br />2022-07-20 16:57:30.246890 7f79ac62f700 20 -- 10.94.39.164:6804/27529 >> 10.94.35.11:6826/1391972 conn(0x5635c4bbf800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=59789223 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_MESSAGE<br />2022-07-20 16:57:30.246895 7f79ac62f700 10 -- 10.94.39.164:6804/27529 >> 10.94.35.11:6826/1391972 conn(0x5635c4bbf800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=59789223 cs=1 l=1).process wants 1 message from policy throttler 128/128<br />2022-07-20 16:57:30.246899 7f79ac62f700 10 -- 10.94.39.164:6804/27529 >> 10.94.35.11:6826/1391972 conn(0x5635c4bbf800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=59789223 cs=1 l=1).process wants 1 message from policy throttle 128/128 failed, just wait.<br />2022-07-20 16:57:30.246924 7f79ac62f700 20 -- 10.94.39.164:6804/27529 >> 10.94.35.11:6826/1391972 conn(0x5635c4bbf800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=59789223 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_MESSAGE<br />2022-07-20 16:57:30.246929 7f79ac62f700 10 -- 10.94.39.164:6804/27529 >> 10.94.35.11:6826/1391972 conn(0x5635c4bbf800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=59789223 cs=1 l=1).process wants 1 message from policy throttler 128/128<br />2022-07-20 16:57:30.246932 7f79ac62f700 10 -- 10.94.39.164:6804/27529 >> 10.94.35.11:6826/1391972 conn(0x5635c4bbf800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=59789223 cs=1 l=1).process wants 1 message from policy throttle 128/128 failed, just wait.<br />2022-07-20 16:57:30.250222 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN pgs=63559770 cs=1 l=1).process prev state is STATE_OPEN<br />2022-07-20 16:57:30.250233 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_HEADER pgs=63559770 cs=1 l=1).process prev state is STATE_OPEN<br />2022-07-20 16:57:30.250236 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_HEADER pgs=63559770 cs=1 l=1).process begin MSG<br />2022-07-20 16:57:30.250242 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_HEADER pgs=63559770 cs=1 l=1).process got MSG header<br />2022-07-20 16:57:30.250246 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_HEADER pgs=63559770 cs=1 l=1).process got envelope type=1794 src osd.162 front=577 data=0 off 0<br />2022-07-20 16:57:30.250250 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=63559770 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_HEADER<br />2022-07-20 16:57:30.250253 7f79ad631700 10 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=63559770 cs=1 l=1).process wants 1 message from policy throttler 128/128<br />2022-07-20 16:57:30.250256 7f79ad631700 10 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=63559770 cs=1 l=1).process wants 1 message from policy throttle 128/128 failed, just wait.<br />2022-07-20 16:57:30.250689 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=63559770 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_MESSAGE<br />2022-07-20 16:57:30.250695 7f79ad631700 10 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=63559770 cs=1 l=1).process wants 1 message from policy throttler 128/128<br />2022-07-20 16:57:30.250699 7f79ad631700 10 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=63559770 cs=1 l=1).process wants 1 message from policy throttle 128/128 failed, just wait.<br />2022-07-20 16:57:30.250776 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=63559770 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_MESSAGE<br />2022-07-20 16:57:30.250781 7f79ad631700 10 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=63559770 cs=1 l=1).process wants 1 message from policy throttler 128/128<br />2022-07-20 16:57:30.250785 7f79ad631700 10 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=63559770 cs=1 l=1).process wants 1 message from policy throttle 128/128 failed, just wait.<br />2022-07-20 16:57:30.250806 7f79ad631700 20 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=63559770 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_MESSAGE<br />2022-07-20 16:57:30.250811 7f79ad631700 10 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=63559770 cs=1 l=1).process wants 1 message from policy throttler 128/128<br />2022-07-20 16:57:30.250814 7f79ad631700 10 -- 10.94.39.164:6804/27529 >> 10.94.34.26:6812/4110130 conn(0x5635d60f8800 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=63559770 cs=1 l=1).process wants 1 message from policy throttle 128/128 failed, just wait.<br />```<br />the throttle warning is consistent until `pg dump` command is finished by dispatch-queue thread<br />```<br />2022-07-20 16:57:30.421196 7f79ac62f700 10 -- 10.94.39.164:6804/27529 >> 10.94.36.12:6830/2043440 conn(0x5635cee1d000 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=61380867 cs=1 l=1).process wants 1 message from policy throttle 128/128 failed, just wait.<br />2022-07-20 16:57:30.421230 7f79ac62f700 20 -- 10.94.39.164:6804/27529 >> 10.94.36.12:6830/2043440 conn(0x5635cee1d000 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=61380867 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_MESSAGE<br />2022-07-20 16:57:30.421235 7f79ac62f700 10 -- 10.94.39.164:6804/27529 >> 10.94.36.12:6830/2043440 conn(0x5635cee1d000 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=61380867 cs=1 l=1).process wants 1 message from policy throttler 128/128<br />2022-07-20 16:57:30.421239 7f79ac62f700 10 -- 10.94.39.164:6804/27529 >> 10.94.36.12:6830/2043440 conn(0x5635cee1d000 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=61380867 cs=1 l=1).process wants 1 message from policy throttle 128/128 failed, just wait.<br />2022-07-20 16:57:30.421261 7f79ac62f700 20 -- 10.94.39.164:6804/27529 >> 10.94.36.12:6830/2043440 conn(0x5635cee1d000 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=61380867 cs=1 l=1).process prev state is STATE_OPEN_MESSAGE_THROTTLE_MESSAGE<br />2022-07-20 16:57:30.421264 7f79ac62f700 10 -- 10.94.39.164:6804/27529 >> 10.94.36.12:6830/2043440 conn(0x5635cee1d000 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=61380867 cs=1 l=1).process wants 1 message from policy throttler 128/128<br />2022-07-20 16:57:30.421268 7f79ac62f700 10 -- 10.94.39.164:6804/27529 >> 10.94.36.12:6830/2043440 conn(0x5635cee1d000 :6804 s=STATE_OPEN_MESSAGE_THROTTLE_MESSAGE pgs=61380867 cs=1 l=1).process wants 1 message from policy throttle 128/128 failed, just wait.<br />2022-07-20 16:57:30.422552 7f79a7625700 1 mgr send_beacon active<br />2022-07-20 16:57:30.423047 7f79a7625700 1 -- 10.94.39.164:0/1281059177 --> 10.94.37.161:6789/0 -- mgrbeacon mgr.public-bjxy-c610(af28a938-4713-49df-beb1-440a0e985857,5107680, 10.94.39.164:6804/27529, 1) v6 -- 0x5635c2f00600 con 0<br />2022-07-20 16:57:30.423060 7f79a7625700 15 -- 10.94.39.164:0/1281059177 >> 10.94.37.161:6789/0 conn(0x5635bf3c0800 :-1 s=STATE_OPEN pgs=1099769 cs=1 l=1).send_message inline write is denied, reschedule m=0x5635c2f00600<br />```<br />from there, the throttle warning logs disappeared.</p>
<p>when a `pg dump` command is running, `mgr perf dump` will show that only 'mon_mgr_throttler' is full, and other throttles are empty. That is unexpected since we have setup different throttles in mgr. This can be solved by adding different messenger policies in mgr, then different throttlers will be bound to the corresponding policy instead the default policy.</p>
<p>The root cause for throttler being full is that there is no fast-dispatch in mgr. Acquiring throttle token is processed in multi asyn-msg-worker threads and release token is processed in single dispatch-queue thread. Commands like `pg dump` in a large-scale cluster will cost significant time in dispatch-queue thread, so all messages who have acquired token have to wait until `pg dump` finishes.</p>
<p>Adding a fast-dispatch in mgr maybe need more discussion, but binding different throttlers to corresponding policy instead to the default one is necessary and can be done now.</p>
mgr - Bug #56675: mgr messenger policy not initialized properly
https://tracker.ceph.com/issues/56675?journal_id=221455
2022-07-28T00:37:43Z
Xinying Song
<ul></ul><p>We can reproduce this in a test environment by modifying mgr handle_command interface, do a sleep when handle `pg dump`.</p>
mgr - Bug #56675: mgr messenger policy not initialized properly
https://tracker.ceph.com/issues/56675?journal_id=221916
2022-07-30T09:53:08Z
Xinying Song
<ul></ul><p>Oops, the root cause for throttler to be full is not because of missing fast-dispatch in mgr, it's just because message process speed is too slow. <br />This patch is trying to fix the problem that a client message will fill mgr_mon_meesager throttle fully.</p>
mgr - Bug #56675: mgr messenger policy not initialized properly
https://tracker.ceph.com/issues/56675?journal_id=225286
2022-09-08T18:09:35Z
Neha Ojha
nojha@redhat.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Fix Under Review</i></li></ul>