Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2023-10-23T19:39:33ZCeph
Redmine mgr - Cleanup #63294 (New): mgr: enable per-subinterpreter GIL (Python >= 3.12) https://tracker.ceph.com/issues/632942023-10-23T19:39:33ZErnesto Puerta
<p>Starting Python 3.12, CPython supports instantiating <a href="https://peps.python.org/pep-0684/#pyinterpreterconfig-own-gil" class="external">subinterpreters with their own GIL</a>.</p>
<p>This would solve performance issues and allow ceph-mgr modules to fully utilize multiple cores (so far, ceph-mgr was performance is limited to a single core).</p>
<p>Additionally, this improves mgr module isolation, which could (perhaps) allow to restart only specific mgr-modules, instead of respawning the whole ceph-mgr everytime a module is enabled or disabled.</p>
<p>In order for this to work, the <code>Py_NewInterpreter()</code> has to be replaced with:<br /><pre><code class="cpp syntaxhl"><span class="CodeRay">PyInterpreterConfig config = {
.check_multi_interp_extensions = <span class="integer">1</span>,
.gil = PyInterpreterConfig_OWN_GIL,
};
PyThreadState *tstate = <span class="predefined-constant">NULL</span>;
PyStatus status = Py_NewInterpreterFromConfig(&tstate, &config);
</span></code></pre></p>
Risks:
<ul>
<li>After this change, issues might start happening with data-sharing across sub-interpreters, like the ceph-mgr <code>remote()</code> method. This could still be an issue without per-subinterpreter GIL, but with independent GILs issues would be more prone to appear.</li>
<li>No major LTS distro release brings Python 3.12 (it was released by Oct 3rd, 2023). Whenever, CentOS/Ubuntu/Debian start adopting it, we'll need to provide updated packages for the Python dependencies (this could be partly solved by <a href="https://github.com/ceph/ceph/pull/47501" class="external">embedding Python deps</a>).</li>
</ul>
Related issues:
<ul>
<li><a href="https://tracker.ceph.com/issues/38407" class="external">Funny issues with python sub-interpreters</a></li>
<li><a href="https://tracker.ceph.com/issues/45574" class="external">subinterpreters: ceph/mgr/rook RuntimeError on import of RookOrchestrator - ceph cluster does not start</a></li>
</ul>
References:
<ul>
<li><a class="external" href="https://docs.python.org/3.12/whatsnew/3.12.html#pep-684-a-per-interpreter-gil">https://docs.python.org/3.12/whatsnew/3.12.html#pep-684-a-per-interpreter-gil</a></li>
<li><a class="external" href="https://peps.python.org/pep-0684/">https://peps.python.org/pep-0684/</a></li>
<li><a class="external" href="https://github.com/python/cpython/pull/104210">https://github.com/python/cpython/pull/104210</a></li>
</ul> Dashboard - Cleanup #50963 (Resolved): mgr/dashboard: stop polling when page is not visiblehttps://tracker.ceph.com/issues/509632021-05-24T18:47:23ZErnesto Puerta
<p>Modern browsers provide support to detect whether a page is visible by the user or not (<a class="external" href="https://www.w3.org/TR/page-visibility/">https://www.w3.org/TR/page-visibility/</a>).</p>
<p>We should rely on that to only poll the back-end when the browser tab is visible. That way we'd heavily reduce the load generated from the front-end. That might have an impact, for example, if a user has many Dashboard tabs opened, which would multiply the number of HTTP requests to the back-end.</p> Dashboard - Feature #48388 (In Progress): mgr,mgr/dashboard: implement multi-layered cachinghttps://tracker.ceph.com/issues/483882020-11-27T16:52:06ZErnesto Puerta
<a name="Summary"></a>
<h3 >Summary<a href="#Summary" class="wiki-anchor">¶</a></h3>
<p>In order to reduce the chances of mgr modules and, specifically, dashboard to compromise the Ceph cluster performance by an increase in the frequency of API calls (unlike other mgr Modules, Dashboard load is predominantly user-driven), a multi-layered caching approach should be put in place.</p>
<a name="Current-status"></a>
<h3 >Current status<a href="#Current-status" class="wiki-anchor">¶</a></h3>
Existing caching approaches and issues:
<ul>
<li>Ceph-mgr itself caches a lot of API calls (get_module_option, get, get_server, get_metadata, ), so not every request to the <a href="https://docs.ceph.com/en/latest/mgr/modules/" class="external">ceph-mgr API</a> hits the Ceph cluster. However, the <code>send_command()</code> is not cached and might have a performance impact.</li>
<li>Additionally, one bottleneck in ceph-mgr is the <code>PyFormatter</code>, the class responsible for deserializing C++ binary structs to Python objects. For big objects (osd_map) this deserialization is not negligible, so it might be worthy to explore caching the resulting deserialized Python object or explore an incremental approach that doesn't involve processing the same data time after time.</li>
<li>Dashboard-backend: <a href="https://github.com/ceph/ceph/pull/20103/commits/349c1ff3d278cacc64e88f17c0edbbbc154ac4d0" class="external">ViewCache</a> decouples REST controller request from ceph-mgr API ones and allows for asynchronous fetching of data.</li>
</ul>
<p>The following picture shows the existing approaches (ViewCache) and the ones to explore, as a well as other potential points (PyFormatter and non-cached <code>send_command</code>):</p>
<p><img src="https://tracker.ceph.com/attachments/download/5430/Ceph%20Dashboard%20Caching.png" alt="" /></p>
<a name="Proposal"></a>
<h3 >Proposal<a href="#Proposal" class="wiki-anchor">¶</a></h3>
Layers:
<ul>
<li><strong>Ceph-mgr API</strong>:
<ul>
<li>C++: this is optimal, as the cached data is shared across modules. However is less trivial to implement.</li>
<li>Python: <a href="https://cachetools.readthedocs.io/en/stable/#cachetools.func.ttl_cache" class="external"><code>cachetools</code></a>. This one could be introduced at per-module level (every module interacts with each version of the cached ceph-mgr API methods) or shared (all modules consume the cached versions of the ceph-mgr API methods, although this could bring issues with modules modifying the objects returned by the cached methods).</li>
</ul>
</li>
<li><strong>Dashboard back-end</strong>:
<ul>
<li>Python: <code>cachetools</code></li>
<li><a href="https://docs.cherrypy.org/en/3.2.6/refman/lib/caching.html" class="external">CherryPy Cache</a> (it also takes care of the HTTP caching)</li>
</ul>
</li>
<li><strong>Dashboard front-end</strong>:
<ul>
<li>HTTP Cache (browser provided?)</li>
<li>Typescript/JS: <a href="https://www.npmjs.com/package/memoize-cache-decorator" class="external">memoize-cache-decorator</a> or <a href="https://www.npmjs.com/package/typescript-cacheable" class="external">ts-cacheable</a></li>
</ul></li>
</ul>
Pros:
<ul>
<li>Reduced load in ceph-mgr</li>
<li>Shorter response times</li>
</ul>
Cons:
<ul>
<li>Increased memory usage</li>
<li>Stale data (while TTL caches can improve this)</li>
<li>Data serializations issues</li>
<li>Leaks/ref counting issues</li>
</ul>
<a name="Implementation-details"></a>
<h3 >Implementation details<a href="#Implementation-details" class="wiki-anchor">¶</a></h3>
As caching is an optimization strategy and optimization needs to be benchmarked, the first step would be to implement a way to measure the effectiveness of caching:
<ul>
<li>From the backend/inner side of the system, that could be the number of calls that actually hit the ceph-mgr API (log message, new mgr CLI command, ...)</li>
<li>From the user facing side of the system, that could be the latency of the call (cold, when cache is not populated, vs. warm cache, when cache is populated and is hit by the request). Here there 2 possible user-facing points: direct RESTful interface (via curl) and WebUI (via Angular/JS). The second one is more realistic but less finegrained, so the 1st one would be preferred for quick benchmarking.</li>
</ul>
<p>Additionally, it would be interesting to short list the quick-wins, those ceph-mgr API calls and dashboard RESTful API endpoints that would benefit most from this caching scheme: those performing more calls to the ceph-mgr API. Keep in mind that the number of calls might depend on the scale of the cluster (e.g.: <code>GET /rbd</code> might required 1-many calls per RBD image created, and there are some users with thousands of RBD images).</p> Dashboard - Tasks #40766 (New): mgr/dashboard: Perform scalability tests with large amounts of RG...https://tracker.ceph.com/issues/407662019-07-12T15:30:35ZLenz Grimmer
<p>Evaluate the performance of the dashboard with increasing sizes of RGW users to identify potential bottlenecks and performance improvements.</p> Dashboard - Bug #40753 (Resolved): mgr/dashboard: Perform scalability tests with large amounts of...https://tracker.ceph.com/issues/407532019-07-12T11:49:28ZLenz Grimmer
<p>Evaluate the performance of the dashboard with increasing sizes of RGW buckets to identify potential bottlenecks and performance improvements.</p> Dashboard - Tasks #40752 (New): mgr/dashboard: Perform scalability tests with large amounts of RBDshttps://tracker.ceph.com/issues/407522019-07-12T11:47:14ZLenz Grimmer
<p>As part of testing the dashboard in larger production environments, the performance of the RBD management and overall dashboard performance should be evaluated in environments with several hundred or even thousands of RBDs.</p> Dashboard - Bug #39996 (Resolved): mgr/dashboard: Angular is creating multiple instances of the s...https://tracker.ceph.com/issues/399962019-05-22T09:03:29ZTiago Melo
<p>Just notice that when we use the TaskListService a new instance of SummaryService will be created.<br />This will result in having multiple requests of the summary at the same time.</p> Dashboard - Feature #39944 (Resolved): mgr/dashboard: Reduce the number of renders on the tableshttps://tracker.ceph.com/issues/399442019-05-15T17:07:38ZTiago Melo
<p>Currently each time we hover a table row or the data is refresh there are too many renders of each cell/row.<br />This should be improved.</p> Dashboard - Bug #39667 (New): mgr/dashboard: Optimize RBD list request on iSCSI formhttps://tracker.ceph.com/issues/396672019-05-10T09:22:08ZRicardo Marquesrimarques@suse.com
<p>iSCSI form is taking too long to open due to a request to the RBD images list that is then used to populate the "Images" dropdown:</p>
<p><img src="https://tracker.ceph.com/attachments/download/4169/rbd-request-on-iscsi-form.png" alt="" /></p>
<p>This request is returning a lot of information that is not needed in this form (e.g. disk usage):</p>
<p><img src="https://tracker.ceph.com/attachments/download/4170/iscsi-rbd-response.png" alt="" /></p>
<p>We should only request the required RBD image fields in order to optimize the initial load of this form.</p> Dashboard - Bug #39492 (Resolved): mgr/dashboard: iSCSI GET requests should not be loggedhttps://tracker.ceph.com/issues/394922019-04-26T10:08:55ZRicardo Marquesrimarques@suse.com
<p>Each time iSCSI targets table is refreshed the following line will be written to the log file:</p>
<p>`2019-04-26 10:00:12.176 7f3dc04ad700 20 mgr[dashboard] iSCSI: Getting targetinfo: iqn.2001-07.com.ceph:1556272706112`</p>
<p>Only POST, PUT and DELETE requests should be logged by the `ceph-iscsi` REST client.</p> Dashboard - Bug #39301 (Won't Fix): mgr/dashboard: Optimize RBD list by reducing the amount of da...https://tracker.ceph.com/issues/393012019-04-15T16:25:18ZRicardo Marquesrimarques@suse.com
<p>Some RBD image information may require a lot of processing, and we are calculating this information for every RBD image, every "X" seconds when the table automatically refreshes.</p>
<p>For optimization, RBD "LIST" endpoint should only return the information needed to populate the table.</p>
<p>Additional information (like snapshots information, etc.) that is only displayed in the details view should be returned by a dedicated GET to the selected image only.</p> Dashboard - Bug #39140 (Fix Under Review): mgr/dashboard: decouple RBD image disk usage calculati...https://tracker.ceph.com/issues/391402019-04-08T14:47:09ZRicardo Diasrdias@suse.com
<p>In the current ceph-dashboard's RBD management implementation, for each image that has the `fast-diff` feature enabled, the image disk usage is calculated upon an image list request. The disk-usage calculation can be a time consuming action, which makes the RBD image listing operation to take a lot of time.</p>
<p>We should decouple the disk usage calculation from the listing operation by calculating the disk usage of each image in the background.</p> Dashboard - Bug #36453 (New): mgr/dashboard: Some REST endpoints grow linearly with OSD counthttps://tracker.ceph.com/issues/364532018-10-16T10:29:33ZErnesto Puerta
<p>Endpoints providing information on OSD show linear size growth with OSD count.</p>
<ul>
<li><code>/health</code> grows 1-2 kB/OSD. It embeds all relevant Ceph maps (mgr_map, fs_map, osd_map, mon_map). Landing page is the main consumer of this endpoint, but it only needs osd_map to print "X total OSD (Y up, Z in)". <em>Solution</em>: calculate in a backend controller (<code>/summary</code>?) all the info needed for the Landing page.</li>
<li><code>/osd</code> grows 1-2 kB/OSD.</li>
</ul>
<p>Those would mean around 1-2 MB payloads every 5 seconds per dashboard instance for a 1000 OSD deployment. The resulting size can be highly varying as the payload is a plain-text JSON with lots of variable-length strings and numbers.</p>
The above might impact on the following:
<ul>
<li><strong>Networking</strong>: especially wireless ones.
<ul>
<li><em>Solution</em>: enable compression on the wire <b>FIXED</b>: <a class="external" href="https://github.com/ceph/ceph/pull/24727">https://github.com/ceph/ceph/pull/24727</a></li>
<li><em>Solution</em>: delta JSON (PATCH-like).</li>
<li><em>Solution</em>: more compact data exchange formats (BSON, MessagePack).</li>
</ul>
</li>
<li><strong>Server-side</strong>: caching either Ceph-mgr results or endpoint payloads could improve performance.
<ul>
<li><em>Solution</em>: cache ceph-mgr responses.</li>
<li><em>Solution</em>: using HTTP cache control (single-user multiple-requests).</li>
<li><em>Solution</em>: cache REST payloads internally (multiple-user).</li>
</ul>
</li>
<li><strong>Client-side</strong>: user experience may be negatively affected by parsing and processing large chunks of JSON.
<ul>
<li><em>Solution</em>: lightweight data exchange formats (BSON, MessagePack).</li>
<li><em>Solution</em>: delta JSON (PATCH-like).</li>
<li><em>Solution</em>: more specialized REST Resources (instead of generalistic ones, like <code>/health</code>).</li>
<li><em>Solution</em>: REST API pagination support.</li>
<li><em>Solution</em>: REST API field selector/filtering support.</li>
</ul></li>
</ul> Dashboard - Tasks #36451 (New): mgr/dashboard: Scalability testinghttps://tracker.ceph.com/issues/364512018-10-16T09:23:08ZErnesto Puerta
<p>This issue is meant to track all scalability related issues.</p>