Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2024-03-26T02:08:51ZCeph
Redmine rgw - Bug #65131 (New): RGW Put Ops from perf counter is invalid for objects > 16MBhttps://tracker.ceph.com/issues/651312024-03-26T02:08:51ZPaul Cuzner
<p>The rgw daemon exposes a perf counter "rgw.puts" as a counter representing the number of put ops performed by the gateway.</p>
<p>On small objects the value is fine, but beyond 16MB the value appears to be counting in 16MB ops instead of the actual object size. For example, a workload of 64MB PUTs is 4x the value it should be when cross checked with the client.<br />(GETs counters are not effected by this issue)</p>
<p>To reproduce, use warp with something like<br /><pre>
warp put --warp-client warp-1 --host rgw-1 --access-key $ACCESS_KEY --secret-key $SECRET_KEY --bucket $bucket --obj.size 64MB --concurrent 1 --duration 1m
</pre></p>
<p>And grab the counter stas via the admin socket<br />e.g.<br /><pre>
#!/usr/bin/bash
write_stats () {
put_count=$(ceph daemon /var/run/ceph/ceph-client.rgw.group2.storage-13-09008.ujvjki.7.94246064814240.asok perf dump | jq ".rgw.put")
now=$(date '+%s')
echo "${now},${put_count}"
}
while true; do
write_stats
sleep 5
done
</pre></p>
<p>Here's my results;</p>
<p>From the rgw script<br /><pre>
1711416974,2108690
1711416980,2108696
1711416985,2108727
1711416990,2108759
1711416995,2108790
1711417000,2108823
1711417005,2108858
1711417010,2108891
1711417015,2108923
1711417021,2108954
1711417026,2108985
1711417031,2109014
1711417036,2109044
1711417041,2109062
1711417046,2109062
</pre><br />The above shows a delta of 31-32 every 5s, so in theory the PUT rate is around 6 ops/sec</p>
<p>However, the results from warp show;<br /><pre>
started workload, 1 client(s) with 64MB objects at Tue Mar 26 01:36:14 UTC 2024
warp: Benchmark data written to "put-experiment/ec8-2/client_count_1/clients_1_64MB_PUT.csv.zst"
----------------------------------------
Operation: PUT. Concurrency: 1
* Average: 94.18 MiB/s, 1.54 obj/s
Throughput, split into 58 x 1s:
* Fastest: 109.1MiB/s, 1.79 obj/s
* 50% Median: 94.4MiB/s, 1.55 obj/s
* Slowest: 84.2MiB/s, 1.38 obj/s
warp: Cleanup done.
workload completed, 1 client(s) with 64MB objects at Tue Mar 26 01:37:20 UTC 2024
</pre></p>
<p>If you account for the reporting unit in rgw being in 16MB ops not 64MB ops and divide by 4, the RGW stats would become 1.5 which is a match for the value reported by warp.</p>
<p>This affects any monitoring or grafana graphs that show put ops, and any that show PUT latency since latency calculations may rely on a count ops processed in an interval.</p> nvme-of - Feature #64578 (New): Add a top tool to the nvmeof CLI to support troubleshooting https://tracker.ceph.com/issues/645782024-02-27T03:35:22ZPaul Cuzner
<p>By adding a top subcommand the admin should be able to understand the performance of the gateway from reactor CPU to overall IOPS, to individual namespaces. Each namespace should be shown with similar stats to iostat (r/s, wMB/s, r-await r-reqsz etc)</p>
<p>The tool should ideally present a batch mode function for capture to a file, and a console mode for regular use like linux 'top'</p> Ceph - Bug #64456 (New): Missing entries for hardware alerts from the MIB filehttps://tracker.ceph.com/issues/644562024-02-15T22:49:15ZPaul Cuzner
<p>While working on adding nvmeof alerts with an snmp trap, I noticed that although the hardware alerts reference an oid they are missing from the CEPH-MIB file</p>
<p>When you run validate_rules.py manually (and have snmptranslate installed - from net-snmp-utils), you see the following;</p>
<p>Problem Report</p>
<pre><code>Group Severity Alert Name Problem Description<br /> ----- -------- ---------- -------------------<br /> hardware Error HardwareStorageError rule defines an OID 1.3.6.1.4.1.50495.1.2.1.13.1 that is missing from the MIB file(CEPH-MIB.txt)<br /> hardware Error HardwareMemoryError rule defines an OID 1.3.6.1.4.1.50495.1.2.1.13.2 that is missing from the MIB file(CEPH-MIB.txt)<br /> hardware Error HardwareProcessorError rule defines an OID 1.3.6.1.4.1.50495.1.2.1.13.3 that is missing from the MIB file(CEPH-MIB.txt)<br /> hardware Error HardwareNetworkError rule defines an OID 1.3.6.1.4.1.50495.1.2.1.13.4 that is missing from the MIB file(CEPH-MIB.txt)<br /> hardware Error HardwarePowerError rule defines an OID 1.3.6.1.4.1.50495.1.2.1.13.5 that is missing from the MIB file(CEPH-MIB.txt)<br /> hardware Error HardwareFanError rule defines an OID 1.3.6.1.4.1.50495.1.2.1.13.6 that is missing from the MIB file(CEPH-MIB.txt)</code></pre>
<p>No problems detected in unit tests file</p> Ceph - Feature #64335 (New): Add alerts to ceph monitoring stack for the nvmeof gatewayshttps://tracker.ceph.com/issues/643352024-02-06T23:03:09ZPaul Cuzner
<p>Add alerts to ceph-mixins, for at least</p>
<p>subsystem namespace count near limit<br />subsystem namespaces exhausted<br />Gateway CPU usage is high<br />Average read I/O latency is over 10ms<br />Average write I/O latency is over 10ms<br />Subsystem host security is disabled<br />subsystem is reaching its max cntlr id<br />Gateway approaching maximum number of supported subsystems<br />Gateway has reached maximum number of supported subsystems</p>
<p>All of the above are severity warning events and do not require SNMP integration</p> Orchestrator - Feature #64334 (Duplicate): The nvmeof gateway has an embedded prometheus exporter...https://tracker.ceph.com/issues/643342024-02-06T22:45:41ZPaul Cuzner
<p>Starting with the 1.0.0 version of the gateway, an exporter is available on port 10008 of the gateway. This endpoint port should be scraped for each nvmeof gateway in the cluster.</p> mgr - Bug #64119 (New): During OSD recovery, performance stats reported by mgr/prometheus are bogushttps://tracker.ceph.com/issues/641192024-01-22T21:19:12ZPaul Cuzner
<p>During an OSD recovery period the pool stats can show IOPS and Throughput numbers which do not reflect the state of the systemn.</p>
<p>For example, I've seen IOPS at 175,592,917 and throughput at 9.39TiB/s!</p>
<p>mgr/prometheus us calling the df mgr interface for these stats which appears to call a different internal function (pg_map.dump_pool_stats_full) when compared to the mgr osd_pool_stats call which uses pg_map.dump_pool_stats_and_io_rate</p>
<p>Whilst this is not a major issue, it does pollute any monitoring making the return to normal I/O rate difficult to see within the dashboard and grafana monitoring.</p> Orchestrator - Bug #64020 (Resolved): cephadm is not accounting for the memory required nvme gate...https://tracker.ceph.com/issues/640202024-01-14T21:47:46ZPaul Cuzner
<p>When the osd_memory_target_autotune is TRUE, the Memory Autotuner class does not account for any nvmf memory requirement when defining the osd memory limits.</p>
<p>I think this should be 8GiB to account for the largest supported gateway</p> Orchestrator - Bug #63865 (Resolved): ceph orch host ls --detail reports the incorrect CPU thread...https://tracker.ceph.com/issues/638652023-12-19T22:32:29ZPaul Cuzner
<p>For example this is the output from some physical machines that are 16c/32t and 32c/64t <br /><pre>
ceph orch host ls --detail
HOST ADDR LABELS STATUS VENDOR/MODEL CPU RAM HDD SSD NIC
index-13-09018 146.118.59.141 Dell Inc. PowerEdge (PowerEdge R6515) 16C/512T 63 GiB 1/240.0GB 16/51.2TB 5
index-14-09038 146.118.58.246 Dell Inc. PowerEdge (PowerEdge R6515) 16C/512T 63 GiB 1/240.0GB 16/51.2TB 5
index-15-09058 146.118.59.223 Dell Inc. PowerEdge (PowerEdge R6515) 16C/512T 63 GiB 1/240.0GB 16/51.2TB 5
index-16-09078 146.118.59.12 Dell Inc. PowerEdge (PowerEdge R6515) 16C/512T 63 GiB 1/240.0GB 16/51.2TB 5
storage-13-09002 10.242.8.219 _admin,mgr Dell Inc. PowerEdge (PowerEdge R6515) 32C/2048T 252 GiB 60/1.3PB 9/13.3TB 5
storage-13-09004 146.118.58.247 Dell Inc. PowerEdge (PowerEdge R6515) 32C/2048T 252 GiB 60/1.3PB 9/13.3TB 5
</pre></p>
<p>This looks like a bug here<br /><a class="external" href="https://github.com/ceph/ceph/blob/c737def7988e17b177746634f58716b8dc5fb6e6/src/pybind/mgr/orchestrator/module.py#L99">https://github.com/ceph/ceph/blob/c737def7988e17b177746634f58716b8dc5fb6e6/src/pybind/mgr/orchestrator/module.py#L99</a></p> Orchestrator - Feature #63864 (Resolved): When listing devices it would be helpful to have a summ...https://tracker.ceph.com/issues/638642023-12-19T21:49:40ZPaul Cuzner
<p>Currently (18.2.0), when you issue a ceph orch device ls <host> you get the devices, but on hosts containing a large device count the output could benefit from a summary footer to confirm quickly whether the device count is as expected. Maybe this could be an additional switch on the command?</p>
<p>e.g.<br /><pre>
ceph orch device ls storage-13-09002 --totals
HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS
storage-13-09002 /dev/nvme0n1 ssd Dell_Ent_NVMe_PM1735a_MU_1.6TB_S6UVNE0TA02403 1490G No 9s ago Has a FileSystem, LVM detected
storage-13-09002 /dev/nvme1n1 ssd Dell_Ent_NVMe_PM1735a_MU_1.6TB_S6UVNE0TA02408 1490G No 9s ago Has a FileSystem, LVM detected
storage-13-09002 /dev/nvme2n1 ssd Dell_Ent_NVMe_PM1735a_MU_1.6TB_S6UVNE0TA02405 1490G No 9s ago Has a FileSystem, LVM detected
storage-13-09002 /dev/nvme3n1 ssd Dell_Ent_NVMe_PM1735a_MU_1.6TB_S6UVNE0TA02406 1490G No 9s ago Has a FileSystem, LVM detected
storage-13-09002 /dev/sda ssd PERC_H740P_Mini_6f4ee080543e2a002be90d9d1f352512 446G No 9s ago Has GPT headers, Has partitions
HDD 0, 5 SSD, 0 devices free
</pre></p> Orchestrator - Bug #63863 (New): When cephadm hits it's timeout, the issue is not flagged as a he...https://tracker.ceph.com/issues/638632023-12-19T21:39:11ZPaul Cuzner
<p>I noticed that when creating a OSDs on a node with 60 HDDs, the cephadm timeout was hit (I was watching the cephadm log).</p>
<p>The error is not escalated to the admin as a healthcheck, so this can go unnoticed which is not helpful when creating large ceph clusters!</p>
<p>Also note that this situation occurred <b>after</b> all the lvm work had been done, so the timeout being enforced left the capacity consumed but without a systemd configuration to actually use them!</p> ceph-volume - Bug #63862 (New): Used mpath devices are not returned in the inventoryhttps://tracker.ceph.com/issues/638622023-12-19T21:31:59ZPaul Cuzner
<p>I am expecting orch device ls to show all devices, but for mpath devices that have been used for osds they are not listed</p>
<p>ceph version 18.2.0-1252-g6a0590bd</p>
<p>for example;<br /><pre>
root@storage-13-09002:/var/lib/ceph/b46162de-9d48-11ee-8c88-2fbb19f34eb8/home# ceph orch device ls storage-16-09072
HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS
storage-16-09072 /dev/nvme0n1 ssd Dell_Ent_NVMe_PM1735a_MU_1.6TB_S6UVNE0TA01903 1490G No 15s ago LVM detected, locked
storage-16-09072 /dev/nvme1n1 ssd Dell_Ent_NVMe_PM1735a_MU_1.6TB_S6UVNE0TA01915 1490G No 15s ago LVM detected, locked
storage-16-09072 /dev/nvme2n1 ssd Dell_Ent_NVMe_PM1735a_MU_1.6TB_S6UVNE0TA01912 1490G No 15s ago LVM detected, locked
storage-16-09072 /dev/nvme3n1 ssd Dell_Ent_NVMe_PM1735a_MU_1.6TB_S6UVNE0TA01916 1490G No 15s ago LVM detected, locked
storage-16-09072 /dev/sda ssd PERC_H740P_Mini_6f4ee080543d4c002be90901fca09f22 446G No 15s ago Has GPT headers, Has partitions, locked
</pre></p>
<p>which shows the NVMe's as consumed, but the HDDs are missing. Looking at lsblk on the node shows<br /><pre>
lsblk -o name,size,type
NAME SIZE TYPE
loop0 63.5M loop
loop1 91.9M loop
loop2 40.9M loop
loop3 63.9M loop
sda 446.6G disk
├─sda1 488M part
└─sda2 446.2G part
sdb 20T disk
└─mpatha 20T mpath
└─ceph--7c7eb905--1bc0--49b2--876d--10b30ac49688-osd--block--e36d08f7--5d8e--4516--a381--84a2774421f8 20T lvm
sdc 20T disk
└─mpathb 20T mpath
└─ceph--0314ba65--7aec--42dc--b528--ff5330a903ef-osd--block--7026a506--43af--447b--9bd7--2b9a26910230 20T lvm
sdd 20T disk
└─mpathm 20T mpath
└─ceph--a8f56ffa--3710--41a7--99d8--929be5ed19aa-osd--block--c3c5062f--7c41--4b07--a327--c2441abf5774 20T lvm
sde 20T disk
└─mpathx 20T mpath
└─ceph--87ab2527--5cea--49b5--9f3b--7e289352d3a5-osd--block--f9e07259--a77d--48ca--b0fb--504707968da1 20T lvm
</pre></p>
<p>ceph-volume on the host shows<br /><pre>
root@storage-16-09072:/# ceph-volume inventory
Device Path Size Device nodes rotates available Model name
/dev/nvme0n1 1.46 TB nvme0n1 False False Dell Ent NVMe PM1735a MU 1.6TB
/dev/nvme1n1 1.46 TB nvme1n1 False False Dell Ent NVMe PM1735a MU 1.6TB
/dev/nvme2n1 1.46 TB nvme2n1 False False Dell Ent NVMe PM1735a MU 1.6TB
/dev/nvme3n1 1.46 TB nvme3n1 False False Dell Ent NVMe PM1735a MU 1.6TB
/dev/sda 446.62 GB sda False False PERC H740P Mini
</pre></p>
<p>So it would appear that once the mpath devices have been used, they are no longer returned to the mgr and consequently not visible in the ceph orch device command</p> ceph-volume - Bug #63851 (New): osd creation failure on external enclosurehttps://tracker.ceph.com/issues/638512023-12-19T03:00:06ZPaul Cuzner
<p>Testing with 18.2.0-1252-g6a0590bd using servers with external drive enclosure is hitting a deployment error, preventing osds from being created on several nodes.</p>
<p>This issue has been seen on 2 of the 26 servers - the same osd spec worked fine for the other machines, so I suspect this is environmental but need some help understanding the issue.</p>
<p>I've attached a mgr log that shows the errors thrown during deployment.</p> Orchestrator - Feature #63224 (New): [RFE] Add an alert for swap space usagehttps://tracker.ceph.com/issues/632242023-10-17T07:37:50ZPaul Cuzner
<p>Ideally swap should not be active on a ceph node, but if swap usage is active it has the potential to affect performance.</p>
<p>This RFE defines a requirement to add an alert to the prometheus rules based on the node_memory_swapfree and node_memory_swaptotal metrics.</p> CephFS - Bug #62674 (Duplicate): cephfs snapshot remains visible in nfs export after deletion and...https://tracker.ceph.com/issues/626742023-08-31T23:01:31ZPaul Cuzner
<p>When a snapshot is taken of the subvolume, the .snap directory shows the snapshot when viewed from the NFS mount and cephfs mount.</p>
<p>However, when the snapshot is deleted the cephfs mount correctly shows an empty .snap directory - but the NFS mount still shows the snapshot entry and the directories and files listed. You can even copy a file from the snapshot to the main filesystem!</p>
<p>e.g<br /><pre>
root@nfs-client-02 ~]# ls /mnt/cephfs/.snap/_snapit_1099511628282/
172.18.199.3.workload.0.0 smalldfile smallfile specstorage testfile
[root@nfs-client-02 ~]# ls /mnt/nfs/.snap/_snapit_1099511628282/
172.18.199.3.workload.0.0 smalldfile smallfile specstorage testfile
[root@nfs-client-02 ~]# # snapshot has been deleted
[root@nfs-client-02 ~]# ls /mnt/cephfs/.snap
[root@nfs-client-02 ~]# ls /mnt/nfs/.snap/_snapit_1099511628282/
172.18.199.3.workload.0.0 smalldfile smallfile specstorage testfile <--- still shows a snapshot!
[root@nfs-client-02 ~]# cp /mnt/nfs/.snap/_snapit_1099511628282/testfile ~/. <- copying a file from a snapshot that has been deleted worked!
[root@nfs-client-02 ~]# cat testfile
hello from the test file
</pre></p>
<p>df on the client shows the following<br /><pre>
172.16.36.61:/volumes/_nogroup/nfs-client-02/c53381bb-c6e8-4ac2-89bf-c07b042ab56b 100G 5.0G 95G 5% /mnt/cephfs
172.18.200.10:/nfs-client-02 100G 5.0G 95G 5% /mnt/nfs
172.18.200.10:/nfs-client-02/.snap 100G 5.0G 95G 5% /mnt/nfs/.snap
172.18.200.10:/nfs-client-02/.snap/_snapit_1099511628282 100G 5.0G 95G 5% /mnt/nfs/.snap/_snapit_1099511628282
</pre></p>
<p>There appears to be a sync issue when there are multiple snapshots too.</p>
<p>When I create several snapshots on cephfs I see this<br /><pre>
[root@nfs-client-02 ~]# ls /mnt/cephfs/.snap
_snapit-again_1099511628282 _snapit-again-2_1099511628282
</pre></p>
<p>But when I view the nfs mount which is the same subvolume<br /><pre>
[root@nfs-client-02 ~]# ls /mnt/nfs/.snap <-- _snapit-again-2 is missing!
_snapit-again_1099511628282
</pre></p>
<p>I'm not sure where the issue lies - Ganesha or cephfs, so raising against cephfs</p> CephFS - Bug #62673 (New): cephfs subvolume resize does not accept 'unit'https://tracker.ceph.com/issues/626732023-08-31T22:26:12ZPaul Cuzner
<p>Specifying the quota or resize for a subvolume requires the value in bytes. This value should be accepted as <num><unit> and if no unit is given default to bytes.</p>
<p>It looks like the resize value is treated as an int currently<br /><a class="external" href="https://github.com/ceph/ceph/blob/9d7c18257836dc888b4a300f3b5af9f080910986/src/pybind/mgr/volumes/fs/operations/versions/subvolume_base.py#L282">https://github.com/ceph/ceph/blob/9d7c18257836dc888b4a300f3b5af9f080910986/src/pybind/mgr/volumes/fs/operations/versions/subvolume_base.py#L282</a></p>