Bug #57385
openOSDs “slow ops” for with multi hour delay.
0%
Description
Sorry, going to be kinda long, been working on this for a while, will try and be as to-the-point as possible.
Cluster Basics:
Clean Install (testing my process)
2x Hosts at the moment, 4x was planned with more drives later.
4x Intel Optane NVMe SSDs (16GBs Each, with newest firmware)Originally all “SSD”
Later 2x “SSD” & 2x “Meta-SSD”7x HDDs ranging from 3 to 10TBs (some SAS, some SATA).
10Gbit Networking
All pools other than the listed are tiny as there is no real data there yet.
Pools:CephFS-Root (3X Replication HDD by OSD) – This is a stub (hardlinks?), and has no data in it.
CephFS-Meta (2X Replication Meta-SSD by OSD)
CephFS-Migration-Data (EC by OSD on HDD)
CephFS-Migration-Cache (2X Replication SSD by OSD)writeback overlay for: CephFS-Migration-Data
quota_max_bytes = 5368709120
target_max_bytes = 4294967296
size = 2 (originally 3, more on this later)
min_size = 2
I setup an rsync to start loading data into the CephFS-Migration-Data using:setfattr -n ceph.dir.layout.pool -v CephFS-Migration-Data /CephFS/Migration/
All was working perfectly, the cache pool was taking in the writes and the OSDs were flushing it out at a good speed, averaging about 100MBps. This went on for maybe 8 hours (about 3TBs in), and I started seeing “slow io” alerts on two of the SSDs, these would never clear causing the MDS IO to backup and effectively stop (400 slow ops, oldest one blocked for 31770 sec). None of the OSDs filled (quotas in place well under limits).
Originally I had all four SSDs as just “SSD”, but I was able to remedy the blocked IOs by re-tagging a few OSD as “ssd-meta”, then re-tagging the MDS pool to use that and finally dropping from a size of 3 down to 2.
I thought that somehow there was a conflict, over utilization and re-started my copy (I have planned to dedicate disks to meta anyway).
Again, it worked for several hours and then started getting the “slow op” again. This time in one OSD for the cache, and one for an HDD spindle… I see no other errors in any of the log.
I attempted to adjust the OSD labels again, but that didn’t seem to clear it, but I only made one half baked attempt.
So I decided to just reboot the node, as this time both of the affected OSDs were on the same node. It went into some sort of crash loop at shutdown (after 15 minutes I hard powered off the box). See:
https://imgur.com/a/ghGJ6ZB
When it rebooted, the alarms had cleared, and I could again write to the file system.
I don’t think it is a hardware issue as not all of the affected OSDs are on the same hosts (the first time), type (ssd or hdd), or classes of hardware (CPU, SATA and SAS) other than the Optane disks that I am using (re-ran Intel Firmware updates just to be sure).
Am I missing something, or have I found some sort of bug? Maybe I am maxing something out in the kernel somehow???
Files
Updated by Brian Woods over 1 year ago
Just want to note that I am holding up further testing as this is a reproducible issue.
I would like to take this to a production configuration at this point.
Updated by Brian Woods over 1 year ago
I have been looking though old bug reports and found this:
https://tracker.ceph.com/issues/37568
But I am not attempting to remove any snapshots. So probably not related, but the impact is similar.
I think my next step will to re-run my tests with Pacific.
Updated by Brian Woods over 1 year ago
I think I have found the trigger...
If the cache pool hits its quota, there is a high probability that it will block new IOs at that moment (causing them to hang indefinitely rather than until there is space can be made by eviction). This issue seems to be aggravated by having multiple file copies going at once. Testing with a single file copy does not seem to consistently reproduce the issue.
Maybe this is a documented limitation that I missed, but this really feels like a nasty bug. :/ The whole point of the quota was to prevent the OSDs from filling. If hitting the quota causes an issue that is almost as bad...
Updated by Brian Woods over 1 year ago
I waiting a good day to hit send on that, just to be sure I had fixed the issue... 10 minutes after posting, it locked again, no were near the quota... AHHHH!!!
Updated by Brian Woods over 1 year ago
This SEEMS to only be an issue while operating CephFS volumes. Running a standard block device using the same write cache and EC data volume don't seem to be impacted. Continuing my tests...
I can not reproduce this on my second cluster at work... It's all virtual and pure SSD... Running out of ideas.
Updated by Brian Woods over 1 year ago
Reverting to Pacific did not resolve the issue.
Ran IO to a raw block device all night without issue.
Rebuilt to Pacific this morning and it started happening within about 10 minutes with CephFS.
Found notes here: https://documentation.suse.com/ses/7/html/ses-all/bp-troubleshooting-status.html
# ceph daemon osd.0 dump_historic_ops { "size": 20, "duration": 600, "ops": [] }
And:
# ceph daemon osd.0 ops { "ops": [ { "description": "osd_op(osd.9.45:8828 4.1es0 4:796ca7cc:::10000001978.00000000:head [copy-from ver 2417 in=58b] snapc 1=[] ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e116)", "initiated_at": "2022-09-12T21:24:38.838966+0000", "age": 1168.318204314, "duration": 1168.318219262, "type_data": { "flag_point": "started", "events": [ { "event": "initiated", "time": "2022-09-12T21:24:38.838966+0000", "duration": 0 }, { "event": "throttled", "time": "2022-09-12T21:24:38.838966+0000", "duration": 4294967295.9999995 }, { "event": "header_read", "time": "2022-09-12T21:24:38.838966+0000", "duration": 1.8670000000000001e-06 }, { "event": "all_read", "time": "2022-09-12T21:24:38.838968+0000", "duration": 1.73e-07 }, { "event": "dispatched", "time": "2022-09-12T21:24:38.838968+0000", "duration": 5.0900000000000002e-07 }, { "event": "queued_for_pg", "time": "2022-09-12T21:24:38.838969+0000", "duration": 7.6323999999999996e-05 }, { "event": "reached_pg", "time": "2022-09-12T21:24:38.839045+0000", "duration": 5.0130000000000001e-06 }, { "event": "started", "time": "2022-09-12T21:24:38.839050+0000", "duration": 8.3566999999999995e-05 } ] } }, { "description": "osd_op(osd.9.45:8830 4.1es0 4:79728a87:::10000000e6f.00000000:head [copy-from ver 1455 in=58b] snapc 1=[] ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e116)", "initiated_at": "2022-09-12T21:24:38.839100+0000", "age": 1168.318070543, "duration": 1168.3181186510001, "type_data": { "flag_point": "started", "events": [ { "event": "initiated", "time": "2022-09-12T21:24:38.839100+0000", "duration": 0 }, { "event": "throttled", "time": "2022-09-12T21:24:38.839100+0000", "duration": 4294967295.9999995 }, { "event": "header_read", "time": "2022-09-12T21:24:38.839100+0000", "duration": 1.6899999999999999e-06 }, { "event": "all_read", "time": "2022-09-12T21:24:38.839101+0000", "duration": 1.37e-07 }, { "event": "dispatched", "time": "2022-09-12T21:24:38.839102+0000", "duration": 7.5899999999999995e-07 }, { "event": "queued_for_pg", "time": "2022-09-12T21:24:38.839102+0000", "duration": 4.7132000000000002e-05 }, { "event": "reached_pg", "time": "2022-09-12T21:24:38.839149+0000", "duration": 5.1819999999999997e-06 }, { "event": "started", "time": "2022-09-12T21:24:38.839155+0000", "duration": 5.4580000000000003e-05 } ] } }, { "description": "osd_op(osd.9.45:8832 4.1es0 4:799e00d3:::1000000130e.00000000:head [copy-from ver 1887 in=58b] snapc 1=[] ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e116)", "initiated_at": "2022-09-12T21:24:38.839256+0000", "age": 1168.3179147369999, "duration": 1168.3179915830001, "type_data": { "flag_point": "started", "events": [ { "event": "initiated", "time": "2022-09-12T21:24:38.839256+0000", "duration": 0 }, { "event": "throttled", "time": "2022-09-12T21:24:38.839256+0000", "duration": 4294967295.9999995 }, { "event": "header_read", "time": "2022-09-12T21:24:38.839255+0000", "duration": 1.64e-06 }, { "event": "all_read", "time": "2022-09-12T21:24:38.839257+0000", "duration": 1.54e-07 }, { "event": "dispatched", "time": "2022-09-12T21:24:38.839257+0000", "duration": 4.4299999999999998e-07 }, { "event": "queued_for_pg", "time": "2022-09-12T21:24:38.839258+0000", "duration": 3.6322000000000002e-05 }, { "event": "reached_pg", "time": "2022-09-12T21:24:38.839294+0000", "duration": 4.0450000000000002e-06 }, { "event": "started", "time": "2022-09-12T21:24:38.839298+0000", "duration": 4.2228999999999998e-05 } ] } }, { "description": "osd_op(osd.9.45:8834 4.1es0 4:79c1296e:::100000010ae.00000000:head [copy-from ver 1816 in=58b] snapc 1=[] ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e116)", "initiated_at": "2022-09-12T21:24:38.839346+0000", "age": 1168.317824408, "duration": 1168.3179286080001, "type_data": { "flag_point": "started", "events": [ { "event": "initiated", "time": "2022-09-12T21:24:38.839346+0000", "duration": 0 }, { "event": "throttled", "time": "2022-09-12T21:24:38.839346+0000", "duration": 4294967295.9999995 }, { "event": "header_read", "time": "2022-09-12T21:24:38.839346+0000", "duration": 5.1279999999999999e-06 }, { "event": "all_read", "time": "2022-09-12T21:24:38.839351+0000", "duration": 1.54e-07 }, { "event": "dispatched", "time": "2022-09-12T21:24:38.839351+0000", "duration": 1.114e-06 }, { "event": "queued_for_pg", "time": "2022-09-12T21:24:38.839352+0000", "duration": 3.3144e-05 }, { "event": "reached_pg", "time": "2022-09-12T21:24:38.839385+0000", "duration": 3.9779999999999999e-06 }, { "event": "started", "time": "2022-09-12T21:24:38.839389+0000", "duration": 4.3167999999999997e-05 } ] } } ], "num_ops": 4 }
IDK if that helps at all, but...
I have also found that just restarting the docker for the OSD clears the problem.
Updated by Brian Woods over 1 year ago
From the second OSD in this state:
# ceph daemon osd.9 dump_historic_ops { "size": 20, "duration": 600, "ops": [ { "description": "osd_repop(mds.0.5:2574 3.6 e116/88 3:654134d2:::mds0_openfiles.0:head v 116'116, mlcod=116'114)", "initiated_at": "2022-09-12T21:51:01.814339+0000", "age": 73.645018769999993, "duration": 0.001372341, "type_data": { "flag_point": "commit sent; apply or cleanup", "events": [ { "event": "initiated", "time": "2022-09-12T21:51:01.814339+0000", "duration": 0 }, { "event": "throttled", "time": "2022-09-12T21:51:01.814339+0000", "duration": 4294967295.9999981 }, { "event": "header_read", "time": "2022-09-12T21:51:01.814337+0000", "duration": 2.7494000000000001e-05 }, { "event": "all_read", "time": "2022-09-12T21:51:01.814364+0000", "duration": 2.6759999999999999e-06 }, { "event": "dispatched", "time": "2022-09-12T21:51:01.814367+0000", "duration": 7.525e-06 }, { "event": "queued_for_pg", "time": "2022-09-12T21:51:01.814375+0000", "duration": 4.7308999999999997e-05 }, { "event": "reached_pg", "time": "2022-09-12T21:51:01.814422+0000", "duration": 1.9072e-05 }, { "event": "started", "time": "2022-09-12T21:51:01.814441+0000", "duration": 0.0012391119999999999 }, { "event": "commit_sent", "time": "2022-09-12T21:51:01.815680+0000", "duration": 3.1170999999999997e-05 }, { "event": "done", "time": "2022-09-12T21:51:01.815711+0000", "duration": 0.001372341 } ] } } ] }
And:
#ceph daemon osd.9 ops ##See Attached, 600K file.###
Updated by Brian Woods over 1 year ago
Notice that the duration's are insanely log, like there is an int that has rolled or something.
Updated by Brian Woods over 1 year ago
Tried moving the cache pool off of the Optane disks in case there was some sort of compatibility issue and started another copy. It did not help, if anything, it made it MUCH worse.
Now after restarting the affected OSDs, it going right back into the SLOW_OPs state after just a few seconds. A good amount of data built up in the cache (spinning rust and all) before I canceled it.
Pre:
# rados df; ceph -s POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR CephFS-Meta 315 MiB 49 0 147 0 0 0 88 72 MiB 4756 108 MiB 0 B 0 B CephFS-Migration-Cache 3.1 GiB 416 0 1248 0 0 0 3431 0 B 29402 20 GiB 0 B 0 B CephFS-Migration-Data 28 GiB 12587 0 62935 0 0 0 1008 0 B 13545 19 GiB 2.1 GiB 4.3 GiB ... health: HEALTH_WARN 1 cache pools at or near target size 531 slow ops, oldest one blocked for 242 sec, daemons [osd.2,osd.3,osd.7,osd.9] have slow ops.
After:
# rados df; ceph -s POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR CephFS-Meta 315 MiB 49 0 147 0 0 32 88 72 MiB 4756 108 MiB 0 B 0 B CephFS-Migration-Cache 3.0 GiB 413 0 1239 0 0 245 3431 0 B 29405 20 GiB 0 B 0 B CephFS-Migration-Data 28 GiB 12590 0 62950 0 0 0 1008 0 B 13548 19 GiB 2.1 GiB 4.3 GiB …. health: HEALTH_WARN 1 cache pools at or near target size 4 slow ops, oldest one blocked for 37 sec, osd.3 has slow ops
Notice the drop in count of objects (and duration of the blocking). It is trying to flush objects than it just stalls out. If I keep restarting the OSDs, it is slowly flushing the write back to the ec pool.
I moved the cache pool back to SSDs, and the issue persists.
Updated by Brian Woods over 1 year ago
Also, it seemed to have no issue moving the data from HDD to SSD. Only tooks a few minutes. So....
Updated by Brian Woods over 1 year ago
Can anyone make any suggestions on what I can check next? I am really running out of ideas.
Updated by Brian Woods over 1 year ago
Tried switching the hosts over to Proxmox and forewent all of my normal customization.
I was happy to see a significant increase in performance, but it ended up doing the same thing after about 10 minutes...
Updated by Brian Woods over 1 year ago
Corrections and Additional Info
RBD ARE Impacted, not just CephFS. It seems that you have to do a LOT more IO to trigger this for RBD’s… Aka ran a Windows 10 VM and let it patch. Took about 30 minutes.
I thought it might have been some sort of hardware compatibility issue, so I re-formated and created two separate stand alone clusters. Both clusters had the issue. The only common hardware between the two nodes was the ConnectX-3 10Gb Ethernet, the 16GB Optane Disks and LSI SAS2008 controllers.
If it helps, this is the basic script I am using to do my deployment:
#InstallDocker sudo apt-get remove -y docker docker-engine docker.io containerd runc sudo apt-get update sudo apt-get install -y ca-certificates curl gnupg lsb-release sudo mkdir -p /etc/apt/keyrings curl -fsSL https://download.docker.com/linux/ubuntu/gpg \ | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg echo "deb [arch=$(dpkg --print-architecture) \ signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" \ | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin docker-compose #Setup data classes: ceph osd crush rule create-replicated BasicRep-ByOSD default osd ceph osd crush rule create-replicated BasicRep-ByOSD-SSD default osd ssd ceph osd crush rule create-replicated BasicRep-ByHost default host ceph osd crush rule create-replicated BasicRep-ByHost-SSD default host ssd ceph osd erasure-code-profile set Erasure-ByOSD-D3F1 crush-failure-domain=osd k=3 m=1 plugin=jerasure technique=reed_sol_van crush-device-class=hdd ceph osd erasure-code-profile set Erasure-ByOSD-D4F1 crush-failure-domain=osd k=4 m=1 plugin=jerasure technique=reed_sol_van crush-device-class=hdd ceph osd erasure-code-profile set Erasure-ByOSD-D3F2 crush-failure-domain=osd k=3 m=2 plugin=jerasure technique=reed_sol_van crush-device-class=hdd ceph osd erasure-code-profile set Erasure-ByOSD-D7F3 crush-failure-domain=osd k=7 m=3 plugin=jerasure technique=reed_sol_van crush-device-class=hdd #Setup MDS services in Gui #Setup RGW services in Gui #Setup services and set rules: #SSD: #ceph osd pool set "device_health_metrics" crush_rule BasicRep-ByHost-SSD ceph osd pool set ".mgr" crush_rule BasicRep-ByOSD-SSD ceph osd pool set ".rgw.root" crush_rule BasicRep-ByOSD-SSD ceph osd pool set "default.rgw.control" crush_rule BasicRep-ByOSD-SSD ceph osd pool set "default.rgw.meta" crush_rule BasicRep-ByOSD-SSD #HDD: ceph osd pool set "default.rgw.log" crush_rule BasicRep-ByOSD PGsRoot=1 PGsMeta=8 PGsCache=8 PGsData=64 #Lets setup CephFS: CephFSName=CephFS RootPoolName=$CephFSName-Root RootRepProfile=BasicRep-ByOSD RootRepCount=3 RootRepMin=2 MetaPoolName=$CephFSName-Meta MetaRepProfile=BasicRep-ByOSD-SSD MetaRepCount=3 MetaRepMin=2 DataPoolName=$CephFSName-Migration-Data DataPoolProfile=Erasure-ByOSD-D3F2 DataPoolMinSize=4 #Should be a MINIMUM of k (data blocks) + 1, + 2 is recomended. CachePoolName=$CephFSName-Migration-Cache CachePoolProfile=BasicRep-ByOSD-SSD CachePoolRepCount=3 CachePoolRepMin=2 CachePool_cache_max_bytes=16106127360 #15GB CachePool_cache_target_max_bytes=1073741824 #1GB CompressionMode=aggressive CompressionAlgorithm=zstd CompressionRequired=.90 #Root Pool ceph osd pool create $RootPoolName $PGsRoot $PGsRoot replicated $RootRepProfile --autoscale-mode=warn ceph osd pool set $RootPoolName size $RootRepCount --yes-i-really-mean-it ceph osd pool set $RootPoolName min_size $RootRepMin --yes-i-really-mean-it #Meta Pool ceph osd pool create $MetaPoolName $PGsMeta $PGsMeta replicated $MetaRepProfile --autoscale-mode=warn ceph osd pool set $MetaPoolName size $MetaRepCount --yes-i-really-mean-it ceph osd pool set $MetaPoolName min_size $MetaRepMin --yes-i-really-mean-it #Data Pool ceph osd pool create $DataPoolName $PGsData $PGsData erasure $DataPoolProfile --autoscale-mode=warn ceph osd pool set $DataPoolName min_size $DataPoolMinSize ceph osd pool set $DataPoolName compression_algorithm $CompressionAlgorithm ceph osd pool set $DataPoolName compression_mode $CompressionMode ceph osd pool set $DataPoolName compression_required_ratio $CompressionRequired #Cache Pool ceph osd pool create $CachePoolName $PGsCache $PGsCache replicated $CachePoolProfile --autoscale-mode=warn ceph osd pool set-quota $CachePoolName max_bytes $CachePool_cache_max_bytes ceph osd pool set $CachePoolName size $CachePoolRepCount --yes-i-really-mean-it ceph osd pool set $CachePoolName min_size $CachePoolRepMin --yes-i-really-mean-it #Create tiers ceph osd tier add $DataPoolName $CachePoolName ceph osd tier set-overlay $DataPoolName $CachePoolName ceph osd tier cache-mode $CachePoolName writeback #More cache pool settings, must be done after tiering. ceph osd pool set $CachePoolName target_max_bytes $CachePool_cache_target_max_bytes ceph osd pool set $MetaPoolName crush_rule $CachePoolProfile ceph osd pool set $CachePoolName crush_rule $CachePoolProfile ceph osd pool set $CachePoolName hit_set_type bloom ceph fs new $CephFSName $MetaPoolName $RootPoolName --force ceph fs add_data_pool $CephFSName $DataPoolName #Mount on another host: apt install -y ceph-fuse ceph-common sudo mkdir -p /CephFS KEY=`cat /etc/ceph/ceph.client.admin.keyring | grep key | cut -d ' ' -f 3` sudo mount -t ceph 192.168.10.210:6789:/ /CephFS -o name=admin,secret=$KEY mkdir -p /CephFS/Migration #Are your variables set? setfattr -n ceph.dir.layout.pool -v "$DataPoolName" /CephFS/Migration/ getfattr -n ceph.dir.layout /CephFS/Migration
Updated by Brian Woods over 1 year ago
Tried disabling compression, that did not resolve the issue.
Updated by Brian Woods over 1 year ago
I think WE have found it, THANKS DISCORD! Well, were the problem is at least.
I have been able to work several other people in the community to find that the issue is not just data draining from cache pools, but having cache pools at all (such as read-proxy). A pool with replication shows the issue as well, even while using a RBD, it just take MUCH more IO for the issue to express itself.
On my home lab demo cluster I created an erasure pool, with erasure overwrites enabled (but NO cache pool) and have had no issues for almost a week now with about 50TBs of IO so far.
I also was able to somewhat re-create the issue on my virtual cluster; however, the errors will often clear after some time. Sometimes it will clear after only 60 seconds or so, something after like 20 minutes, with no ongoing IO going on or any indication as to why.
What I did to re-create the issue in my VM Cluster (VMware vSAN backed):I created a 6 node cluster with three disks each.
- A 40GB OS disk
- A 10GB cache disk.
- A 1TB data Disk.
After added an IO policy to the data disk limiting the IOps to 256 it almost instantly caused slow IOps that would last for several minutes, even after IO had stopped. Removing the read-proxy cache after a reboot prevents the issue and IO can be done without a problem.
I then wiped the cluster (as I have been messing with it) and did the same setup without a cache pool added and ran a continuous IO load in a loop without any issues. After a few hours without a problem, I added the read-proxy cache pool while the loop was running and within 30 seconds started to see the issue again.
I know the answer is always “you need more disks” :) and though don’t disagree, there is definitely some sort of a race condition that happens when trying to evict/delete cache pools objects when IO is heavy. I have testing this with both Ceph 16 and 17, both show the issue.
Updated by Brian Woods over 1 year ago
The VM cluster was left to its own devices for a while. It became stuck recovering with no errors, or even slow io alerts. It was like that for several days no movment. Even the recovert timer was stuck at 2 hours.
Removed the IO limits and rebooted one of the nodes and it appears to be recovering again.