Project

General

Profile

Actions

Bug #57385

open

OSDs “slow ops” for with multi hour delay.

Added by Brian Woods over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Sorry, going to be kinda long, been working on this for a while, will try and be as to-the-point as possible.

Cluster Basics:

Clean Install (testing my process)
2x Hosts at the moment, 4x was planned with more drives later.
4x Intel Optane NVMe SSDs (16GBs Each, with newest firmware)

Originally all “SSD”
Later 2x “SSD” & 2x “Meta-SSD”

7x HDDs ranging from 3 to 10TBs (some SAS, some SATA).
10Gbit Networking
All pools other than the listed are tiny as there is no real data there yet.
Pools:

CephFS-Root (3X Replication HDD by OSD) – This is a stub (hardlinks?), and has no data in it.
CephFS-Meta (2X Replication Meta-SSD by OSD)
CephFS-Migration-Data (EC by OSD on HDD)
CephFS-Migration-Cache (2X Replication SSD by OSD)

writeback overlay for: CephFS-Migration-Data
quota_max_bytes = 5368709120
target_max_bytes = 4294967296
size = 2 (originally 3, more on this later)
min_size = 2

I setup an rsync to start loading data into the CephFS-Migration-Data using:
setfattr -n ceph.dir.layout.pool -v CephFS-Migration-Data /CephFS/Migration/

All was working perfectly, the cache pool was taking in the writes and the OSDs were flushing it out at a good speed, averaging about 100MBps. This went on for maybe 8 hours (about 3TBs in), and I started seeing “slow io” alerts on two of the SSDs, these would never clear causing the MDS IO to backup and effectively stop (400 slow ops, oldest one blocked for 31770 sec). None of the OSDs filled (quotas in place well under limits).

Originally I had all four SSDs as just “SSD”, but I was able to remedy the blocked IOs by re-tagging a few OSD as “ssd-meta”, then re-tagging the MDS pool to use that and finally dropping from a size of 3 down to 2.

I thought that somehow there was a conflict, over utilization and re-started my copy (I have planned to dedicate disks to meta anyway).

Again, it worked for several hours and then started getting the “slow op” again. This time in one OSD for the cache, and one for an HDD spindle… I see no other errors in any of the log.

I attempted to adjust the OSD labels again, but that didn’t seem to clear it, but I only made one half baked attempt.

So I decided to just reboot the node, as this time both of the affected OSDs were on the same node. It went into some sort of crash loop at shutdown (after 15 minutes I hard powered off the box). See:
https://imgur.com/a/ghGJ6ZB

When it rebooted, the alarms had cleared, and I could again write to the file system.

I don’t think it is a hardware issue as not all of the affected OSDs are on the same hosts (the first time), type (ssd or hdd), or classes of hardware (CPU, SATA and SAS) other than the Optane disks that I am using (re-ran Intel Firmware updates just to be sure).

Am I missing something, or have I found some sort of bug? Maybe I am maxing something out in the kernel somehow???


Files

Log.txt (596 KB) Log.txt ceph daemon osd.9 ops Brian Woods, 09/12/2022 10:00 PM
Actions #1

Updated by Brian Woods over 1 year ago

Just want to note that I am holding up further testing as this is a reproducible issue.

I would like to take this to a production configuration at this point.

Actions #2

Updated by Brian Woods over 1 year ago

I have been looking though old bug reports and found this:
https://tracker.ceph.com/issues/37568

But I am not attempting to remove any snapshots. So probably not related, but the impact is similar.

I think my next step will to re-run my tests with Pacific.

Actions #3

Updated by Brian Woods over 1 year ago

I think I have found the trigger...

If the cache pool hits its quota, there is a high probability that it will block new IOs at that moment (causing them to hang indefinitely rather than until there is space can be made by eviction). This issue seems to be aggravated by having multiple file copies going at once. Testing with a single file copy does not seem to consistently reproduce the issue.

Maybe this is a documented limitation that I missed, but this really feels like a nasty bug. :/ The whole point of the quota was to prevent the OSDs from filling. If hitting the quota causes an issue that is almost as bad...

Actions #4

Updated by Brian Woods over 1 year ago

I waiting a good day to hit send on that, just to be sure I had fixed the issue... 10 minutes after posting, it locked again, no were near the quota... AHHHH!!!

Actions #5

Updated by Brian Woods over 1 year ago

This SEEMS to only be an issue while operating CephFS volumes. Running a standard block device using the same write cache and EC data volume don't seem to be impacted. Continuing my tests...

I can not reproduce this on my second cluster at work... It's all virtual and pure SSD... Running out of ideas.

Actions #6

Updated by Brian Woods over 1 year ago

Reverting to Pacific did not resolve the issue.
Ran IO to a raw block device all night without issue.
Rebuilt to Pacific this morning and it started happening within about 10 minutes with CephFS.

Found notes here: https://documentation.suse.com/ses/7/html/ses-all/bp-troubleshooting-status.html

# ceph daemon osd.0 dump_historic_ops
{
    "size": 20,
    "duration": 600,
    "ops": []
}

And:

# ceph daemon osd.0 ops
{
    "ops": [
        {
            "description": "osd_op(osd.9.45:8828 4.1es0 4:796ca7cc:::10000001978.00000000:head [copy-from ver 2417 in=58b] snapc 1=[] ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e116)",
            "initiated_at": "2022-09-12T21:24:38.838966+0000",
            "age": 1168.318204314,
            "duration": 1168.318219262,
            "type_data": {
                "flag_point": "started",
                "events": [
                    {
                        "event": "initiated",
                        "time": "2022-09-12T21:24:38.838966+0000",
                        "duration": 0
                    },
                    {
                        "event": "throttled",
                        "time": "2022-09-12T21:24:38.838966+0000",
                        "duration": 4294967295.9999995
                    },
                    {
                        "event": "header_read",
                        "time": "2022-09-12T21:24:38.838966+0000",
                        "duration": 1.8670000000000001e-06
                    },
                    {
                        "event": "all_read",
                        "time": "2022-09-12T21:24:38.838968+0000",
                        "duration": 1.73e-07
                    },
                    {
                        "event": "dispatched",
                        "time": "2022-09-12T21:24:38.838968+0000",
                        "duration": 5.0900000000000002e-07
                    },
                    {
                        "event": "queued_for_pg",
                        "time": "2022-09-12T21:24:38.838969+0000",
                        "duration": 7.6323999999999996e-05
                    },
                    {
                        "event": "reached_pg",
                        "time": "2022-09-12T21:24:38.839045+0000",
                        "duration": 5.0130000000000001e-06
                    },
                    {
                        "event": "started",
                        "time": "2022-09-12T21:24:38.839050+0000",
                        "duration": 8.3566999999999995e-05
                    }
                ]
            }
        },
        {
            "description": "osd_op(osd.9.45:8830 4.1es0 4:79728a87:::10000000e6f.00000000:head [copy-from ver 1455 in=58b] snapc 1=[] ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e116)",
            "initiated_at": "2022-09-12T21:24:38.839100+0000",
            "age": 1168.318070543,
            "duration": 1168.3181186510001,
            "type_data": {
                "flag_point": "started",
                "events": [
                    {
                        "event": "initiated",
                        "time": "2022-09-12T21:24:38.839100+0000",
                        "duration": 0
                    },
                    {
                        "event": "throttled",
                        "time": "2022-09-12T21:24:38.839100+0000",
                        "duration": 4294967295.9999995
                    },
                    {
                        "event": "header_read",
                        "time": "2022-09-12T21:24:38.839100+0000",
                        "duration": 1.6899999999999999e-06
                    },
                    {
                        "event": "all_read",
                        "time": "2022-09-12T21:24:38.839101+0000",
                        "duration": 1.37e-07
                    },
                    {
                        "event": "dispatched",
                        "time": "2022-09-12T21:24:38.839102+0000",
                        "duration": 7.5899999999999995e-07
                    },
                    {
                        "event": "queued_for_pg",
                        "time": "2022-09-12T21:24:38.839102+0000",
                        "duration": 4.7132000000000002e-05
                    },
                    {
                        "event": "reached_pg",
                        "time": "2022-09-12T21:24:38.839149+0000",
                        "duration": 5.1819999999999997e-06
                    },
                    {
                        "event": "started",
                        "time": "2022-09-12T21:24:38.839155+0000",
                        "duration": 5.4580000000000003e-05
                    }
                ]
            }
        },
        {
            "description": "osd_op(osd.9.45:8832 4.1es0 4:799e00d3:::1000000130e.00000000:head [copy-from ver 1887 in=58b] snapc 1=[] ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e116)",
            "initiated_at": "2022-09-12T21:24:38.839256+0000",
            "age": 1168.3179147369999,
            "duration": 1168.3179915830001,
            "type_data": {
                "flag_point": "started",
                "events": [
                    {
                        "event": "initiated",
                        "time": "2022-09-12T21:24:38.839256+0000",
                        "duration": 0
                    },
                    {
                        "event": "throttled",
                        "time": "2022-09-12T21:24:38.839256+0000",
                        "duration": 4294967295.9999995
                    },
                    {
                        "event": "header_read",
                        "time": "2022-09-12T21:24:38.839255+0000",
                        "duration": 1.64e-06
                    },
                    {
                        "event": "all_read",
                        "time": "2022-09-12T21:24:38.839257+0000",
                        "duration": 1.54e-07
                    },
                    {
                        "event": "dispatched",
                        "time": "2022-09-12T21:24:38.839257+0000",
                        "duration": 4.4299999999999998e-07
                    },
                    {
                        "event": "queued_for_pg",
                        "time": "2022-09-12T21:24:38.839258+0000",
                        "duration": 3.6322000000000002e-05
                    },
                    {
                        "event": "reached_pg",
                        "time": "2022-09-12T21:24:38.839294+0000",
                        "duration": 4.0450000000000002e-06
                    },
                    {
                        "event": "started",
                        "time": "2022-09-12T21:24:38.839298+0000",
                        "duration": 4.2228999999999998e-05
                    }
                ]
            }
        },
        {
            "description": "osd_op(osd.9.45:8834 4.1es0 4:79c1296e:::100000010ae.00000000:head [copy-from ver 1816 in=58b] snapc 1=[] ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e116)",
            "initiated_at": "2022-09-12T21:24:38.839346+0000",
            "age": 1168.317824408,
            "duration": 1168.3179286080001,
            "type_data": {
                "flag_point": "started",
                "events": [
                    {
                        "event": "initiated",
                        "time": "2022-09-12T21:24:38.839346+0000",
                        "duration": 0
                    },
                    {
                        "event": "throttled",
                        "time": "2022-09-12T21:24:38.839346+0000",
                        "duration": 4294967295.9999995
                    },
                    {
                        "event": "header_read",
                        "time": "2022-09-12T21:24:38.839346+0000",
                        "duration": 5.1279999999999999e-06
                    },
                    {
                        "event": "all_read",
                        "time": "2022-09-12T21:24:38.839351+0000",
                        "duration": 1.54e-07
                    },
                    {
                        "event": "dispatched",
                        "time": "2022-09-12T21:24:38.839351+0000",
                        "duration": 1.114e-06
                    },
                    {
                        "event": "queued_for_pg",
                        "time": "2022-09-12T21:24:38.839352+0000",
                        "duration": 3.3144e-05
                    },
                    {
                        "event": "reached_pg",
                        "time": "2022-09-12T21:24:38.839385+0000",
                        "duration": 3.9779999999999999e-06
                    },
                    {
                        "event": "started",
                        "time": "2022-09-12T21:24:38.839389+0000",
                        "duration": 4.3167999999999997e-05
                    }
                ]
            }
        }
    ],
    "num_ops": 4
}

IDK if that helps at all, but...

I have also found that just restarting the docker for the OSD clears the problem.

Actions #7

Updated by Brian Woods over 1 year ago

From the second OSD in this state:

# ceph daemon osd.9 dump_historic_ops
{
    "size": 20,
    "duration": 600,
    "ops": [
        {
            "description": "osd_repop(mds.0.5:2574 3.6 e116/88 3:654134d2:::mds0_openfiles.0:head v 116'116, mlcod=116'114)",
            "initiated_at": "2022-09-12T21:51:01.814339+0000",
            "age": 73.645018769999993,
            "duration": 0.001372341,
            "type_data": {
                "flag_point": "commit sent; apply or cleanup",
                "events": [
                    {
                        "event": "initiated",
                        "time": "2022-09-12T21:51:01.814339+0000",
                        "duration": 0
                    },
                    {
                        "event": "throttled",
                        "time": "2022-09-12T21:51:01.814339+0000",
                        "duration": 4294967295.9999981
                    },
                    {
                        "event": "header_read",
                        "time": "2022-09-12T21:51:01.814337+0000",
                        "duration": 2.7494000000000001e-05
                    },
                    {
                        "event": "all_read",
                        "time": "2022-09-12T21:51:01.814364+0000",
                        "duration": 2.6759999999999999e-06
                    },
                    {
                        "event": "dispatched",
                        "time": "2022-09-12T21:51:01.814367+0000",
                        "duration": 7.525e-06
                    },
                    {
                        "event": "queued_for_pg",
                        "time": "2022-09-12T21:51:01.814375+0000",
                        "duration": 4.7308999999999997e-05
                    },
                    {
                        "event": "reached_pg",
                        "time": "2022-09-12T21:51:01.814422+0000",
                        "duration": 1.9072e-05
                    },
                    {
                        "event": "started",
                        "time": "2022-09-12T21:51:01.814441+0000",
                        "duration": 0.0012391119999999999
                    },
                    {
                        "event": "commit_sent",
                        "time": "2022-09-12T21:51:01.815680+0000",
                        "duration": 3.1170999999999997e-05
                    },
                    {
                        "event": "done",
                        "time": "2022-09-12T21:51:01.815711+0000",
                        "duration": 0.001372341
                    }
                ]
            }
        }
    ]
}

And:

#ceph daemon osd.9 ops
##See Attached, 600K file.###

Actions #8

Updated by Brian Woods over 1 year ago

Notice that the duration's are insanely log, like there is an int that has rolled or something.

Actions #9

Updated by Brian Woods over 1 year ago

Tried moving the cache pool off of the Optane disks in case there was some sort of compatibility issue and started another copy. It did not help, if anything, it made it MUCH worse.

Now after restarting the affected OSDs, it going right back into the SLOW_OPs state after just a few seconds. A good amount of data built up in the cache (spinning rust and all) before I canceled it.

Pre:

# rados df; ceph -s
POOL_NAME                  USED  OBJECTS  CLONES  COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED  RD_OPS      RD  WR_OPS       WR  USED COMPR  UNDER COMPR
CephFS-Meta             315 MiB       49       0     147                   0        0         0      88  72 MiB    4756  108 MiB         0 B          0 B
CephFS-Migration-Cache  3.1 GiB      416       0    1248                   0        0         0    3431     0 B   29402   20 GiB         0 B          0 B
CephFS-Migration-Data    28 GiB    12587       0   62935                   0        0         0    1008     0 B   13545   19 GiB     2.1 GiB      4.3 GiB
...
    health: HEALTH_WARN
            1 cache pools at or near target size
            531 slow ops, oldest one blocked for 242 sec, daemons [osd.2,osd.3,osd.7,osd.9] have slow ops.

After:

# rados df; ceph -s
POOL_NAME                  USED  OBJECTS  CLONES  COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED  RD_OPS      RD  WR_OPS       WR  USED COMPR  UNDER COMPR
CephFS-Meta             315 MiB       49       0     147                   0        0        32      88  72 MiB    4756  108 MiB         0 B          0 B
CephFS-Migration-Cache  3.0 GiB      413       0    1239                   0        0       245    3431     0 B   29405   20 GiB         0 B          0 B
CephFS-Migration-Data    28 GiB    12590       0   62950                   0        0         0    1008     0 B   13548   19 GiB     2.1 GiB      4.3 GiB
….
    health: HEALTH_WARN
            1 cache pools at or near target size
            4 slow ops, oldest one blocked for 37 sec, osd.3 has slow ops

Notice the drop in count of objects (and duration of the blocking). It is trying to flush objects than it just stalls out. If I keep restarting the OSDs, it is slowly flushing the write back to the ec pool.

I moved the cache pool back to SSDs, and the issue persists.

Actions #10

Updated by Brian Woods over 1 year ago

Also, it seemed to have no issue moving the data from HDD to SSD. Only tooks a few minutes. So....

Actions #11

Updated by Brian Woods over 1 year ago

Can anyone make any suggestions on what I can check next? I am really running out of ideas.

Actions #12

Updated by Brian Woods over 1 year ago

Tried switching the hosts over to Proxmox and forewent all of my normal customization.

I was happy to see a significant increase in performance, but it ended up doing the same thing after about 10 minutes...

Actions #13

Updated by Brian Woods over 1 year ago

Corrections and Additional Info
RBD ARE Impacted, not just CephFS. It seems that you have to do a LOT more IO to trigger this for RBD’s… Aka ran a Windows 10 VM and let it patch. Took about 30 minutes.

I thought it might have been some sort of hardware compatibility issue, so I re-formated and created two separate stand alone clusters. Both clusters had the issue. The only common hardware between the two nodes was the ConnectX-3 10Gb Ethernet, the 16GB Optane Disks and LSI SAS2008 controllers.

If it helps, this is the basic script I am using to do my deployment:

#InstallDocker
sudo apt-get remove -y docker docker-engine docker.io containerd runc
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg lsb-release
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
  | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) \
  signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" \
  | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin docker-compose

#Setup data classes:
ceph osd crush rule create-replicated BasicRep-ByOSD default osd
ceph osd crush rule create-replicated BasicRep-ByOSD-SSD default osd ssd
ceph osd crush rule create-replicated BasicRep-ByHost default host
ceph osd crush rule create-replicated BasicRep-ByHost-SSD default host ssd
ceph osd erasure-code-profile set Erasure-ByOSD-D3F1 crush-failure-domain=osd k=3 m=1 plugin=jerasure technique=reed_sol_van crush-device-class=hdd
ceph osd erasure-code-profile set Erasure-ByOSD-D4F1 crush-failure-domain=osd k=4 m=1 plugin=jerasure technique=reed_sol_van crush-device-class=hdd
ceph osd erasure-code-profile set Erasure-ByOSD-D3F2 crush-failure-domain=osd k=3 m=2 plugin=jerasure technique=reed_sol_van crush-device-class=hdd
ceph osd erasure-code-profile set Erasure-ByOSD-D7F3 crush-failure-domain=osd k=7 m=3 plugin=jerasure technique=reed_sol_van crush-device-class=hdd

#Setup MDS services in Gui
#Setup RGW services in Gui

#Setup services and set rules:
#SSD:
#ceph osd pool set "device_health_metrics" crush_rule BasicRep-ByHost-SSD
ceph osd pool set ".mgr" crush_rule BasicRep-ByOSD-SSD
ceph osd pool set ".rgw.root" crush_rule BasicRep-ByOSD-SSD
ceph osd pool set "default.rgw.control" crush_rule BasicRep-ByOSD-SSD
ceph osd pool set "default.rgw.meta" crush_rule BasicRep-ByOSD-SSD

#HDD:
ceph osd pool set "default.rgw.log" crush_rule BasicRep-ByOSD

PGsRoot=1
PGsMeta=8
PGsCache=8
PGsData=64

#Lets setup CephFS:
CephFSName=CephFS

RootPoolName=$CephFSName-Root
RootRepProfile=BasicRep-ByOSD
RootRepCount=3
RootRepMin=2

MetaPoolName=$CephFSName-Meta
MetaRepProfile=BasicRep-ByOSD-SSD
MetaRepCount=3
MetaRepMin=2

DataPoolName=$CephFSName-Migration-Data
DataPoolProfile=Erasure-ByOSD-D3F2
DataPoolMinSize=4   #Should be a MINIMUM of k (data blocks) + 1, + 2 is recomended.

CachePoolName=$CephFSName-Migration-Cache
CachePoolProfile=BasicRep-ByOSD-SSD
CachePoolRepCount=3
CachePoolRepMin=2
CachePool_cache_max_bytes=16106127360              #15GB
CachePool_cache_target_max_bytes=1073741824        #1GB

CompressionMode=aggressive
CompressionAlgorithm=zstd
CompressionRequired=.90

#Root Pool
ceph osd pool create $RootPoolName $PGsRoot $PGsRoot replicated $RootRepProfile --autoscale-mode=warn
ceph osd pool set $RootPoolName size $RootRepCount --yes-i-really-mean-it
ceph osd pool set $RootPoolName min_size $RootRepMin --yes-i-really-mean-it

#Meta Pool
ceph osd pool create $MetaPoolName $PGsMeta $PGsMeta replicated $MetaRepProfile --autoscale-mode=warn
ceph osd pool set $MetaPoolName size $MetaRepCount --yes-i-really-mean-it
ceph osd pool set $MetaPoolName min_size $MetaRepMin --yes-i-really-mean-it

#Data Pool
ceph osd pool create $DataPoolName $PGsData $PGsData erasure $DataPoolProfile --autoscale-mode=warn
ceph osd pool set $DataPoolName min_size $DataPoolMinSize
ceph osd pool set $DataPoolName compression_algorithm $CompressionAlgorithm
ceph osd pool set $DataPoolName compression_mode $CompressionMode
ceph osd pool set $DataPoolName compression_required_ratio $CompressionRequired

#Cache Pool
ceph osd pool create $CachePoolName $PGsCache $PGsCache replicated $CachePoolProfile --autoscale-mode=warn
ceph osd pool set-quota $CachePoolName max_bytes $CachePool_cache_max_bytes
ceph osd pool set $CachePoolName size $CachePoolRepCount --yes-i-really-mean-it
ceph osd pool set $CachePoolName min_size $CachePoolRepMin --yes-i-really-mean-it

#Create tiers
ceph osd tier add $DataPoolName $CachePoolName
ceph osd tier set-overlay $DataPoolName $CachePoolName
ceph osd tier cache-mode $CachePoolName writeback

#More cache pool settings, must be done after tiering.
ceph osd pool set $CachePoolName target_max_bytes $CachePool_cache_target_max_bytes
ceph osd pool set $MetaPoolName crush_rule $CachePoolProfile
ceph osd pool set $CachePoolName crush_rule $CachePoolProfile
ceph osd pool set $CachePoolName hit_set_type bloom

ceph fs new $CephFSName $MetaPoolName $RootPoolName --force
ceph fs add_data_pool $CephFSName $DataPoolName

#Mount on another host:
apt install -y ceph-fuse ceph-common

sudo mkdir -p /CephFS
KEY=`cat /etc/ceph/ceph.client.admin.keyring | grep key | cut -d ' ' -f 3`
sudo mount -t ceph 192.168.10.210:6789:/ /CephFS -o name=admin,secret=$KEY

mkdir -p /CephFS/Migration

#Are your variables set?
setfattr -n ceph.dir.layout.pool -v "$DataPoolName" /CephFS/Migration/
getfattr -n ceph.dir.layout /CephFS/Migration

Actions #14

Updated by Brian Woods over 1 year ago

Tried disabling compression, that did not resolve the issue.

Actions #15

Updated by Brian Woods over 1 year ago

I think WE have found it, THANKS DISCORD! Well, were the problem is at least.

I have been able to work several other people in the community to find that the issue is not just data draining from cache pools, but having cache pools at all (such as read-proxy). A pool with replication shows the issue as well, even while using a RBD, it just take MUCH more IO for the issue to express itself.

On my home lab demo cluster I created an erasure pool, with erasure overwrites enabled (but NO cache pool) and have had no issues for almost a week now with about 50TBs of IO so far.

I also was able to somewhat re-create the issue on my virtual cluster; however, the errors will often clear after some time. Sometimes it will clear after only 60 seconds or so, something after like 20 minutes, with no ongoing IO going on or any indication as to why.

What I did to re-create the issue in my VM Cluster (VMware vSAN backed):
I created a 6 node cluster with three disks each.
  • A 40GB OS disk
  • A 10GB cache disk.
  • A 1TB data Disk.

After added an IO policy to the data disk limiting the IOps to 256 it almost instantly caused slow IOps that would last for several minutes, even after IO had stopped. Removing the read-proxy cache after a reboot prevents the issue and IO can be done without a problem.

I then wiped the cluster (as I have been messing with it) and did the same setup without a cache pool added and ran a continuous IO load in a loop without any issues. After a few hours without a problem, I added the read-proxy cache pool while the loop was running and within 30 seconds started to see the issue again.

I know the answer is always “you need more disks” :) and though don’t disagree, there is definitely some sort of a race condition that happens when trying to evict/delete cache pools objects when IO is heavy. I have testing this with both Ceph 16 and 17, both show the issue.

Actions #16

Updated by Brian Woods over 1 year ago

The VM cluster was left to its own devices for a while. It became stuck recovering with no errors, or even slow io alerts. It was like that for several days no movment. Even the recovert timer was stuck at 2 hours.

Removed the IO limits and rebooted one of the nodes and it appears to be recovering again.

Actions

Also available in: Atom PDF