Project

General

Profile

Bug #44184

Slow / Hanging Ops after pool creation

Added by Wido den Hollander almost 2 years ago. Updated about 1 month ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
osd,slow ops,osd_pg_create
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On a cluster with 1405 OSDs I've ran into a situation for the second time now where a pool creation resulted into massive slow ops on the OSDs serving that pool.

I'll summarize it first before I continue with more information:

- A pool is created
- Once the balancer starts it's first balancing round slow op start to appear
- Slow ops are gone as soon as the pool is removed

Details about the cluster:

- 1405 OSDs
- Mixture of HDD, SSD, FileStore and BlueStore
- Nautilus 14.2.4 (80%) and 14.2.6 (20%)

This situation happened two times:

- December 12th 2019
- February 18th 2020

On both cases a pool was created.

Creating the pool went fine, nothing happened. All PGs were created and were active+clean

The balancer module is configured to only rebalance between 0700 and 1800 and both pools were created outside this window.

On 07:00 the balancer module rebalanced the cluster and this lead to massive slow ops on the OSDs. So much that client I/O is massively impacted.

Now keep in mind that the pool was created hours before this happened and all PGs are active+clean. It was not before the balancer started to rebalance the cluster these slow op occured.

In the last 24h this happened again because somebody ran the command:

radosgw-admin usage show

This caused the RGW to create the usage pool for RGW.

It wasn't until 18-02-2020 07:02 that the slow ops started to appear:

2020-02-18 07:03:10.518 7fd1f0344700 -1 osd.1646 1064314 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e1064314 40.66:1064099 40.6a:1064099)
2020-02-18 07:05:15.112 7fd1f0344700 -1 osd.1646 1064325 get_health_metrics reporting 4 slow ops, oldest is osd_op(client.1804403322.0:43816957 25.6c8 25.f23496c8 (undecoded) ondisk+write+known_if_redirected e10
64326)
2020-02-18 07:05:16.158 7fd1f0344700 -1 osd.1646 1064325 get_health_metrics reporting 14 slow ops, oldest is osd_op(client.1804403322.0:43816957 25.6c8 25.f23496c8 (undecoded) ondisk+write+known_if_redirected e1
064326)
2020-02-18 07:05:17.198 7fd1f0344700 -1 osd.1646 1064325 get_health_metrics reporting 17 slow ops, oldest is osd_op(client.1804403322.0:43816957 25.6c8 25.f23496c8 (undecoded) ondisk+write+known_if_redirected e1064326)
2020-02-18 07:05:18.209 7fd1f0344700 -1 osd.1646 1064325 get_health_metrics reporting 21 slow ops, oldest is osd_op(client.1804403322.0:43816957 25.6c8 25.f23496c8 (undecoded) ondisk+write+known_if_redirected e1064326)
2020-02-18 07:05:19.248 7fd1f0344700 -1 osd.1646 1064325 get_health_metrics reporting 25 slow ops, oldest is osd_op(client.1804403322.0:43816957 25.6c8 25.f23496c8 (undecoded) ondisk+write+known_if_redirected e1064326)
2020-02-18 07:05:36.235 7fd1f0344700 -1 osd.1646 1064329 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e1064328 40.66:1064099 40.6a:1064099)
2020-02-18 07:05:37.242 7fd1f0344700 -1 osd.1646 1064329 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e1064328 40.66:1064099 40.6a:1064099)
2020-02-18 07:05:38.242 7fd1f0344700 -1 osd.1646 1064329 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e1064328 40.66:1064099 40.6a:1064099)
2020-02-18 07:05:39.262 7fd1f0344700 -1 osd.1646 1064329 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e1064329 40.66:1064099 40.6a:1064099)
2020-02-18 07:05:40.248 7fd1f0344700 -1 osd.1646 1064329 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e1064329 40.66:1064099 40.6a:1064099)
2020-02-18 07:05:41.284 7fd1f0344700 -1 osd.1646 1064329 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e1064329 40.66:1064099 40.6a:1064099)
2020-02-18 07:05:42.235 7fd1f0344700 -1 osd.1646 1064329 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e1064329 40.66:1064099 40.6a:1064099)
2020-02-18 07:05:43.199 7fd1f0344700 -1 osd.1646 1064329 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e1064329 40.66:1064099 40.6a:1064099)
2020-02-18 07:05:44.182 7fd1f0344700 -1 osd.1646 1064329 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e1064329 40.66:1064099 40.6a:1064099)
2020-02-18 07:05:45.132 7fd1f0344700 -1 osd.1646 1064329 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e1064329 40.66:1064099 40.6a:1064099)

This went on and on. All the OSDs involved with this new pool (pool ID 40) were reporting slow ops.

I downloaded osdmap e1064314 and checked the map. It shows:

epoch 1064314
fsid 5ec1249f-7a87-4b4a-b360-2b5d17e24984
created 2015-11-25 16:50:51.738355
modified 2020-02-18 07:02:39.535132
flags sortbitwise,recovery_deletes,purged_snapdirs
...
...
pool 40 '.zm1.usage' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 1064100 flags hashpspool,nodelete stripe_width 0 application rgw
...
...
pg_upmap_items 3.0 [2491,2338]
pg_upmap_items 3.3 [1468,2393]
...
pg_temp 4.3c2 [254,2597]
pg_temp 10.c6e [421,2193,2180]
pg_temp 25.362 [674,941,2180]
pg_temp 25.3c9 [2203,829,1088]

There are no upmap nor pg_temp Placement Groups for pool 40.

Finally we removed the pool and within a couple of seconds all the slow requests were gone.

To me this seems to be related to the pool being created, but it's only triggered after the balancer triggers a optimization.

Has anybody seen this before or has a clue what might be happening here?

ceph-13.2.8-debug-creating-pgs.diff View (513 Bytes) Nikola Ciprich, 03/23/2020 03:39 PM

History

#1 Updated by Wido den Hollander almost 2 years ago

On the Ceph users list there are multiple reports of people experiencing this:

- https://www.spinics.net/lists/ceph-users/msg54910.html
- https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QSFGP5H7KJAW7Y3MZR7B2UWCHUQCHSLK/?sort=thread

This seems to be Nautilus related.

I'll see if I can reproduce it somehow, but that doesn't seem to be very simple. None of my test systems experience this.

#2 Updated by Neha Ojha almost 2 years ago

  • Status changed from New to Need More Info

Hi Wido,

I did come across something like this while investigating https://tracker.ceph.com/issues/43048. It was also on nautilus but the slow op osd_pg_create messages eventually went away once the pgs were able to fetch the required osdmaps for the corresponding epochs. In your case, do you observe the same or in other words, how long do you see these slow ops appear? If you happen to have osd logs, please attach them.

#3 Updated by Wido den Hollander almost 2 years ago

Neha Ojha wrote:

Hi Wido,

I did come across something like this while investigating https://tracker.ceph.com/issues/43048. It was also on nautilus but the slow op osd_pg_create messages eventually went away once the pgs were able to fetch the required osdmaps for the corresponding epochs. In your case, do you observe the same or in other words, how long do you see these slow ops appear? If you happen to have osd logs, please attach them.

The logs we have a very minimal. But the slow ops never went away. What I have from logs I've posted in the beginning of this issue.

All PGs are active+clean.

This started around 07:00 in the morning (when the balancer became active) and the slow ops remained until ~11:30 until we removed the newly created pool.

I think I can reproduce this easily on this production cluster, but with a few PB online I simply can't test this again and see what happens.

#4 Updated by Neha Ojha almost 2 years ago

  • Priority changed from Normal to High

#5 Updated by Igor Fedotov over 1 year ago

We've got similar case with a plenty of slow op indications many of them are osd_op_create ones.
Which eventually goes away and reappear after a while.

Nautilus v14.2.5

#6 Updated by Jan Fajerski over 1 year ago

Hi everyone, we were able to collect some debug logs that seems to exhibit this case. the log is fairly large, you can find a compressed version of it under https://framadrop.org/r/zMEeWuFJF7#X6kbAAfodjjvIR9K5h3AdeZfaS2Jf9AuHeiJtiuU7wM= (will self delete Thursday, April 9, 2020 5:13 PM)

#7 Updated by Paul Emmerich over 1 year ago

I've encountered another cluster in the wild that hit this bug. It seems to be triggered somewhat reliable a few hours after creating new pools. Deleting the new pools resolves it. Keeping the pools causes random problems when anything in the osdmap changes.

The new pool works without any problems until the problem occurs, it only happens after a few hours.

It might be related to scrubbing oder snapshot operations; but this may be a co-incidence. It triggered at around the same time as scrubbing started and a snapshot cron job. The balancer is disabled on this cluster.

And yeah, unfortunately I can't debug it on this system either because it's a somewhat important production cluster :(

#8 Updated by Paul Emmerich over 1 year ago

Oh, running mostly 14.2.4 and some OSDs on 14.2.5 on Ubuntu 16.04. Small cluster with only 127 OSDs.

#9 Updated by Paul Emmerich over 1 year ago

Forgot to mention: the cluster has been running since Jewel 10.2.4. I think upgrading from Jewel or older seems to be a pre-requisite for this to happen?

I've seen this on two setups and one has been upgraded from Jewel, one from Firefly. Both have been fully migrated to BlueStore, so no mixed setup like in the original description.

#10 Updated by Wido den Hollander over 1 year ago

So this message came along on the users mailinglist:

so I can confirm that at least in my case, the problem is caused
by old osd maps not being pruned for some reason, and thus not fitting
into cache. When I increased osd map cache to 5000 the problem is gone.

The question is why they're not being pruned, even though the cluster is in
healthy state. But you can try checking:

ceph daemon osd.X status to see how many maps are your OSDs storing
and ceph daemon osd.X perf dump | grep osd_map_cache_miss

to see if you're experiencing similar problem..

so I'm going to debug further..

#11 Updated by Jan Fajerski over 1 year ago

Can confirm seeing issues with osd map pruning ("oldest_map": 41985 vs "newest_map": 83376) and large osd_map_cache_miss counter values.

#12 Updated by Dan van der Ster over 1 year ago

Jan Fajerski wrote:

Can confirm seeing issues with osd map pruning ("oldest_map": 41985 vs "newest_map": 83376) and large osd_map_cache_miss counter values.

See https://tracker.ceph.com/issues/37875

With `ceph pg dump -f json | jq .osd_epochs` you can see which OSD is needing osdmap 41985. What's the status of that osd?

#13 Updated by Andrew Mitroshin over 1 year ago

Could you please submit output for the command

ceph osd dump | grep require

#14 Updated by Wout van Heeswijk over 1 year ago

Another customer of ours has reported this behaviour. This cluster was, most likely, installed with Hammer and is now running Nautilus (14.2.4/14.2.5).

created 2016-05-25
require_min_compat_client firefly
min_compat_client firefly

#15 Updated by Jan Fajerski over 1 year ago

Dan van der Ster wrote:

See https://tracker.ceph.com/issues/37875

With `ceph pg dump -f json | jq .osd_epochs` you can see which OSD is needing osdmap 41985. What's the status of that osd?

This is no longer part of the pf dump in nautilus unfortunately.

We also tried the workaround mentioned in https://tracker.ceph.com/issues/37875, i.e. removing all down or out OSDs from the crushmap and then restarting the MON daemons. This did not result in OSD maps being pruned.

Still seeing

# ceph report | jq '.osdmap_manifest.pinned_maps | length'
4162

#16 Updated by Jan Fajerski over 1 year ago

Andrew Mitroshin wrote:

Could you please submit output for the command

[...]

% ceph osd dump | grep require :(
require_min_compat_client jewel
require_osd_release luminous

#17 Updated by Andrew Mitroshin over 1 year ago

Jan Fajerski wrote:

Andrew Mitroshin wrote:

Could you please submit output for the command

[...]

% ceph osd dump | grep require :(
require_min_compat_client jewel
require_osd_release luminous

When upgrading to nautilus you should issue command:

ceph osd require-osd-release nautilus

As mentioned in https://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous
p. 10

This will eliminate slow_ops and osd_map_cache_misses

#18 Updated by Nikola Ciprich over 1 year ago

Hi, I'm dealing with similar (not sure whether the same) problem on 13.2.8. I've narrowed it down to osdmaps not being pruned, even though the cluster is in healthy state and there are no down OSDs. Digging further, I got to OSDMonitor::get_trim_to function, namely this part:

std::lock_guard<std::mutex> l(creating_pgs_lock);
if (!creating_pgs.pgs.empty()) {
return 0;
}

indeed, the problem in this cluster is, that there is bunch of pgs in the monitors DB which monitors believe are in creating state, even though they're in fact in clean+active state. I'm attaching 13.2.8 patch to get dump of those, as I wasn't able to get the list using ceph-kvstore-tool. So if anybody is interested, you can give it a try. Now the question is, why this happened and how can I get rid of those..

#19 Updated by hoan nv over 1 year ago

Andrew Mitroshin wrote:

Jan Fajerski wrote:

Andrew Mitroshin wrote:

Could you please submit output for the command

[...]

% ceph osd dump | grep require :(
require_min_compat_client jewel
require_osd_release luminous

When upgrading to nautilus you should issue command:
[...]

As mentioned in https://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous
p. 10

This will eliminate slow_ops and osd_map_cache_misses

Are you sure?

My cluster has 2 pools. Pool1 has all of osd in nautilus version, pool2 has all of osd in mimic version.
Now i can upgrade pool2 to nautilus then run this command 'ceph osd require-osd-release nautilus‘’.After above step i can create new pool without slow_ops.

Thanks.

#20 Updated by Wido den Hollander over 1 year ago

When searching through the code I found this in src/mon/OSDMonitor.cc

    // update creating pgs first so that we can remove the created pgid and
    // process the pool flag removal below in the same osdmap epoch.
    auto pending_creatings = update_pending_pgs(pending_inc, tmp);
    bufferlist creatings_bl;
    encode(pending_creatings, creatings_bl);
    t->put(OSD_PG_CREATING_PREFIX, "creating", creatings_bl);
    // remove any old (or incompat) POOL_CREATING flags
    for (auto& i : tmp.get_pools()) {
      if (tmp.require_osd_release < CEPH_RELEASE_NAUTILUS) {
        // pre-nautilus OSDMaps shouldn't get this flag.
        if (pending_inc.new_pools.count(i.first)) {
          pending_inc.new_pools[i.first].flags &= ~pg_pool_t::FLAG_CREATING;
        }
      }
      if (i.second.has_flag(pg_pool_t::FLAG_CREATING) &&
          !pending_creatings.still_creating_pool(i.first)) {
        dout(10) << __func__ << " done creating pool " << i.first
                 << ", clearing CREATING flag" << dendl;
        if (pending_inc.new_pools.count(i.first) == 0) {
          pending_inc.new_pools[i.first] = i.second;
        }
        pending_inc.new_pools[i.first].flags &= ~pg_pool_t::FLAG_CREATING;
      }
    }

So if the require_osd_release is NOT set to nautilus the code handles OSDMap pruning differently when PGs are in creating state.

Now, I find it very weird that the clusters I've seen this on do not have this flag set to nautilus.

Forgetting this once is a possibility. But having this happen on multiple clusters is very weird. It's a step which I always take (and other people do). It's clearly stated in the Release Notes and such.

Still, this issue shouldn't happen, but it might explain why it happens looking at the code I just posted.

#21 Updated by Jan Fajerski over 1 year ago

Fwiw on the cluster I'm seeing this on, I did set this flag after tidying the osd map (removed a couple of destroyed OSDs). Behaviour stayed the same and the pinned_maps count i still just as high.

#22 Updated by Ist Gab about 2 months ago

Wido den Hollander wrote:

On a cluster with 1405 OSDs I've ran into a situation for the second time now where a pool creation resulted into massive slow ops on the OSDs serving that pool.

Were there any solution to this issue? I'm facing with my objectstore multisite environment in 2 out of the 3 cluster which makes RGW down also and outage from user's point of view.

#23 Updated by Wido den Hollander about 2 months ago

Ist Gab wrote:

Wido den Hollander wrote:

On a cluster with 1405 OSDs I've ran into a situation for the second time now where a pool creation resulted into massive slow ops on the OSDs serving that pool.

Were there any solution to this issue? I'm facing with my objectstore multisite environment in 2 out of the 3 cluster which makes RGW down also and outage from user's point of view.

Are you sure recovery_deletes is set in the OSDMap?

#24 Updated by Ist Gab about 2 months ago

Are you sure recovery_deletes is set in the OSDMap?

yeah, these are the flags:
flags noout,sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit

#25 Updated by Wido den Hollander about 2 months ago

Ist Gab wrote:

Are you sure recovery_deletes is set in the OSDMap?

yeah, these are the flags:
flags noout,sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit

I have never encountered this with that flag enabled. In that case I wouldn't know

#26 Updated by Neha Ojha about 1 month ago

Ist Gab wrote:

Wido den Hollander wrote:

On a cluster with 1405 OSDs I've ran into a situation for the second time now where a pool creation resulted into massive slow ops on the OSDs serving that pool.

Were there any solution to this issue? I'm facing with my objectstore multisite environment in 2 out of the 3 cluster which makes RGW down also and outage from user's point of view.

Which version are you using?

#27 Updated by Ist Gab about 1 month ago

Neha Ojha wrote:

Which version are you using?

Octopus 15.2.14

Also available in: Atom PDF