Bug #64546: client io requests hang when issued before the creation of the related pgs - crimson - Ceph

Actions

Copy link

Bug #64546

open

client io requests hang when issued before the creation of the related pgs

Added by Xuehan Xu 2 months ago. Updated about 1 month ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The issue is as follows:

1. Pool X is created at osdmap epoch 16
2. The monitor sends out pg_create messages to osd.A at osdmap epoch 17
3. The client sends an IO request to osd.A after epoch 16 before epoch 17
4. Before the client's io request is processed by osd.A, osdmap 17 reaches it and creates the corresponding pgs whose "same_interval_since" are set to 17
5. The client's io request is discarded as its osdmap epoch is 16, less than 17

Although the client receives osdmap 17 later, it doesn't resend the io request because there's no related changes to the io request between 16 and 17; so the client's io request can only be resent after the client's timeout which defaults to half an hour.

the mon/osd log is here: https://drive.google.com/drive/folders/1bqcFgS1CR8bvhtFUNmvI4Ah52V2JPZur?usp=drive_link

PG 46.0 and io client:4303:0.2 is the example.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Xuehan Xu 2 months ago

Description updated (diff)

Actions

Copy link

Updated by Samuel Just 2 months ago

osd.0 crash is unrelated -- should be fixed by https://github.com/ceph/ceph/pull/55705

Actions

Copy link

Updated by Samuel Just 2 months ago

2024-02-22T09:56:52.307+0000 2b476bef4700 10 mon.a@0(leader).osd e16 update_pending_pgs pg 46.0 just added, up [2,0] p 2 acting [2,0] p 2 history ec=17/17 lis/c=0/0 les/c/f=0/0/0 sis=17 past_intervals ([0,0] all_participants={} intervals=[])

46.0 was actually created with same_interval_since set to 17.

Actions

Copy link

Updated by Samuel Just 2 months ago

DEBUG 2024-02-22 09:56:51,337 [shard 0:main] osd - client_request(id=2536, detail=m=[osd_op(client.4303.0:2 46.0 46.71d521c0 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e16) v8]): starting start_pg_operation

IO sent at epoch 16. Seems like a bug in OSDMonitor -- history.sis should have been initialized to 16.

Xuehan: how did you reproduce this? If the above is correct, it should reproduce just as easily on classic.

Actions

Copy link

Updated by Samuel Just 2 months ago

OSDMonitor::update_pending_pgs limits the number of pgs it's willing to process in each invocation -- mon_osd_max_creating_pgs. Thus, there may be several epochs between when the OSDMap with the pool is exposed to clients and when OSDMonitor::update_pending_pgs actually generates the history for the pg and sends creates to OSDs.

The trouble is:

      const pg_t pgid{ps, static_cast<uint64_t>(poolid)};
      // NOTE: use the *current* epoch as the PG creation epoch so that the
      // OSD does not have to generate a long set of PastIntervals.
      pending_creatings.pgs.emplace(
    pgid,
    creating_pgs_t::pg_create_info(inc.epoch,
                       p->second.modified));

If update_pending_pgs doesn't get to a particular PG in the epoch in which the pool was created, the pg will end up with a same_interval_since at that epoch rather than the pool creation epoch. Any entity other than the monitor, however, would consider the pool creation epoch to be the interval start. This latter value is correct, monitor internal execution details aren't visible externally and can't be part of the definition of an interval.

The simple solution is to entirely remove this config and require the monitor to initialized creating_pgs_t entries for each pg in the interval on which the pool is created.

Actions

Copy link

Updated by Samuel Just 2 months ago

In the logs above, it seems that pools 39 (8 pgs), 40 (8 pgs), 41 (32 pgs), 42 (32 pgs), 43 (32 pgs), 44 (32 pgs), 46 (32 pgs), 48 (32 pgs), and 49 (32 pgs) were all created in the same epoch (16) at the same timestamp (2024-02-22T09:56:51.296239+0000).

mon_osd_max_creating_pgs defaults to 1024 -- more than the sum of the above pg counts. However, pools 3-38 were created during epochs 13, 14, and 15 most with 32 pgs. That suffices to push the pg creation queue size over the configured 1024 limit.

What prompted so many pools to be created so close in time?

Actions

Copy link