Bug #61762: PGs are stucked in creating+peering when starting up OSDs - RADOS - Ceph

Actions

Copy link

Bug #61762

open

PGs are stucked in creating+peering when starting up OSDs

Added by Venky Shankar 11 months ago. Updated 8 months ago.

Status:

New

Priority:

Normal

Assignee:

Laura Flores

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

/a/vshankar-2023-06-20_10:07:44-fs-wip-vshankar-testing-20230620.052303-testing-default-smithi/7308858

qa/tasks/cephfs/filesystem.py::create() creates a new ceph file system and blocks for all PGs to be clean. This routine also creates data and metadata pools with --pg_num_min=64. ceph_manager.py::wait_for_clean() times out waiting for all PGS to be clean.

I haven't seen this issue before, so creating a tracker. Looks unrelated to CephFS and might require looking to OSD log to infer as to why the PGs were not clean.

Update:

Looks like the problem is with PGs stuck in creating + peering for 20 minutes ever since we started the OSDs.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Laura Flores 11 months ago

Translation missing: en.field_tag_list set to test-failure

Last pg map before failure:

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 106
    },
    {
      "state_name": "creating+peering",
      "count": 20
    },
    {
      "state_name": "unknown",
      "count": 11
    }
  ],
  "num_pgs": 137,
  "num_pools": 5,
  "num_objects": 24,
  "data_bytes": 594959,
  "bytes_used": 227508224,
  "bytes_avail": 772866605056,
  "bytes_total": 773094113280,
  "unknown_pgs_ratio": 0.08029197156429291,
  "inactive_pgs_ratio": 0.14598539471626282
}

Actions

Copy link

Updated by Radoslaw Zarzynski 11 months ago

Let's observe.

Actions

Copy link

Updated by Matan Breizman 10 months ago

Related to Bug #59172: test_pool_min_size: AssertionError: wait_for_clean: failed before timeout expired due to down PGs added

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

Subject changed from qa: wait_for_clean: failed before timeout expired to PGs are stucked in creating+peering when starting up OSDs
Description updated (diff)

Changing the title to a more accurate one.

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

Looking at one of the PGs that is in creating+peering, we can see that
it is blocked by OSD.2

2023-06-20T13:15:18.268 INFO:tasks.cephfs.filesystem.ceph_manager:{'pgid': '5.19', 'version': "0'0", 'reported_seq': 3, 'reported_epoch': 26, 'state': 'creating+peering', 'last_fresh': '2023-06-20T13:14:49.625280+0000', 'last_change': '2023-06-20T13:14:47.453902+0000', 'last_active': '2023-06-20T13:14:47.443881+0000', 'last_peered': '2023-06-20T13:14:47.443881+0000', 'last_clean': '2023-06-20T13:14:47.443881+0000', 'last_became_active': '0.000000', 'last_became_peered': '0.000000', 'last_unstale': '2023-06-20T13:14:49.625280+0000', 'last_undegraded': '2023-06-20T13:14:49.625280+0000', 'last_fullsized': '2023-06-20T13:14:49.625280+0000', 'mapping_epoch': 25, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 25, 'last_epoch_clean': 0, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2023-06-20T13:14:47.443881+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2023-06-20T13:14:47.443881+0000', 'last_clean_scrub_stamp': '2023-06-20T13:14:47.443881+0000', 'objects_scrubbed': 0, 'log_size': 0, 'log_dups_size': 0, 'ondisk_log_size': 0, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 0, 'scrub_schedule': 'no scrub is scheduled', 'scrub_duration': 0, 'objects_trimmed': 0, 'snaptrim_duration': 0, 'stat_sum': {'num_bytes': 0, 'num_objects': 0, 'num_object_clones': 0, 'num_object_copies': 0, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 0, 'num_write_kb': 0, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [1, 5, 7, 2], 'acting': [1, 5, 7, 2], 'avail_no_missing': [], 'object_location_counts': [], 'blocked_by': [2], 'up_primary': 1, 'acting_primary': 1, 'purged_snaps': []}

'up': [1, 5, 7, 2], 'acting': [1, 5, 7, 2], 'avail_no_missing': [], 'object_location_counts': [], 'blocked_by': [2], 'up_primary': 1, 'acting_primary': 1,

Actions

Copy link

Updated by Radoslaw Zarzynski 8 months ago

Seems worth looking the OSD.2's log to determine why it's the blocker.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #61762

PGs are stucked in creating+peering when starting up OSDs

Updated by Laura Flores 11 months ago

Updated by Radoslaw Zarzynski 11 months ago

Updated by Matan Breizman 10 months ago

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

Updated by Radoslaw Zarzynski 8 months ago