Project

General

Profile

Actions

Bug #61762

open

PGs are stucked in creating+peering when starting up OSDs

Added by Venky Shankar 11 months ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/vshankar-2023-06-20_10:07:44-fs-wip-vshankar-testing-20230620.052303-testing-default-smithi/7308858

qa/tasks/cephfs/filesystem.py::create() creates a new ceph file system and blocks for all PGs to be clean. This routine also creates data and metadata pools with --pg_num_min=64. ceph_manager.py::wait_for_clean() times out waiting for all PGS to be clean.

I haven't seen this issue before, so creating a tracker. Looks unrelated to CephFS and might require looking to OSD log to infer as to why the PGs were not clean.

Update:

Looks like the problem is with PGs stuck in creating + peering for 20 minutes ever since we started the OSDs.


Related issues 1 (1 open0 closed)

Related to RADOS - Bug #59172: test_pool_min_size: AssertionError: wait_for_clean: failed before timeout expired due to down PGsPending BackportKamoltat (Junior) Sirivadhna

Actions
Actions #1

Updated by Laura Flores 11 months ago

  • Translation missing: en.field_tag_list set to test-failure

Last pg map before failure:

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 106
    },
    {
      "state_name": "creating+peering",
      "count": 20
    },
    {
      "state_name": "unknown",
      "count": 11
    }
  ],
  "num_pgs": 137,
  "num_pools": 5,
  "num_objects": 24,
  "data_bytes": 594959,
  "bytes_used": 227508224,
  "bytes_avail": 772866605056,
  "bytes_total": 773094113280,
  "unknown_pgs_ratio": 0.08029197156429291,
  "inactive_pgs_ratio": 0.14598539471626282
}

Actions #2

Updated by Radoslaw Zarzynski 11 months ago

Let's observe.

Actions #3

Updated by Matan Breizman 10 months ago

  • Related to Bug #59172: test_pool_min_size: AssertionError: wait_for_clean: failed before timeout expired due to down PGs added
Actions #4

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

  • Subject changed from qa: wait_for_clean: failed before timeout expired to PGs are stucked in creating+peering when starting up OSDs
  • Description updated (diff)

Changing the title to a more accurate one.

Actions #5

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

Looking at one of the PGs that is in creating+peering, we can see that
it is blocked by OSD.2

2023-06-20T13:15:18.268 INFO:tasks.cephfs.filesystem.ceph_manager:{'pgid': '5.19', 'version': "0'0", 'reported_seq': 3, 'reported_epoch': 26, 'state': 'creating+peering', 'last_fresh': '2023-06-20T13:14:49.625280+0000', 'last_change': '2023-06-20T13:14:47.453902+0000', 'last_active': '2023-06-20T13:14:47.443881+0000', 'last_peered': '2023-06-20T13:14:47.443881+0000', 'last_clean': '2023-06-20T13:14:47.443881+0000', 'last_became_active': '0.000000', 'last_became_peered': '0.000000', 'last_unstale': '2023-06-20T13:14:49.625280+0000', 'last_undegraded': '2023-06-20T13:14:49.625280+0000', 'last_fullsized': '2023-06-20T13:14:49.625280+0000', 'mapping_epoch': 25, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 25, 'last_epoch_clean': 0, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2023-06-20T13:14:47.443881+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2023-06-20T13:14:47.443881+0000', 'last_clean_scrub_stamp': '2023-06-20T13:14:47.443881+0000', 'objects_scrubbed': 0, 'log_size': 0, 'log_dups_size': 0, 'ondisk_log_size': 0, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 0, 'scrub_schedule': 'no scrub is scheduled', 'scrub_duration': 0, 'objects_trimmed': 0, 'snaptrim_duration': 0, 'stat_sum': {'num_bytes': 0, 'num_objects': 0, 'num_object_clones': 0, 'num_object_copies': 0, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 0, 'num_write_kb': 0, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [1, 5, 7, 2], 'acting': [1, 5, 7, 2], 'avail_no_missing': [], 'object_location_counts': [], 'blocked_by': [2], 'up_primary': 1, 'acting_primary': 1, 'purged_snaps': []}
'up': [1, 5, 7, 2], 'acting': [1, 5, 7, 2], 'avail_no_missing': [], 'object_location_counts': [], 'blocked_by': [2], 'up_primary': 1, 'acting_primary': 1,
Actions #6

Updated by Radoslaw Zarzynski 8 months ago

Seems worth looking the OSD.2's log to determine why it's the blocker.

Actions

Also available in: Atom PDF