Ceph &raquo; RADOS

Category:

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

quincy, pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

47138

Crash signature (v1):

Crash signature (v2):

Description

2021-07-25T15:18:21.688 INFO:tasks.ceph.ceph_manager.ceph:PG 1.b is not active+clean
2021-07-25T15:18:21.688 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '1.b', 'version': "22'3", 'reported_seq': 475, 'reported_epoch': 478, 'state': 'down', 'last_fresh': '2021-07-25T15:18:15.848533+0000', 'last_change': '2021-07-25T14:58:09.960725+0000', 'last_active': '2021-07-25T14:58:09.960471+0000', 'last_peered': '2021-07-25T14:58:01.054827+0000', 'last_clean': '2021-07-25T14:58:01.054827+0000', 'last_became_active': '2021-07-25T14:58:00.955866+0000', 'last_became_peered': '2021-07-25T14:58:00.955866+0000', 'last_unstale': '2021-07-25T15:18:15.848533+0000', 'last_undegraded': '2021-07-25T15:18:15.848533+0000', 'last_fullsized': '2021-07-25T15:18:15.848533+0000', 'mapping_epoch': 31, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 18, 'last_epoch_clean': 22, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2021-07-25T14:57:55.882472+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2021-07-25T14:57:55.882472+0000', 'last_clean_scrub_stamp': '2021-07-25T14:57:55.882472+0000', 'log_size': 3, 'ondisk_log_size': 3, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'stat_sum': {'num_bytes': 2554631, 'num_objects': 1, 'num_object_clones': 0, 'num_object_copies': 3, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 1, 'num_whiteouts': 0, 'num_read': 1, 'num_read_kb': 0, 'num_write': 4, 'num_write_kb': 1011, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [2, 0, 5], 'acting': [2, 0, 5], 'avail_no_missing': [], 'object_location_counts': [], 'blocked_by': [1], 'up_primary': 0, 'acting_primary': 0, 'purged_snaps': [{'start': '1', 'length': '1'}]}
2021-07-25T15:18:21.688 INFO:tasks.ceph.ceph_manager.ceph:PG 1.3 is not active+clean
2021-07-25T15:18:21.689 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '1.3', 'version': "0'0", 'reported_seq': 449, 'reported_epoch': 478, 'state': 'down', 'last_fresh': '2021-07-25T15:18:16.232816+0000', 'last_change': '2021-07-25T14:58:09.960717+0000', 'last_active': '0.000000', 'last_peered': '0.000000', 'last_clean': '0.000000', 'last_became_active': '0.000000', 'last_became_peered': '0.000000', 'last_unstale': '2021-07-25T15:18:16.232816+0000', 'last_undegraded': '2021-07-25T15:18:16.232816+0000', 'last_fullsized': '2021-07-25T15:18:16.232816+0000', 'mapping_epoch': 31, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 18, 'last_epoch_clean': 22, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2021-07-25T14:57:55.882472+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2021-07-25T14:57:55.882472+0000', 'last_clean_scrub_stamp': '2021-07-25T14:57:55.882472+0000', 'log_size': 0, 'ondisk_log_size': 0, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'stat_sum': {'num_bytes': 0, 'num_objects': 0, 'num_object_clones': 0, 'num_object_copies': 0, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 0, 'num_write_kb': 0, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [4, 6, 5], 'acting': [4, 6, 5], 'avail_no_missing': [], 'object_location_counts': [], 'blocked_by': [1], 'up_primary': 4, 'acting_primary': 4, 'purged_snaps': []}
2021-07-25T15:18:21.689 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_3df2a2c35be348281dd870cd64ba15314e6776a9/qa/tasks/ceph_manager.py", line 188, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_3df2a2c35be348281dd870cd64ba15314e6776a9/qa/tasks/ceph_manager.py", line 1411, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_3df2a2c35be348281dd870cd64ba15314e6776a9/qa/tasks/ceph_manager.py", line 979, in test_pool_min_size
    self.ceph_manager.wait_for_clean(timeout=self.config.get('timeout'))
  File "/home/teuthworker/src/github.com_ceph_ceph-c_3df2a2c35be348281dd870cd64ba15314e6776a9/qa/tasks/ceph_manager.py", line 2719, in wait_for_clean
    'wait_for_clean: failed before timeout expired'
AssertionError: wait_for_clean: failed before timeout expired

rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{rhel_8} thrashers/minsize_recovery thrashosds-health workloads/ec-pool-snaps-few-objects-overwrites}

/a/yuriw-2021-07-25_14:33:55-rados-wip-yuri7-testing-master-7.23.21-distro-basic-smithi/6292358 - no logs

Related issues 5 (1 open — 4 closed)

Related to RADOS - Bug #49777: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds

Resolved

Related to RADOS - Bug #54511: test_pool_min_size: AssertionError: not clean before minsize thrashing starts

Resolved

Related to RADOS - Bug #59172: test_pool_min_size: AssertionError: wait_for_clean: failed before timeout expired due to down PGs

Pending Backport

Copied to RADOS - Backport #57025: quincy: test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs

Resolved

Copied to RADOS - Backport #57026: pacific: test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs

Resolved

Updated by Neha Ojha over 2 years ago

/a/yuriw-2021-12-07_16:04:59-rados-wip-yuri5-testing-2021-12-06-1619-distro-default-smithi/6551120

pg map right before Traceback:

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 15
    },
    {
      "state_name": "down",
      "count": 1
    }
  ],
  "num_pgs": 16,
  "num_pools": 1,
  "num_objects": 78,
  "data_bytes": 150511044,
  "bytes_used": 385671168,
  "bytes_avail": 676071677952,
  "bytes_total": 676457349120,
  "inactive_pgs_ratio": 0.0625
}

Actions

Updated by Laura Flores over 2 years ago

/a/yuriw-2022-01-12_21:37:22-rados-wip-yuri6-testing-2022-01-12-1131-distro-default-smithi/6611439

last pg map before Traceback:

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 15
    },
    {
      "state_name": "down+remapped",
      "count": 1
    }
  ],
  "num_pgs": 16,
  "num_pools": 1,
  "num_objects": 44,
  "data_bytes": 117580353,
  "bytes_used": 245002240,
  "bytes_avail": 676212346880,
  "bytes_total": 676457349120,
  "inactive_pgs_ratio": 0.0625
}

Actions

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-02-05_22:51:11-rados-wip-yuri2-testing-2022-02-04-1646-pacific-distro-default-smithi/6663906

last pg map before Traceback:

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 25
    },
    {
      "state_name": "active+recovery_wait+undersized+degraded+remapped",
      "count": 7
    },
    {
      "state_name": "active+recovering+undersized+degraded+remapped",
      "count": 1
    },
    {
      "state_name": "recovery_wait+undersized+degraded+remapped+peered",
      "count": 1
    }
  ],
  "num_pgs": 34,
  "num_pools": 2,
  "num_objects": 615,
  "data_bytes": 1190117376,
  "bytes_used": 5568552960,
  "bytes_avail": 1057435852800,
  "bytes_total": 1063004405760,
  "inactive_pgs_ratio": 0.029411764815449715,
  "degraded_objects": 297,
  "degraded_total": 3690,
  "degraded_ratio": 0.08048780487804878
}

Actions

Updated by Laura Flores about 2 years ago

Backport set to quincy, pacific

Actions

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-02-08_17:00:23-rados-wip-yuri5-testing-2022-02-08-0733-pacific-distro-default-smithi/6670539

last pg map before Traceback:

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 91
    },
    {
      "state_name": "active+recovery_wait+undersized+degraded+remapped",
      "count": 7
    },
    {
      "state_name": "active+undersized+degraded+remapped+backfill_wait",
      "count": 4
    },
    {
      "state_name": "recovery_wait+undersized+degraded+remapped+peered",
      "count": 2
    },
    {
      "state_name": "active+recovering+undersized+degraded+remapped",
      "count": 1
    }
  ],
  "num_pgs": 105,
  "num_pools": 3,
  "num_objects": 351,
  "data_bytes": 812849152,
  "bytes_used": 5592006656,
  "bytes_avail": 1057412399104,
  "bytes_total": 1063004405760,
  "inactive_pgs_ratio": 0.01904761977493763,
  "degraded_objects": 470,
  "degraded_total": 2106,
  "degraded_ratio": 0.22317188983855651
}

Actions

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-03-19_14:37:23-rados-wip-yuri6-testing-2022-03-18-1104-distro-default-smithi/6746893

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 40
    },
    {
      "state_name": "recovery_wait+undersized+degraded+remapped+peered",
      "count": 11
    },
    {
      "state_name": "active+recovery_wait+undersized+degraded+remapped",
      "count": 10
    },
    {
      "state_name": "active+remapped+backfill_wait",
      "count": 2
    },
    {
      "state_name": "active+recovery_wait",
      "count": 1
    },
    {
      "state_name": "active+recovering+undersized+remapped",
      "count": 1
    }
  ],
  "num_pgs": 65,
  "num_pools": 3,
  "num_objects": 123,
  "data_bytes": 380551712,
  "bytes_used": 3641274368,
  "bytes_avail": 866089603072,
  "bytes_total": 869730877440,
  "inactive_pgs_ratio": 0.16923077404499054,
  "degraded_objects": 1081,
  "degraded_total": 730,
  "degraded_ratio": 1.4808219178082191,
  "misplaced_objects": 30,
  "misplaced_total": 730,
  "misplaced_ratio": 0.0410958904109589
}

Actions

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-03-24_16:44:32-rados-wip-yuri-testing-2022-03-24-0726-distro-default-smithi/6758238

Actions

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-03-25_18:42:52-rados-wip-yuri7-testing-2022-03-24-1341-pacific-distro-default-smithi/6761700

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 29
    },
    {
      "state_name": "active+recovery_wait+undersized+degraded+remapped",
      "count": 3
    },
    {
      "state_name": "recovery_wait+undersized+degraded+remapped+peered",
      "count": 2
    },
    {
      "state_name": "active+recovering+undersized+degraded+remapped",
      "count": 1
    }
  ],
  "num_pgs": 35,
  "num_pools": 3,
  "num_objects": 198,
  "data_bytes": 162054144,
  "bytes_used": 1095106560,
  "bytes_avail": 1061909299200,
  "bytes_total": 1063004405760,
  "inactive_pgs_ratio": 0.05714285746216774,
  "degraded_objects": 124,
  "degraded_total": 1188,
  "degraded_ratio": 0.10437710437710437
}

Actions

Updated by Neha Ojha about 2 years ago

Laura Flores wrote:

/a/yuriw-2022-03-25_18:42:52-rados-wip-yuri7-testing-2022-03-24-1341-pacific-distro-default-smithi/6761700
[...]

This does not have down PGs, some other bug maybe?

Actions

#10

Updated by Laura Flores almost 2 years ago

Neha Ojha wrote:

Laura Flores wrote:

/a/yuriw-2022-03-25_18:42:52-rados-wip-yuri7-testing-2022-03-24-1341-pacific-distro-default-smithi/6761700
[...]

This does not have down PGs, some other bug maybe?

Yeah that one looks very similar pg-map wise to #51076. It checks out too since the test uses the "mapgap" thrasher.
Description: rados/thrash-erasure-code-big/{ceph cluster/{12-osds openstack} mon_election/connectivity msgr-failures/few objectstore/bluestore-hybrid rados recovery-overrides/{default} supported-random-distro$/{ubuntu_latest} thrashers/mapgap thrashosds-health workloads/ec-rados-plugin=jerasure-k=4-m=2}

Actions

#11

Updated by Laura Flores almost 2 years ago

Backport changed from quincy, pacific to quincy, pacific, octopus

Actions

#12

Updated by Radoslaw Zarzynski almost 2 years ago

Not urgent, perhaps not low-hanging-fruit but still good as a training issue.

Actions

#13

Updated by Laura Flores almost 2 years ago

Other reported instances of this `wait_for_clean` assertion failure where the pgmap has a pg stuck in recovery have likely been fixed in #51076. This seems to be an example of what this Tracker is reporting. The assertion failure happend in a `rados/thrash-erasure-code-overwrites` test, and there were down pgs:

/a/yuriw-2022-05-27_21:59:17-rados-wip-yuri-testing-2022-05-27-0934-distro-default-smithi/6851025
Description: rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few rados recovery-overrides/{more-async-recovery} supported-random-distro$/{rhel_8} thrashers/minsize_recovery thrashosds-health workloads/ec-pool-snaps-few-objects-overwrites}

Last pgmap before the assertion failure:

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 12
    },
    {
      "state_name": "down",
      "count": 3
    },
    {
      "state_name": "active+clean+scrubbing",
      "count": 1
    }
  ],
  "num_pgs": 16,
  "num_pools": 1,
  "num_objects": 43,
  "data_bytes": 112235760,
  "bytes_used": 221577216,
  "bytes_avail": 676235771904,
  "bytes_total": 676457349120,
  "inactive_pgs_ratio": 0.1875
}

Actions

#14

Updated by Neha Ojha almost 2 years ago

Assignee set to Nitzan Mordechai

Actions

#15

Updated by Laura Flores almost 2 years ago

This `wait_for_clean` assertion failure is happening with the minsize_recovery thrasher, which is used by rados/thrash-erasure-code and rados/thrash-erasure-code-overwrites tests. This indicates that the minsize_recovery thrasher is somehow responsible. See https://tracker.ceph.com/issues/49777#note-10 for more recorded instances.

Actions

#16

Updated by Laura Flores almost 2 years ago

Related to Bug #49777: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds added

Actions

#17

Updated by Laura Flores almost 2 years ago

Related to Bug #54511: test_pool_min_size: AssertionError: not clean before minsize thrashing starts added

Actions

#18

Updated by Nitzan Mordechai almost 2 years ago

Talked with Ronen Friedman regarding that issue, it may be related to other bug that he is working on that the scrub didn't completed

Actions

#19

Updated by Laura Flores almost 2 years ago

Oh interesting. Is there a reason that bug would only be affecting minsize_recovery?

Actions

#20

Updated by Laura Flores almost 2 years ago

Found an instance where this does not occur with minsize_recovery. It's possible that it's a different root cause, but here we're also seeing down pgs in the pgmap:

Description: rados/thrash-erasure-code-isa/{arch/x86_64 ceph clusters/{fixed-2 openstack} mon_election/connectivity msgr-failures/fastclose objectstore/bluestore-comp-zstd rados recovery-overrides/{more-async-recovery} supported-random-distro$/{rhel_8} thrashers/mapgap thrashosds-health workloads/ec-rados-plugin=isa-k=2-m=1}

/a/yuriw-2022-06-22_22:13:20-rados-wip-yuri3-testing-2022-06-22-1121-pacific-distro-default-smithi/6892939

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 34
    },
    {
      "state_name": "down+remapped",
      "count": 1
    }
  ],
  "num_pgs": 35,
  "num_pools": 3,
  "num_objects": 112,
  "data_bytes": 239910912,
  "bytes_used": 2221826048,
  "bytes_avail": 577598758912,
  "bytes_total": 579820584960,
  "inactive_pgs_ratio": 0.02857142873108387
}

Actions

#21

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

/a/ksirivad-2022-07-01_21:00:49-rados:thrash-erasure-code-main-distro-default-smithi/6910169/

Actions

#22

Updated by Ronen Friedman almost 2 years ago

Kamoltat Sirivadhna wrote:

/a/ksirivad-2022-07-01_21:00:49-rados:thrash-erasure-code-main-distro-default-smithi/6910169/

That one failed in ~Mon(), asserting on session_map.sessions.empty()

Actions

#23

Updated by Ronen Friedman almost 2 years ago

Ronen Friedman wrote:

Kamoltat Sirivadhna wrote:

/a/ksirivad-2022-07-01_21:00:49-rados:thrash-erasure-code-main-distro-default-smithi/6910169/

That one failed in ~Mon(), asserting on session_map.sessions.empty()

Probably https://tracker.ceph.com/issues/56192

Actions

#24

Updated by Neha Ojha almost 2 years ago

/a/ksirivad-2022-07-01_21:00:49-rados:thrash-erasure-code-main-distro-default-smithi/6910169/ first timed out and then we see the shutdown crash

2022-07-01T23:42:43.073 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
...
2022-07-01T23:42:50.766 INFO:tasks.ceph.mon.c.smithi105.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-13311-g57487367/rpm/el8/BUILD/ceph-17.0.0-13311-g57487367/src/mon/Monitor.cc: 285: FAILED ceph_assert(session_map.sessions.empty())

Actions

#25

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

Subject changed from AssertionError: wait_for_clean: failed before timeout expired due to down PGs to test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs
Assignee changed from Nitzan Mordechai to Kamoltat (Junior) Sirivadhna
Pull request ID set to 47138

Actions

#26

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

Status changed from New to Fix Under Review

Actions

#27

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

I was able to reproduce the problem after modifying qa/tasks/ceph_manager.py: https://github.com/ceph/ceph/pull/46931/commits/1f6bcbb3d680d8589e498b993d2cf566480e2c3e.

Here are the runs that I reproduced with the modified test PR:
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921376
ksirivad-2022-07-14_03:50:33-rados:thrash-erasure-code-main-distro-default-smithi/6929814
ksirivad-2022-07-14_03:50:33-rados:thrash-erasure-code-main-distro-default-smithi/6929818
ksirivad-2022-07-14_03:50:33-rados:thrash-erasure-code-main-distro-default-smithi/6929819
ksirivad-2022-07-14_03:50:33-rados:thrash-erasure-code-main-distro-default-smithi/6929855

The Problem:
The problem was due to test_pool_min_size():ceph_manager.wait_for_clean() not having enough timeout since according to my analysis: https://docs.google.com/document/d/1YQeMF2Rc_unQvS2GAtLRXTJEb9a-D66McUQbBdXWZAE/edit?usp=sharing, we can see that the PGs were temporarily in states like `snaptrim` and `backfill_full` but eventually got out of these states and became active+clean, it's just that sometimes 60 seconds is not enough.

Solution
We increase timeout in test_pool_min_size():ceph_manager.wait_for_clean() from 60s to 180s.

Actions