Project

General

Profile

Actions

Bug #64504

open

aio ops queued but never executed

Added by Nitzan Mordechai 2 months ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Few of teuthology tests were failed when trying to execute aio_write and then wait_for_complete never completed.
the client shows the op sent:

2024-02-13T17:25:17.599+0000 7f67c83ce740 10 client.5066.objecter _op_submit op 0x555f070ec010
2024-02-13T17:25:17.599+0000 7f67c83ce740 20 client.5066.objecter _calc_target epoch 81 base foo @43;nspace precalc_pgid 0 pgid 0.0 is_write
2024-02-13T17:25:17.599+0000 7f67c83ce740 20 client.5066.objecter _calc_target target foo @43;nspace -> pgid 43.1272949
2024-02-13T17:25:17.599+0000 7f67c83ce740 10 client.5066.objecter _calc_target  raw pgid 43.1272949 -> actual 43.9 acting [1,6,7] primary 1
2024-02-13T17:25:17.599+0000 7f67c83ce740  1 --2- 172.21.15.84:0/686509261 >> [v2:172.21.15.84:6800/3364418886,v1:172.21.15.84:6801/3364418886] conn(0x555f071a2640 0x555f071a4a30 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).connect
2024-02-13T17:25:17.599+0000 7f67c83ce740 20 client.5066.objecter _get_session s=0x555f070ece90 osd=1 3
2024-02-13T17:25:17.599+0000 7f67c83ce740 10 client.5066.objecter _op_submit oid foo '@43;nspace' '@43;nspace' [write 0~128 in=128b] tid 3 osd.1
2024-02-13T17:25:17.599+0000 7f67c83ce740 20 client.5066.objecter get_session s=0x555f070ece90 osd=1 3
2024-02-13T17:25:17.599+0000 7f67c83ce740 15 client.5066.objecter _session_op_assign 1 3
2024-02-13T17:25:17.599+0000 7f67c83ce740 15 client.5066.objecter _send_op 3 to 43.9 on osd.1
2024-02-13T17:25:17.599+0000 7f67c83ce740  1 -- 172.21.15.84:0/686509261 --> [v2:172.21.15.84:6800/3364418886,v1:172.21.15.84:6801/3364418886] -- osd_op(unknown.0.0:3 43.9 43:9294e480:nspace::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e81) v8 -- 0x555f071a4fb0 con 0x555f071a2640

the osd shows the op queued:

2024-02-13T17:25:17.599+0000 7fa2f661d640 15 osd.1 85 enqueue_op osd_op(client.5066.0:3 43.9 43.1272949 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e81) v8 prio 63 type 42 cost 128 latency 0.000016 epoch 81 osd_op(client.5066.0:3 43.9 43.1272949 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e81) v8
2024-02-13T17:25:17.599+0000 7fa2f661d640 20 osd.1 op_wq(1) _enqueue OpSchedulerItem(43.9 PGOpItem(op=osd_op(client.5066.0:3 43.9 43.1272949 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e81) v8) class_id 3 prio 63 cost 128 e81)

but not dequeue and execute that op.
/a/lflores-2024-01-22_04:12:37-rados-wip-lflores-testing-2024-01-18-2129-distro-default-smithi/7525368
/a/yuriw-2024-02-13_23:41:23-rados-wip-yuri6-testing-2024-02-13-0904-distro-default-smithi/7558996
/a/lflores-2024-02-13_16:18:32-rados-wip-yuri5-testing-2024-02-12-1152-distro-default-smithi/7558384


Related issues 1 (1 open0 closed)

Related to RADOS - Bug #58130: LibRadosAio.SimpleWrite hang and pkillIn ProgressNitzan Mordechai

Actions
Actions #1

Updated by Nitzan Mordechai 2 months ago

  • Related to Bug #58130: LibRadosAio.SimpleWrite hang and pkill added
Actions #2

Updated by Radoslaw Zarzynski 2 months ago

Asked Sridhar to judge whether it's dmclock-related.

Actions #3

Updated by Sridhar Seshasayee 2 months ago

So far what I see is that the client op didn't get scheduled at all on osd.1 due a continuous stream of higher priority peering ops on the same OSD shard . As a result the slow request warning is reported. But this warning is ignored since it is a part of the ignore list. I didn't see any rados api test failures though. The test fails because a few PGs didn't get backfilled in time. Each of those PGs has a single missing object which for some reason is not getting backfilled shown below:

2024-02-13T19:18:34.399 INFO:tasks.ceph.ceph_manager.ceph:PG 13.1 is not active+clean
2024-02-13T19:18:34.399 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '13.1', 'version': "1222'7", 'reported_seq': 5407, 'reported_epoch': 4192, 'state': 'active+remapped+backfilling', 'last_fresh': '2024-02-13T19:18:32.968671+0000', 'last_change': '2024-02-13T18:33:52.663811+0000', 'last_active': '2024-02-13T19:18:32.968671+0000', 'last_peered': '2024-02-13T19:18:32.968671+0000', 'last_clean': '2024-02-13T18:19:16.469528+0000', 'last_became_active': '2024-02-13T18:27:20.655505+0000', 'last_became_peered': '2024-02-13T18:27:20.655505+0000', 'last_unstale': '2024-02-13T19:18:32.968671+0000', 'last_undegraded': '2024-02-13T19:18:32.968671+0000', 'last_fullsized': '2024-02-13T19:18:32.968671+0000', 'mapping_epoch': 2381, 'log_start': "1222'5", 'ondisk_log_start': "1222'5", 'created': 70, 'last_epoch_clean': 2149, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2024-02-13T17:25:01.973344+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2024-02-13T17:25:01.973344+0000', 'last_clean_scrub_stamp': '2024-02-13T17:25:01.973344+0000', 'objects_scrubbed': 0, 'log_size': 2, 'log_dups_size': 5, 'ondisk_log_size': 2, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 0, 'scrub_schedule': 'no scrub is scheduled', 'scrub_duration': 0, 'objects_trimmed': 0, 'snaptrim_duration': 0, 'stat_sum': {'num_bytes': 5, 'num_objects': 1, 'num_object_clones': 0, 'num_object_copies': 3, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 1, 'num_objects_unfound': 0, 'num_objects_dirty': 1, 'num_whiteouts': 0, 'num_read': 6, 'num_read_kb': 6, 'num_write': 7, 'num_write_kb': 7, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 2, 'num_bytes_recovered': 10, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [1, 4, 3], 'acting': [1, 4, 0], 'avail_no_missing': ['1', '0', '4'], 'object_location_counts': [{'shards': '0,1,4', 'objects': 1}], 'blocked_by': [], 'up_primary': 1, 'acting_primary': 1, 'purged_snaps': []}
2024-02-13T19:18:34.399 INFO:tasks.ceph.ceph_manager.ceph:PG 5.11 is not active+clean
2024-02-13T19:18:34.399 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '5.11', 'version': "2578'7", 'reported_seq': 5062, 'reported_epoch': 4191, 'state': 'active+remapped+backfilling', 'last_fresh': '2024-02-13T19:18:30.998467+0000', 'last_change': '2024-02-13T18:34:10.970040+0000', 'last_active': '2024-02-13T19:18:30.998467+0000', 'last_peered': '2024-02-13T19:18:30.998467+0000', 'last_clean': '2024-02-13T18:26:49.076456+0000', 'last_became_active': '2024-02-13T18:27:08.059580+0000', 'last_became_peered': '2024-02-13T18:27:08.059580+0000', 'last_unstale': '2024-02-13T19:18:30.998467+0000', 'last_undegraded': '2024-02-13T19:18:30.998467+0000', 'last_fullsized': '2024-02-13T19:18:30.998467+0000', 'mapping_epoch': 2423, 'log_start': "2243'4", 'ondisk_log_start': "2243'4", 'created': 70, 'last_epoch_clean': 2411, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2024-02-13T17:25:01.973344+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2024-02-13T17:25:01.973344+0000', 'last_clean_scrub_stamp': '2024-02-13T17:25:01.973344+0000', 'objects_scrubbed': 0, 'log_size': 3, 'log_dups_size': 4, 'ondisk_log_size': 3, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 0, 'scrub_schedule': 'no scrub is scheduled', 'scrub_duration': 0, 'objects_trimmed': 0, 'snaptrim_duration': 0, 'stat_sum': {'num_bytes': 128, 'num_objects': 1, 'num_object_clones': 0, 'num_object_copies': 3, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 1, 'num_objects_unfound': 0, 'num_objects_dirty': 1, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 7, 'num_write_kb': 4, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 1, 'num_bytes_recovered': 128, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [3, 2, 4], 'acting': [3, 2, 5], 'avail_no_missing': ['3', '2', '5'], 'object_location_counts': [{'shards': '2,3,5', 'objects': 1}], 'blocked_by': [], 'up_primary': 3, 'acting_primary': 3, 'purged_snaps': []}
2024-02-13T19:18:34.399 INFO:tasks.ceph.ceph_manager.ceph:PG 5.1d is not active+clean
2024-02-13T19:18:34.399 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '5.1d', 'version': "2578'7", 'reported_seq': 7113, 'reported_epoch': 4192, 'state': 'active+remapped+backfilling', 'last_fresh': '2024-02-13T19:18:32.969131+0000', 'last_change': '2024-02-13T18:33:46.389640+0000', 'last_active': '2024-02-13T19:18:32.969131+0000', 'last_peered': '2024-02-13T19:18:32.969131+0000', 'last_clean': '2024-02-13T18:27:05.509667+0000', 'last_became_active': '2024-02-13T18:27:08.564327+0000', 'last_became_peered': '2024-02-13T18:27:08.564327+0000', 'last_unstale': '2024-02-13T19:18:32.969131+0000', 'last_undegraded': '2024-02-13T19:18:32.969131+0000', 'last_fullsized': '2024-02-13T19:18:32.969131+0000', 'mapping_epoch': 2423, 'log_start': "2243'4", 'ondisk_log_start': "2243'4", 'created': 70, 'last_epoch_clean': 2409, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2024-02-13T17:36:30.244815+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2024-02-13T17:25:01.973344+0000', 'last_clean_scrub_stamp': '2024-02-13T17:36:30.244815+0000', 'objects_scrubbed': 0, 'log_size': 3, 'log_dups_size': 4, 'ondisk_log_size': 3, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 1, 'scrub_schedule': 'no scrub is scheduled', 'scrub_duration': 902, 'objects_trimmed': 0, 'snaptrim_duration': 0, 'stat_sum': {'num_bytes': 128, 'num_objects': 1, 'num_object_clones': 0, 'num_object_copies': 3, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 1, 'num_objects_unfound': 0, 'num_objects_dirty': 1, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 7, 'num_write_kb': 4, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [1, 4, 6], 'acting': [1, 6, 2], 'avail_no_missing': ['1', '2', '6'], 'object_location_counts': [{'shards': '1,2,6', 'objects': 1}], 'blocked_by': [], 'up_primary': 1, 'acting_primary': 1, 'purged_snaps': []}
2024-02-13T19:18:34.401 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_3d2b0d7b9d303971fd78b4efe33cf2a06e3a7a4d/qa/tasks/ceph_manager.py", line 190, in wrapper
    return func(self)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_3d2b0d7b9d303971fd78b4efe33cf2a06e3a7a4d/qa/tasks/ceph_manager.py", line 1425, in _do_thrash
    self.ceph_manager.wait_for_recovery(
  File "/home/teuthworker/src/git.ceph.com_ceph-c_3d2b0d7b9d303971fd78b4efe33cf2a06e3a7a4d/qa/tasks/ceph_manager.py", line 2873, in wait_for_recovery
    assert now - start < timeout, \
AssertionError: wait_for_recovery: failed before timeout expired

I will update with additional details if I find anything else.

Actions

Also available in: Atom PDF