Project

General

Profile

Bug #52948

osd: fails to come up: "teuthology.misc:7 of 8 OSDs are up"

Added by Patrick Donnelly over 1 year ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-10-15T03:22:32.089 INFO:teuthology.misc.health.smithi089.stdout:{"epoch":17,"fsid":"dcc6a9bc-1936-44b9-aa05-4bd710856c95","created":"2021-10-15T03:11:55.278461+0000","modified":"2021-10-15T03:22:05.151198+0000","last_up_change":"2021-10-15T03:12:06.037577+0000","last_in_change":"2021-10-15T03:22:05.151198+0000","flags":"sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit","flags_num":5799936,"flags_set":["pglog_hardlimit","purged_snapdirs","recovery_deletes","sortbitwise"],"crush_version":4,"full_ratio":0.94999998807907104,"backfillfull_ratio":0.89999997615814209,"nearfull_ratio":0.85000002384185791,"cluster_snapshot":"","pool_max":1,"max_osd":8,"require_min_compat_client":"luminous","min_compat_client":"jewel","require_osd_release":"quincy","pools":[{"pool":1,"pool_name":".mgr","create_time":"2021-10-15T03:12:06.169415+0000","flags":1,"flags_names":"hashpspool","type":1,"size":2,"min_size":1,"crush_rule":0,"peering_crush_bucket_count":0,"peering_crush_bucket_target":0,"peering_crush_bucket_barrier":0,"peering_crush_bucket_mandatory_member":2147483647,"object_hash":2,"pg_autoscale_mode":"off","pg_num":1,"pg_placement_num":1,"pg_placement_num_target":1,"pg_num_target":1,"pg_num_pending":1,"last_pg_merge_meta":{"source_pgid":"0.0","ready_epoch":0,"last_epoch_started":0,"last_epoch_clean":0,"source_version":"0'0","target_version":"0'0"},"last_change":"16","last_force_op_resend":"0","last_force_op_resend_prenautilus":"0","last_force_op_resend_preluminous":"0","auid":0,"snap_mode":"selfmanaged","snap_seq":0,"snap_epoch":0,"pool_snaps":[],"removed_snaps":"[]","quota_max_bytes":0,"quota_max_objects":0,"tiers":[],"tier_of":-1,"read_tier":-1,"write_tier":-1,"cache_mode":"none","target_max_bytes":0,"target_max_objects":0,"cache_target_dirty_ratio_micro":400000,"cache_target_dirty_high_ratio_micro":600000,"cache_target_full_ratio_micro":800000,"cache_min_flush_age":0,"cache_min_evict_age":0,"erasure_code_profile":"","hit_set_params":{"type":"none"},"hit_set_period":0,"hit_set_count":0,"use_gmt_hitset":true,"min_read_recency_for_promote":0,"min_write_recency_for_promote":0,"hit_set_grade_decay_rate":0,"hit_set_search_last_n":0,"grade_table":[],"stripe_width":0,"expected_num_objects":0,"fast_read":false,"options":{"pg_num_min":1},"application_metadata":{"mgr":{}}}],"osds":[{"osd":0,"uuid":"cd5c8454-0a20-454e-a59b-51cb3c3e9a78","up":1,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":13,"up_thru":0,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.89:6823","nonce":36889},{"type":"v1","addr":"172.21.15.89:6825","nonce":36889}]},"cluster_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.89:6826","nonce":36889},{"type":"v1","addr":"172.21.15.89:6827","nonce":36889}]},"heartbeat_back_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.89:6830","nonce":36889},{"type":"v1","addr":"172.21.15.89:6831","nonce":36889}]},"heartbeat_front_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.89:6828","nonce":36889},{"type":"v1","addr":"172.21.15.89:6829","nonce":36889}]},"public_addr":"172.21.15.89:6825/36889","cluster_addr":"172.21.15.89:6827/36889","heartbeat_back_addr":"172.21.15.89:6831/36889","heartbeat_front_addr":"172.21.15.89:6829/36889","state":["exists","up"]},{"osd":1,"uuid":"23b82270-88f9-4b1e-954d-609d5b4f7f3a","up":1,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":13,"up_thru":14,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.89:6808","nonce":36887},{"type":"v1","addr":"172.21.15.89:6809","nonce":36887}]},"cluster_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.89:6810","nonce":36887},{"type":"v1","addr":"172.21.15.89:6811","nonce":36887}]},"heartbeat_back_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.89:6814","nonce":36887},{"type":"v1","addr":"172.21.15.89:6815","nonce":36887}]},"heartbeat_front_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.89:6812","nonce":36887},{"type":"v1","addr":"172.21.15.89:6813","nonce":36887}]},"public_addr":"172.21.15.89:6809/36887","cluster_addr":"172.21.15.89:6811/36887","heartbeat_back_addr":"172.21.15.89:6815/36887","heartbeat_front_addr":"172.21.15.89:6813/36887","state":["exists","up"]},{"osd":2,"uuid":"166f20d4-cc82-4f6e-ba0a-1700c16cdea4","up":1,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":13,"up_thru":0,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.89:6800","nonce":36886},{"type":"v1","addr":"172.21.15.89:6801","nonce":36886}]},"cluster_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.89:6802","nonce":36886},{"type":"v1","addr":"172.21.15.89:6803","nonce":36886}]},"heartbeat_back_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.89:6806","nonce":36886},{"type":"v1","addr":"172.21.15.89:6807","nonce":36886}]},"heartbeat_front_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.89:6804","nonce":36886},{"type":"v1","addr":"172.21.15.89:6805","nonce":36886}]},"public_addr":"172.21.15.89:6801/36886","cluster_addr":"172.21.15.89:6803/36886","heartbeat_back_addr":"172.21.15.89:6807/36886","heartbeat_front_addr":"172.21.15.89:6805/36886","state":["exists","up"]},{"osd":3,"uuid":"cbe05e41-eb54-4d93-b06b-ef8adffd6261","up":0,"in":0,"weight":0,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":0,"up_thru":0,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[]},"cluster_addrs":{"addrvec":[]},"heartbeat_back_addrs":{"addrvec":[]},"heartbeat_front_addrs":{"addrvec":[]},"public_addr":"(unrecognized address family 0)/0","cluster_addr":"(unrecognized address family 0)/0","heartbeat_back_addr":"(unrecognized address family 0)/0","heartbeat_front_addr":"(unrecognized address family 0)/0","state":["autoout","exists","new"]},{"osd":4,"uuid":"8389a217-b195-4550-9874-ddea95eb0eab","up":1,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":13,"up_thru":0,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6808","nonce":37119},{"type":"v1","addr":"172.21.15.25:6809","nonce":37119}]},"cluster_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6810","nonce":37119},{"type":"v1","addr":"172.21.15.25:6811","nonce":37119}]},"heartbeat_back_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6814","nonce":37119},{"type":"v1","addr":"172.21.15.25:6815","nonce":37119}]},"heartbeat_front_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6812","nonce":37119},{"type":"v1","addr":"172.21.15.25:6813","nonce":37119}]},"public_addr":"172.21.15.25:6809/37119","cluster_addr":"172.21.15.25:6811/37119","heartbeat_back_addr":"172.21.15.25:6815/37119","heartbeat_front_addr":"172.21.15.25:6813/37119","state":["exists","up"]},{"osd":5,"uuid":"4d1f724e-4fc9-4862-a8a6-041554d0586f","up":1,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":13,"up_thru":0,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6816","nonce":37135},{"type":"v1","addr":"172.21.15.25:6817","nonce":37135}]},"cluster_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6818","nonce":37135},{"type":"v1","addr":"172.21.15.25:6819","nonce":37135}]},"heartbeat_back_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6822","nonce":37135},{"type":"v1","addr":"172.21.15.25:6823","nonce":37135}]},"heartbeat_front_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6820","nonce":37135},{"type":"v1","addr":"172.21.15.25:6821","nonce":37135}]},"public_addr":"172.21.15.25:6817/37135","cluster_addr":"172.21.15.25:6819/37135","heartbeat_back_addr":"172.21.15.25:6823/37135","heartbeat_front_addr":"172.21.15.25:6821/37135","state":["exists","up"]},{"osd":6,"uuid":"adf42516-390e-46ac-9ed4-f70e4642c5c8","up":1,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":13,"up_thru":0,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6800","nonce":37118},{"type":"v1","addr":"172.21.15.25:6801","nonce":37118}]},"cluster_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6802","nonce":37118},{"type":"v1","addr":"172.21.15.25:6803","nonce":37118}]},"heartbeat_back_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6806","nonce":37118},{"type":"v1","addr":"172.21.15.25:6807","nonce":37118}]},"heartbeat_front_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6804","nonce":37118},{"type":"v1","addr":"172.21.15.25:6805","nonce":37118}]},"public_addr":"172.21.15.25:6801/37118","cluster_addr":"172.21.15.25:6803/37118","heartbeat_back_addr":"172.21.15.25:6807/37118","heartbeat_front_addr":"172.21.15.25:6805/37118","state":["exists","up"]},{"osd":7,"uuid":"ca84ff07-273e-4a21-9ac7-e8761c0e1264","up":1,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":13,"up_thru":0,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6824","nonce":37122},{"type":"v1","addr":"172.21.15.25:6825","nonce":37122}]},"cluster_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6826","nonce":37122},{"type":"v1","addr":"172.21.15.25:6827","nonce":37122}]},"heartbeat_back_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6830","nonce":37122},{"type":"v1","addr":"172.21.15.25:6831","nonce":37122}]},"heartbeat_front_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.25:6828","nonce":37122},{"type":"v1","addr":"172.21.15.25:6829","nonce":37122}]},"public_addr":"172.21.15.25:6825/37122","cluster_addr":"172.21.15.25:6827/37122","heartbeat_back_addr":"172.21.15.25:6831/37122","heartbeat_front_addr":"172.21.15.25:6829/37122","state":["exists","up"]}],"osd_xinfo":[{"osd":0,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":4540138303579357183,"old_weight":0,"last_purged_snaps_scrub":"2021-10-15T03:12:02.877193+0000","dead_epoch":0},{"osd":1,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":4540138303579357183,"old_weight":0,"last_purged_snaps_scrub":"2021-10-15T03:12:02.903567+0000","dead_epoch":0},{"osd":2,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":4540138303579357183,"old_weight":0,"last_purged_snaps_scrub":"2021-10-15T03:12:02.901545+0000","dead_epoch":0},{"osd":3,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":0,"old_weight":65536,"last_purged_snaps_scrub":"0.000000","dead_epoch":0},{"osd":4,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":4540138303579357183,"old_weight":0,"last_purged_snaps_scrub":"2021-10-15T03:12:02.855567+0000","dead_epoch":0},{"osd":5,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":4540138303579357183,"old_weight":0,"last_purged_snaps_scrub":"2021-10-15T03:12:02.657142+0000","dead_epoch":0},{"osd":6,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":4540138303579357183,"old_weight":0,"last_purged_snaps_scrub":"2021-10-15T03:12:02.826109+0000","dead_epoch":0},{"osd":7,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":4540138303579357183,"old_weight":0,"last_purged_snaps_scrub":"2021-10-15T03:12:02.930599+0000","dead_epoch":0}],"pg_upmap":[],"pg_upmap_items":[],"pg_temp":[],"primary_temp":[],"blocklist":{},"erasure_code_profiles":{"default":{"crush-failure-domain":"osd","k":"2","m":"1","plugin":"jerasure","technique":"reed_sol_van"}},"removed_snaps_queue":[],"new_removed_snaps":[],"new_purged_snaps":[],"crush_node_flags":{},"device_class_flags":{},"stretch_mode":{"stretch_mode_enabled":false,"stretch_bucket_count":0,"degraded_stretch_mode":0,"recovering_stretch_mode":0,"stretch_mode_bucket":0}}
...
2021-10-15T03:22:32.101 DEBUG:teuthology.misc:7 of 8 OSDs are up
2021-10-15T03:22:32.102 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_423bd94291582cb44cbfac84a53ec84580eb3f08/teuthology/contextutil.py", line 31, in nested
    vars.append(enter())
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_9bcded95726562b598e8b1357a01cf828b42f563/qa/tasks/ceph.py", line 403, in create_rbd_pool
    ceph_cluster=cluster_name,
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_423bd94291582cb44cbfac84a53ec84580eb3f08/teuthology/misc.py", line 858, in wait_until_osds_up
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_423bd94291582cb44cbfac84a53ec84580eb3f08/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (90) after waiting for 540 seconds

From: /ceph/teuthology-archive/pdonnell-2021-10-15_02:49:27-fs-wip-pdonnell-testing-20211012.192211-distro-basic-smithi/6443932/teuthology.log

OSD log:

/ceph/teuthology-archive/pdonnell-2021-10-15_02:49:27-fs-wip-pdonnell-testing-20211012.192211-distro-basic-smithi/6443932/remote/smithi089/log/ceph-osd.3.log.gz

Other jobs affected:

Failure: reached maximum tries (90) after waiting for 540 seconds
4 jobs: ['6443932', '6443950', '6443942', '6443924']
suites intersection: ['2-workunit/suites/ffsb}}', 'clusters/1a5s-mds-1c-client', 'conf/{client', 'fs/thrash/workloads/{begin', 'mds', 'mon', 'msgr-failures/osd-mds-delay', 'osd}', 'overrides/{frag', 'ranks/3', 'session_timeout', 'thrashosds-health', 'whitelist_health', 'whitelist_wrongly_marked_down}']
suites union: ['2-workunit/suites/ffsb}}', 'clusters/1a5s-mds-1c-client', 'conf/{client', 'distro/{centos_8.stream}', 'distro/{centos_8}', 'distro/{rhel_8}', 'fs/thrash/workloads/{begin', 'k-testing}', 'mds', 'mon', 'mount/fuse', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'msgr-failures/osd-mds-delay', 'objectstore-ec/bluestore-comp-ec-root', 'objectstore-ec/bluestore-ec-root', 'osd}', 'overrides/{distro/testing/{flavor/centos_latest', 'overrides/{frag', 'ranks/3', 'session_timeout', 'tasks/{1-thrash/mds', 'tasks/{1-thrash/mon', 'tasks/{1-thrash/osd', 'thrashosds-health', 'whitelist_health', 'whitelist_wrongly_marked_down}']

History

#1 Updated by Neha Ojha over 1 year ago

This could be related to the removal of allocation metadata from rocksdb work from Gabi, I will have him verify.

Patrick: any idea when this first started showing up?

#2 Updated by Patrick Donnelly over 1 year ago

Neha Ojha wrote:

This could be related to the removal of allocation metadata from rocksdb work from Gabi, I will have him verify.

Patrick: any idea when this first started showing up?

First incidence I can find is:

/ceph/teuthology-archive/pdonnell-2021-10-02_19:36:16-fs-wip-pdonnell-testing-20211002.163337-distro-basic-smithi/6418203/teuthology.log

#3 Updated by Deepika Upadhyay over 1 year ago

2021-10-26T09:37:02.937 DEBUG:teuthology.misc:5 of 6 OSDs are up
2021-10-26T09:37:02.938 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_c56135d151713269e811ede3163c9743c2e269de/teuthology/contextutil.py", line 31, in nested
    vars.append(enter())
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_0efa758f5dd904ae1b1b1db0654cffb844da3f9f/qa/tasks/ceph.py", line 403, in create_rbd_pool
    ceph_cluster=cluster_name,
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_c56135d151713269e811ede3163c9743c2e269de/teuthology/misc.py", line 858, in wait_until_osds_up
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_c56135d151713269e811ede3163c9743c2e269de/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (90) after waiting for 540 seconds
2021-10-26T09:37:02.938 INFO:teuthology.misc:Shutting down osd daemons...

@Mykola, looks like we were seeing this failure in test run for https://pulpito.ceph.com/trociny-2021-10-26_08:59:17-rbd-wip-mgolub-testing-distro-basic-smithi/
http://qa-proxy.ceph.com/teuthology/trociny-2021-10-26_08:59:17-rbd-wip-mgolub-testing-distro-basic-smithi/6462007/teuthology.log

#4 Updated by Neha Ojha about 1 year ago

  • Assignee set to Sridhar Seshasayee

Sridhar has started looking into this.

#5 Updated by Sridhar Seshasayee about 1 year ago

Analysis of logs from JobID: 6443924

osd.3 during running of the "ceph" teuthology task didn't get initialized. As part of osd initialization,
the "osd bench" test is run to determine its overall IOPS capacity. The test succeeded and the next step
is to save this information on the mon db store. As part of this step, the "config set" command is
executed to store the osd_mclock_max_capacity_iops_ssd config value. But before the command from osd.3
could be ack'd by the mon, a socket error injection was introduced between osd.3 and mon.b (leader):

2021-10-15T03:29:39.763+0000 7f38f4dee0c0  1 osd.3 0  bench count 12288000 bsize 4 KiB
...
2021-10-15T03:29:48.062+0000 7f38f4dee0c0  1 osd.3 0 maybe_override_max_osd_capacity_for_qos osd bench result - bandwidth (MiB/sec): 3.340 iops: 854.959 elapsed_sec: 3.509
2021-10-15T03:29:48.062+0000 7f38f4dee0c0  1 -- [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] --> [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] -- mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1 -- 0x56395f2cc900 con 0x56395ef45c00
...
2021-10-15T03:29:48.066+0000 7f38ee7b3700  0 -- [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] >> [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] conn(0x56395ef45c00 msgr2=0x56395e1dc680 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until injecting socket failure
2021-10-15T03:29:48.066+0000 7f38ee7b3700  1 -- [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] >> [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] conn(0x56395ef45c00 msgr2=0x56395e1dc680 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 57
2021-10-15T03:29:48.066+0000 7f38ee7b3700  1 -- [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] >> [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] conn(0x56395ef45c00 msgr2=0x56395e1dc680 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2021-10-15T03:29:48.066+0000 7f38da71f700  1 -- [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] <== mon.0 v2:172.21.15.37:3300/0 10 ==== config(1 keys) v1 ==== 54+0+0 (secure 0 0 0) 0x56395f2cc900 con 0x56395ef45c00
2021-10-15T03:29:48.066+0000 7f38ee7b3700  1 --2- [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] >> [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] conn(0x56395ef45c00 0x56395e1dc680 secure :-1 s=READY pgs=35 cs=0 l=1 rev1=1 crypto rx=0x56395f2d1b60 tx=0x56395e161aa0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
2021-10-15T03:29:48.066+0000 7f38ee7b3700  1 --2- [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] >> [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] conn(0x56395ef45c00 0x56395e1dc680 secure :-1 s=READY pgs=35 cs=0 l=1 rev1=1 crypto rx=0x56395f2d1b60 tx=0x56395e161aa0 comp rx=0 tx=0).stop

The connection between osd.3 and mon.b was stopped.
The logs below show the connection being reset on mon.b around the same time.

mon.b logs (leader):

2021-10-15T03:29:48.063+0000 7f4df4965700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] <== osd.3 v2:172.21.15.72:6808/60999 9 ==== mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1 ==== 149+0+0 (secure 0 0 0) 0x5600451e2c00 con 0x560044b10400
2021-10-15T03:29:48.063+0000 7f4df4965700 20 mon.b@0(leader) e1 _ms_dispatch existing session 0x560044fc1b00 for osd.3
2021-10-15T03:29:48.063+0000 7f4df4965700 20 mon.b@0(leader) e1  entity_name osd.3 global_id 4239 (new_ok) caps allow profile osd
2021-10-15T03:29:48.063+0000 7f4df4965700  0 mon.b@0(leader) e1 handle_command mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1
2021-10-15T03:29:48.063+0000 7f4df4965700 20 is_capable service=config command=config set read write addr v2:172.21.15.72:6808/60999 on cap allow profile osd
2021-10-15T03:29:48.063+0000 7f4df4965700 20  allow so far , doing grant allow profile osd
2021-10-15T03:29:48.063+0000 7f4df4965700 20  match
2021-10-15T03:29:48.063+0000 7f4df4965700 10 mon.b@0(leader) e1 _allowed_command capable
2021-10-15T03:29:48.063+0000 7f4df4965700 10 mon.b@0(leader).paxosservice(config 1..7) dispatch 0x5600451e2c00 mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1 from osd.3 v2:172.21.15.72:6808/60999 con 0x560044b10400
2021-10-15T03:29:48.063+0000 7f4df4965700  5 mon.b@0(leader).paxos(paxos active c 1..48) is_readable = 1 - now=2021-10-15T03:29:48.064280+0000 lease_expire=2021-10-15T03:29:53.042698+0000 has v0 lc 48
2021-10-15T03:29:48.063+0000 7f4df4965700  7 mon.b@0(leader).config prepare_update mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1 from osd.3 v2:172.21.15.72:6808/60999
2021-10-15T03:29:48.064+0000 7f4df4965700 20 mon.b@0(leader).config encode_pending_to_kvmon set config/osd.3/osd_mclock_max_capacity_iops_ssd
2021-10-15T03:29:48.064+0000 7f4df4965700 10 mon.b@0(leader).paxosservice(kv 1..9) propose_pending
2021-10-15T03:29:48.064+0000 7f4df4965700 10 mon.b@0(leader).kv  10
2021-10-15T03:29:48.064+0000 7f4df4965700 20 mon.b@0(leader).kv encode_pending set config-history/8/
2021-10-15T03:29:48.064+0000 7f4df4965700 20 mon.b@0(leader).kv encode_pending set config-history/8/+osd.3/osd_mclock_max_capacity_iops_ssd
2021-10-15T03:29:48.064+0000 7f4df4965700 20 mon.b@0(leader).kv encode_pending set config/osd.3/osd_mclock_max_capacity_iops_ssd

...

2021-10-15T03:29:48.066+0000 7f4df2160700 10 mon.b@0(leader).config maybe_send_config to osd.3 (changed)
2021-10-15T03:29:48.066+0000 7f4df2160700 10 mon.b@0(leader).config send_config to osd.3
2021-10-15T03:29:48.066+0000 7f4df2160700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] --> [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] -- config(1 keys) v1 -- 0x5600451e3e00 con 0x560044b10400

...

2021-10-15T03:29:48.067+0000 7f4df2160700  0 log_channel(audit) log [INF] : from='osd.3 [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999]' entity='osd.3'
2021-10-15T03:29:48.067+0000 7f4df896d700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] >> [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] conn(0x560044b10400 msgr2=0x560043c5c580 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 46
2021-10-15T03:29:48.067+0000 7f4df896d700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] >> [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] conn(0x560044b10400 msgr2=0x560043c5c580 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2021-10-15T03:29:48.067+0000 7f4df2160700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] --> [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] -- log(1 entries from seq 209 at 2021-10-15T03:29:48.067497+0000) v1 -- 0x5600451e76c0 con 0x560043c1a800
2021-10-15T03:29:48.067+0000 7f4df896d700  1 --2- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] >> [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] conn(0x560044b10400 0x560043c5c580 secure :-1 s=READY pgs=2 cs=0 l=1 rev1=1 crypto rx=0x5600452dec60 tx=0x560044b252c0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
2021-10-15T03:29:48.067+0000 7f4df896d700  1 --2- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] >> [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] conn(0x560044b10400 0x560043c5c580 secure :-1 s=READY pgs=2 cs=0 l=1 rev1=1 crypto rx=0x5600452dec60 tx=0x560044b252c0 comp rx=0 tx=0).stop
2021-10-15T03:29:48.067+0000 7f4df2160700  2 mon.b@0(leader) e1 send_reply 0x56004539bc20 0x5600456ee9c0 mon_command_ack([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}]=0  v8)=0  v8) v1
2021-10-15T03:29:48.067+0000 7f4df2160700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] --> [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] -- mon_command_ack([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}]=0  v8)=0  v8) v1 -- 0x5600456ee9c0 con 0x560044b10400
2021-10-15T03:29:48.067+0000 7f4df2160700 20 mon.b@0(leader).paxos(paxos active c 1..49) finish_round waiting_for_readable
2021-10-15T03:29:48.067+0000 7f4df4965700 10 mon.b@0(leader) e1 ms_handle_reset 0x560044b10400 v2:172.21.15.72:6808/60999
2021-10-15T03:29:48.067+0000 7f4df2160700 20 mon.b@0(leader).paxos(paxos active c 1..49) finish_round waiting_for_writeable
2021-10-15T03:29:48.067+0000 7f4df2160700 10 mon.b@0(leader).paxos(paxos active c 1..49) finish_round done w/ waiters, state active
2021-10-15T03:29:48.067+0000 7f4df4965700 10 mon.b@0(leader) e1 reset/close on session osd.3 [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999]
2021-10-15T03:29:48.067+0000 7f4df4965700 10 mon.b@0(leader) e1 remove_session 0x560044fc1b00 osd.3 [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] features 0x3f01cfbb7ffdfff

But the expectation would be that a new session be established and an acknowledgement sent back to
osd.3. The mon.b logs do show the new session being established and a mon_command_ack sent back to
osd.3 on the new session. But osd.3 doesn't show the command being ack'd. The auth requests from
osd.3 to mon.b are handled but osd.3 never receives an "auth_reply".

The logs below show the new session being established and auth_replies sent from mon.b.

New connection established with osd.3:


2021-10-15T03:29:48.069+0000 7f4df2961700  1 --2- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] >>  conn(0x560045083800 0x560043c98100 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).accept
2021-10-15T03:29:48.069+0000 7f4df896d700  1 --2- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] >>  conn(0x560045083800 0x560043c98100 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0
2021-10-15T03:29:48.069+0000 7f4df896d700 10 mon.b@0(leader) e1 handle_auth_request con 0x560045083800 (start) method 2 payload 18
2021-10-15T03:29:48.069+0000 7f4df896d700 10 start_session entity_name=osd.3 global_id=4239 is_new_global_id=0
2021-10-15T03:29:48.069+0000 7f4df896d700 10 cephx server osd.3: start_session server_challenge 7c53f5b3e701e9ce
2021-10-15T03:29:48.069+0000 7f4df896d700 10 mon.b@0(leader) e1 handle_auth_request con 0x560045083800 (more) method 2 payload 132
2021-10-15T03:29:48.069+0000 7f4df896d700 10 cephx server osd.3: handle_request get_auth_session_key for osd.3
2021-10-15T03:29:48.069+0000 7f4df896d700 20 cephx server osd.3:  checking key: req.key=d18a67fbd00bd99e expected_key=d18a67fbd00bd99e
2021-10-15T03:29:48.069+0000 7f4df896d700 20 cephx server osd.3:  checking old_ticket: secret_id=2 len=96, old_ticket_may_be_omitted=0
2021-10-15T03:29:48.069+0000 7f4df896d700 20 cephx server osd.3:  decoded old_ticket: global_id=4239
2021-10-15T03:29:48.069+0000 7f4df896d700 10 cephx server osd.3:  allowing reclaim of global_id 4239 (valid ticket presented, will encrypt new ticket)
2021-10-15T03:29:48.069+0000 7f4df896d700 10 cephx: build_service_ticket_reply encoding 1 tickets with secret AQCA9WhhbisQIhAAK4Yqds9QJXxOfq5SVi3eKA==
2021-10-15T03:29:48.069+0000 7f4df896d700 10 cephx: build_service_ticket service auth secret_id 2 ticket_info.ticket.name=osd.3 ticket.global_id 4239
2021-10-15T03:29:48.069+0000 7f4df896d700 10 cephx keyserverdata: get_caps: name=osd.3
2021-10-15T03:29:48.069+0000 7f4df896d700 10 cephx keyserverdata: get_secret: num of caps=3
2021-10-15T03:29:48.069+0000 7f4df896d700 10 mon.b@0(leader) e1 ms_handle_authentication session 0x5600449b1d40 con 0x560045083800 addr - MonSession(unknown.0  is open , features 0x0 (unknown))
2021-10-15T03:29:48.069+0000 7f4df4965700 10 mon.b@0(leader) e1 ms_handle_accept con 0x560045083800 session 0x5600449b1d40 registering session for [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999]
2021-10-15T03:29:48.069+0000 7f4df896d700  1 --2- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] >> [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] conn(0x560045083800 0x560043c98100 secure :-1 s=READY pgs=14 cs=0 l=1 rev1=1 crypto rx=0x56004511bd70 tx=0x5600449ad3e0 comp rx=0 tx=0).ready entity=osd.3 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2021-10-15T03:29:48.070+0000 7f4df4965700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] <== osd.3 v2:172.21.15.72:6808/60999 1 ==== mon_getmap magic: 0 v1 ==== 0+0+0 (secure 0 0 0) 0x560045022280 con 0x560045083800
2021-10-15T03:29:48.070+0000 7f4df4965700 20 mon.b@0(leader) e1 _ms_dispatch existing session 0x5600449b1d40 for osd.3
2021-10-15T03:29:48.070+0000 7f4df4965700 20 mon.b@0(leader) e1  entity_name osd.3 global_id 4239 (reclaim_ok) caps allow profile osd

...

2021-10-15T03:29:48.070+0000 7f4df4965700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] <== osd.3 v2:172.21.15.72:6808/60999 3 ==== mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1 ==== 149+0+0 (secure 0 0 0) 0x5600451e3e00 con 0x560045083800
...
2021-10-15T03:29:48.070+0000 7f4df4965700 20 mon.b@0(leader) e1  entity_name osd.3 global_id 4239 (reclaim_ok) caps allow profile osd
2021-10-15T03:29:48.070+0000 7f4df4965700  0 mon.b@0(leader) e1 handle_command mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1
2021-10-15T03:29:48.070+0000 7f4df4965700 20 is_capable service=config command=config set read write addr v2:172.21.15.72:6808/60999 on cap allow profile osd
2021-10-15T03:29:48.070+0000 7f4df4965700 20  allow so far , doing grant allow profile osd
2021-10-15T03:29:48.070+0000 7f4df4965700 20  match
2021-10-15T03:29:48.070+0000 7f4df4965700 10 mon.b@0(leader) e1 _allowed_command capable
2021-10-15T03:29:48.070+0000 7f4df4965700 10 mon.b@0(leader).paxosservice(config 1..8) dispatch 0x5600451e3e00 mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1 from osd.3 v2:172.21.15.72:6808/60999 con 0x560045083800
2021-10-15T03:29:48.070+0000 7f4df4965700  5 mon.b@0(leader).paxos(paxos active c 1..49) is_readable = 1 - now=2021-10-15T03:29:48.070993+0000 lease_expire=2021-10-15T03:29:53.067405+0000 has v0 lc 49
2021-10-15T03:29:48.070+0000 7f4df4965700  7 mon.b@0(leader).config prepare_update mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1 from osd.3 v2:172.21.15.72:6808/60999
2021-10-15T03:29:48.070+0000 7f4df4965700  2 mon.b@0(leader) e1 send_reply 0x56004539bef0 0x5600453769c0 mon_command_ack([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}]=0  v8)=0  v8) v1
2021-10-15T03:29:48.070+0000 7f4df4965700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] --> [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] -- mon_command_ack([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}]=0  v8)=0  v8) v1 -- 0x5600453769c0 con 0x560045083800
2021-10-15T03:29:48.071+0000 7f4df4965700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] <== mon.2 v2:172.21.15.37:3301/0 199 ==== paxos(lease_ack lc 49 fc 1 pn 0 opn 0) v4 ==== 190+0+0 (secure 0 0 0) 0x5600449aac00 con 0x560044abc000

Mon.b keeps considering osd.3 down as shown below:


2021-10-15T03:29:50.655+0000 7f4df716a700 20 mon.b@0(leader).osd e17 osd.3 laggy halflife 3600 decay_k -0.000192541 down for 5.001168 decay 0.999038
...
2021-10-15T03:29:55.656+0000 7f4df716a700 20 mon.b@0(leader).osd e17 osd.3 laggy halflife 3600 decay_k -0.000192541 down for 10.002136 decay 0.998076
...
2021-10-15T03:30:00.657+0000 7f4df716a700 20 mon.b@0(leader).osd e17 osd.3 laggy halflife 3600 decay_k -0.000192541 down for 15.003203 decay 0.997115
...
2021-10-15T03:32:35.685+0000 7f4df716a700 20 mon.b@0(leader).osd e17 osd.3 laggy halflife 3600 decay_k -0.000192541 down for 170.030632 decay 0.967792

Logs showing the auth message being handled and a auth_reply sent back:

2021-10-15T03:32:38.574+0000 7f4df4965700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] <== osd.3 v2:172.21.15.72:6808/60999 4 ==== auth(proto 2 165 bytes epoch 0) v1 ==== 195+0+0 (secure 0 0 0) 0x5600456d9200 con 0x560045083800
2021-10-15T03:32:38.574+0000 7f4df4965700 20 mon.b@0(leader) e1 _ms_dispatch existing session 0x5600449b1d40 for osd.3
2021-10-15T03:32:38.574+0000 7f4df4965700 20 mon.b@0(leader) e1  entity_name osd.3 global_id 4239 (reclaim_ok) caps allow profile osd
2021-10-15T03:32:38.574+0000 7f4df4965700 10 mon.b@0(leader).paxosservice(auth 1..2) dispatch 0x5600456d9200 auth(proto 2 165 bytes epoch 0) v1 from osd.3 v2:172.21.15.72:6808/60999 con 0x560045083800
2021-10-15T03:32:38.574+0000 7f4df4965700  5 mon.b@0(leader).paxos(paxos active c 1..275) is_readable = 1 - now=2021-10-15T03:32:38.575567+0000 lease_expire=2021-10-15T03:32:42.844034+0000 has v0 lc 275
2021-10-15T03:32:38.574+0000 7f4df4965700 10 mon.b@0(leader).auth v2 preprocess_query auth(proto 2 165 bytes epoch 0) v1 from osd.3 v2:172.21.15.72:6808/60999
2021-10-15T03:32:38.574+0000 7f4df4965700 10 mon.b@0(leader).auth v2 prep_auth() blob_size=165
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx server osd.3: handle_request get_principal_session_key
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx: verify_authorizer decrypted service auth secret_id=2
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx: verify_authorizer global_id=4239
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx: verify_authorizer ok nonce 9c278287ce64cbdf reply_bl.length()=36
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx server osd.3:  ticket_req.keys = 21
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx server osd.3:  adding key for service mon
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx server osd.3:  adding key for service osd
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx keyserverdata: get_caps: name=osd.3
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx keyserverdata: get_secret: num of caps=3
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx server osd.3:  adding key for service mgr
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx keyserverdata: get_caps: name=osd.3
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx keyserverdata: get_secret: num of caps=3
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx: build_service_ticket_reply encoding 3 tickets with secret AQCs9Whhc4QpBBAAIwCLUBMeSOm/CesEw79xHA==
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx: build_service_ticket service mon secret_id 2 ticket_info.ticket.name=osd.3 ticket.global_id 4239
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx: build_service_ticket service osd secret_id 2 ticket_info.ticket.name=osd.3 ticket.global_id 4239
2021-10-15T03:32:38.574+0000 7f4df4965700 10 cephx: build_service_ticket service mgr secret_id 2 ticket_info.ticket.name=osd.3 ticket.global_id 4239
2021-10-15T03:32:38.574+0000 7f4df4965700  2 mon.b@0(leader) e1 send_reply 0x560045ac4780 0x560043ce3a00 auth_reply(proto 2 0 (0) Success) v1
2021-10-15T03:32:38.574+0000 7f4df4965700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] --> [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] -- auth_reply(proto 2 0 (0) Success) v1 -- 0x560043ce3a00 con 0x560045083800

osd.3 keeps sending the auth message indefinitely shown below since it never gets an auth_reply message:

2021-10-15T03:32:38.574+0000 7f38d971d700  1 -- [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] --> [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x56395e0d998
0 con 0x56395f55b400
2021-10-15T03:32:39.575+0000 7f38d971d700  1 -- [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] --> [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x56395f2cc90
0 con 0x56395f55b400
2021-10-15T03:32:40.575+0000 7f38d971d700  1 -- [v2:172.21.15.72:6808/60999,v1:172.21.15.72:6809/60999] --> [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x56395f2cd38
0 con 0x56395f55b400

The above indicates that a communication between osd.3 and the leader mon.b wasn't restored.
This needs more investigating.

#8 Updated by Radoslaw Zarzynski 10 months ago

Hello Sridhar! Is there anything new? Have we discussed it already maybe?

#9 Updated by Sridhar Seshasayee 10 months ago

Radoslaw Zarzynski wrote:

Hello Sridhar! Is there anything new? Have we discussed it already maybe?

Hello Radek, Yes, we discussed about my findings briefly and recall that this was probably a msgr quirk where an auth_reply message was not received by the OSD. This needed some more analysis from the msgr perspective. I was hoping you could take a look at this and provide your thoughts in this tracker. In any case, I am not sure if we are still seeing this in the teuthology runs.

#10 Updated by Radoslaw Zarzynski 10 months ago

  • Priority changed from High to Normal

Lowering the priority as the last replication is 5 months old.

#11 Updated by Sridhar Seshasayee 8 months ago

Quick Update
This was again hit recently in
/a/yuriw-2022-06-09_03:58:30-smoke-quincy-release-distro-default-smithi/6869659

The issue appears to be the same where the leader mon (mon.a) considers an osd down indefinitely after socket failures are induced. In the latest failure osd.3 is considered down indefinitely by mon.a shown below,

2022-06-09T04:31:48.657+0000 7eff7a872700 20 mon.a@0(leader).osd e13 osd.3 laggy halflife 3600 decay_k -0.000192541 down for 595.217742 decay 0.891719

This needs a deeper investigation from msgr perspective. I will update if I find anything new as I look into the latest logs.

#13 Updated by Yuri Weinstein 8 months ago

Sridhar Seshasayee wrote:

Quick Update
This was again hit recently in
/a/yuriw-2022-06-09_03:58:30-smoke-quincy-release-distro-default-smithi/6869659

The issue appears to be the same where the leader mon (mon.a) considers an osd down indefinitely after socket failures are induced. In the latest failure osd.3 is considered down indefinitely by mon.a shown below,

[...]

This needs a deeper investigation from msgr perspective. I will update if I find anything new as I look into the latest logs.

Seeing in two runs:
http://pulpito.front.sepia.ceph.com/yuriw-2022-06-09_03:58:30-smoke-quincy-release-distro-default-smithi
http://pulpito.front.sepia.ceph.com/yuriw-2022-06-09_00:45:34-smoke-quincy-release-distro-default-smithi

#14 Updated by Aishwarya Mathuria 8 months ago

/a/yuriw-2022-06-09_03:58:30-smoke-quincy-release-distro-default-smithi/6869659/
Test description: smoke/basic/{clusters/{fixed-3-cephfs openstack} objectstore/bluestore-bitmap supported-random-distro$/{rhel_8} tasks/{0-install test/mon_thrash}}

Tried to reproduce this failure by running the same test in the smoke suite and saw it happening again here (2 out of 10 jobs) :
/a/amathuri-2022-06-10_10:08:28-smoke-quincy-release-distro-default-smithi/6872871
/a/amathuri-2022-06-10_10:08:28-smoke-quincy-release-distro-default-smithi/6872875

Also available in: Atom PDF