Project

General

Profile

Actions

Bug #56574

open

rados/valgrind-leaks: cluster [WRN] Health check failed: 2 osds down (OSD_DOWN)" in cluster log

Added by Laura Flores almost 2 years ago. Updated over 1 year ago.

Status:
Need More Info
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description: rados/valgrind-leaks/{1-start 2-inject-leak/osd centos_latest}

/a/nojha-2022-07-14_20:32:09-rados-snapshot_key_conversion-distro-default-smithi/6931015

2022-07-14T23:01:41.552 INFO:teuthology.misc.health.smithi145.stdout:{"epoch":13,"fsid":"24de2949-251d-4169-919f-ec93cdf5f733","created":"2022-07-14T22:51:15.685917+0000","modified":"2022-07-14T22:52:15.383144+0000","last_up_change":"2022-07-14T22:52:11.619286+0000","last_in_change":"2022-07-14T22:51:19.303295+0000","flags":"sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit","flags_num":5799936,"flags_set":["pglog_hardlimit","purged_snapdirs","recovery_deletes","sortbitwise"],"crush_version":5,"full_ratio":0.94999998807907104,"backfillfull_ratio":0.89999997615814209,"nearfull_ratio":0.85000002384185791,"cluster_snapshot":"","pool_max":1,"max_osd":3,"require_min_compat_client":"luminous","min_compat_client":"jewel","require_osd_release":"quincy","pools":[{"pool":1,"pool_name":".mgr","create_time":"2022-07-14T22:52:12.379399+0000","flags":1,"flags_names":"hashpspool","type":1,"size":2,"min_size":1,"crush_rule":0,"peering_crush_bucket_count":0,"peering_crush_bucket_target":0,"peering_crush_bucket_barrier":0,"peering_crush_bucket_mandatory_member":2147483647,"object_hash":2,"pg_autoscale_mode":"off","pg_num":1,"pg_placement_num":1,"pg_placement_num_target":1,"pg_num_target":1,"pg_num_pending":1,"last_pg_merge_meta":{"source_pgid":"0.0","ready_epoch":0,"last_epoch_started":0,"last_epoch_clean":0,"source_version":"0'0","target_version":"0'0"},"last_change":"13","last_force_op_resend":"0","last_force_op_resend_prenautilus":"0","last_force_op_resend_preluminous":"0","auid":0,"snap_mode":"selfmanaged","snap_seq":0,"snap_epoch":0,"pool_snaps":[],"removed_snaps":"[]","quota_max_bytes":0,"quota_max_objects":0,"tiers":[],"tier_of":-1,"read_tier":-1,"write_tier":-1,"cache_mode":"none","target_max_bytes":0,"target_max_objects":0,"cache_target_dirty_ratio_micro":400000,"cache_target_dirty_high_ratio_micro":600000,"cache_target_full_ratio_micro":800000,"cache_min_flush_age":0,"cache_min_evict_age":0,"erasure_code_profile":"","hit_set_params":{"type":"none"},"hit_set_period":0,"hit_set_count":0,"use_gmt_hitset":true,"min_read_recency_for_promote":0,"min_write_recency_for_promote":0,"hit_set_grade_decay_rate":0,"hit_set_search_last_n":0,"grade_table":[],"stripe_width":0,"expected_num_objects":0,"fast_read":false,"options":{"pg_num_max":32,"pg_num_min":1},"application_metadata":{"mgr":{}}}],"osds":[{"osd":0,"uuid":"da4088a2-ef48-4e6e-bafe-29b8472b672c","up":1,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":9,"up_thru":0,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6810","nonce":103817},{"type":"v1","addr":"172.21.15.145:6811","nonce":103817}]},"cluster_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6812","nonce":103817},{"type":"v1","addr":"172.21.15.145:6813","nonce":103817}]},"heartbeat_back_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6816","nonce":103817},{"type":"v1","addr":"172.21.15.145:6817","nonce":103817}]},"heartbeat_front_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6814","nonce":103817},{"type":"v1","addr":"172.21.15.145:6815","nonce":103817}]},"public_addr":"172.21.15.145:6811/103817","cluster_addr":"172.21.15.145:6813/103817","heartbeat_back_addr":"172.21.15.145:6817/103817","heartbeat_front_addr":"172.21.15.145:6815/103817","state":["exists","up"]},{"osd":1,"uuid":"5420bad2-daeb-4064-bb0f-a58aed6bab71","up":1,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":10,"up_thru":11,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6818","nonce":103818},{"type":"v1","addr":"172.21.15.145:6819","nonce":103818}]},"cluster_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6820","nonce":103818},{"type":"v1","addr":"172.21.15.145:6821","nonce":103818}]},"heartbeat_back_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6824","nonce":103818},{"type":"v1","addr":"172.21.15.145:6825","nonce":103818}]},"heartbeat_front_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6822","nonce":103818},{"type":"v1","addr":"172.21.15.145:6823","nonce":103818}]},"public_addr":"172.21.15.145:6819/103818","cluster_addr":"172.21.15.145:6821/103818","heartbeat_back_addr":"172.21.15.145:6825/103818","heartbeat_front_addr":"172.21.15.145:6823/103818","state":["exists","up"]},{"osd":2,"uuid":"3e6b48d7-1062-486a-8504-b043154edeb2","up":0,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":0,"up_thru":0,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[]},"cluster_addrs":{"addrvec":[]},"heartbeat_back_addrs":{"addrvec":[]},"heartbeat_front_addrs":{"addrvec":[]},"public_addr":"(unrecognized address family 0)/0","cluster_addr":"(unrecognized address family 0)/0","heartbeat_back_addr":"(unrecognized address family 0)/0","heartbeat_front_addr":"(unrecognized address family 0)/0","state":["exists","new"]}],"osd_xinfo":[{"osd":0,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":4540138320759226367,"old_weight":0,"last_purged_snaps_scrub":"2022-07-14T22:51:57.552229+0000","dead_epoch":0},{"osd":1,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":4540138320759226367,"old_weight":0,"last_purged_snaps_scrub":"2022-07-14T22:51:58.124476+0000","dead_epoch":0},{"osd":2,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":0,"old_weight":0,"last_purged_snaps_scrub":"0.000000","dead_epoch":0}],"pg_upmap":[],"pg_upmap_items":[],"pg_temp":[],"primary_temp":[],"blocklist":{},"range_blocklist":{},"erasure_code_profiles":{"default":{"crush-failure-domain":"osd","k":"2","m":"1","plugin":"jerasure","technique":"reed_sol_van"}},"removed_snaps_queue":[],"new_removed_snaps":[],"new_purged_snaps":[],"crush_node_flags":{},"device_class_flags":{},"stretch_mode":{"stretch_mode_enabled":false,"stretch_bucket_count":0,"degraded_stretch_mode":0,"recovering_stretch_mode":0,"stretch_mode_bucket":0}}
2022-07-14T23:01:41.566 DEBUG:teuthology.misc:2 of 3 OSDs are up

Actions #1

Updated by Laura Flores almost 2 years ago

/a/nojha-2022-07-15_14:45:04-rados-snapshot_key_conversion-distro-default-smithi/6932156

Actions #2

Updated by Nitzan Mordechai over 1 year ago

osd.0 is still down..

The valagrind for osd.0 shows:

<fatal_signal>
  <tid>1</tid>
  <signo>15</signo>
  <signame>SIGTERM</signame>

that coming from wait to mon_cmd_set_config

the code in OSD.CC shows for function mon_cmd_set_config:

int OSD::mon_cmd_set_config(const std::string &key, const std::string &val)
{
  std::string cmd =
    "{" 
      "\"prefix\": \"config set\", " 
      "\"who\": \"osd." + std::to_string(whoami) + "\", " 
      "\"name\": \"" + key + "\", " 
      "\"value\": \"" + val + "\"" 
    "}";

  vector<std::string> vcmd{cmd};
  bufferlist inbl;
  std::string outs;
  C_SaferCond cond;
  monc->start_mon_command(vcmd, inbl, nullptr, &outs, &cond);
  int r = cond.wait();
  if (r < 0) {

looks like we are waiting but failed during that, but mon.c accepted the set and also ack

mon.c showing it ack the set of osd_mclock_max_capacity_iops_ssd:

2022-07-15T15:17:00.355+0000 1721f700  1 -- [v2:172.21.15.36:3302/0,v1:172.21.15.36:6791/0] <== osd.0 v2:172.21.15.36:6802/103698 3 ==== mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1 ==== 149+0+0 (secure 0 0 0) 0x1c2a97b0 con 0x1c6ccbf0
2022-07-15T15:17:00.359+0000 1721f700 20 is_capable service=config command=config set read write addr v2:172.21.15.36:6802/103698 on cap allow profile osd
2022-07-15T15:17:00.373+0000 1721f700 10 mon.c@2(peon).paxosservice(config 1..1) dispatch 0x1c2a97b0 mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1 from osd.0 v2:172.21.15.36:6802/103698 con 0x1c6ccbf0
2022-07-15T15:17:00.635+0000 1721f700  1 -- [v2:172.21.15.36:3302/0,v1:172.21.15.36:6791/0] --> [v2:172.21.15.36:6802/103698,v1:172.21.15.36:6803/103698] -- config(1 keys) v1 -- 0x1c79f000 con 0x1c6ccbf0
2022-07-15T15:17:00.649+0000 1721f700  1 -- [v2:172.21.15.36:3302/0,v1:172.21.15.36:6791/0] --> [v2:172.21.15.36:6802/103698,v1:172.21.15.36:6803/103698] -- mon_command_ack([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}]=0  v2)=0  v2) v1 -- 0x1246f350 con 0x1c6ccbf0

Actions #3

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to Need More Info

Watching for more reoccurances.

Actions #4

Updated by Laura Flores over 1 year ago

  • Assignee set to Sridhar Seshasayee

Found another occurrence here: /a/yuriw-2022-07-18_18:20:02-rados-wip-yuri8-testing-2022-07-18-0918-distro-default-smithi/6936397
Description: rados/valgrind-leaks/{1-start 2-inject-leak/mon centos_latest}

Based on the valgrind (same from what Nitzan shared), it seems like it could be something to do with mClock. (Or a separate issue that is interacting badly with mClock). This was the first PR that this failure occurred on: https://github.com/ceph/ceph/pull/46908. During testing, we reran valgrind-leaks tests 10 times, and they all passed after failing initially; thus, we deemed the PR unrelated. See the PR for more testing info.

@Sridhar, do you have an idea of what this could be?

Actions #5

Updated by Sridhar Seshasayee over 1 year ago

This looks similar to https://tracker.ceph.com/issues/52948. See comment https://tracker.ceph.com/issues/52948#note-5 for more details. The osd in question doesn't get the ack for the command even though the mon sends it. In the original tracker the issue looks to be msgr related as there were socket failure injections in that test. I will look more into the logs of this tracker soon, but it could help if this can be looked into more from the msgr perspective.

Actions #6

Updated by Laura Flores over 1 year ago

Ah, thanks Sridhar. I will compare the two Trackers and mark this one as a duplicate if needed.

Actions #7

Updated by Nitzan Mordechai over 1 year ago

Just a note, i was able to recreate it with vstart, without error injection but with valgrind
as soon as we step into that snap code that i posted, we will hang in that wait (i think that more than 1 mon is needed for recreation)

Actions #8

Updated by Radoslaw Zarzynski over 1 year ago

Nitzan, could it be a different issue?

Actions #9

Updated by Laura Flores over 1 year ago

/a/yuriw-2022-08-22_16:21:19-rados-wip-yuri8-testing-2022-08-22-0646-distro-default-smithi/6985175

Actions #10

Updated by Laura Flores over 1 year ago

  • Translation missing: en.field_tag_list set to test-failure
Actions

Also available in: Atom PDF