Bug #56574
rados/valgrind-leaks: cluster [WRN] Health check failed: 2 osds down (OSD_DOWN)" in cluster log
0%
Description
Description: rados/valgrind-leaks/{1-start 2-inject-leak/osd centos_latest}
/a/nojha-2022-07-14_20:32:09-rados-snapshot_key_conversion-distro-default-smithi/6931015
2022-07-14T23:01:41.552 INFO:teuthology.misc.health.smithi145.stdout:{"epoch":13,"fsid":"24de2949-251d-4169-919f-ec93cdf5f733","created":"2022-07-14T22:51:15.685917+0000","modified":"2022-07-14T22:52:15.383144+0000","last_up_change":"2022-07-14T22:52:11.619286+0000","last_in_change":"2022-07-14T22:51:19.303295+0000","flags":"sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit","flags_num":5799936,"flags_set":["pglog_hardlimit","purged_snapdirs","recovery_deletes","sortbitwise"],"crush_version":5,"full_ratio":0.94999998807907104,"backfillfull_ratio":0.89999997615814209,"nearfull_ratio":0.85000002384185791,"cluster_snapshot":"","pool_max":1,"max_osd":3,"require_min_compat_client":"luminous","min_compat_client":"jewel","require_osd_release":"quincy","pools":[{"pool":1,"pool_name":".mgr","create_time":"2022-07-14T22:52:12.379399+0000","flags":1,"flags_names":"hashpspool","type":1,"size":2,"min_size":1,"crush_rule":0,"peering_crush_bucket_count":0,"peering_crush_bucket_target":0,"peering_crush_bucket_barrier":0,"peering_crush_bucket_mandatory_member":2147483647,"object_hash":2,"pg_autoscale_mode":"off","pg_num":1,"pg_placement_num":1,"pg_placement_num_target":1,"pg_num_target":1,"pg_num_pending":1,"last_pg_merge_meta":{"source_pgid":"0.0","ready_epoch":0,"last_epoch_started":0,"last_epoch_clean":0,"source_version":"0'0","target_version":"0'0"},"last_change":"13","last_force_op_resend":"0","last_force_op_resend_prenautilus":"0","last_force_op_resend_preluminous":"0","auid":0,"snap_mode":"selfmanaged","snap_seq":0,"snap_epoch":0,"pool_snaps":[],"removed_snaps":"[]","quota_max_bytes":0,"quota_max_objects":0,"tiers":[],"tier_of":-1,"read_tier":-1,"write_tier":-1,"cache_mode":"none","target_max_bytes":0,"target_max_objects":0,"cache_target_dirty_ratio_micro":400000,"cache_target_dirty_high_ratio_micro":600000,"cache_target_full_ratio_micro":800000,"cache_min_flush_age":0,"cache_min_evict_age":0,"erasure_code_profile":"","hit_set_params":{"type":"none"},"hit_set_period":0,"hit_set_count":0,"use_gmt_hitset":true,"min_read_recency_for_promote":0,"min_write_recency_for_promote":0,"hit_set_grade_decay_rate":0,"hit_set_search_last_n":0,"grade_table":[],"stripe_width":0,"expected_num_objects":0,"fast_read":false,"options":{"pg_num_max":32,"pg_num_min":1},"application_metadata":{"mgr":{}}}],"osds":[{"osd":0,"uuid":"da4088a2-ef48-4e6e-bafe-29b8472b672c","up":1,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":9,"up_thru":0,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6810","nonce":103817},{"type":"v1","addr":"172.21.15.145:6811","nonce":103817}]},"cluster_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6812","nonce":103817},{"type":"v1","addr":"172.21.15.145:6813","nonce":103817}]},"heartbeat_back_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6816","nonce":103817},{"type":"v1","addr":"172.21.15.145:6817","nonce":103817}]},"heartbeat_front_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6814","nonce":103817},{"type":"v1","addr":"172.21.15.145:6815","nonce":103817}]},"public_addr":"172.21.15.145:6811/103817","cluster_addr":"172.21.15.145:6813/103817","heartbeat_back_addr":"172.21.15.145:6817/103817","heartbeat_front_addr":"172.21.15.145:6815/103817","state":["exists","up"]},{"osd":1,"uuid":"5420bad2-daeb-4064-bb0f-a58aed6bab71","up":1,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":10,"up_thru":11,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6818","nonce":103818},{"type":"v1","addr":"172.21.15.145:6819","nonce":103818}]},"cluster_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6820","nonce":103818},{"type":"v1","addr":"172.21.15.145:6821","nonce":103818}]},"heartbeat_back_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6824","nonce":103818},{"type":"v1","addr":"172.21.15.145:6825","nonce":103818}]},"heartbeat_front_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6822","nonce":103818},{"type":"v1","addr":"172.21.15.145:6823","nonce":103818}]},"public_addr":"172.21.15.145:6819/103818","cluster_addr":"172.21.15.145:6821/103818","heartbeat_back_addr":"172.21.15.145:6825/103818","heartbeat_front_addr":"172.21.15.145:6823/103818","state":["exists","up"]},{"osd":2,"uuid":"3e6b48d7-1062-486a-8504-b043154edeb2","up":0,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":0,"up_thru":0,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[]},"cluster_addrs":{"addrvec":[]},"heartbeat_back_addrs":{"addrvec":[]},"heartbeat_front_addrs":{"addrvec":[]},"public_addr":"(unrecognized address family 0)/0","cluster_addr":"(unrecognized address family 0)/0","heartbeat_back_addr":"(unrecognized address family 0)/0","heartbeat_front_addr":"(unrecognized address family 0)/0","state":["exists","new"]}],"osd_xinfo":[{"osd":0,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":4540138320759226367,"old_weight":0,"last_purged_snaps_scrub":"2022-07-14T22:51:57.552229+0000","dead_epoch":0},{"osd":1,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":4540138320759226367,"old_weight":0,"last_purged_snaps_scrub":"2022-07-14T22:51:58.124476+0000","dead_epoch":0},{"osd":2,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":0,"old_weight":0,"last_purged_snaps_scrub":"0.000000","dead_epoch":0}],"pg_upmap":[],"pg_upmap_items":[],"pg_temp":[],"primary_temp":[],"blocklist":{},"range_blocklist":{},"erasure_code_profiles":{"default":{"crush-failure-domain":"osd","k":"2","m":"1","plugin":"jerasure","technique":"reed_sol_van"}},"removed_snaps_queue":[],"new_removed_snaps":[],"new_purged_snaps":[],"crush_node_flags":{},"device_class_flags":{},"stretch_mode":{"stretch_mode_enabled":false,"stretch_bucket_count":0,"degraded_stretch_mode":0,"recovering_stretch_mode":0,"stretch_mode_bucket":0}}
2022-07-14T23:01:41.566 DEBUG:teuthology.misc:2 of 3 OSDs are up
History
#1 Updated by Laura Flores about 1 year ago
/a/nojha-2022-07-15_14:45:04-rados-snapshot_key_conversion-distro-default-smithi/6932156
#2 Updated by Nitzan Mordechai about 1 year ago
osd.0 is still down..
The valagrind for osd.0 shows:
<fatal_signal>
<tid>1</tid>
<signo>15</signo>
<signame>SIGTERM</signame>
that coming from wait to mon_cmd_set_config
the code in OSD.CC shows for function mon_cmd_set_config:
int OSD::mon_cmd_set_config(const std::string &key, const std::string &val)
{
std::string cmd =
"{"
"\"prefix\": \"config set\", "
"\"who\": \"osd." + std::to_string(whoami) + "\", "
"\"name\": \"" + key + "\", "
"\"value\": \"" + val + "\""
"}";
vector<std::string> vcmd{cmd};
bufferlist inbl;
std::string outs;
C_SaferCond cond;
monc->start_mon_command(vcmd, inbl, nullptr, &outs, &cond);
int r = cond.wait();
if (r < 0) {
looks like we are waiting but failed during that, but mon.c accepted the set and also ack
mon.c showing it ack the set of osd_mclock_max_capacity_iops_ssd:
2022-07-15T15:17:00.355+0000 1721f700 1 -- [v2:172.21.15.36:3302/0,v1:172.21.15.36:6791/0] <== osd.0 v2:172.21.15.36:6802/103698 3 ==== mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1 ==== 149+0+0 (secure 0 0 0) 0x1c2a97b0 con 0x1c6ccbf0
2022-07-15T15:17:00.359+0000 1721f700 20 is_capable service=config command=config set read write addr v2:172.21.15.36:6802/103698 on cap allow profile osd
2022-07-15T15:17:00.373+0000 1721f700 10 mon.c@2(peon).paxosservice(config 1..1) dispatch 0x1c2a97b0 mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1 from osd.0 v2:172.21.15.36:6802/103698 con 0x1c6ccbf0
2022-07-15T15:17:00.635+0000 1721f700 1 -- [v2:172.21.15.36:3302/0,v1:172.21.15.36:6791/0] --> [v2:172.21.15.36:6802/103698,v1:172.21.15.36:6803/103698] -- config(1 keys) v1 -- 0x1c79f000 con 0x1c6ccbf0
2022-07-15T15:17:00.649+0000 1721f700 1 -- [v2:172.21.15.36:3302/0,v1:172.21.15.36:6791/0] --> [v2:172.21.15.36:6802/103698,v1:172.21.15.36:6803/103698] -- mon_command_ack([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}]=0 v2)=0 v2) v1 -- 0x1246f350 con 0x1c6ccbf0
#3 Updated by Radoslaw Zarzynski about 1 year ago
- Status changed from New to Need More Info
Watching for more reoccurances.
#4 Updated by Laura Flores about 1 year ago
- Assignee set to Sridhar Seshasayee
Found another occurrence here: /a/yuriw-2022-07-18_18:20:02-rados-wip-yuri8-testing-2022-07-18-0918-distro-default-smithi/6936397
Description: rados/valgrind-leaks/{1-start 2-inject-leak/mon centos_latest}
Based on the valgrind (same from what Nitzan shared), it seems like it could be something to do with mClock. (Or a separate issue that is interacting badly with mClock). This was the first PR that this failure occurred on: https://github.com/ceph/ceph/pull/46908. During testing, we reran valgrind-leaks tests 10 times, and they all passed after failing initially; thus, we deemed the PR unrelated. See the PR for more testing info.
@Sridhar, do you have an idea of what this could be?
#5 Updated by Sridhar Seshasayee about 1 year ago
This looks similar to https://tracker.ceph.com/issues/52948. See comment https://tracker.ceph.com/issues/52948#note-5 for more details. The osd in question doesn't get the ack for the command even though the mon sends it. In the original tracker the issue looks to be msgr related as there were socket failure injections in that test. I will look more into the logs of this tracker soon, but it could help if this can be looked into more from the msgr perspective.
#6 Updated by Laura Flores about 1 year ago
Ah, thanks Sridhar. I will compare the two Trackers and mark this one as a duplicate if needed.
#7 Updated by Nitzan Mordechai about 1 year ago
Just a note, i was able to recreate it with vstart, without error injection but with valgrind
as soon as we step into that snap code that i posted, we will hang in that wait (i think that more than 1 mon is needed for recreation)
#8 Updated by Radoslaw Zarzynski about 1 year ago
Nitzan, could it be a different issue?
#9 Updated by Laura Flores about 1 year ago
/a/yuriw-2022-08-22_16:21:19-rados-wip-yuri8-testing-2022-08-22-0646-distro-default-smithi/6985175
#10 Updated by Laura Flores about 1 year ago
- Tags set to test-failure