Bug #43975
closedSlow Requests/OP's types not getting logged
0%
Description
- From ceph.log
2020-01-30 05:00:34.367139 mon.node1 (mon.0) 53851 : cluster [WRN] Health check update: Degraded data redundancy: 5916359/1069099881 objects degraded (0.553%), 201 pgs degraded (PG_DEGRADED) 2020-01-30 05:00:39.368143 mon.node1 (mon.0) 53853 : cluster [WRN] Health check update: Degraded data redundancy: 5916361/1069100427 objects degraded (0.553%), 201 pgs degraded (PG_DEGRADED) 2020-01-30 05:00:41.100123 mon.node1 (mon.0) 53854 : cluster [WRN] Health check failed: 668 slow ops, oldest one blocked for 207 sec, daemons [osd,0,osd,100,osd,102,osd,106,osd,108,osd,109,osd,110,osd,112,osd,119,osd,120]... have slow ops. (SLOW_OPS) 2020-01-30 05:00:32.949857 mgr.node1 (mgr.14052) 108028 : cluster [DBG] pgmap v109693: 4768 pgs: 1 active+recovery_wait, 8 active+undersized, 151 active+undersized+degraded, 50 active+recovery_wait+degraded, 6 active+clean+scrubbing, 4552 active+clean; 2.4 TiB data, 75 TiB used, 285 TiB / 361 TiB avail; 9.0 MiB/s rd, 1.2 MiB/s wr, 4.19k op/s; 5916359/1069099881 objects degraded (0.553%); 11 KiB/s, 0 objects/s recovering
- Earlier in the luminous, we used to log slow requests/ops type, for example -
reached pg queued for pg op applied waiting for sub ops
Files
Updated by Vikhyat Umrao about 4 years ago
- Component(RADOS) MonClient added
- Component(RADOS) deleted (
Manager (RADOS bits))
Updated by Vikhyat Umrao about 4 years ago
- Types - src/osd/OpRequest.h
std::string_view state_string() const override { switch(latest_flag_point) { case flag_queued_for_pg: return "queued for pg"; case flag_reached_pg: return "reached pg"; case flag_delayed: return "delayed"; case flag_started: return "started"; case flag_sub_op_sent: return "waiting for sub ops"; case flag_commit_sent: return "commit sent; apply or cleanup"; default: break; } return "no flag points reached"; } void mark_queued_for_pg() { mark_flag_point(flag_queued_for_pg, "queued_for_pg"); } void mark_reached_pg() { mark_flag_point(flag_reached_pg, "reached_pg"); } void mark_delayed(const string& s) { mark_flag_point_string(flag_delayed, s); } void mark_started() { mark_flag_point(flag_started, "started"); } void mark_sub_op_sent(const string& s) { mark_flag_point_string(flag_sub_op_sent, s); } void mark_commit_sent() { mark_flag_point(flag_commit_sent, "commit_sent"); }
Updated by Sridhar Seshasayee about 4 years ago
The logging to cluster logs was removed as part of re-factoring effort in mimic. Here are the commits of interest:
721cff9a971a14c884addf2bc492ab6356b57a09 and
f4b74125e44fe78154fb377fa06fc08b3325859d
I am working on re-introducing the logging of the warning along with the slow request/op type into the cluster logs.
Updated by Neha Ojha about 4 years ago
- Status changed from New to Fix Under Review
- Backport set to nautilus
- Pull request ID set to 33328
Updated by Sridhar Seshasayee about 4 years ago
- Assignee changed from Sridhar Seshasayee to Yehuda Sadeh
- Severity deleted (
3 - minor)
Before and after logs to show the extra information relating to slow op/types:
Before:
--------
2020-02-12 20:39:55.642 7f49eee91700 -1 osd.0 13 get_health_metrics reporting 16 slow ops, oldest is osd_op(client.4245.0:1030 3.4 3:2ca66b65:::benchmark_data_localhost.localdomain_ 393675_object1029:head [set-alloc-hint object_size 1024 write_size 1024,write 0~1024] snapc 0=[] ondisk+write+known_if_redirected e13)
After (from cluster logs):
--------------------------
2020-02-14 09:18:21.117073 osd.0 (osd.0) 89 : cluster [WRN] slow op osd_op(client.4257.0:1874 3.f 3.dcef9fdf (undecoded) ondisk+write+known_if_redirected e13) initiated 2020-02-14 09:18 :21.110722 currently queued for pg ... ... 2020-02-14 09:18:30.200961 osd.0 (osd.0) 214 : cluster [WRN] slow op osd_op(client.4257.0:5565 3.1 3:8f1150fa:::benchmark_data_localhost.localdomain_425610_object5564:head [set-alloc-hi nt object_size 1024 write_size 1024,write 01024] snapc 0=[] ondisk+write+known_if_redirected e13) initiated 2020-02-14 09:18:30.168480 currently commit sent; apply or cleanup 2020-02-14 09:18:30.200992 osd.0 (osd.0) 215 : cluster [WRN] slow op osd_op(client.4257.0:5567 3.1 3:876433d0:::benchmark_data_localhost.localdomain_425610_object5566:head [set-alloc-hi nt object_size 1024 write_size 1024,write 01024] snapc 0=[] ondisk+write+known_if_redirected e13) initiated 2020-02-14 09:18:30.172254 currently started
Updated by Sridhar Seshasayee about 4 years ago
- Assignee changed from Yehuda Sadeh to Sridhar Seshasayee
Updated by Sridhar Seshasayee about 4 years ago
- Copied to Backport #44259: nautilus: Slow Requests/OP's types not getting logged added
Updated by Sridhar Seshasayee about 4 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler about 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".