Bug #43975
Slow Requests/OP's types not getting logged
0%
Description
- From ceph.log
2020-01-30 05:00:34.367139 mon.node1 (mon.0) 53851 : cluster [WRN] Health check update: Degraded data redundancy: 5916359/1069099881 objects degraded (0.553%), 201 pgs degraded (PG_DEGRADED) 2020-01-30 05:00:39.368143 mon.node1 (mon.0) 53853 : cluster [WRN] Health check update: Degraded data redundancy: 5916361/1069100427 objects degraded (0.553%), 201 pgs degraded (PG_DEGRADED) 2020-01-30 05:00:41.100123 mon.node1 (mon.0) 53854 : cluster [WRN] Health check failed: 668 slow ops, oldest one blocked for 207 sec, daemons [osd,0,osd,100,osd,102,osd,106,osd,108,osd,109,osd,110,osd,112,osd,119,osd,120]... have slow ops. (SLOW_OPS) 2020-01-30 05:00:32.949857 mgr.node1 (mgr.14052) 108028 : cluster [DBG] pgmap v109693: 4768 pgs: 1 active+recovery_wait, 8 active+undersized, 151 active+undersized+degraded, 50 active+recovery_wait+degraded, 6 active+clean+scrubbing, 4552 active+clean; 2.4 TiB data, 75 TiB used, 285 TiB / 361 TiB avail; 9.0 MiB/s rd, 1.2 MiB/s wr, 4.19k op/s; 5916359/1069099881 objects degraded (0.553%); 11 KiB/s, 0 objects/s recovering
- Earlier in the luminous, we used to log slow requests/ops type, for example -
reached pg queued for pg op applied waiting for sub ops
Related issues
History
#1 Updated by Vikhyat Umrao over 3 years ago
- Component(RADOS) MonClient added
- Component(RADOS) deleted (
Manager (RADOS bits))
#2 Updated by Vikhyat Umrao over 3 years ago
#3 Updated by Vikhyat Umrao over 3 years ago
- Types - src/osd/OpRequest.h
std::string_view state_string() const override {
switch(latest_flag_point) {
case flag_queued_for_pg: return "queued for pg";
case flag_reached_pg: return "reached pg";
case flag_delayed: return "delayed";
case flag_started: return "started";
case flag_sub_op_sent: return "waiting for sub ops";
case flag_commit_sent: return "commit sent; apply or cleanup";
default: break;
}
return "no flag points reached";
}
void mark_queued_for_pg() {
mark_flag_point(flag_queued_for_pg, "queued_for_pg");
}
void mark_reached_pg() {
mark_flag_point(flag_reached_pg, "reached_pg");
}
void mark_delayed(const string& s) {
mark_flag_point_string(flag_delayed, s);
}
void mark_started() {
mark_flag_point(flag_started, "started");
}
void mark_sub_op_sent(const string& s) {
mark_flag_point_string(flag_sub_op_sent, s);
}
void mark_commit_sent() {
mark_flag_point(flag_commit_sent, "commit_sent");
}
#4 Updated by Neha Ojha over 3 years ago
- Assignee set to Sridhar Seshasayee
#5 Updated by Sridhar Seshasayee over 3 years ago
The logging to cluster logs was removed as part of re-factoring effort in mimic. Here are the commits of interest:
721cff9a971a14c884addf2bc492ab6356b57a09 and
f4b74125e44fe78154fb377fa06fc08b3325859d
I am working on re-introducing the logging of the warning along with the slow request/op type into the cluster logs.
#6 Updated by Neha Ojha over 3 years ago
- Status changed from New to Fix Under Review
- Backport set to nautilus
- Pull request ID set to 33328
#7 Updated by Sridhar Seshasayee over 3 years ago
- Assignee changed from Sridhar Seshasayee to Yehuda Sadeh
- Severity deleted (
3 - minor)
Before and after logs to show the extra information relating to slow op/types:
Before:
--------
2020-02-12 20:39:55.642 7f49eee91700 -1 osd.0 13 get_health_metrics reporting 16 slow ops, oldest is osd_op(client.4245.0:1030 3.4 3:2ca66b65:::benchmark_data_localhost.localdomain_ 393675_object1029:head [set-alloc-hint object_size 1024 write_size 1024,write 0~1024] snapc 0=[] ondisk+write+known_if_redirected e13)
After (from cluster logs):
--------------------------
2020-02-14 09:18:21.117073 osd.0 (osd.0) 89 : cluster [WRN] slow op osd_op(client.4257.0:1874 3.f 3.dcef9fdf (undecoded) ondisk+write+known_if_redirected e13) initiated 2020-02-14 09:18 :21.110722 currently queued for pg ... ... 2020-02-14 09:18:30.200961 osd.0 (osd.0) 214 : cluster [WRN] slow op osd_op(client.4257.0:5565 3.1 3:8f1150fa:::benchmark_data_localhost.localdomain_425610_object5564:head [set-alloc-hi nt object_size 1024 write_size 1024,write 01024] snapc 0=[] ondisk+write+known_if_redirected e13) initiated 2020-02-14 09:18:30.168480 currently commit sent; apply or cleanup 2020-02-14 09:18:30.200992 osd.0 (osd.0) 215 : cluster [WRN] slow op osd_op(client.4257.0:5567 3.1 3:876433d0:::benchmark_data_localhost.localdomain_425610_object5566:head [set-alloc-hi nt object_size 1024 write_size 1024,write 01024] snapc 0=[] ondisk+write+known_if_redirected e13) initiated 2020-02-14 09:18:30.172254 currently started
#8 Updated by Sridhar Seshasayee over 3 years ago
- Severity set to 3 - minor
#9 Updated by Sridhar Seshasayee over 3 years ago
- Assignee changed from Yehuda Sadeh to Sridhar Seshasayee
#10 Updated by Sridhar Seshasayee over 3 years ago
- Copied to Backport #44259: nautilus: Slow Requests/OP's types not getting logged added
#11 Updated by Sridhar Seshasayee over 3 years ago
- Status changed from Fix Under Review to Pending Backport
#12 Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".