Project

General

Profile

Bug #43975

Slow Requests/OP's types not getting logged

Added by Vikhyat Umrao about 2 months ago. Updated 11 days ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
MonClient
Pull request ID:
Crash signature:

Description

- From ceph.log

2020-01-30 05:00:34.367139 mon.node1 (mon.0) 53851 : cluster [WRN] Health check update: Degraded data redundancy: 5916359/1069099881 objects degraded (0.553%), 201 pgs degraded (PG_DEGRADED)
2020-01-30 05:00:39.368143 mon.node1 (mon.0) 53853 : cluster [WRN] Health check update: Degraded data redundancy: 5916361/1069100427 objects degraded (0.553%), 201 pgs degraded (PG_DEGRADED)

2020-01-30 05:00:41.100123 mon.node1 (mon.0) 53854 : cluster [WRN] Health check failed: 668 slow ops, oldest one blocked for 207 sec, daemons [osd,0,osd,100,osd,102,osd,106,osd,108,osd,109,osd,110,osd,112,osd,119,osd,120]... have slow ops. (SLOW_OPS)

2020-01-30 05:00:32.949857 mgr.node1 (mgr.14052) 108028 : cluster [DBG] pgmap v109693: 4768 pgs: 1 active+recovery_wait, 8 active+undersized, 151 active+undersized+degraded, 50 active+recovery_wait+degraded, 6 active+clean+scrubbing, 4552 active+clean; 2.4 TiB data, 75 TiB used, 285 TiB / 361 TiB avail; 9.0 MiB/s rd, 1.2 MiB/s wr, 4.19k op/s; 5916359/1069099881 objects degraded (0.553%); 11 KiB/s, 0 objects/s recovering

- Earlier in the luminous, we used to log slow requests/ops type, for example -

reached pg
queued for pg
op applied
waiting for sub ops

slow.txt View (127 KB) Vikhyat Umrao, 02/05/2020 01:09 PM


Related issues

Copied to RADOS - Backport #44259: nautilus: Slow Requests/OP's types not getting logged Resolved

History

#1 Updated by Vikhyat Umrao about 2 months ago

  • Component(RADOS) MonClient added
  • Component(RADOS) deleted (Manager (RADOS bits))

#2 Updated by Vikhyat Umrao about 2 months ago

#3 Updated by Vikhyat Umrao about 2 months ago

- Types - src/osd/OpRequest.h

  std::string_view state_string() const override {
    switch(latest_flag_point) {
    case flag_queued_for_pg: return "queued for pg";
    case flag_reached_pg: return "reached pg";
    case flag_delayed: return "delayed";
    case flag_started: return "started";
    case flag_sub_op_sent: return "waiting for sub ops";
    case flag_commit_sent: return "commit sent; apply or cleanup";
    default: break;
    }
    return "no flag points reached";
  }

  void mark_queued_for_pg() {
    mark_flag_point(flag_queued_for_pg, "queued_for_pg");
  }
  void mark_reached_pg() {
    mark_flag_point(flag_reached_pg, "reached_pg");
  }
  void mark_delayed(const string& s) {
    mark_flag_point_string(flag_delayed, s);
  }
  void mark_started() {
    mark_flag_point(flag_started, "started");
  }
  void mark_sub_op_sent(const string& s) {
    mark_flag_point_string(flag_sub_op_sent, s);
  }
  void mark_commit_sent() {
    mark_flag_point(flag_commit_sent, "commit_sent");
  }

#4 Updated by Neha Ojha about 2 months ago

  • Assignee set to Sridhar Seshasayee

#5 Updated by Sridhar Seshasayee about 1 month ago

The logging to cluster logs was removed as part of re-factoring effort in mimic. Here are the commits of interest:

721cff9a971a14c884addf2bc492ab6356b57a09 and
f4b74125e44fe78154fb377fa06fc08b3325859d

I am working on re-introducing the logging of the warning along with the slow request/op type into the cluster logs.

#6 Updated by Neha Ojha about 1 month ago

  • Status changed from New to Fix Under Review
  • Backport set to nautilus
  • Pull request ID set to 33328

#7 Updated by Sridhar Seshasayee about 1 month ago

  • Assignee changed from Sridhar Seshasayee to Yehuda Sadeh
  • Severity deleted (3 - minor)

Before and after logs to show the extra information relating to slow op/types:

Before:
--------

2020-02-12 20:39:55.642 7f49eee91700 -1 osd.0 13 get_health_metrics reporting 16 slow ops, oldest is osd_op(client.4245.0:1030 3.4 3:2ca66b65:::benchmark_data_localhost.localdomain_ 393675_object1029:head [set-alloc-hint object_size 1024 write_size 1024,write 0~1024] snapc 0=[] ondisk+write+known_if_redirected e13)

After (from cluster logs):
--------------------------

2020-02-14 09:18:21.117073 osd.0 (osd.0) 89 : cluster [WRN] slow op osd_op(client.4257.0:1874 3.f 3.dcef9fdf (undecoded) ondisk+write+known_if_redirected e13) initiated 2020-02-14 09:18 :21.110722 currently queued for pg
...
...
2020-02-14 09:18:30.200961 osd.0 (osd.0) 214 : cluster [WRN] slow op osd_op(client.4257.0:5565 3.1 3:8f1150fa:::benchmark_data_localhost.localdomain_425610_object5564:head [set-alloc-hi nt object_size 1024 write_size 1024,write 01024] snapc 0=[] ondisk+write+known_if_redirected e13) initiated 2020-02-14 09:18:30.168480 currently commit sent; apply or cleanup
2020-02-14 09:18:30.200992 osd.0 (osd.0) 215 : cluster [WRN] slow op osd_op(client.4257.0:5567 3.1 3:876433d0:::benchmark_data_localhost.localdomain_425610_object5566:head [set-alloc-hi nt object_size 1024 write_size 1024,write 01024] snapc 0=[] ondisk+write+known_if_redirected e13) initiated 2020-02-14 09:18:30.172254 currently started

#8 Updated by Sridhar Seshasayee about 1 month ago

  • Severity set to 3 - minor

#9 Updated by Sridhar Seshasayee about 1 month ago

  • Assignee changed from Yehuda Sadeh to Sridhar Seshasayee

#10 Updated by Sridhar Seshasayee about 1 month ago

  • Copied to Backport #44259: nautilus: Slow Requests/OP's types not getting logged added

#11 Updated by Sridhar Seshasayee about 1 month ago

  • Status changed from Fix Under Review to Pending Backport

#12 Updated by Nathan Cutler 11 days ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF