Project

General

Profile

Actions

Feature #11589

closed

teuthology runs should install collectl and collection collectl logs after run

Added by Samuel Just almost 9 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
Actions #1

Updated by Samuel Just almost 9 years ago

(02:35:01 PM) sjusthm: dmick: that is, can that become a part of the things installed with teuthology?
(02:35:41 PM) dmick: the task could ask for it to be installed
(02:35:42 PM) nhm: sjusthm: collectl can also be used to look for major pagefaults, which is something that was causing heartbeat timeouts on burnupiX back when it was running ubuntu precise.
(02:36:17 PM) andrewschoen: we install collectl on ubuntu anyway, https://github.com/ceph/ceph-cm-ansible/blob/c452a3af1f75642e00d523a432b973a406fc6132/roles/testnode/vars/ubuntu.yml#L74
(02:36:25 PM) nhm: sjusthm: even though there was plenty of memory, for some reason the VM layer decided rather than using buffercache it would swap the ceph-osd process out and that was enough to cause heartbeat timeout failures under extreme load.
(02:36:38 PM) andrewschoen: is collectl something we should just always have installed and configured on testnodes?
(02:37:24 PM) sjusthm: andrewschoen: I think so
(02:37:36 PM) andrewschoen: sjusthm dmick: if collectl is ok to always be installed and running on testnodes, I say we put it in ceph-cm-ansible.
(02:38:03 PM) dmick: dunno if we always want it running
(02:38:11 PM) sjusthm: andrewschoen: created http://tracker.ceph.com/issues/11589
(02:38:21 PM) sjusthm: nhm: is there a reason to not have it always running?
(02:38:22 PM) andrewschoen: dmick: maybe not
(02:38:29 PM) dmick: usually there's some collection overhead, and log space is very much at a premium on typcia
(02:38:56 PM) sjusthm: nhm: how much log does it generate?
(02:39:43 PM) andrewschoen: log space is sparse on typica nodes
(02:40:04 PM) **dmick would swear someone just said that :)
(02:40:32 PM) *
*andrewschoen was agreeing on the point :)
(02:40:44 PM) andrewschoen: cm-ansible could install it, teuthology could turn it on
(02:40:51 PM) nhm: sjusthm: In daemon mode I think it records general data every 10s and per process data ever minute.
(02:41:00 PM) andrewschoen: but, then we'd have to notice it's a problem and rerun if we wanted those logs
(02:41:00 PM) sjusthm: that seems like a pretty tiny amount of data
(02:41:25 PM) nhm: sjusthm: it's intended to not have much impact.
(02:41:29 PM) sjusthm: yep
(02:41:46 PM) dmick: 2MB compressed a day is what they say. that's not much.
(02:41:47 PM) sjusthm: dmick andrewschoen: I'd vote on having it always on and collected at the end of each run unless it becomes a problem
(02:41:54 PM) nhm: sjusthm: I run it directly during collectl tests with 1s intervals for general data and 10s for per process data and it's still about 1% CPU usage.
(02:42:27 PM) nhm: s/collectl tests/cbt tests
(02:42:36 PM) andrewschoen: sjusthm: who / what would collect it?
(02:42:52 PM) sjusthm: at the end of the test when it grabs syslog and kern.log, it can also grab the collectl log
(02:43:05 PM) andrewschoen: ok, so teuthology
(02:43:09 PM) sjusthm: yes
(02:44:02 PM) nhm: if you want to invoke it directly, before tests, you can do: "collectl -s+mYZ -i 1:10 -F0 -f <log file>" and then kill the pid when you are done.
(02:44:17 PM) sjusthm: or that
(02:44:26 PM) sjusthm: ceph task setup teardown could do that instead

Actions #2

Updated by Samuel Just almost 9 years ago

  • Project changed from Ceph to teuthology
  • Category deleted (teuthology)
  • Affected Versions sprint27 added
  • Affected Versions deleted (v9.0.2)
Actions #3

Updated by Mark Nelson almost 9 years ago

Once the log file is in place, information about specific disks can be gathered via a command like the one below:

[nhm@burnupiX collectl]$ collectl -sD -oT -p burnupiX-20150507-130144.raw.gz --dskfilt sdj | head -n 10

# DISK STATISTICS (/sec)
#                   <---------reads---------><---------writes---------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
13:01:46 sdj            632      0   79    8  1301807      0 2797  465     452    10     2      0   79
13:01:47 sdj            312      0   39    8  1301491      0 2830  460     453     9     2      0   81
13:01:48 sdj            208      0   26    8  1132760      1 2423  468     462    41     9      0   69
13:01:49 sdj            304      0   39    8  1451129      2 3034  478     472    90    27      0   86
13:01:50 sdj            624      0   78    8  1393120   1644 3159  441     430    13     3      0   87
13:01:51 sdj             96      0   12    8  1171385      0 2535  462     459     9     2      0   72

Specifically in this case, the Qlen, Wait, and SvcTim fields can tell us if a particular block device (say a disk under an OSD) is building up lots of waiting IOs and how long it's taking to service them on average over the specified iteration duration.

Actions #4

Updated by Zack Cerza over 7 years ago

  • Status changed from New to Resolved
  • Assignee set to Zack Cerza

We use PCP now :)

Actions

Also available in: Atom PDF