Feature #11589: teuthology runs should install collectl and collection collectl logs after run - teuthology - Ceph

Actions

Copy link

Feature #11589

closed

teuthology runs should install collectl and collection collectl logs after run

Added by Samuel Just almost 9 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

High

Assignee:

Zack Cerza

Category:

% Done:

Source:

other

Tags:

Backport:

Reviewed:

Affected Versions:

sprint27

Actions

Copy link

Updated by Samuel Just almost 9 years ago

(02:35:01 PM) sjusthm: dmick: that is, can that become a part of the things installed with teuthology?
(02:35:41 PM) dmick: the task could ask for it to be installed
(02:35:42 PM) nhm: sjusthm: collectl can also be used to look for major pagefaults, which is something that was causing heartbeat timeouts on burnupiX back when it was running ubuntu precise.
(02:36:17 PM) andrewschoen: we install collectl on ubuntu anyway, https://github.com/ceph/ceph-cm-ansible/blob/c452a3af1f75642e00d523a432b973a406fc6132/roles/testnode/vars/ubuntu.yml#L74
(02:36:25 PM) nhm: sjusthm: even though there was plenty of memory, for some reason the VM layer decided rather than using buffercache it would swap the ceph-osd process out and that was enough to cause heartbeat timeout failures under extreme load.
(02:36:38 PM) andrewschoen: is collectl something we should just always have installed and configured on testnodes?
(02:37:24 PM) sjusthm: andrewschoen: I think so
(02:37:36 PM) andrewschoen: sjusthm dmick: if collectl is ok to always be installed and running on testnodes, I say we put it in ceph-cm-ansible.
(02:38:03 PM) dmick: dunno if we always want it running
(02:38:11 PM) sjusthm: andrewschoen: created http://tracker.ceph.com/issues/11589
(02:38:21 PM) sjusthm: nhm: is there a reason to not have it always running?
(02:38:22 PM) andrewschoen: dmick: maybe not
(02:38:29 PM) dmick: usually there's some collection overhead, and log space is very much at a premium on typcia
(02:38:56 PM) sjusthm: nhm: how much log does it generate?
(02:39:43 PM) andrewschoen: log space is sparse on typica nodes
(02:40:04 PM) **dmick would swear someone just said that :)
(02:40:32 PM) **andrewschoen was agreeing on the point :)
(02:40:44 PM) andrewschoen: cm-ansible could install it, teuthology could turn it on
(02:40:51 PM) nhm: sjusthm: In daemon mode I think it records general data every 10s and per process data ever minute.
(02:41:00 PM) andrewschoen: but, then we'd have to notice it's a problem and rerun if we wanted those logs
(02:41:00 PM) sjusthm: that seems like a pretty tiny amount of data
(02:41:25 PM) nhm: sjusthm: it's intended to not have much impact.
(02:41:29 PM) sjusthm: yep
(02:41:46 PM) dmick: 2MB compressed a day is what they say. that's not much.
(02:41:47 PM) sjusthm: dmick andrewschoen: I'd vote on having it always on and collected at the end of each run unless it becomes a problem
(02:41:54 PM) nhm: sjusthm: I run it directly during collectl tests with 1s intervals for general data and 10s for per process data and it's still about 1% CPU usage.
(02:42:27 PM) nhm: s/collectl tests/cbt tests
(02:42:36 PM) andrewschoen: sjusthm: who / what would collect it?
(02:42:52 PM) sjusthm: at the end of the test when it grabs syslog and kern.log, it can also grab the collectl log
(02:43:05 PM) andrewschoen: ok, so teuthology
(02:43:09 PM) sjusthm: yes
(02:44:02 PM) nhm: if you want to invoke it directly, before tests, you can do: "collectl -s+mYZ -i 1:10 -F0 -f <log file>" and then kill the pid when you are done.
(02:44:17 PM) sjusthm: or that
(02:44:26 PM) sjusthm: ceph task setup teardown could do that instead

Actions

Copy link

Updated by Samuel Just almost 9 years ago

Project changed from Ceph to teuthology
Category deleted (~~teuthology~~)
Affected Versions sprint27 added
Affected Versions deleted (~~v9.0.2~~)

Actions

Copy link

Updated by Mark Nelson almost 9 years ago

Once the log file is in place, information about specific disks can be gathered via a command like the one below:

[nhm@burnupiX collectl]$ collectl -sD -oT -p burnupiX-20150507-130144.raw.gz --dskfilt sdj | head -n 10

# DISK STATISTICS (/sec)
#                   <---------reads---------><---------writes---------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
13:01:46 sdj            632      0   79    8  1301807      0 2797  465     452    10     2      0   79
13:01:47 sdj            312      0   39    8  1301491      0 2830  460     453     9     2      0   81
13:01:48 sdj            208      0   26    8  1132760      1 2423  468     462    41     9      0   69
13:01:49 sdj            304      0   39    8  1451129      2 3034  478     472    90    27      0   86
13:01:50 sdj            624      0   78    8  1393120   1644 3159  441     430    13     3      0   87
13:01:51 sdj             96      0   12    8  1171385      0 2535  462     459     9     2      0   72

Specifically in this case, the Qlen, Wait, and SvcTim fields can tell us if a particular block device (say a disk under an OSD) is building up lots of waiting IOs and how long it's taking to service them on average over the specified iteration duration.

Actions

Copy link

Updated by Zack Cerza over 7 years ago

Status changed from New to Resolved
Assignee set to Zack Cerza

We use PCP now :)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Tools » teuthology

Custom queries

Feature #11589

teuthology runs should install collectl and collection collectl logs after run

Updated by Samuel Just almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by Mark Nelson almost 9 years ago

Updated by Zack Cerza over 7 years ago