Feature #4875: gather logs on hung tasks - teuthology - Ceph

Actions

Copy link

Feature #4875

closed

gather logs on hung tasks

Added by Greg Farnum almost 11 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

% Done:

Source:

other

Tags:

Backport:

Reviewed:

Affected Versions:

Description

Right now, if a run hangs, we eventually erase all evidence of it. We should gather up any logs which might exist on the machines first!

Actions

Copy link

Updated by Anonymous almost 11 years ago

Tracker changed from Bug to Feature

Actions

Copy link

Updated by Greg Farnum over 10 years ago

This is causing me trouble again. Can we move it up the backlog, pretty please? :) I don't think a better-effort one should take too long.

Actions

Copy link

Updated by Greg Farnum about 9 years ago

Priority changed from Normal to High

bump

Actions

Copy link

Updated by Greg Farnum over 8 years ago

Priority changed from High to Urgent

bump

We had a long string of jobs get hung on MDS crash #12711, which we managed to diagnose, but http://pulpito.ceph.com/teuthology-2015-08-17_23:04:01-fs-master---basic-multi/1020414/ for instance also contains a monitor running out of memory and it would be really nice if we could examine that.

Actions

Copy link

Updated by Greg Farnum over 8 years ago

Also in http://pulpito.ceph.com/teuthology-2015-08-17_23:04:01-fs-master---basic-multi/1020395/ it looks like a ceph-fuse process crashed from secondary evidence, but there are no traces left of it. :(

Actions

Copy link

Updated by Zack Cerza over 8 years ago

An idea that was floated just now by Greg and Sam was:

Make teuthology-kill attempt to gather logs when passed a certain flag; teuthology-worker, when killing a job that's taking too long, could use that flag. We'd have to make sure that any exceptions raised would not take down the worker process.

Actions

Copy link