Feature #4875
closedgather logs on hung tasks
0%
Description
Right now, if a run hangs, we eventually erase all evidence of it. We should gather up any logs which might exist on the machines first!
Updated by Greg Farnum over 10 years ago
This is causing me trouble again. Can we move it up the backlog, pretty please? :) I don't think a better-effort one should take too long.
Updated by Greg Farnum over 8 years ago
- Priority changed from High to Urgent
bump
We had a long string of jobs get hung on MDS crash #12711, which we managed to diagnose, but http://pulpito.ceph.com/teuthology-2015-08-17_23:04:01-fs-master---basic-multi/1020414/ for instance also contains a monitor running out of memory and it would be really nice if we could examine that.
Updated by Greg Farnum over 8 years ago
Also in http://pulpito.ceph.com/teuthology-2015-08-17_23:04:01-fs-master---basic-multi/1020395/ it looks like a ceph-fuse process crashed from secondary evidence, but there are no traces left of it. :(
Updated by Zack Cerza over 8 years ago
An idea that was floated just now by Greg and Sam was:
Make teuthology-kill
attempt to gather logs when passed a certain flag; teuthology-worker
, when killing a job that's taking too long, could use that flag. We'd have to make sure that any exceptions raised would not take down the worker process.
Updated by Kyrylo Shatskyy almost 4 years ago
Gregory, is this ticket still actual or we can close it?
Updated by Josh Durgin about 3 years ago
- Status changed from New to Resolved
teuthology-dispatcher does this: https://github.com/ceph/teuthology/pull/1546