Bug #8166
closedpaddles: deadlock updating run.updated?
0%
Description
A teuthology job and worker died because of this completely unexpected error:
2014-04-18T04:31:03.475 INFO:teuthology.worker:Creating archive dir /var/lib/teuthworker/archive/gregf-2014-04-17_23:48:51-rados-wip-fast-dispatch-testing-basic-plana/199983 2014-04-18T04:31:03.475 INFO:teuthology.worker:Running job 199983 2014-04-18T04:31:03.488 INFO:teuthology.worker:Job archive: /var/lib/teuthworker/archive/gregf-2014-04-17_23:48:51-rados-wip-fast-dispatch-testing-basic-plana/199983 2014-04-18T04:31:03.489 INFO:teuthology.worker:Job PID: 3767 2014-04-18T04:31:03.489 INFO:teuthology.worker:Running with watchdog 2014-04-18T07:01:40.343 ERROR:teuthology.worker:run_with_watchdog had an unhandled exception Traceback (most recent call last): File "/home/teuthworker/teuthology-master/teuthology/worker.py", line 303, in run_job run_with_watchdog(p, job_config) File "/home/teuthworker/teuthology-master/teuthology/worker.py", line 230, in run_with_watchdog report.try_push_job_info(job_info, dict(status='running')) File "/home/teuthworker/teuthology-master/teuthology/report.py", line 410, in try_push_job_info push_job_info(run_name, job_id, job_info) File "/home/teuthworker/teuthology-master/teuthology/report.py", line 373, in push_job_info reporter.report_job(run_name, job_id, job_info) File "/home/teuthworker/teuthology-master/teuthology/report.py", line 277, in report_job resp_json = response.json() File "/home/teuthworker/teuthology-master/virtualenv/local/lib/python2.7/site-packages/requests/models.py", line 741, in json File "/usr/lib/python2.7/dist-packages/simplejson/__init__.py", line 413, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/dist-packages/simplejson/decoder.py", line 402, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python2.7/dist-packages/simplejson/decoder.py", line 420, in raw_decode raise JSONDecodeError("No JSON object could be decoded", s, idx) JSONDecodeError: No JSON object could be decoded: line 1 column 0 (char 0)
I looked in paddles' logs to see what was up:
[huge traceback] DBAPIError: (TransactionRollbackError) deadlock detected DETAIL: Process 3874 waits for ShareLock on transaction 500780; blocked by process 3896. Process 3896 waits for ExclusiveLock on tuple (184,19) of relation 16397 of database 16385; blocked by process 3874. HINT: See server log for query details. 'UPDATE runs SET updated=%(updated)s WHERE runs.id = %(runs_id)s' {'runs_id': 7031, 'updated': datetime.datetime(2014, 4, 18, 14, 1, 36, 556867)}
And now in the PostgreSQL log:
2014-04-18 07:01:37 PDT ERROR: deadlock detected 2014-04-18 07:01:37 PDT DETAIL: Process 3874 waits for ShareLock on transaction 500780; blocked by process 3896. Process 3896 waits for ExclusiveLock on tuple (184,19) of relation 16397 of database 16385; blocked by process 3874. Process 3874: UPDATE runs SET updated='2014-04-18T14:01:36.556867'::timestamp WHERE runs.id = 7031 Process 3896: UPDATE runs SET updated='2014-04-18T14:01:36.558376'::timestamp WHERE runs.id = 7031 2014-04-18 07:01:37 PDT HINT: See server log for query details. 2014-04-18 07:01:37 PDT STATEMENT: UPDATE runs SET updated='2014-04-18T14:01:36.556867'::timestamp WHERE runs.id = 7031 2014-04-18 07:01:37 PDT LOG: duration: 999.330 ms statement: UPDATE runs SET updated='2014-04-18T14:01:36.558376'::timestamp WHERE runs.id = 7031 2014-04-18 07:01:37 PDT LOG: duration: 895.216 ms statement: UPDATE runs SET updated='2014-04-18T14:01:36.666787'::timestamp WHERE runs.id = 7031
Updated by Zack Cerza about 10 years ago
I put a teuthology workaround in place here:
https://github.com/ceph/teuthology/commit/1448cdf5fc1d9726fbc28c172f38ea96c50f89b7
Updated by Zack Cerza about 10 years ago
- Status changed from New to In Progress
- Assignee set to Zack Cerza
Updated by Zack Cerza about 10 years ago
I thought the fix for #8190 would fix this, but it looks like we're not totally out of the woods yet:
Apr 24 01:27:20 pulpito postgres[6112]: [1159-1] ERROR: deadlock detected Apr 24 01:27:20 pulpito postgres[6112]: [1159-2] DETAIL: Process 6112 waits for ShareLock on transaction 828930; blocked by process 6103. Apr 24 01:27:20 pulpito postgres[6112]: [1159-3] #011Process 6103 waits for ExclusiveLock on tuple (46,46) of relation 16397 of database 16385; blocked by process 6112. Apr 24 01:27:20 pulpito postgres[6112]: [1159-4] #011Process 6112: UPDATE runs SET updated='2014-04-24T08:27:19.219395'::timestamp WHERE runs.id = 7146 Apr 24 01:27:20 pulpito postgres[6112]: [1159-5] #011Process 6103: UPDATE runs SET updated='2014-04-24T08:27:19.223676'::timestamp WHERE runs.id = 7146 Apr 24 01:27:20 pulpito postgres[6112]: [1159-6] HINT: See server log for query details. Apr 24 01:27:20 pulpito postgres[6112]: [1159-7] STATEMENT: UPDATE runs SET updated='2014-04-24T08:27:19.219395'::timestamp WHERE runs.id = 7146
Updated by Zack Cerza almost 10 years ago
- Status changed from In Progress to 12
This is happening less now, but still happening. I am not actively working on it, so changing status.
Updated by Zack Cerza almost 10 years ago
Saw it happen a few times this morning. :-/
Updated by Zack Cerza almost 10 years ago
- Priority changed from High to Normal
Dropping priority since this happens very infrequently now.
Updated by Zack Cerza almost 10 years ago
- Status changed from 12 to 7
Testing this as a fix:
https://github.com/ceph/paddles/commit/91c0840504ad9c3afb2dd6c718417a6b1f1e4847
Updated by Zack Cerza almost 10 years ago
Hasn't happened since 5/21; I'll leave this open for another one or two days
Updated by Zack Cerza almost 10 years ago
- Status changed from 7 to Resolved
Still hasn't happened. Marking resolved.