Project

General

Profile

Actions

Bug #58696

open

Entire teuthology runs are dying

Added by Laura Flores about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):


Files

Screenshot from 2023-02-13 10-27-40.png (220 KB) Screenshot from 2023-02-13 10-27-40.png pulpito screenshot Laura Flores, 02/13/2023 04:29 PM

Related issues 2 (1 open1 closed)

Related to Infrastructure - Bug #58697: Some teuthology jobs are getting scheduled as "unknown"New

Actions
Has duplicate sepia - Bug #58711: Jobs deadDuplicate

Actions
Actions #1

Updated by Laura Flores about 1 year ago

  • Related to Bug #58697: Some teuthology jobs are getting scheduled as "unknown" added
Actions #2

Updated by Laura Flores about 1 year ago

Note from Zack regarding this issue:

looking at paddles logs again, I see this:
Feb 09 16:35:16 pulpito sudo[32458]:     root : TTY=pts/0 ; PWD=/root ; USER=root ; COMMAND=/usr/bin/docker exec -it fa31f1474a1a sh -c pecan expire_jobs config.py -q 0 -r 600
and I am pretty sure the -q 0 there would cause all queued jobs to be marked dead in paddles - but the expire_jobs tool does not touch the beanstalkd queue, which is the source of truth. this is a bizarre situation, but i believe those jobs are actually still queued and will run eventually
Actions #3

Updated by Yuri Weinstein about 1 year ago

This is from #sepia slack and may be related:

 I see 8581 jobs in the queue (maybe ~4k are mine), I suspect that what we assumed has shown up for the last couple of days as dead were not actually dead (I think 
@Zack Cerza
 mentioned something similar), I am not even sure how to manage it.  Maybe let it digest whatever it has ATM?
Actions #4

Updated by Zack Cerza about 1 year ago

The runs aren't dying; job statuses were set to the wrong value by a botched maintenance command invocation.

Pasting an explanation I wrote up yesterday in Slack:

looking at paddles logs again, I see this:
Feb 09 16:35:16 pulpito sudo[32458]: root : TTY=pts/0 ; PWD=/root ; USER=root ; COMMAND=/usr/bin/docker exec -it fa31f1474a1a sh -c pecan expire_jobs config.py -q 0 -r 600
and I am pretty sure the -q 0 there would cause all queued jobs to be marked dead in paddles - but the expire_jobs tool does not touch the beanstalkd queue, which is the source of truth. this is a bizarre situation, but i believe those jobs are actually still queued and will run eventually

Actions #6

Updated by Laura Flores about 1 year ago

  • Project changed from Infrastructure to sepia
Actions #7

Updated by Laura Flores about 1 year ago

Actions #9

Updated by Laura Flores about 1 year ago

If you look at http://pulpito.front.sepia.ceph.com/?page=2, you can tell right away by the number of dead jobs which runs were affected. It's like sporadic outages.

See attached screenshot, as this link will change views as more runs are scheduled.

Actions #10

Updated by Laura Flores about 1 year ago

Finished+dead jobs: http://pulpito.front.sepia.ceph.com/?suite=rados&status=finished+dead
This is a more comprehensive history of failed runs.

Actions #11

Updated by Laura Flores about 1 year ago

Dead runs remain in the queue. For instance:

https://pulpito.ceph.com/yuriw-2023-02-09_23:12:16-krbd-wip-yuri6-testing-2023-02-09-0734-testing-default-smithi/
Check the queue => it's still there

Actions #12

Updated by adam kraitman about 1 year ago

I restarted the pulpito DB after seeing this error when I cleared some old jobs from the queue and this error disappeared, maybe it has some effect on the current running jobs

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1224, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 725, in do_executemany
cursor.executemany(statement, parameters)
psycopg2.errors.SerializationFailure: could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on identification as a pivot, during conflict out checking.
HINT: The transaction might succeed if retried.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/bin/pecan", line 11, in <module>
sys.exit(CommandRunner.handle_command_line())
File "/usr/local/lib/python3.6/site-packages/pecan/commands/base.py", line 96, in handle_command_line
runner.run(sys.argv[1:])
File "/usr/local/lib/python3.6/site-packages/pecan/commands/base.py", line 91, in run
self.commands[ns.command_name]().run(ns)
File "/usr/local/lib/python3.6/site-packages/paddles/commands/expire_jobs.py", line 44, in run
self.expire_queued()
File "/usr/local/lib/python3.6/site-packages/paddles/commands/expire_jobs.py", line 72, in expire_queued
self._do_expire(to_expire, 'queued')
File "/usr/local/lib/python3.6/site-packages/paddles/commands/expire_jobs.py", line 56, in do_expire
runs.add(job.run)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/attributes.py", line 276, in get
return self.impl.get(instance_state(instance), dict
)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/attributes.py", line 682, in get
value = self.callable_(state, passive)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/strategies.py", line 722, in _load_for_state
session, state, primary_key_identity, passive
File "<string>", line 1, in <lambda>
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/strategies.py", line 812, in _emit_lazyload
session.query(self.mapper), primary_key_identity
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/ext/baked.py", line 602, in _load_on_pk_identity
result = list(bq.for_session(self.session).params(**params))
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/ext/baked.py", line 429, in iter
self.session._autoflush()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 1587, in _autoflush
util.raise_from_cause(e)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 1576, in _autoflush
self.flush()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 2451, in flush
self._flush(objects)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 2589, in _flush
transaction.rollback(_capture_exception=True)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in exit
compat.reraise(exc_type, exc_value, exc_tb)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 2549, in _flush
flush_context.execute()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
rec.execute(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/unitofwork.py", line 589, in execute
uow,
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py", line 236, in save_obj
update,
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py", line 978, in _emit_update_statements
statement, multiparams
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 988, in execute
return meth(self, multiparams, params)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
distilled_params,
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
e, statement, parameters, cursor, context
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1466, in _handle_dbapi_exception
util.raise_from_cause(sqlalchemy_exception, exc_info)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 128, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1224, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 725, in do_executemany
cursor.executemany(statement, parameters)
sqlalchemy.exc.OperationalError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(psycopg2.errors.SerializationFailure) could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on identification as a pivot, during conflict out checking.
HINT: The transaction might succeed if retried.

[SQL: UPDATE jobs SET status=%(status)s WHERE jobs.id = %(jobs_id)s]
[parameters: ({'status': 'dead', 'jobs_id': 8478540}, {'status': 'dead', 'jobs_id': 8478541}, {'status': 'dead', 'jobs_id': 8478542}, {'status': 'dead', 'jobs_id': 8478543}, {'status': 'dead', 'jobs_id': 8478544}, {'status': 'dead', 'jobs_id': 8478545}, {'status': 'dead', 'jobs_id': 8478546}, {'status': 'dead', 'jobs_id': 8478547} ... displaying 10 of 293 total bound parameter sets ... {'status': 'dead', 'jobs_id': 8479032}, {'status': 'dead', 'jobs_id': 8479033})]
(Background on this error at: http://sqlalche.me/e/e3q8)

Actions #13

Updated by Laura Flores about 1 year ago

  • Assignee set to Laura Flores
Actions #14

Updated by Laura Flores about 1 year ago

  • Assignee deleted (Laura Flores)
Actions

Also available in: Atom PDF