Bug #49864
closedPassed jobs classified as dead due to "Could not report results to http://paddles.front.sepia.ceph.com/"
0%
Description
2021-03-16T19:38:09.304 INFO:teuthology.run:Summary data: description: rados/singleton/{all/max-pg-per-osd.from-replica msgr-failures/few msgr/async-v2only objectstore/bluestore-comp-zstd rados supported-random-distro$/{rhel_latest}} duration: 820.3036348819733 flavor: basic owner: scheduled_yuriw@teuthology success: true 2021-03-16T19:38:09.305 DEBUG:teuthology.report:Pushing job info to http://paddles.front.sepia.ceph.com/ 2021-03-16T19:38:09.338 ERROR:teuthology.report:Could not report results to http://paddles.front.sepia.ceph.com/ Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/report.py", line 498, in try_push_job_info push_job_info(run_name, job_id, job_info) File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/report.py", line 463, in push_job_info reporter.report_job(run_name, job_id, job_info) File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/report.py", line 320, in report_job response.raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/virtualenv/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://paddles.front.sepia.ceph.com/runs/yuriw-2021-03-16_18:33:54-rados-octopus-distro-basic-smithi/jobs/5971023/ 2021-03-16T19:38:09.340 INFO:teuthology.run:pass
/a/yuriw-2021-03-16_18:33:54-rados-octopus-distro-basic-smithi/5971023
many more in https://pulpito.ceph.com/yuriw-2021-03-16_18:33:54-rados-octopus-distro-basic-smithi/
Updated by Neha Ojha about 3 years ago
somewhat similar
2021-03-17T21:26:51.583 INFO:teuthology.run:Summary data: description: rados:thrash/{0-size-min-size-overrides/3-size-2-min-size 1-pg-log-overrides/short_pg_log 2-recovery-overrides/{more-async-partial-recovery} backoff/normal ceph clusters/{fixed-2 openstack} crc-failures/default d-balancer/on mon_election/connectivity msgr-failures/fastclose msgr/async-v1only objectstore/bluestore-comp-zlib rados supported-random-distro$/{ubuntu_latest} thrashers/morepggrow thrashosds-health workloads/rados_api_tests} duration: 2478.8178923130035 failure_reason: Exiting scrub checking -- not all pgs scrubbed. flavor: basic owner: scheduled_nojha@teuthology status: fail success: false 2021-03-17T21:26:51.583 DEBUG:teuthology.report:Pushing job info to http://paddles.front.sepia.ceph.com/ 2021-03-17T21:26:51.628 ERROR:teuthology.report:Could not report results to http://paddles.front.sepia.ceph.com/ Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/run.py", line 409, in main run_tasks(tasks=config['tasks'], ctx=fake_ctx) File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/run_tasks.py", line 204, in run_tasks raise SystemExit(1) SystemExit: 1 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/report.py", line 498, in try_push_job_info push_job_info(run_name, job_id, job_info) File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/report.py", line 463, in push_job_info reporter.report_job(run_name, job_id, job_info) File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/report.py", line 320, in report_job response.raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/virtualenv/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://paddles.front.sepia.ceph.com/runs/nojha-2021-03-17_19:06:39-rados:thrash-master-distro-basic-gibba/jobs/5974441/ 2021-03-17T21:26:51.632 INFO:teuthology.run:FAIL
/a/nojha-2021-03-17_19:06:39-rados:thrash-master-distro-basic-gibba/5974441
Updated by Neha Ojha about 3 years ago
- Priority changed from Normal to Urgent
Updated by Neha Ojha about 3 years ago
during job scheduling
Job scheduled with name nojha-2021-03-18_23:17:30-rados:thrash-master-distro-basic-gibba and ID 5978592 2021-03-18 23:21:02,825.825 ERROR:teuthology.report:POST to http://paddles.front.sepia.ceph.com:80/runs/nojha-2021-03-18_23:17:30-rados:thrash-master-distro-basic-gibba/jobs/ failed with status 500: <html> <head> <title>Internal Server Error</title> </head> <body> <h1><p>Internal Server Error</p></h1> </body> </html> 2021-03-18 23:21:02,825.825 ERROR:teuthology.report:Could not report results to http://paddles.front.sepia.ceph.com:80 Traceback (most recent call last): File "/home/nojha/py3_teuthology/teuthology/report.py", line 498, in try_push_job_info push_job_info(run_name, job_id, job_info) File "/home/nojha/py3_teuthology/teuthology/report.py", line 463, in push_job_info reporter.report_job(run_name, job_id, job_info) File "/home/nojha/py3_teuthology/teuthology/report.py", line 320, in report_job response.raise_for_status() File "/home/nojha/py3_teuthology/virtualenv/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://paddles.front.sepia.ceph.com:80/runs/nojha-2021-03-18_23:17:30-rados:thrash-master-distro-basic-gibba/jobs/
Updated by Neha Ojha about 3 years ago
- Related to Bug #49893: "500 Server Error" during runs scheduling added
Updated by Josh Durgin about 3 years ago
These are all the same cause under the hood - db conflicts throwing exceptions in paddles. Testing adding retries and serializable isolation level to all the write ops in paddles to avoid this (it's also causing dispatchers to die on a 500 they don't tolerate when the postgres connection that hit a problem is reused for a query and isn't in a good state).
An example from paddles logs:
2021-03-19 07:03:00,049 INFO [paddles.controllers.jobs] Job nojha-2021-03-18_23:17:30-rados:thrash-master-distro-basic-gibba/5978616 status changed from running to pass [2021-03-19 07:03:04 +0000] [31938] [ERROR] Error handling request Traceback (most recent call last): File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/gunicorn/workers/sync.py", line 130, in handle self.handle_request(listener, req, client, addr) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/gunicorn/workers/sync.py", line 171, in handle_request respiter = self.wsgi(environ, resp.start_response) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/pecan/middleware/recursive.py", line 56, in __call__ return self.application(environ, start_response) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/pecan/core.py", line 810, in __call__ return super(Pecan, self).__call__(environ, start_response) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/pecan/core.py", line 659, in __call__ self.invoke_controller(controller, args, kwargs, state) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/pecan/core.py", line 559, in invoke_controller result = controller(*args, **kwargs) File "/home/ubuntu/paddles/paddles/controllers/jobs.py", line 53, in index_post self.job.update(request.json) File "/home/ubuntu/paddles/paddles/models/jobs.py", line 206, in update self.set_or_update(json_data) File "/home/ubuntu/paddles/paddles/models/jobs.py", line 156, in set_or_update if len(json_data.get('targets', {})) > len(target_nodes_q.all()): File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2399, in all return list(self) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2515, in __iter__ self.session._autoflush() File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1292, in _autoflush util.raise_from_cause(e) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause reraise(type(exception), exception, tb=exc_tb) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1282, in _autoflush self.flush() File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2004, in flush self._flush(objects) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2122, in _flush transaction.rollback(_capture_exception=True) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__ compat.reraise(exc_type, exc_value, exc_tb) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2086, in _flush flush_context.execute() File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute rec.execute(self) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute uow File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 170, in save_obj mapper, table, update) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 692, in _emit_update_statements execute(statement, multiparams) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute return meth(self, multiparams, params) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement compiled_sql, distilled_params File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context context) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception exc_info File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause reraise(type(exception), exception, tb=exc_tb) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context context) File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute cursor.execute(statement, parameters) OperationalError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (psycopg2.extensions.TransactionRollbackError) could not serialize access due to concurrent update [SQL: 'UPDATE runs SET updated=%(updated)s WHERE runs.id = %(runs_id)s'] [parameters: {'runs_id': 101009, 'updated': datetime.datetime(2021, 3, 19, 7, 3, 4, 176354)}]
Updated by Zack Cerza over 2 years ago
- Related to Bug #52101: paddles: race condition in Job.set_or_update() added
Updated by Zack Cerza over 2 years ago
I'm thinking it might make sense to try using a no_autoflush block around the self.run.set_status()
call mentioned in #52101 and also the target_nodes_q.all()
call (factoring it out of that if
statement first) to see if that works to avoid these conflicts.
We would want to test that thoroughly before sending to production, however.
Updated by Zack Cerza almost 2 years ago
- Status changed from New to Resolved
- Assignee changed from Josh Durgin to Zack Cerza
Updated by Ilya Dryomov about 1 year ago
Hi Zack,
I hit something similar on the teuthology side, not related to paddles. Here a passing job just got stuck:
2023-02-26T10:18:46.330 DEBUG:teuthology.run_tasks:Unwinding manager install 2023-02-26T10:18:46.339 INFO:teuthology.task.install.util:Removing shipped files: /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits... 2023-02-26T10:18:46.339 DEBUG:teuthology.orchestra.run.smithi122:> sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits 2023-02-26T10:18:46.342 DEBUG:teuthology.orchestra.run.smithi131:> sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits 2023-02-26T10:18:46.344 DEBUG:teuthology.orchestra.run.smithi165:> sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits 2023-02-26T10:19:13.990 ERROR:teuthology.run_tasks:Manager failed: install Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/run_tasks.py", line 188, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__ next(self.gen) File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/task/install/__init__.py", line 619, in task yield File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__ next(self.gen) File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/contextutil.py", line 55, in nested raise exc[1] File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/contextutil.py", line 47, in nested if exit(*exc): File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__ next(self.gen) File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/task/install/__init__.py", line 222, in install remove_packages(ctx, config, package_list) File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/task/install/__init__.py", line 105, in remove_packages if not remote.is_reimageable or cleanup: File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/orchestra/remote.py", line 471, in is_reimageable return self.machine_type in self._reimage_types File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/orchestra/remote.py", line 463, in machine_type remote_info = teuthology.lock.query.get_status(self.hostname) File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/lock/query.py", line 20, in get_status while proceed(): File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/contextutil.py", line 133, in __call__ raise MaxWhileTries(error_msg) teuthology.exceptions.MaxWhileTries: 'get_status smithi122.front.sepia.ceph.com' reached maximum tries (10) after waiting for 32.5 seconds
2023-02-26T10:19:16.180 INFO:teuthology.task.internal:Tidying up after the test... 2023-02-26T10:19:16.180 DEBUG:teuthology.orchestra.run.smithi122:> find /home/ubuntu/cephtest -ls ; rmdir -- /home/ubuntu/cephtest 2023-02-26T10:19:16.183 DEBUG:teuthology.orchestra.run.smithi131:> find /home/ubuntu/cephtest -ls ; rmdir -- /home/ubuntu/cephtest 2023-02-26T10:19:16.185 DEBUG:teuthology.orchestra.run.smithi165:> find /home/ubuntu/cephtest -ls ; rmdir -- /home/ubuntu/cephtest 2023-02-26T10:19:16.198 INFO:teuthology.orchestra.run.smithi122.stdout: 393230 4 drwxr-xr-x 2 ubuntu ubuntu 4096 Feb 26 10:19 /home/ubuntu/cephtest 2023-02-26T10:19:16.200 INFO:teuthology.orchestra.run.smithi131.stdout: 393230 4 drwxr-xr-x 2 ubuntu ubuntu 4096 Feb 26 10:19 /home/ubuntu/cephtest 2023-02-26T10:19:16.202 INFO:teuthology.orchestra.run.smithi165.stdout: 393279 4 drwxr-xr-x 2 ubuntu ubuntu 4096 Feb 26 10:19 /home/ubuntu/cephtest 2023-02-26T10:19:16.203 DEBUG:teuthology.run_tasks:Unwinding manager console_log 2023-02-26T10:19:16.292 DEBUG:teuthology.run_tasks:Exception was not quenched, exiting: MaxWhileTries: 'get_status smithi122.front.sepia.ceph.com' reached maximum tries (10) after waiting for 32.5 seconds 2023-02-26T10:19:16.293 INFO:teuthology.nuke:Checking targets against current locks 2023-02-26T10:19:43.922 CRITICAL:teuthology:Uncaught exception Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/run.py", line 411, in main run_tasks(tasks=config['tasks'], ctx=fake_ctx) File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/run_tasks.py", line 216, in run_tasks raise SystemExit(1) SystemExit: 1 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/virtualenv/bin/teuthology", line 8, in <module> sys.exit(main()) File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/scripts/run.py", line 38, in main teuthology.run.main(args) File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/run.py", line 414, in main report_outcome(config, archive, fake_ctx.summary, fake_ctx) File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/run.py", line 271, in report_outcome nuke(fake_ctx, fake_ctx.lock) File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/nuke/__init__.py", line 245, in nuke status = get_status(target) File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/lock/query.py", line 20, in get_status while proceed(): File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/contextutil.py", line 133, in __call__ raise MaxWhileTries(error_msg) teuthology.exceptions.MaxWhileTries: 'get_status smithi122.front.sepia.ceph.com' reached maximum tries (10) after waiting for 32.5 seconds
/a/dis-2023-02-26_09:14:25-rbd-wip-dis-testing-distro-default-smithi/7188287
I killed it manually 27 hours later. The test nodes were unnecessarily locked all that time.