Project

General

Profile

Actions

Bug #49864

closed

Passed jobs classified as dead due to "Could not report results to http://paddles.front.sepia.ceph.com/"

Added by Neha Ojha about 3 years ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

2021-03-16T19:38:09.304 INFO:teuthology.run:Summary data:
description: rados/singleton/{all/max-pg-per-osd.from-replica msgr-failures/few msgr/async-v2only
  objectstore/bluestore-comp-zstd rados supported-random-distro$/{rhel_latest}}
duration: 820.3036348819733
flavor: basic
owner: scheduled_yuriw@teuthology
success: true

2021-03-16T19:38:09.305 DEBUG:teuthology.report:Pushing job info to http://paddles.front.sepia.ceph.com/
2021-03-16T19:38:09.338 ERROR:teuthology.report:Could not report results to http://paddles.front.sepia.ceph.com/
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/report.py", line 498, in try_push_job_info
    push_job_info(run_name, job_id, job_info)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/report.py", line 463, in push_job_info
    reporter.report_job(run_name, job_id, job_info)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/report.py", line 320, in report_job
    response.raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/virtualenv/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://paddles.front.sepia.ceph.com/runs/yuriw-2021-03-16_18:33:54-rados-octopus-distro-basic-smithi/jobs/5971023/
2021-03-16T19:38:09.340 INFO:teuthology.run:pass

/a/yuriw-2021-03-16_18:33:54-rados-octopus-distro-basic-smithi/5971023
many more in https://pulpito.ceph.com/yuriw-2021-03-16_18:33:54-rados-octopus-distro-basic-smithi/


Related issues 2 (1 open1 closed)

Related to paddles - Bug #49893: "500 Server Error" during runs scheduling Duplicate

Actions
Related to paddles - Bug #52101: paddles: race condition in Job.set_or_update()NewAishwarya Mathuria

Actions
Actions #1

Updated by Neha Ojha about 3 years ago

somewhat similar

2021-03-17T21:26:51.583 INFO:teuthology.run:Summary data:
description: rados:thrash/{0-size-min-size-overrides/3-size-2-min-size 1-pg-log-overrides/short_pg_log
  2-recovery-overrides/{more-async-partial-recovery} backoff/normal ceph clusters/{fixed-2
  openstack} crc-failures/default d-balancer/on mon_election/connectivity msgr-failures/fastclose
  msgr/async-v1only objectstore/bluestore-comp-zlib rados supported-random-distro$/{ubuntu_latest}
  thrashers/morepggrow thrashosds-health workloads/rados_api_tests}
duration: 2478.8178923130035
failure_reason: Exiting scrub checking -- not all pgs scrubbed.
flavor: basic
owner: scheduled_nojha@teuthology
status: fail
success: false

2021-03-17T21:26:51.583 DEBUG:teuthology.report:Pushing job info to http://paddles.front.sepia.ceph.com/
2021-03-17T21:26:51.628 ERROR:teuthology.report:Could not report results to http://paddles.front.sepia.ceph.com/
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/run.py", line 409, in main
    run_tasks(tasks=config['tasks'], ctx=fake_ctx)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/run_tasks.py", line 204, in run_tasks
    raise SystemExit(1)
SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/report.py", line 498, in try_push_job_info
    push_job_info(run_name, job_id, job_info)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/report.py", line 463, in push_job_info
    reporter.report_job(run_name, job_id, job_info)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/report.py", line 320, in report_job
    response.raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/virtualenv/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://paddles.front.sepia.ceph.com/runs/nojha-2021-03-17_19:06:39-rados:thrash-master-distro-basic-gibba/jobs/5974441/
2021-03-17T21:26:51.632 INFO:teuthology.run:FAIL

/a/nojha-2021-03-17_19:06:39-rados:thrash-master-distro-basic-gibba/5974441

Actions #4

Updated by David Galloway about 3 years ago

  • Assignee set to Josh Durgin
Actions #5

Updated by Neha Ojha about 3 years ago

during job scheduling

Job scheduled with name nojha-2021-03-18_23:17:30-rados:thrash-master-distro-basic-gibba and ID 5978592
2021-03-18 23:21:02,825.825 ERROR:teuthology.report:POST to http://paddles.front.sepia.ceph.com:80/runs/nojha-2021-03-18_23:17:30-rados:thrash-master-distro-basic-gibba/jobs/ failed with status 500: <html>
  <head>
    <title>Internal Server Error</title>
  </head>
  <body>
    <h1><p>Internal Server Error</p></h1>

  </body>
</html>

2021-03-18 23:21:02,825.825 ERROR:teuthology.report:Could not report results to http://paddles.front.sepia.ceph.com:80
Traceback (most recent call last):
  File "/home/nojha/py3_teuthology/teuthology/report.py", line 498, in try_push_job_info
    push_job_info(run_name, job_id, job_info)
  File "/home/nojha/py3_teuthology/teuthology/report.py", line 463, in push_job_info
    reporter.report_job(run_name, job_id, job_info)
  File "/home/nojha/py3_teuthology/teuthology/report.py", line 320, in report_job
    response.raise_for_status()
  File "/home/nojha/py3_teuthology/virtualenv/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://paddles.front.sepia.ceph.com:80/runs/nojha-2021-03-18_23:17:30-rados:thrash-master-distro-basic-gibba/jobs/
Actions #6

Updated by Neha Ojha about 3 years ago

  • Related to Bug #49893: "500 Server Error" during runs scheduling added
Actions #7

Updated by Josh Durgin about 3 years ago

These are all the same cause under the hood - db conflicts throwing exceptions in paddles. Testing adding retries and serializable isolation level to all the write ops in paddles to avoid this (it's also causing dispatchers to die on a 500 they don't tolerate when the postgres connection that hit a problem is reused for a query and isn't in a good state).

An example from paddles logs:

2021-03-19 07:03:00,049 INFO  [paddles.controllers.jobs] Job nojha-2021-03-18_23:17:30-rados:thrash-master-distro-basic-gibba/5978616 status changed from running to pass
[2021-03-19 07:03:04 +0000] [31938] [ERROR] Error handling request
Traceback (most recent call last):
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/gunicorn/workers/sync.py", line 130, in handle
    self.handle_request(listener, req, client, addr)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/gunicorn/workers/sync.py", line 171, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/pecan/middleware/recursive.py", line 56, in __call__
    return self.application(environ, start_response)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/pecan/core.py", line 810, in __call__
    return super(Pecan, self).__call__(environ, start_response)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/pecan/core.py", line 659, in __call__
    self.invoke_controller(controller, args, kwargs, state)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/pecan/core.py", line 559, in invoke_controller
    result = controller(*args, **kwargs)
  File "/home/ubuntu/paddles/paddles/controllers/jobs.py", line 53, in index_post
    self.job.update(request.json)
  File "/home/ubuntu/paddles/paddles/models/jobs.py", line 206, in update
    self.set_or_update(json_data)
  File "/home/ubuntu/paddles/paddles/models/jobs.py", line 156, in set_or_update
    if len(json_data.get('targets', {})) > len(target_nodes_q.all()):
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2399, in all
    return list(self)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2515, in __iter__
    self.session._autoflush()
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1292, in _autoflush
    util.raise_from_cause(e)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1282, in _autoflush
    self.flush()
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2004, in flush
    self._flush(objects)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2122, in _flush
    transaction.rollback(_capture_exception=True)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2086, in _flush
    flush_context.execute()
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute
    rec.execute(self)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute
    uow
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 170, in save_obj
    mapper, table, update)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 692, in _emit_update_statements
    execute(statement, multiparams)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute
    return meth(self, multiparams, params)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
    context)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
    exc_info
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
    context)
  File "/home/ubuntu/.virtualenvs/paddles/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
    cursor.execute(statement, parameters)
OperationalError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (psycopg2.extensions.TransactionRollbackError) could not serialize access due to concurrent update
 [SQL: 'UPDATE runs SET updated=%(updated)s WHERE runs.id = %(runs_id)s'] [parameters: {'runs_id': 101009, 'updated': datetime.datetime(2021, 3, 19, 7, 3, 4, 176354)}]

Actions #8

Updated by Zack Cerza over 2 years ago

  • Related to Bug #52101: paddles: race condition in Job.set_or_update() added
Actions #9

Updated by Zack Cerza over 2 years ago

I'm thinking it might make sense to try using a no_autoflush block around the self.run.set_status() call mentioned in #52101 and also the target_nodes_q.all() call (factoring it out of that if statement first) to see if that works to avoid these conflicts.

We would want to test that thoroughly before sending to production, however.

Actions #10

Updated by Zack Cerza almost 2 years ago

  • Status changed from New to Resolved
  • Assignee changed from Josh Durgin to Zack Cerza
Actions #11

Updated by Ilya Dryomov about 1 year ago

Hi Zack,

I hit something similar on the teuthology side, not related to paddles. Here a passing job just got stuck:

2023-02-26T10:18:46.330 DEBUG:teuthology.run_tasks:Unwinding manager install
2023-02-26T10:18:46.339 INFO:teuthology.task.install.util:Removing shipped files: /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits...
2023-02-26T10:18:46.339 DEBUG:teuthology.orchestra.run.smithi122:> sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits
2023-02-26T10:18:46.342 DEBUG:teuthology.orchestra.run.smithi131:> sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits
2023-02-26T10:18:46.344 DEBUG:teuthology.orchestra.run.smithi165:> sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits
2023-02-26T10:19:13.990 ERROR:teuthology.run_tasks:Manager failed: install
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/run_tasks.py", line 188, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/task/install/__init__.py", line 619, in task
    yield
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/contextutil.py", line 55, in nested
    raise exc[1]
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/contextutil.py", line 47, in nested
    if exit(*exc):
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/task/install/__init__.py", line 222, in install
    remove_packages(ctx, config, package_list)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/task/install/__init__.py", line 105, in remove_packages
    if not remote.is_reimageable or cleanup:
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/orchestra/remote.py", line 471, in is_reimageable
    return self.machine_type in self._reimage_types
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/orchestra/remote.py", line 463, in machine_type
    remote_info = teuthology.lock.query.get_status(self.hostname)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/lock/query.py", line 20, in get_status
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'get_status smithi122.front.sepia.ceph.com' reached maximum tries (10) after waiting for 32.5 seconds
2023-02-26T10:19:16.180 INFO:teuthology.task.internal:Tidying up after the test...
2023-02-26T10:19:16.180 DEBUG:teuthology.orchestra.run.smithi122:> find /home/ubuntu/cephtest -ls ; rmdir -- /home/ubuntu/cephtest
2023-02-26T10:19:16.183 DEBUG:teuthology.orchestra.run.smithi131:> find /home/ubuntu/cephtest -ls ; rmdir -- /home/ubuntu/cephtest
2023-02-26T10:19:16.185 DEBUG:teuthology.orchestra.run.smithi165:> find /home/ubuntu/cephtest -ls ; rmdir -- /home/ubuntu/cephtest
2023-02-26T10:19:16.198 INFO:teuthology.orchestra.run.smithi122.stdout:   393230      4 drwxr-xr-x   2  ubuntu   ubuntu       4096 Feb 26 10:19 /home/ubuntu/cephtest
2023-02-26T10:19:16.200 INFO:teuthology.orchestra.run.smithi131.stdout:   393230      4 drwxr-xr-x   2  ubuntu   ubuntu       4096 Feb 26 10:19 /home/ubuntu/cephtest
2023-02-26T10:19:16.202 INFO:teuthology.orchestra.run.smithi165.stdout:   393279      4 drwxr-xr-x   2  ubuntu   ubuntu       4096 Feb 26 10:19 /home/ubuntu/cephtest
2023-02-26T10:19:16.203 DEBUG:teuthology.run_tasks:Unwinding manager console_log
2023-02-26T10:19:16.292 DEBUG:teuthology.run_tasks:Exception was not quenched, exiting: MaxWhileTries: 'get_status smithi122.front.sepia.ceph.com' reached maximum tries (10) after waiting for 32.5 seconds
2023-02-26T10:19:16.293 INFO:teuthology.nuke:Checking targets against current locks
2023-02-26T10:19:43.922 CRITICAL:teuthology:Uncaught exception
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/run.py", line 411, in main
    run_tasks(tasks=config['tasks'], ctx=fake_ctx)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/run_tasks.py", line 216, in run_tasks
    raise SystemExit(1)
SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/virtualenv/bin/teuthology", line 8, in <module>
    sys.exit(main())
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/scripts/run.py", line 38, in main
    teuthology.run.main(args)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/run.py", line 414, in main
    report_outcome(config, archive, fake_ctx.summary, fake_ctx)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/run.py", line 271, in report_outcome
    nuke(fake_ctx, fake_ctx.lock)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/nuke/__init__.py", line 245, in nuke
    status = get_status(target)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/lock/query.py", line 20, in get_status
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_teuthology_e06b2e089b00edb8cb2ffc1647f3d9093a53c565/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'get_status smithi122.front.sepia.ceph.com' reached maximum tries (10) after waiting for 32.5 seconds

/a/dis-2023-02-26_09:14:25-rbd-wip-dis-testing-distro-default-smithi/7188287

I killed it manually 27 hours later. The test nodes were unnecessarily locked all that time.

Actions

Also available in: Atom PDF