Project

General

Profile

Actions

Bug #19737

closed

EAGAIN encountered during pg scrub (jewel)

Added by Nathan Cutler about 7 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Tests
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

test description: rados/singleton-nomsgr/{all/lfn-upgrade-infernalis.yaml rados.yaml}

http://qa-proxy.ceph.com/teuthology/smithfarm-2017-04-21_05:45:14-rados-wip-jewel-backports-distro-basic-smithi/1052551/teuthology.log

  1. Infernalis is installed
  2. HEALTH_OK is reached
  3. test pool created with pg_num 1
  4. create_verify_lfn_objects task completes
  5. "sequential" block starts
  6. create_verify_lfn_objects task runs again (wtf?)
  7. cluster is upgraded to jewel
  8. all daemons except osd.2 are restarted (so osd.2 continues on infernalis)
  9. ceph_manager.wait_for_clean runs
  10. ceph_manager.do_pg_scrub runs
  11. sequential block ends
  12. ceph_manager.do_pg_scrub runs again (wtf?)
  13. create_verify_lfn_objects task runs on the mixed cluster
  14. osd.2 is restarted (becoming jewel)
  15. ceph osd set require_jewel_osds
  16. ceph_manager.do_pg_scrub runs

Reading the log, everything seems to work fine up to and including "ceph_manager.wait_for_clean"

At this point, all Ceph daemons except for osd.2 are running jewel; osd.2 is running infernalis.

The last step of the sequential block - do_pg_scrub task - starts, does some work, and fails with EAGAIN:

2017-04-21T06:47:16.670 INFO:tasks.ceph.ceph_manager.ceph:clean!
2017-04-21T06:47:16.670 INFO:teuthology.task.sequential:In sequential, running task ceph_manager.do_pg_scrub...
...
2017-04-21T06:47:17.486 INFO:teuthology.orchestra.run.smithi176:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph pg scrub 1.0'
2017-04-21T06:47:17.633 INFO:teuthology.orchestra.run.smithi176.stderr:Error EAGAIN: pg 1.0 primary osd.1 not up

Immediately after that, we find ourselves in "create_verify_lfn_objects" instead of the expected second do_pg_scrub:

2017-04-21T06:47:17.641 INFO:tasks.create_verify_lfn_objects:ceph_verify_lfn_objects verifying...

That task completes, but does not appear to be relevant because right on its heels comes the Traceback from the EAGAIN:

Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/sequential.py", line 46, in task
    mgr = run_tasks.run_one_task(taskname, ctx=ctx, config=confg)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-jewel-backports/qa/tasks/ceph_manager.py", line 2041, in task
    fn(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-jewel-backports/qa/tasks/ceph_manager.py", line 1469, in do_pg_scrub
    self.raw_cluster_cmd('pg', stype, self.get_pgid(pool, pgnum))
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-jewel-backports/qa/tasks/ceph_manager.py", line 865, in raw_cluster_cmd
    stdout=StringIO(),
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 193, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 414, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 149, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 171, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed on smithi176 with status 11: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph pg scrub 1.0'
2017-04-21T06:47:33.266 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/sepia/teuthology/?q=0f9ca46556a642158e873d093d39cd2c

Related issues 2 (0 open2 closed)

Related to Ceph - Bug #23007: jewel integration testing: ceph pg scrub 1.0 fails in create_verify_lfn_objectsDuplicateDavid Zafman02/15/2018

Actions
Has duplicate Ceph - Bug #16692: deep-scrub Error EAGAIN in cephmanager task in rados runDuplicate07/14/2016

Actions
Actions

Also available in: Atom PDF