Bug #19737: EAGAIN encountered during pg scrub (jewel) - RADOS - Ceph

Actions

Copy link

Bug #19737

closed

EAGAIN encountered during pg scrub (jewel)

Added by Nathan Cutler about 7 years ago. Updated about 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Tests

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

test description: rados/singleton-nomsgr/{all/lfn-upgrade-infernalis.yaml rados.yaml}

http://qa-proxy.ceph.com/teuthology/smithfarm-2017-04-21_05:45:14-rados-wip-jewel-backports-distro-basic-smithi/1052551/teuthology.log

Infernalis is installed
HEALTH_OK is reached
test pool created with pg_num 1
create_verify_lfn_objects task completes
"sequential" block starts
create_verify_lfn_objects task runs again (wtf?)
cluster is upgraded to jewel
all daemons except osd.2 are restarted (so osd.2 continues on infernalis)
ceph_manager.wait_for_clean runs
ceph_manager.do_pg_scrub runs
sequential block ends
ceph_manager.do_pg_scrub runs again (wtf?)
create_verify_lfn_objects task runs on the mixed cluster
osd.2 is restarted (becoming jewel)
ceph osd set require_jewel_osds
ceph_manager.do_pg_scrub runs

Reading the log, everything seems to work fine up to and including "ceph_manager.wait_for_clean"

At this point, all Ceph daemons except for osd.2 are running jewel; osd.2 is running infernalis.

The last step of the sequential block - do_pg_scrub task - starts, does some work, and fails with EAGAIN:

2017-04-21T06:47:16.670 INFO:tasks.ceph.ceph_manager.ceph:clean!
2017-04-21T06:47:16.670 INFO:teuthology.task.sequential:In sequential, running task ceph_manager.do_pg_scrub...
...
2017-04-21T06:47:17.486 INFO:teuthology.orchestra.run.smithi176:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph pg scrub 1.0'
2017-04-21T06:47:17.633 INFO:teuthology.orchestra.run.smithi176.stderr:Error EAGAIN: pg 1.0 primary osd.1 not up

Immediately after that, we find ourselves in "create_verify_lfn_objects" instead of the expected second do_pg_scrub:

2017-04-21T06:47:17.641 INFO:tasks.create_verify_lfn_objects:ceph_verify_lfn_objects verifying...

That task completes, but does not appear to be relevant because right on its heels comes the Traceback from the EAGAIN:

Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/sequential.py", line 46, in task
    mgr = run_tasks.run_one_task(taskname, ctx=ctx, config=confg)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-jewel-backports/qa/tasks/ceph_manager.py", line 2041, in task
    fn(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-jewel-backports/qa/tasks/ceph_manager.py", line 1469, in do_pg_scrub
    self.raw_cluster_cmd('pg', stype, self.get_pgid(pool, pgnum))
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-jewel-backports/qa/tasks/ceph_manager.py", line 865, in raw_cluster_cmd
    stdout=StringIO(),
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 193, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 414, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 149, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 171, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed on smithi176 with status 11: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph pg scrub 1.0'
2017-04-21T06:47:33.266 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/sepia/teuthology/?q=0f9ca46556a642158e873d093d39cd2c

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #19737

EAGAIN encountered during pg scrub (jewel)

Updated by Nathan Cutler about 7 years ago

Updated by Nathan Cutler about 7 years ago

Updated by Nathan Cutler about 7 years ago

Updated by Josh Durgin almost 7 years ago

Updated by Greg Farnum almost 7 years ago

Updated by Greg Farnum almost 7 years ago

Updated by Josh Durgin about 6 years ago

Updated by Josh Durgin about 6 years ago

Updated by Nathan Cutler about 6 years ago

Updated by Nathan Cutler about 6 years ago