Bug #19737: EAGAIN encountered during pg scrub (jewel) - RADOS - Ceph

Actions

Copy link

Bug #19737

closed

EAGAIN encountered during pg scrub (jewel)

Added by Nathan Cutler about 7 years ago. Updated about 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Tests

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

test description: rados/singleton-nomsgr/{all/lfn-upgrade-infernalis.yaml rados.yaml}

http://qa-proxy.ceph.com/teuthology/smithfarm-2017-04-21_05:45:14-rados-wip-jewel-backports-distro-basic-smithi/1052551/teuthology.log

Infernalis is installed
HEALTH_OK is reached
test pool created with pg_num 1
create_verify_lfn_objects task completes
"sequential" block starts
create_verify_lfn_objects task runs again (wtf?)
cluster is upgraded to jewel
all daemons except osd.2 are restarted (so osd.2 continues on infernalis)
ceph_manager.wait_for_clean runs
ceph_manager.do_pg_scrub runs
sequential block ends
ceph_manager.do_pg_scrub runs again (wtf?)
create_verify_lfn_objects task runs on the mixed cluster
osd.2 is restarted (becoming jewel)
ceph osd set require_jewel_osds
ceph_manager.do_pg_scrub runs

Reading the log, everything seems to work fine up to and including "ceph_manager.wait_for_clean"

At this point, all Ceph daemons except for osd.2 are running jewel; osd.2 is running infernalis.

The last step of the sequential block - do_pg_scrub task - starts, does some work, and fails with EAGAIN:

2017-04-21T06:47:16.670 INFO:tasks.ceph.ceph_manager.ceph:clean!
2017-04-21T06:47:16.670 INFO:teuthology.task.sequential:In sequential, running task ceph_manager.do_pg_scrub...
...
2017-04-21T06:47:17.486 INFO:teuthology.orchestra.run.smithi176:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph pg scrub 1.0'
2017-04-21T06:47:17.633 INFO:teuthology.orchestra.run.smithi176.stderr:Error EAGAIN: pg 1.0 primary osd.1 not up

Immediately after that, we find ourselves in "create_verify_lfn_objects" instead of the expected second do_pg_scrub:

2017-04-21T06:47:17.641 INFO:tasks.create_verify_lfn_objects:ceph_verify_lfn_objects verifying...

That task completes, but does not appear to be relevant because right on its heels comes the Traceback from the EAGAIN:

Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/sequential.py", line 46, in task
    mgr = run_tasks.run_one_task(taskname, ctx=ctx, config=confg)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-jewel-backports/qa/tasks/ceph_manager.py", line 2041, in task
    fn(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-jewel-backports/qa/tasks/ceph_manager.py", line 1469, in do_pg_scrub
    self.raw_cluster_cmd('pg', stype, self.get_pgid(pool, pgnum))
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-jewel-backports/qa/tasks/ceph_manager.py", line 865, in raw_cluster_cmd
    stdout=StringIO(),
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 193, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 414, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 149, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 171, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed on smithi176 with status 11: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph pg scrub 1.0'
2017-04-21T06:47:33.266 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/sepia/teuthology/?q=0f9ca46556a642158e873d093d39cd2c

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Nathan Cutler almost 7 years ago

http://pulpito.ceph.com/smithfarm-2017-04-27_16:56:17-rados-wip-jewel-backports---basic-smithi/1074069/

Actions

Copy link

Updated by Nathan Cutler almost 7 years ago

Subject changed from EAGAIN encountered during pg scrub (jewel 10.2.8 integration testing) to EAGAIN encountered during pg scrub (jewel)

Actions

Copy link

Updated by Nathan Cutler almost 7 years ago

Ran 4 times - 50% failure rate: http://pulpito.ceph.com/smithfarm-2017-04-27_17:35:57-rados-wip-jewel-backports-distro-basic-smithi/

Actions

Copy link

Updated by Josh Durgin almost 7 years ago

Has duplicate Bug #16692: deep-scrub Error EAGAIN in cephmanager task in rados run added

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Is the message that the primary OSD is down incorrect? We've seen a few things like this that are test bugs around having he correct (number of) OSDs running as it invokes other commands.

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Project changed from Ceph to RADOS
Category set to Tests

(Optimistically sorting it as a test issue.)

Actions

Copy link

Updated by Josh Durgin about 6 years ago

Related to Bug #23007: jewel integration testing: ceph pg scrub 1.0 fails in create_verify_lfn_objects added

Actions

Copy link

Updated by Josh Durgin about 6 years ago

Looked at the logs from http://pulpito.front.sepia.ceph.com/smithfarm-2018-02-06_21:07:15-rados-wip-jewel-backports-distro-basic-smithi/2160760 and it's clear this is a test issue.

This is upgrading the osds, restarting them, and waiting for 'ceph health' to become clean. However, it is not waiting for the osds to finish booting. This error occurs when osd.1 has been restarted, and even gets to the point of sending the boot message:

2018-02-07 09:41:41.752620 7f909e213700  1 -- 172.21.15.60:6813/25192 --> 172.21.15.60:6791/0 -- osd_boot(osd.1 booted
 0 features 576460752032874495 v6) v6 -- ?+0 0x7f90c6572800 con 0x7f90c6680100

but the scrub command is sent:

2018-02-07T09:41:42.911 INFO:teuthology.orchestra.run.smithi060:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph pg scrub 1.0'
2018-02-07T09:41:43.048 INFO:teuthology.orchestra.run.smithi060.stderr:Error EAGAIN: pg 1.0 primary osd.1 not up

before the monitor processes the boot message:

02-07 09:41:43.775471 mon.0 172.21.15.60:6789/0 21 : cluster [INF] osd.1 172.21.15.60:6813/25192 boot

It looks like this would be fixed by backporting at least 1b7552c9cb331978cb0bfd4d7dc4dcde4186c176 and 86c0d07e32205e2b6aa417a0e4ae03f0084a1888

Actions

Copy link

#10

Updated by Nathan Cutler about 6 years ago

@Josh Jones - thanks

https://github.com/ceph/ceph/pull/20508

Actions

Copy link

#11

Updated by Nathan Cutler about 6 years ago

Status changed from New to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #19737

EAGAIN encountered during pg scrub (jewel)

Updated by Nathan Cutler almost 7 years ago

Updated by Nathan Cutler almost 7 years ago

Updated by Nathan Cutler almost 7 years ago

Updated by Josh Durgin almost 7 years ago

Updated by Greg Farnum almost 7 years ago

Updated by Greg Farnum almost 7 years ago

Updated by Josh Durgin about 6 years ago

Updated by Josh Durgin about 6 years ago

Updated by Nathan Cutler about 6 years ago

Updated by Nathan Cutler about 6 years ago