Project

General

Profile

Actions

Bug #19737

closed

EAGAIN encountered during pg scrub (jewel)

Added by Nathan Cutler about 7 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Tests
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

test description: rados/singleton-nomsgr/{all/lfn-upgrade-infernalis.yaml rados.yaml}

http://qa-proxy.ceph.com/teuthology/smithfarm-2017-04-21_05:45:14-rados-wip-jewel-backports-distro-basic-smithi/1052551/teuthology.log

  1. Infernalis is installed
  2. HEALTH_OK is reached
  3. test pool created with pg_num 1
  4. create_verify_lfn_objects task completes
  5. "sequential" block starts
  6. create_verify_lfn_objects task runs again (wtf?)
  7. cluster is upgraded to jewel
  8. all daemons except osd.2 are restarted (so osd.2 continues on infernalis)
  9. ceph_manager.wait_for_clean runs
  10. ceph_manager.do_pg_scrub runs
  11. sequential block ends
  12. ceph_manager.do_pg_scrub runs again (wtf?)
  13. create_verify_lfn_objects task runs on the mixed cluster
  14. osd.2 is restarted (becoming jewel)
  15. ceph osd set require_jewel_osds
  16. ceph_manager.do_pg_scrub runs

Reading the log, everything seems to work fine up to and including "ceph_manager.wait_for_clean"

At this point, all Ceph daemons except for osd.2 are running jewel; osd.2 is running infernalis.

The last step of the sequential block - do_pg_scrub task - starts, does some work, and fails with EAGAIN:

2017-04-21T06:47:16.670 INFO:tasks.ceph.ceph_manager.ceph:clean!
2017-04-21T06:47:16.670 INFO:teuthology.task.sequential:In sequential, running task ceph_manager.do_pg_scrub...
...
2017-04-21T06:47:17.486 INFO:teuthology.orchestra.run.smithi176:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph pg scrub 1.0'
2017-04-21T06:47:17.633 INFO:teuthology.orchestra.run.smithi176.stderr:Error EAGAIN: pg 1.0 primary osd.1 not up

Immediately after that, we find ourselves in "create_verify_lfn_objects" instead of the expected second do_pg_scrub:

2017-04-21T06:47:17.641 INFO:tasks.create_verify_lfn_objects:ceph_verify_lfn_objects verifying...

That task completes, but does not appear to be relevant because right on its heels comes the Traceback from the EAGAIN:

Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/sequential.py", line 46, in task
    mgr = run_tasks.run_one_task(taskname, ctx=ctx, config=confg)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-jewel-backports/qa/tasks/ceph_manager.py", line 2041, in task
    fn(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-jewel-backports/qa/tasks/ceph_manager.py", line 1469, in do_pg_scrub
    self.raw_cluster_cmd('pg', stype, self.get_pgid(pool, pgnum))
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-jewel-backports/qa/tasks/ceph_manager.py", line 865, in raw_cluster_cmd
    stdout=StringIO(),
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 193, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 414, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 149, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 171, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed on smithi176 with status 11: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph pg scrub 1.0'
2017-04-21T06:47:33.266 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/sepia/teuthology/?q=0f9ca46556a642158e873d093d39cd2c

Related issues 2 (0 open2 closed)

Related to Ceph - Bug #23007: jewel integration testing: ceph pg scrub 1.0 fails in create_verify_lfn_objectsDuplicateDavid Zafman02/15/2018

Actions
Has duplicate Ceph - Bug #16692: deep-scrub Error EAGAIN in cephmanager task in rados runDuplicate07/14/2016

Actions
Actions #2

Updated by Nathan Cutler almost 7 years ago

  • Subject changed from EAGAIN encountered during pg scrub (jewel 10.2.8 integration testing) to EAGAIN encountered during pg scrub (jewel)
Actions #4

Updated by Josh Durgin almost 7 years ago

  • Has duplicate Bug #16692: deep-scrub Error EAGAIN in cephmanager task in rados run added
Actions #5

Updated by Greg Farnum almost 7 years ago

Is the message that the primary OSD is down incorrect? We've seen a few things like this that are test bugs around having he correct (number of) OSDs running as it invokes other commands.

Actions #6

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category set to Tests

(Optimistically sorting it as a test issue.)

Actions #7

Updated by Josh Durgin about 6 years ago

  • Related to Bug #23007: jewel integration testing: ceph pg scrub 1.0 fails in create_verify_lfn_objects added
Actions #8

Updated by Josh Durgin about 6 years ago

Looked at the logs from http://pulpito.front.sepia.ceph.com/smithfarm-2018-02-06_21:07:15-rados-wip-jewel-backports-distro-basic-smithi/2160760 and it's clear this is a test issue.

This is upgrading the osds, restarting them, and waiting for 'ceph health' to become clean. However, it is not waiting for the osds to finish booting. This error occurs when osd.1 has been restarted, and even gets to the point of sending the boot message:

2018-02-07 09:41:41.752620 7f909e213700  1 -- 172.21.15.60:6813/25192 --> 172.21.15.60:6791/0 -- osd_boot(osd.1 booted
 0 features 576460752032874495 v6) v6 -- ?+0 0x7f90c6572800 con 0x7f90c6680100

but the scrub command is sent:

2018-02-07T09:41:42.911 INFO:teuthology.orchestra.run.smithi060:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph pg scrub 1.0'
2018-02-07T09:41:43.048 INFO:teuthology.orchestra.run.smithi060.stderr:Error EAGAIN: pg 1.0 primary osd.1 not up

before the monitor processes the boot message:

02-07 09:41:43.775471 mon.0 172.21.15.60:6789/0 21 : cluster [INF] osd.1 172.21.15.60:6813/25192 boot

It looks like this would be fixed by backporting at least 1b7552c9cb331978cb0bfd4d7dc4dcde4186c176 and 86c0d07e32205e2b6aa417a0e4ae03f0084a1888

Actions #11

Updated by Nathan Cutler about 6 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF