Fix #6673: 'osd pool set metadata pg_num 34' broken - Ceph - Ceph

Actions

Copy link

Fix #6673

closed

'osd pool set metadata pg_num 34' broken

Added by Sage Weil over 10 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Greg Farnum

Category:

Monitor

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2013-10-29T03:50:14.118 DEBUG:teuthology.orchestra.run:Running [10.214.131.28]: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph osd pool set metadata pg_num 34'
2013-10-29T03:50:14.345 INFO:teuthology.task.ceph.mon.b.err:[10.214.131.17]: 2013-10-29 03:50:14.345069 7efffa578700 -1 bad boost::get: key val is not type long
2013-10-29T03:50:14.347 INFO:teuthology.task.ceph.mon.b.err:[10.214.131.17]: 2013-10-29 03:50:14.346887 7efffa578700 -1 0x7efffa573ce8
2013-10-29T03:50:14.347 INFO:teuthology.task.ceph.mon.b.err:[10.214.131.17]: 2013-10-29 03:50:14.346911 7efffa578700 -1 bad boost::get: key val is not type float
2013-10-29T03:50:14.349 INFO:teuthology.task.ceph.mon.b.err:[10.214.131.17]: 2013-10-29 03:50:14.348718 7efffa578700 -1 0x7efffa573cf8
2013-10-29T03:50:14.352 INFO:teuthology.orchestra.run.err:[10.214.131.28]: Error EAGAIN: currently creating pgs, wait
2013-10-29T03:50:14.364 INFO:teuthology.task.thrashosds.ceph_manager:got EAGAIN setting pool property, waiting a few seconds...

also, let's fix the arg parsing noise.

Actions

Copy link

Updated by Greg Farnum over 10 years ago

What's broken about this, besides the ridiculous parsing output? We deliberately prevent splitting while creating the PGs.

Or is this supposed to be a teuthology bug to not split while creating?

Actions

Copy link

Updated by Sage Weil over 10 years ago

the test later fails with

2013-10-29T04:01:46.191 ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x1c20cd0>
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-master/teuthology/run_tasks.py", line 84, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/teuthworker/teuthology-master/teuthology/task/thrashosds.py", line 170, in task
    thrash_proc.do_join()
  File "/home/teuthworker/teuthology-master/teuthology/task/ceph_manager.py", line 105, in do_join
    self.thread.get()
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 308, in get
    raise self._exception
Exception: timed out getting EAGAIN when setting pool property metadata pg_num = 34
2013-10-29T04:01:46.192 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x1a56290>
2013-10-29T04:01:46.192 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-master/teuthology/contextutil.py", line 27, in nested
    yield vars
  File "/home/teuthworker/teuthology-master/teuthology/task/ceph.py", line 1356, in task
    yield
  File "/home/teuthworker/teuthology-master/teuthology/run_tasks.py", line 84, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/teuthworker/teuthology-master/teuthology/task/thrashosds.py", line 170, in task
    thrash_proc.do_join()
  File "/home/teuthworker/teuthology-master/teuthology/task/ceph_manager.py", line 105, in do_join
    self.thread.get()
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 308, in get
    raise self._exception
Exception: timed out getting EAGAIN when setting pool property metadata pg_num = 34

Actions

Copy link

Updated by Sage Weil over 10 years ago

Status changed from In Progress to Fix Under Review

Actions

Copy link

Updated by Sage Weil over 10 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

Updated by Greg Farnum over 10 years ago

Category set to Monitor
Status changed from Resolved to In Progress
Assignee changed from Sage Weil to Greg Farnum

We saw this again; /a/dzafman-2013-10-31_14:29:25-rados-wip-flush-5855-testing-basic-plana/77511.
The PGs were indeed still creating on the first two attempts to increase the pg count, but for the remaining 48 they were active+clean. The monitor is just checking pgmap::creating_pgs.empty() to see if it can increase the count, so it appears that is somehow not being maintained quite correctly (and the failure seems to be a pretty new bug, or at least newly exposed).

Actions

Copy link

Updated by Greg Farnum over 10 years ago

Tracker changed from Bug to Fix

Actually, looking at this again, I've realized that the teuthology test is running the rados api tests, which involve constantly creating and removing new pools. Of the 518 pgmaps, only 93 do not have PGs in the "creating" state. All the create attempts I spot checked actually did have creating PGs in the previous pgmap.

The best idea I can come up with to limit this issue is to check specifically for PGs being created in the pool which we want to change pgnum on. I'll push a proof-of-concept branch shortly.

Actions

Copy link