Project

General

Profile

Bug #38157

progress: KeyError on pg_to_state[pg_str]['stat_sum']['num_bytes_recovered']

Added by Sage Weil 7 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
Start date:
02/03/2019
Due date:
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

i've noticed this a few times

2019-02-03T04:16:50.847 INFO:tasks.ceph.mgr.x.smithi072.stderr:2019-02-03 04:16:50.837 7f23348fa700 -1 mgr notify Traceback (most recent call last):
2019-02-03T04:16:50.847 INFO:tasks.ceph.mgr.x.smithi072.stderr:  File "/usr/lib/ceph/mgr/progress/module.py", line 363, in notify
2019-02-03T04:16:50.847 INFO:tasks.ceph.mgr.x.smithi072.stderr:    self._osdmap_changed(old_osdmap, self._latest_osdmap)
2019-02-03T04:16:50.847 INFO:tasks.ceph.mgr.x.smithi072.stderr:  File "/usr/lib/ceph/mgr/progress/module.py", line 345, in _osdmap_changed
2019-02-03T04:16:50.847 INFO:tasks.ceph.mgr.x.smithi072.stderr:    self._osd_out(old_osdmap, old_dump, new_osdmap, osd_id)
2019-02-03T04:16:50.847 INFO:tasks.ceph.mgr.x.smithi072.stderr:  File "/usr/lib/ceph/mgr/progress/module.py", line 320, in _osd_out
2019-02-03T04:16:50.847 INFO:tasks.ceph.mgr.x.smithi072.stderr:    ev.pg_update(self.get("pg_dump"), self.log)
2019-02-03T04:16:50.848 INFO:tasks.ceph.mgr.x.smithi072.stderr:  File "/usr/lib/ceph/mgr/progress/module.py", line 147, in pg_update
2019-02-03T04:16:50.848 INFO:tasks.ceph.mgr.x.smithi072.stderr:    pg_to_state[pg_str]['stat_sum']['num_bytes_recovered']
2019-02-03T04:16:50.848 INFO:tasks.ceph.mgr.x.smithi072.stderr:KeyError: ('1.43',)

/a/sage-2019-02-03_00:19:05-rados-wip-sage-testing-2019-02-02-1454-distro-basic-smithi/3542892


Related issues

Copied to mgr - Backport #39344: nautilus: progress: KeyError on pg_to_state[pg_str]['stat_sum']['num_bytes_recovered'] Resolved

History

#1 Updated by Kefu Chai 7 months ago

by looking at the python backtrace, pg_str was a tuple with only a single element in it, but it should have been a string. that's weird..

#2 Updated by Sage Weil 6 months ago

similar?

2019-02-16T13:40:30.630 INFO:tasks.ceph.mgr.y.smithi148.stderr:2019-02-16 13:40:30.625 7f74ead29700 -1 mgr notify Traceback (most recent call last):
2019-02-16T13:40:30.630 INFO:tasks.ceph.mgr.y.smithi148.stderr:  File "/usr/share/ceph/mgr/progress/module.py", line 399, in notify
2019-02-16T13:40:30.630 INFO:tasks.ceph.mgr.y.smithi148.stderr:    self._osdmap_changed(old_osdmap, self._latest_osdmap)
2019-02-16T13:40:30.630 INFO:tasks.ceph.mgr.y.smithi148.stderr:  File "/usr/share/ceph/mgr/progress/module.py", line 381, in _osdmap_changed
2019-02-16T13:40:30.630 INFO:tasks.ceph.mgr.y.smithi148.stderr:    self._osd_out(old_osdmap, old_dump, new_osdmap, osd_id)
2019-02-16T13:40:30.631 INFO:tasks.ceph.mgr.y.smithi148.stderr:  File "/usr/share/ceph/mgr/progress/module.py", line 356, in _osd_out
2019-02-16T13:40:30.631 INFO:tasks.ceph.mgr.y.smithi148.stderr:    ev.pg_update(self.get("pg_dump"), self.log)
2019-02-16T13:40:30.631 INFO:tasks.ceph.mgr.y.smithi148.stderr:  File "/usr/share/ceph/mgr/progress/module.py", line 155, in pg_update
2019-02-16T13:40:30.631 INFO:tasks.ceph.mgr.y.smithi148.stderr:    pg_to_state[pg_str]['stat_sum']['num_bytes_recovered']
2019-02-16T13:40:30.631 INFO:tasks.ceph.mgr.y.smithi148.stderr:KeyError: ('1.a',)

/a/kchai-2019-02-16_11:36:29-rados-wip-sage-testing-2019-02-16-1748-distro-basic-smithi/3601223

#3 Updated by Kefu Chai 6 months ago

  • Assignee set to Kefu Chai

#5 Updated by Sage Weil 6 months ago

/a/sage-2019-02-19_23:03:51-rados-wip-sage3-testing-2019-02-19-1008-distro-basic-smithi/3614261

#6 Updated by Sage Weil 6 months ago

/a/sage-2019-02-21_21:52:17-rados-wip-sage3-testing-2019-02-21-1359-distro-basic-smithi/3622638

#7 Updated by Sage Weil 6 months ago

/a/sage-2019-02-23_23:02:18-rados-wip-sage2-testing-2019-02-23-1354-distro-basic-smithi/3631867

#8 Updated by Sebastian Wagner 6 months ago

I wasn't really able to reproduce this locally in my vstart cluster.

Also, don't really get how this is supposed to work, because in progress.module.Module#_osd_out, affected_pgs is something like

affected_pgs = [PgId(poll_id, ps) for ps in range(0, pool['pg_num']) if ...]
PgRecoveryEvent(which_pgs=affected_pgs)

where ps will never be a character, but in the Traceback, it clearly asks for a pg_str containing characters.

Anyway. found it.

#9 Updated by Sage Weil 6 months ago

/a/rdias-2019-02-26_22:35:27-rados-wip-rdias2-testing-distro-basic-smithi/3642422

theory: rados_api_tests triggers this

#10 Updated by Sebastian Wagner 6 months ago

No luck with finding the cause using type annotations and mypy: https://github.com/sebastian-philipp/ceph/commit/a79a0846c3aee84e6edac5d30d55224ebd663233

I don't think this is caused by Python code.

I'm not sure it makes sense to create a PR for this branch.

#11 Updated by Kefu Chai 5 months ago

  • Assignee deleted (Kefu Chai)

reassigning from this ticket. i don't have a clue.

#13 Updated by Sage Weil 4 months ago

2019-04-11 16:52:56.429 7f635385d700 10 mgr.server operator() pool 1 pg_num_target 18 pg_num 23 -> 22 (merging 1.16 and 1.6)
...
2019-04-11 16:52:59.309 7f6360c27700 20 check_osd_map removing merged 1.16
2019-04-11 16:52:59.309 7f6360c27700 20  deleted pool 218
...
2019-04-11 16:52:59.441 7f6355060700  4 mgr[progress] got KeyError, see http://tracker.ceph.com/issues/38157
2019-04-11 16:52:59.441 7f6355060700  4 mgr[progress] pg_to_state as string: .... (does not include 1.16) ....
2019-04-11 16:52:59.441 7f6355060700  4 mgr[progress] pg_str "'1.16'" 
2019-04-11 16:52:59.441 7f6355060700 -1 mgr notify progress.notify:
2019-04-11 16:52:59.441 7f6355060700 -1 mgr notify Traceback (most recent call last):
  File "/usr/share/ceph/mgr/progress/module.py", line 408, in notify
    self._osdmap_changed(old_osdmap, self._latest_osdmap)
  File "/usr/share/ceph/mgr/progress/module.py", line 390, in _osdmap_changed
    self._osd_out(old_osdmap, old_dump, new_osdmap, osd_id)
  File "/usr/share/ceph/mgr/progress/module.py", line 365, in _osd_out
    ev.pg_update(self.get("pg_dump"), self.log)
  File "/usr/share/ceph/mgr/progress/module.py", line 156, in pg_update
    pg_to_state[pg_str]['stat_sum']['num_bytes_recovered']
KeyError: '1.16'

#14 Updated by Sage Weil 4 months ago

  • Status changed from Verified to Need Review

#15 Updated by Sage Weil 4 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to nautilus

#16 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #39344: nautilus: progress: KeyError on pg_to_state[pg_str]['stat_sum']['num_bytes_recovered'] added

#17 Updated by Nathan Cutler 3 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF