Project

General

Profile

Actions

Bug #48819

open

fsck error: found stray (per-pg) omap data on omap_head

Added by Kefu Chai over 3 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/kchai-2021-01-10_13:20:22-rados-master-distro-basic-smithi/


Related issues 1 (0 open1 closed)

Related to crimson - Bug #48810: "mount fsck found 1 errors" in crimson-rados-masterCan't reproduce

Actions
Actions #1

Updated by Josh Durgin over 3 years ago

This is happening due to an exception running ceph-objectstore-tool:

2021-01-10T13:59:28.286 ERROR:tasks.thrashosds.thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 1069, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 115, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 1201, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 308, in kill_osd
    format(ret=proc.exitstatus))
Exception: ceph-objectstore-tool: exp list-pgs failure with status 1

This halts execution of the thrasher thread, since nothing handles the exception, which leaves the cluster with at least one osd down. Thus, adding exception handling for these ceph-object-store calls would fix the problem.

In terms of why ceph-objectstore-tool falied, it may be due to an osd not being completely stopped when trying to run it.

Actions #2

Updated by Josh Durgin over 3 years ago

  • Project changed from RADOS to bluestore

Cause appears to be a bluestore issue - manually running the objectstore command with --log-to-stderr we can see it's failing fsck with stray per-pg omap data:

https://pulpito.ceph.com/yuriw-2021-01-15_19:06:33-rados-wip-yuri8-testing-master-2021-01-15-0935-distro-basic-smithi/5789365/

2021-01-15T23:56:00.067+0000 7f77a73d6dc0  1 bluefs fsck
2021-01-15T23:56:00.067+0000 7f77a73d6dc0  1 bluestore(/var/lib/ceph/osd/ceph-4) _fsck_on_open walking object keyspace
2021-01-15T23:56:00.069+0000 7f77a73d6dc0  1 bluestore(/var/lib/ceph/osd/ceph-4) _fsck_on_open checking shared_blobs
2021-01-15T23:56:00.069+0000 7f77a73d6dc0  1 bluestore(/var/lib/ceph/osd/ceph-4) _fsck_on_open checking pool_statfs
2021-01-15T23:56:00.069+0000 7f77a73d6dc0  1 bluestore(/var/lib/ceph/osd/ceph-4) _fsck_on_open checking for stray omap data 
2021-01-15T23:56:00.069+0000 7f77a73d6dc0 -1 bluestore(/var/lib/ceph/osd/ceph-4) fsck error: found stray (per-pg) omap data on omap_head 16092768843478336515 0 0
2021-01-15T23:56:00.069+0000 7f77a73d6dc0  1 bluestore(/var/lib/ceph/osd/ceph-4) _fsck_on_open checking deferred events
2021-01-15T23:56:00.069+0000 7f77a73d6dc0  1 bluestore(/var/lib/ceph/osd/ceph-4) _fsck_on_open checking freelist vs allocated
2021-01-15T23:56:00.141+0000 7f77a73d6dc0  1 bluestore(/var/lib/ceph/osd/ceph-4) _fsck_on_open <<<FINISH>>> with 1 errors, 0 warnings, 0 repaired, 1 remaining in 0.075985 seconds
2021-01-15T23:56:00.141+0000 7f7769ffb700  0 bluestore(/var/lib/ceph/osd/ceph-4)  allocation stats probe 0: cnt: 0 frags: 0 size: 0
2021-01-15T23:56:00.141+0000 7f7769ffb700  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -1: 0,  0, 0
2021-01-15T23:56:00.141+0000 7f7769ffb700  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -2: 0,  0, 0
2021-01-15T23:56:00.141+0000 7f7769ffb700  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -4: 0,  0, 0
2021-01-15T23:56:00.141+0000 7f7769ffb700  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -8: 0,  0, 0
2021-01-15T23:56:00.141+0000 7f7769ffb700  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -16: 0,  0, 0
2021-01-15T23:56:00.141+0000 7f7769ffb700  0 bluestore(/var/lib/ceph/osd/ceph-4) ------------
2021-01-15T23:56:00.141+0000 7f77a73d6dc0  4 rocksdb: [db_impl/db_impl.cc:397] Shutdown: canceling all background work
2021-01-15T23:56:00.141+0000 7f77a73d6dc0  4 rocksdb: [db_impl/db_impl.cc:573] Shutdown complete
2021-01-15T23:56:00.141+0000 7f77a73d6dc0  1 bluefs umount
2021-01-15T23:56:00.141+0000 7f77a73d6dc0  1 bdev(0x55dfbbe73bb0 /var/lib/ceph/osd/ceph-4/block) close
2021-01-15T23:56:00.321+0000 7f77a73d6dc0  1 freelist shutdown
2021-01-15T23:56:00.321+0000 7f77a73d6dc0  1 stupidalloc 0x0x55dfbbe7f7e0 shutdown
2021-01-15T23:56:00.321+0000 7f77a73d6dc0  1 bdev(0x55dfbbe3ba00 /var/lib/ceph/osd/ceph-4/block) close
Mount failed with '(5) Input/output error'
2021-01-15T23:56:00.554+0000 7f77a73d6dc0 -1 bluestore(/var/lib/ceph/osd/ceph-4) _mount fsck found 1 errors
Actions #3

Updated by Neha Ojha over 3 years ago

  • Related to Bug #48810: "mount fsck found 1 errors" in crimson-rados-master added
Actions #4

Updated by Igor Fedotov over 3 years ago

This looks related to my recent PR introducing per-pg omap naming scheme: https://github.com/ceph/ceph/pull/38651

Wondering how I can manually (and locally) run the test case to reproduce the issue?
Or at least some insight on the test case scenario would be highly appreciated.

Actions #5

Updated by Josh Durgin over 3 years ago

Igor, you can log on to ubuntu@smithi092 right now and check out osd.4.

The test isn't doing much special, it's thrashing osds (taking them up/down) and testing monitor recovery. Not doing much I/O other than osdmaps I would think.

Actions #6

Updated by Josh Durgin over 3 years ago

https://github.com/ceph/ceph/pull/38929 this will let us see this in teuthology logs at least

Actions #7

Updated by Josh Durgin over 3 years ago

  • Subject changed from verify (valgrind) test times out to fsck error: found stray (per-pg) omap data on omap_head
Actions #8

Updated by Neha Ojha over 3 years ago

  • Priority changed from Normal to Urgent
Actions #9

Updated by Neha Ojha about 3 years ago

  • Priority changed from Urgent to Normal
Actions #10

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

/a/yuriw-2022-07-22_03:30:40-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6943842/

Actions #11

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

/a/yuriw-2022-07-22_03:30:40-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6944045/

Actions

Also available in: Atom PDF