Project

General

Profile

Bug #17177

cephfs metadata pool: deep-scrub error "omap_digest != best guess omap_digest"

Added by Goncalo Borges 7 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
-
Start date:
08/31/2016
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Release:
jewel
Needs Doc:
No

Description

From time to time we experience deep-scrub errors in pgs inside our cephfs metadata pool. We do not see any hardware errors on the osds themselves so the error must have some other source. We also never saw this in infernalis.

The error itself is the following:

  1. cat /var/log/ceph/ceph.log| grep 5.3d0
    2016-08-30 00:30:53.492626 osd.78 192.231.127.171:6828/6072 331 : cluster [INF] 5.3d0 deep-scrub starts
    2016-08-30 00:30:54.276134 osd.78 192.231.127.171:6828/6072 332 : cluster [ERR] 5.3d0 shard 78: soid 5:0bd6d154:::602.00000000:head omap_digest 0xf3fdfd0c != best guess omap_digest 0x23b2eae0 from auth shard 49
    2016-08-30 00:30:54.747795 osd.78 192.231.127.171:6828/6072 333 : cluster [ERR] 5.3d0 deep-scrub 0 missing, 1 inconsistent objects
    2016-08-30 00:30:54.747801 osd.78 192.231.127.171:6828/6072 334 : cluster [ERR] 5.3d0 deep-scrub 1 errors

The result of a pg query for the inconsistent pg is here: https://dl.dropboxusercontent.com/u/2946024/pg_query.debug.log

The result of a manual deep-scrub after increasing log level to debug_ms 1, debug_filestore 20 and debug_osd 30 are here:
- osd.78 (primary): https://dl.dropboxusercontent.com/u/2946024/ceph-osd.78.debug.log.gz
- osd.59 (peer): https://dl.dropboxusercontent.com/u/2946024/ceph-osd.59.debug.log.gz
- osd.49 (peer): https://dl.dropboxusercontent.com/u/2946024/ceph-osd.49.debug.log.gz

Diving in in the previous logs, it seems that the primary is the problematic osd because the two peers 49 and 59 seem to be consistent between each other, and the pg was just recently migrated to osd.78. The logs of osd.49 and osd.59 seem to point that the divergence is on the getattrs for the object:

2016-08-31 00:55:01.444902 7f2dcbaaa700 10 filestore(/var/lib/ceph/osd/ceph-49) stat 5.3d0_head/#5:0bd6d154:::602.00000000:head# = 0 (size 0)
2016-08-31 00:55:01.444909 7f2dcbaaa700 15 filestore(/var/lib/ceph/osd/ceph-49) getattrs 5.3d0_head/#5:0bd6d154:::602.00000000:head#
2016-08-31 00:55:01.444953 7f2dcbaaa700 10 filestore(/var/lib/ceph/osd/ceph-49) -ERANGE, len is 208
2016-08-31 00:55:01.444964 7f2dcbaaa700 10 filestore(/var/lib/ceph/osd/ceph-49) -ERANGE, got 104
2016-08-31 00:55:01.444967 7f2dcbaaa700 20 filestore(/var/lib/ceph/osd/ceph-49) fgetattrs 315 getting '_'
2016-08-31 00:55:01.444974 7f2dcbaaa700 20 filestore(/var/lib/ceph/osd/ceph-49) fgetattrs 315 getting '_parent'
2016-08-31 00:55:01.444980 7f2dcbaaa700 20 filestore(/var/lib/ceph/osd/ceph-49) fgetattrs 315 getting 'snapset'
2016-08-31 00:55:01.444986 7f2dcbaaa700 20 filestore(/var/lib/ceph/osd/ceph-49) fgetattrs 315 getting '_layout'
2016-08-31 00:55:01.444992 7f2dcbaaa700 10 filestore(/var/lib/ceph/osd/ceph-49) getattrs no xattr exists in object_map r = 0
2016-08-31 00:55:01.444994 7f2dcbaaa700 10 filestore(/var/lib/ceph/osd/ceph-49) getattrs 5.3d0_head/#5:0bd6d154:::602.00000000:head# = 0
2016-08-31 00:55:01.444998 7f2dcbaaa700 10 osd.49 pg_epoch: 23099 pg[5.3d0( v 23099'104738 (23099'101639,23099'104738] local-les=22440 n=257 ec=339 les/c/f 22440/22440/0 19928/22439/22439) [78,59,49] r=2 lpr=22439 pi=4173-22438/25 luod=0'0 crt=23099'104736 lcod 23099'104737 active] be_deep_scrub 5:0bd6d154:::602.00000000:head seed 4294967295

I conclude that the replication of the pg did not go has expected. In the mailing list there are other reports of similar situations which seems to point to a systematic issue. At least for us it already happened more than once, and that is why I am opening this bug.

osd17_list-omap.txt View (234 KB) sean redmond, 09/06/2016 11:40 AM

osd6_list-omap.txt View (234 KB) sean redmond, 09/06/2016 01:15 PM

osd28_list-omap.txt View (234 KB) sean redmond, 09/06/2016 01:16 PM

07_09_16.zip (198 KB) sean redmond, 09/07/2016 08:44 AM

pg_query.txt View (16.2 KB) sean redmond, 12/05/2016 11:32 AM

osd.2-pg5.1fa-omap.list View (211 KB) sean redmond, 12/05/2016 11:33 AM

osd.2-pg5.1fa.oplog.gz (80.9 KB) sean redmond, 12/05/2016 11:34 AM

osd.8-5.1fa-omap.list View (211 KB) sean redmond, 12/05/2016 11:38 AM

osd.8-5.1fa-op.log.gz (81.3 KB) sean redmond, 12/05/2016 11:38 AM

osd.12-5.1fa-omap.list View (211 KB) sean redmond, 12/05/2016 11:41 AM

osd.12-5.1fa-op.log.gz (82.1 KB) sean redmond, 12/05/2016 11:41 AM


Related issues

Related to Bug #17867: jewel: omap deep scrub error during power cycle run Can't reproduce 11/11/2016
Copied to Backport #18183: jewel: cephfs metadata pool: deep-scrub error "omap_digest != best guess omap_digest" Resolved

History

#1 Updated by Brad Hubbard 7 months ago

  • Assignee set to Brad Hubbard

#2 Updated by Brad Hubbard 7 months ago

  • Category set to OSD
  • Source changed from other to Community (user)
  • ceph-qa-suite rados added
  • ceph-qa-suite deleted (fs)

#3 Updated by Brad Hubbard 7 months ago

Thanks for the detailed report and the data Goncalo.

Please give me a day or two to review the logs and the code.

#4 Updated by sean redmond 7 months ago

I also seem to be hitting an issue that looks a little like this today and I can not see any hardware problem to account for it.

The error itself is the following:

root@ceph-osd3:~# grep -Hn 'ERR' /var/log/syslog.1
/var/log/syslog.1:4401:Aug 31 02:42:45 ceph-osd3 ceph-osd: 2016-08-31 02:42:45.802629 7fc468d39700 -1 log_channel(cluster) log [ERR] : 5.62 shard 28: soid 5:46045301:::100008ad802.00000000:head omap_digest 0x9222da74 != best guess omap_digest 0x84fd2685 from auth shard 6
/var/log/syslog.1:4402:Aug 31 02:42:45 ceph-osd3 ceph-osd50754: 2016-08-31 02:42:45.802629 7fc468d39700 -1 log_channel(cluster) log [ERR] : 5.62 shard 28: soid 5:46045301:::100008ad802.00000000:head omap_digest 0x9222da74 != best guess omap_digest 0x84fd2685 from auth shard 6
/var/log/syslog.1:4427:Aug 31 02:43:20 ceph-osd3 ceph-osd: 2016-08-31 02:43:20.808692 7fc468d39700 -1 log_channel(cluster) log [ERR] : 5.62 deep-scrub 0 missing, 1 inconsistent objects
/var/log/syslog.1:4428:Aug 31 02:43:20 ceph-osd3 ceph-osd50754: 2016-08-31 02:43:20.808692 7fc468d39700 -1 log_channel(cluster) log [ERR] : 5.62 deep-scrub 0 missing, 1 inconsistent objects
/var/log/syslog.1:4429:Aug 31 02:43:20 ceph-osd3 ceph-osd50754: 2016-08-31 02:43:20.808698 7fc468d39700 -1 log_channel(cluster) log [ERR] : 5.62 deep-scrub 1 errors
/var/log/syslog.1:4430:Aug 31 02:43:20 ceph-osd3 ceph-osd: 2016-08-31 02:43:20.808698 7fc468d39700 -1 log_channel(cluster) log [ERR] : 5.62 deep-scrub 1 errors
root@ceph-osd3:~#

The result of a pg query for the inconsistent pg is here http://pastebin.com/N1KbKnaZ

The result of a manual deep-scrub after increasing log level to debug_ms 1, debug_filestore 20 and debug_osd 30 are here:

osd.17 = https://www.dropbox.com/s/wwsn65ikmm5o2ae/ceph-osd.17.log.gz
osd.28 = https://www.dropbox.com/s/26bloqx053qj8td/ceph-osd.28.log.gz
osd.6 = https://www.dropbox.com/s/029vsmlax1255oi/ceph-osd.6.log.gz

I tried to get data using ceph-objectstore-tool but I was unsure of the syntax, if this is useful I am more than happy to collect it.

ceph-objectstore-tool 100008ad802.00000000 get-omaphdr --pgid 5.62 --data-path /var/lib/ceph/osd/ceph-6 --dry-run

Ceph health detail, xfs_info,version and arch check here : http://pastebin.com/piAyW88w

#5 Updated by sean redmond 7 months ago

I Just noticed 'user.cephos.seq' was missing from osd.17, not sure of the relevance.

http://pastebin.com/LxCiJUnj

#6 Updated by Yuri Weinstein 7 months ago

per IRC discussion with badone

02:39:18 PM) badone: yuriw, dis: http://tracker.ceph.com/issues/17177 "omap_digest != best guess omap_digest" on jewel opened yesterday FYI
(02:41:14 PM) yuriw: badone - is is something has to to be in for jewel 10.2.3 ?
(02:41:36 PM) yuriw: I don't recall anything similar :(
(02:41:58 PM) badone:  dis | [01:59:40] and cluster [ERR] 1.4 shard 1: soid 1:3ed09add:::607.00000000:head omap_digest 0xbce78084 != best guess
(02:42:00 PM) badone:      | omap_digest 0x3e64643e from auth shard 0" in cluster log                                                          
(02:42:12 PM) badone: yuriw: ^
(02:42:30 PM) badone: from your conversation about your hammer failure
(02:42:35 PM) yuriw: I see, this what we talked about in the am
(02:43:09 PM) badone: seems an odd coincidence
(02:43:50 PM) yuriw: I am rerunning that suite and passing now, hmm ...
(02:44:19 PM) badone: OK, just FYI for now then..
(02:48:44 PM) yuriw: badone - thanks do you think it's the same?
(02:49:35 PM) badone: yuriw: well, it's really just an indication that one pg has a divergent omap so it is rather generic but it seems fairly rare
(02:50:01 PM) badone: a pg repair should fix it I guess
(02:50:13 PM) badone: but one wonders why they are different
(02:50:41 PM) yuriw: I will note on the ticket and have it listed on issues to approve for the release

Run: http://pulpito.front.sepia.ceph.com/yuriw-2016-08-30_15:41:01-powercycle-master-testing-basic-mira/
Job: 392468
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2016-08-30_15:41:01-powercycle-master-testing-basic-mira/392468/teuthology.log

cluster [ERR] 1.4 shard 1:
    soid 1:3ed09add:::607.00000000:head omap_digest 0xbce78084 != best guess omap_digest
    0x3e64643e from auth shard 0" in cluster log

#7 Updated by Brad Hubbard 7 months ago

So it turns out that in jewel a command was introduced to get information about inconsistencies and may prove helpful here.

Could we try the following using the relevant pg?

  1. rados list-inconsistent-obj [pgid]

So for Goncalo that's pgid = 5.3d0 and for Sean pgid = 5.62 and Yuri pgid = 1.4 (Yuri, I realise your test environment may be gone so next time).

#8 Updated by Goncalo Borges 7 months ago

Hi Brad...

Here it is. It doesn't add much to what we already knew in the sense that it confirms that the inconsistency is on the omap_digest and that osd.78 is the one different from all the others.

  1. rados list-inconsistent-obj 5.3d0 | /usr/bin/json_reformat
    [ {
    "object": {
    "name": "602.00000000",
    "nspace": "",
    "locator": "",
    "snap": "head"
    },
    "missing": false,
    "stat_err": false,
    "read_err": false,
    "data_digest_mismatch": false,
    "omap_digest_mismatch": true,
    "size_mismatch": false,
    "attr_mismatch": false,
    "shards": [ {
    "osd": 49,
    "missing": false,
    "read_error": false,
    "data_digest_mismatch": false,
    "omap_digest_mismatch": false,
    "size_mismatch": false,
    "data_digest_mismatch_oi": false,
    "omap_digest_mismatch_oi": false,
    "size_mismatch_oi": false,
    "size": 0,
    "omap_digest": "0x546f0914",
    "data_digest": "0xffffffff"
    }, {
    "osd": 59,
    "missing": false,
    "read_error": false,
    "data_digest_mismatch": false,
    "omap_digest_mismatch": false,
    "size_mismatch": false,
    "data_digest_mismatch_oi": false,
    "omap_digest_mismatch_oi": false,
    "size_mismatch_oi": false,
    "size": 0,
    "omap_digest": "0x546f0914",
    "data_digest": "0xffffffff"
    }, {
    "osd": 78,
    "missing": false,
    "read_error": false,
    "data_digest_mismatch": false,
    "omap_digest_mismatch": true,
    "size_mismatch": false,
    "data_digest_mismatch_oi": false,
    "omap_digest_mismatch_oi": false,
    "size_mismatch_oi": false,
    "size": 0,
    "omap_digest": "0x1dabd301",
    "data_digest": "0xffffffff"
    }
    ]
    }
    ]

#9 Updated by Goncalo Borges 7 months ago

To YuriW and Badone:

(02:49:35 PM) badone: yuriw: well, it's really just an indication that one pg has a divergent omap so it is rather generic but it seems fairly rare
(02:50:01 PM) badone: a pg repair should fix it I guess

We should be rather careful. 'pg repair' is not the solution in every case. In my example, the primary osd is the one which is containing the inconsistent object in its pg. So, unless the 'pg repair' behaviour changed in Jewel (which was simply pushing the primary osd pg contents to its peers osds) we shouldn't run it. I actually asked this quite a lot in the ML but no one has came forward saying what is the true 'pg repair' behaviour and if there is a smarter implementation in Jewel

#10 Updated by Brad Hubbard 7 months ago

In Sean's case we identified that 5:46045301:::100008ad802.00000000:head was referring to the file current/5.62_head/DIR_2/DIR_6/DIR_0/DIR_2/100008ad802.00000000__head_80CA2062__5 on disk and that the extended attribute user.cephos.seq was different on each OSD.

OSD 6
user.cephos.seq=0x010110000000f376480300000000000000000200000000

OSD 17 (Primary)
user.cephos.seq is not present

OSD 28
user.cephos.seq=0x010110000000368e0b0000000000000000000200000000

Given user.cephos.seq is REPLAY_GUARD_XATTR and it is recorded to avoid replaying an older transaction it may not be relevant once peering is completed and the OSD logs and the object are consistent so it may no longer be relevant but I'm still looking into that. I'm also concerned about how three OSDs can all have different values for this extended attribute?

#11 Updated by Greg Farnum 7 months ago

Differing replay seqs shouldn't be an issue as that's entirely local journal state; not a part of the shared PG state.
Unless there's a bug, of course. Is scrub looking at that attribute? I believe it should be excluded (subject to correction from somebody who's actually worked on those code paths recently, of course).

#12 Updated by Goncalo Borges 7 months ago

I think I can confirm Brad findings for Sean's case. If I do a getfattr in the objects in the different OSDs, I see that the user.cephos.seq value exists in OSD.49 and OSD.59 (with the same value) but not in OSD.78.

--- OSD.78: ---

  1. getfattr -d 602.00000000__head_2A8B6BD0__5
  2. file: 602.00000000__head_2A8B6BD0__5
    user.ceph._=0sDwjxAAAABAMtAAAAAAAAAAwAAAA2MDIuMDAwMDAwMDD+/////////9BriyoAAAAAAAUAAAAAAAAABgMcAAAABQAAAAAAAAD/////AAAAAAAAAAD//////////wAAAAAmmQEAAAAAADtaAAAlmQEAAAAAADtaAAACAhUAAAACAAAAAAAAAAAq55oGAAAAACUDAAAAAAAAAAAAADKtxldhWKERAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJpkBAAAAAAAAAAAAAAAAAAAcAAAAMq3GV5Ul0BH//////////w==
    user.ceph._layout=0sAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==
    user.ceph._parent=0sBQQ4AAAAAgYAAAAAAAABAAAAAgIaAAAAAAEAAAAAAAAGAAAAc3RyYXkyCCsAAAAAAAAFAAAAAAAAAAAAAAA=
    user.ceph.snapset=0sAgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==
    user.cephos.spill_out=0sMAA=

--- OSD.49: ---

  1. getfattr -d 602.00000000__head_2A8B6BD0__5
  2. file: 602.00000000__head_2A8B6BD0__5
    user.ceph._=0sDwjxAAAABAMtAAAAAAAAAAwAAAA2MDIuMDAwMDAwMDD+/////////9BriyoAAAAAAAUAAAAAAAAABgMcAAAABQAAAAAAAAD/////AAAAAAAAAAD//////////wAAAAAmmQEAAAAAADtaAAAlmQEAAAAAADtaAAACAhUAAAACAAAAAAAAAAAq55oGAAAAACUDAAAAAAAAAAAAADKtxldhWKERAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJpkBAAAAAAAAAAAAAAAAAAAcAAAAMq3GV5Ul0BH//////////w==
    user.ceph._layout=0sAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==
    user.ceph._parent=0sBQQ4AAAAAgYAAAAAAAABAAAAAgIaAAAAAAEAAAAAAAAGAAAAc3RyYXkyCCsAAAAAAAAFAAAAAAAAAAAAAAA=
    user.ceph.snapset=0sAgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==
    user.cephos.seq=0sAQEQAAAAYYKuAAAAAAAAAAAAAgAAAAA=
    user.cephos.spill_out=0sMAA=

--- OSD.59: ---

  1. getfattr -d 602.00000000__head_2A8B6BD0__5
  2. file: 602.00000000__head_2A8B6BD0__5
    user.ceph._=0sDwjxAAAABAMtAAAAAAAAAAwAAAA2MDIuMDAwMDAwMDD+/////////9BriyoAAAAAAAUAAAAAAAAABgMcAAAABQAAAAAAAAD/////AAAAAAAAAAD//////////wAAAAAmmQEAAAAAADtaAAAlmQEAAAAAADtaAAACAhUAAAACAAAAAAAAAAAq55oGAAAAACUDAAAAAAAAAAAAADKtxldhWKERAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJpkBAAAAAAAAAAAAAAAAAAAcAAAAMq3GV5Ul0BH//////////w==
    user.ceph._layout=0sAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==
    user.ceph._parent=0sBQQ4AAAAAgYAAAAAAAABAAAAAgIaAAAAAAEAAAAAAAAGAAAAc3RyYXkyCCsAAAAAAAAFAAAAAAAAAAAAAAA=
    user.ceph.snapset=0sAgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==
    user.cephos.seq=0sAQEQAAAAOfFyAQAAAAAAAAAAAgAAAAA=
    user.cephos.spill_out=0sMAA=

#13 Updated by Brad Hubbard 7 months ago

Goncalo, could you gather "attr -l" output for the file on each OSD please?

#14 Updated by Goncalo Borges 7 months ago

Here it goes

-- OSD.78: --
  1. attr -l 602.00000000__head_2A8B6BD0__5
    Attribute "cephos.spill_out" has a 2 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "ceph._layout" has a 28 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "ceph._parent" has a 62 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "ceph._" has a 247 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "ceph.snapset" has a 31 byte value for 602.00000000__head_2A8B6BD0__5
-- OSD.49: --
  1. attr -l 602.00000000__head_2A8B6BD0__5
    Attribute "cephos.spill_out" has a 2 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "ceph._layout" has a 28 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "ceph._parent" has a 62 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "cephos.seq" has a 23 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "ceph._" has a 247 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "ceph.snapset" has a 31 byte value for 602.00000000__head_2A8B6BD0__5
-- OSD.59: --
  1. attr -l 602.00000000__head_2A8B6BD0__5
    Attribute "cephos.spill_out" has a 2 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "ceph._layout" has a 28 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "ceph._parent" has a 62 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "cephos.seq" has a 23 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "ceph._" has a 247 byte value for 602.00000000__head_2A8B6BD0__5
    Attribute "ceph.snapset" has a 31 byte value for 602.00000000__head_2A8B6BD0__5

#15 Updated by Brad Hubbard 7 months ago

I may have sidetracked this ticket by focussing on the ERANGE errors seen in the OP as I wanted to eliminate them first. So far I have been able to gather very little information about why we are seeing those errors and I'm pretty confident now they are not relevant and neither is the presence or absence of "cephos.seq" as Greg has mentioned.

I have looked at the logs for both sets of OSDs and in Sean's case osd.28 contains a key in the omap for C2B08449-DF6C-52EB-E9BFC1CFF6FF2D6F.jpg_head which the other two OSDs do not have.

$ grep --text C2B08449-DF6C-52EB-E9BFC1CFF6FF2D6F.jpg_head ceph*.log
ceph-osd.28.log:2016-08-31 15:31:45.901438 7fd8a13d9700 25 osd.28 pg_epoch: 27188 pg[5.62( v 27176'332497 (27145'329178,27176'332497] local-les=27187 n=7311 ec=121 les/c/f 27187/27187/0 27184/27186/27180) [17,28,6] r=1 lpr=27186 pi=27177-27185/5 luod=0'0 crt=27176'332497 lcod 0'0 active] CRC key C2B08449-DF6C-52EB-E9BFC1CFF6FF2D6F.jpg_head value:

In the case of Goncarlo's logs there appear to be around 500 less keys in the OSD.78 omap than in the others.

$ diff 49.out 59.out
$

$ diff 49.out 78.out|wc
513 1017 9711

$ diff 49.out 78.out|tail -5
< 100004e7d0e_head
< 100004e7d18_head
< 100004e7d19_head
< 100004e7d1e_head
< 100004e7d21_head

$ grep --text 100004e7d21_head ceph*.log
ceph-osd.49.debug.log:2016-08-31 00:55:02.082312 7f2dcbaaa700 25 osd.49 pg_epoch: 23099 pg[5.3d0( v 23099'104738 (23099'101639,23099'104738] local-les=22440 n=257 ec=339 les/c/f 22440/22440/0 19928/22439/22439) [78,59,49] r=2 lpr=22439 pi=4173-22438/25 luod=0'0 crt=23099'104736 lcod 23099'104737 active] CRC key 100004e7d21_head value:
ceph-osd.59.debug.log:2016-08-31 00:55:01.853879 7f335510b700 25 osd.59 pg_epoch: 23099 pg[5.3d0( v 23099'104738 (23099'101639,23099'104738] local-les=22440 n=257 ec=339 les/c/f 22440/22440/0 19928/22439/22439) [78,59,49] r=1 lpr=22439 pi=19928-22438/1 luod=0'0 crt=23099'104736 lcod 23099'104737 active] CRC key 100004e7d21_head value:

$ grep --text 100004e7d1e_head ceph*.log
ceph-osd.49.debug.log:2016-08-31 00:55:02.082163 7f2dcbaaa700 25 osd.49 pg_epoch: 23099 pg[5.3d0( v 23099'104738 (23099'101639,23099'104738] local-les=22440 n=257 ec=339 les/c/f 22440/22440/0 19928/22439/22439) [78,59,49] r=2 lpr=22439 pi=4173-22438/25 luod=0'0 crt=23099'104736 lcod 23099'104737 active] CRC key 100004e7d1e_head value:
ceph-osd.59.debug.log:2016-08-31 00:55:01.853769 7f335510b700 25 osd.59 pg_epoch: 23099 pg[5.3d0( v 23099'104738 (23099'101639,23099'104738] local-les=22440 n=257 ec=339 les/c/f 22440/22440/0 19928/22439/22439) [78,59,49] r=1 lpr=22439 pi=19928-22438/1 luod=0'0 crt=23099'104736 lcod 23099'104737 active] CRC key 100004e7d1e_head value:

Goncalo, in terms of the way the repair works I believe the copying is now done from what is considered the authoritative shard, which in your case would be 49 to the "bad peer" which I believe in your case would be the primary, osd.78

In Sean's case osd.28 is considered to be the "bad peer" and osd.6 authoritative so osd.28s copy would be overwritten by the copy on osd.6 (neither of which is currently primary).

https://github.com/ceph/ceph/blob/jewel/src/osd/PG.cc#L4422

https://github.com/ceph/ceph/blob/jewel/src/osd/PG.cc#L3831

Still looking into this.

#16 Updated by Goncalo Borges 7 months ago

Oohhh.... I see what you are doing. Something like:

$ for osd in 49 59 78; do grep --text 5.3d0 ceph-osd.${osd}.debug.log | grep --text "CRC key" | awk '{print $(NF-1)}' | sort > ceph-osd.${osd}.debug.keys; done

$ ll *.debug.keys
rw-r--r- 1 goncalo goncalo 118218 Sep 1 15:05 ceph-osd.49.debug.keys
rw-r--r- 1 goncalo goncalo 118218 Sep 1 15:05 ceph-osd.59.debug.keys
rw-r--r- 1 goncalo goncalo 109650 Sep 1 15:05 ceph-osd.78.debug.keys

$ diff ceph-osd.49.debug.keys ceph-osd.78.debug.keys | wc -l
513

How does 513 keys were lost in the replication process is still a mistery.

#17 Updated by Brad Hubbard 7 months ago

Yes, that's it.

For the working of ceph-objectstore-tool here's some testing I've done on a jewel development cluster (so the paths are a little different).

Set noout and stop osd.0.

$ ls dev/osd0/
ceph_fsid current fsid journal keyring magic ready store_version superblock type whoami

$ ./ceph-objectstore-tool --op list --pgid 2.5 --data-path dev/osd0/
["2.5",{"oid":"604.00000000","key":"","snapid":-2,"hash":1139301525,"max":0,"pool":2,"namespace":"","max":0}]
["2.5",{"oid":"mds_snaptable","key":"","snapid":-2,"hash":3640815789,"max":0,"pool":2,"namespace":"","max":0}]
["2.5",{"oid":"605.00000000","key":"","snapid":-2,"hash":3502345757,"max":0,"pool":2,"namespace":"","max":0}]

$ ./ceph-objectstore-tool --op log --pgid 2.5 --data-path dev/osd0/|head -15 {
"pg_log_t": {
"head": "12'5",
"tail": "0'0",
"log": [ {
"op": "modify ",
"object": "2:a93a17c2:::604.00000000:head",
"version": "12'1",
"prior_version": "0'0",
"reqid": "mds.0.6:13",
"extra_reqids": [],
"mtime": "2016-08-31 23:20:39.036112",
"mod_desc": {
"object_mod_desc": {

$ ./ceph-objectstore-tool --data-path dev/osd0/ benchmark_data_rex001.example.com_22492_object55 list-omap
$ ./ceph-objectstore-tool --data-path dev/osd0/ benchmark_data_rex001.example.com_22492_object55 dump

Obviously you'll use the relevant pgid, path to osd store, and object id (which you'll get from the "--op list" command and journal if you need to specify it. I'd like to see the output from all of the OSDs involved and hopefully the logs go back far enough and we can get a better idea of what happened and when.

#18 Updated by David Zafman 7 months ago

I think we should add a cephfs test with thrasher to see if we get an omap inconsistency.

UPDATE: I found suites/fs/thrash, but wonder if it needs improvement.

#19 Updated by Brad Hubbard 7 months ago

Goncarlo and Sean,

Could you provide complete history of these clusters including all versions they have run in the past. When they were upgraded last and from what version. We can work on a thrash test for Cephfs to see if we can reproduce this but if it's due to some sort of upgrade issue it may not reproduce easily.

Could you also let us know how these pools are configured, is there any sort of tiering set up?

#20 Updated by Goncalo Borges 7 months ago

Hi Brad. All...

Regarding the history of this cluster:
- It was originally configured with infernalis 9.2.0. According to 'ceph osd dump' that was in 2015-12-01.
- Then, on the 27th of June 2016, we upgraded directly to Jewel 10.2.2.
- The configuration of the pools follows bellow. There no tier setup. Just standard/ default pools.

  1. ceph osd dump | head -n 10
    epoch 23099
    fsid a9431bc6-3ee1-4b0a-8d21-0ad883a4d2ed
    created 2015-12-01 01:02:58.332319
    modified 2016-08-29 03:58:37.941700
    flags sortbitwise
    pool 4 'coepp_ceph_objectstore' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 337 flags hashpspool stripe_width 0
    pool 5 'coepp_cephfs_metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 339 flags hashpspool stripe_width 0
    pool 6 'coepp_cephfs_data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 9121 flags hashpspool crash_replay_interval 45 stripe_width 0

#21 Updated by Goncalo Borges 7 months ago

Brad...

Regarding your ceph-objectstore-tool log request, I would like to ask you it is ok to recover the pg now or if you really wants the logs with the pg in inconsistent state. Just asking because I would feel more confortable in stopping the osds with the pg already recover (maybe it doesn't make a difference but ?I am just following my gut feeling).

Moreover, I am having the same issues I faced when I initially tried to use ceph-objectstore-tool. Please note that, for now, I am just doing some tests in an osd which was included in the same cluster some time ago but we remove in the meantime. However, we still have the data and journal available.

- This is the version I have in my osds:

  1. rpm -qf /usr/bin/ceph-objectstore-tool
    ceph-osd-10.2.2-0.el7.x86_64

- This works:

  1. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-70 --journal-path /var/lib/ceph/osd/ceph-70/journal --op list --pgid 5.2e9 | head -n 1
    SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ["5.2e9",{"oid":"10000427299.00000000","key":"","snapid":-2,"hash":1949303529,"max":0,"pool":5,"namespace":"","max":0}]

- This doesn't:

  1. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-70 --journal-path /var/lib/ceph/osd/ceph-70/journal -pgid 5.2e9 10000427299.00000000 list-omap
    SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    Without --pgid the object '5.2e9' must be a JSON array
  1. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-70 --journal-path /var/lib/ceph/osd/ceph-70/journal --pgid 5.2e9 10000427299.00000000 list-omap
    SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    Decode object JSON error: value type is 5 not 0

I even tried to give the object as a json array using different formats like:

{oid:10000427299.00000000}
[{oid:10000427299.00000000}]
[oid:10000427299.00000000]
["oid":"10000427299.00000000"]
[{"oid":"10000427299.00000000"}]
["5.2e9",{"oid":"10000427299.00000000"}]

That actuall makes the tool progress a bit further but it always exits with "No object id '[{oid:10000427299.00000000}]' found or invalid JSON specified"

  1. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-70 --journal-path /var/lib/ceph/osd/ceph-70/journal [{oid:10000427299.00000000}] list-omap
    SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    Error getting attr on : 4.124_head,#-6:24800000:::scrub_4.124:head#, (61) No data available
    Error getting attr on : 4.2c7_head,#-6:e3400000:::scrub_4.2c7:head#, (61) No data available
    Error getting attr on : 4.f4_head,#-6:2f000000:::scrub_4.f4:head#, (61) No data available
    Error getting attr on : 4.236_head,#-6:6c400000:::scrub_4.236:head#, (61) No data available
    Error getting attr on : 6.792_head,#-8:49e00000:::scrub_6.792:head#, (61) No data available
    Error getting attr on : 6.37b_head,#-8:dec00000:::scrub_6.37b:head#, (61) No data available
    Error getting attr on : 6.323_head,#-8:c4c00000:::scrub_6.323:head#, (61) No data available
    Error getting attr on : 6.57f_head,#-8:fea00000:::scrub_6.57f:head#, (61) No data available
    Error getting attr on : 6.6c_head,#-8:36000000:::scrub_6.6c:head#, (61) No data available
    Error getting attr on : 5.232_head,#-7:4c400000:::scrub_5.232:head#, (61) No data available
    Error getting attr on : 4.268_head,#-6:16400000:::scrub_4.268:head#, (61) No data available
    Error getting attr on : 6.223_head,#-8:c4400000:::scrub_6.223:head#, (61) No data available
    Error getting attr on : 4.356_head,#-6:6ac00000:::scrub_4.356:head#, (61) No data available
    Error getting attr on : 6.cd_head,#-8:b3000000:::scrub_6.cd:head#, (61) No data available
    Error getting attr on : 5.387_head,#-7:e1c00000:::scrub_5.387:head#, (61) No data available
    Error getting attr on : 6.7a7_head,#-8:e5e00000:::scrub_6.7a7:head#, (61) No data available
    Error getting attr on : 6.3e4_head,#-8:27c00000:::scrub_6.3e4:head#, (61) No data available
    Error getting attr on : 6.634_head,#-8:2c600000:::scrub_6.634:head#, (61) No data available
    Error getting attr on : 6.7d7_head,#-8:ebe00000:::scrub_6.7d7:head#, (61) No data available
    Error getting attr on : 6.41e_head,#-8:78200000:::scrub_6.41e:head#, (61) No data available
    Error getting attr on : 4.275_head,#-6:ae400000:::scrub_4.275:head#, (61) No data available
    Error getting attr on : 4.65_head,#-6:a6000000:::scrub_4.65:head#, (61) No data available
    Error getting attr on : 6.794_head,#-8:29e00000:::scrub_6.794:head#, (61) No data available
    Error getting attr on : 5.1bc_head,#-7:3d800000:::scrub_5.1bc:head#, (61) No data available
    Error getting attr on : 6.4db_head,#-8:db200000:::scrub_6.4db:head#, (61) No data available
    Error getting attr on : 4.375_head,#-6:aec00000:::scrub_4.375:head#, (61) No data available
    Error getting attr on : 6.42f_head,#-8:f4200000:::scrub_6.42f:head#, (61) No data available
    Error getting attr on : 4.223_head,#-6:c4400000:::scrub_4.223:head#, (61) No data available
    Error getting attr on : 6.42c_head,#-8:34200000:::scrub_6.42c:head#, (61) No data available
    Error getting attr on : 6.f2_head,#-8:4f000000:::scrub_6.f2:head#, (61) No data available
    Error getting attr on : 6.5f0_head,#-8:0fa00000:::scrub_6.5f0:head#, (61) No data available
    Error getting attr on : 5.23d_head,#-7:bc400000:::scrub_5.23d:head#, (61) No data available
    Error getting attr on : 6.3c7_head,#-8:e3c00000:::scrub_6.3c7:head#, (61) No data available
    Error getting attr on : 6.575_head,#-8:aea00000:::scrub_6.575:head#, (61) No data available
    Error getting attr on : 6.9b_head,#-8:d9000000:::scrub_6.9b:head#, (61) No data available
    Error getting attr on : 5.221_head,#-7:84400000:::scrub_5.221:head#, (61) No data available
    Error getting attr on : 4.254_head,#-6:2a400000:::scrub_4.254:head#, (61) No data available
    No object id '[{oid:10000427299.00000000}]' found or invalid JSON specified

Must be doing something wrong here. No much documentation on the tool to help me dough...

#22 Updated by Brad Hubbard 7 months ago

Just a quick update about the SCSI errors, more on the object store tool later today I hope.

$ sg_decode_sense 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fixed format, current; Sense key: Illegal Request
Additional sense: Invalid command operation code

Do you understand exactly why you are getting these errors and the implications? Generally this can indicate a bad driver, fw, hw, or combination of the above.

Do you see these errors on all of the OSDs?

#23 Updated by Goncalo Borges 7 months ago

I do not think they are relevant. They appear if you, for example, run 'hdparm -N /dev/sfX'. That message is there for all hosts in our ceph cluster, for our non-ceph storage servers, and even in my Fedora Desktop

  1. hdparm -N /dev/sdb
    /dev/sdb:
    SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    max sectors = 0/1, HPA is enabled

#24 Updated by Brad Hubbard 7 months ago

That error is different and I would be looking at both of them if I was you.

$ sg_decode_sense -v 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fixed format, current; Sense key: Illegal Request
Additional sense: Invalid command operation code
Raw sense data (in hex):
70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00
00 00 00 00 00

$ sg_decode_sense -v 70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fixed format, current; Sense key: Illegal Request
Additional sense: Invalid field in cdb
Raw sense data (in hex):
70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00
00 00

I'd be asking the question why a userspace application like ceph-objectstore-tool which is merely reading files from the disk would trigger such messages from the hardware. I've seen this with system utilities before such as parted invoked by ceph-disk/ceph-deploy or hdparm as you've demonstrated but this is not a system disk utility we are talking about. I would want to know what was behind those and would be paying careful attention to the driver/controller combination and any upstream work that may have been done to rectify the situation for said controller.

Anyway, this troubles me but we have no evidence that it is related as you mention so I leave it as an aside atm but I suggest it should be investigated.

#25 Updated by Goncalo Borges 7 months ago

Hi Brad...

We get the same EXACT same message even in (ceph and non ceph) servers long in production, and both in enclosures and system disks managed by different controllers (although both DELL/LSI ones). In the machine where we have the problematic osd, we do have the latest firmwares on controllers and disks. I've updated everything before joining the machine to the cluster. At this point I think this is something unrelated.

#26 Updated by sean redmond 7 months ago

In my case the cluster has always been running 10.2.2, it only started its life around 1 month ago.

Interestingly these disks also sit behind a DELL / LSI card in None RAID mode.

#27 Updated by Brad Hubbard 7 months ago

Oh, wow, that is interesting.

Could I get both of you to provide what controller specifically and what fw version?

It may also be worth contacting the vendor to see if they are aware of any issues with hw/fw/sw combinations running on these machines.

#28 Updated by sean redmond 7 months ago

The details here are:

PERC H730 Mini (Firmware Version 25.3.0.0016)
All disks are intel SSDSC2BA80 (Revision G2010140)

I have checked and these too are running the latest firmware from the vendor, I will drop them an email to see if they have any such related known issues.

#29 Updated by Goncalo Borges 7 months ago

Hi...

In our case, our storage servers have:
- Perc H710Mini (Firmware Version: 21.3.2-0005) controlling 4 INTEL SSDSC2BB120G6 SSDs disks working as journals.
- Perc H810 Adapter (Firmware Version: 21.3.2-0005, Driver Version: 06.807.10.00-rh1) managing two enclosures with 8 active OSDs running on SAS disks
Everything is in its latest firmware versions for both controllers and disks.

I do not want to narrow down any hypothesis but, for me, it doesn't feel the problem we are seeing is hardware related. These servers have been working for a couple years in a highly intensive used production infrastructure of ours (Tier-2 for ATLAS WLCG), and once their warranty expired, we migrated them to the ceph cluster. Also, if it was a hardware problem, then I do not understand why it just manifests itself in the cephfs metadatapool. We never had degraded pgs without corresponding I/O problems in the data pool, for example (at least, until now).

#30 Updated by Brad Hubbard 7 months ago

All talk of other things aside I think our best opportunity to RCA this is to gather "ceph-objectstore-tool --op log" data from the relevant pg and try to find log entries relevant to the divergent object that may help us to establish a timeline and allow us to focus on the OSD's logs from the period of time when the replicas diverged.

In regards to the issue with the ceph-objectstore-tool command for list-omap, dump, etc. this is probably my fault for giving a misleading example. the key is in the help for that command.

$ bin/ceph-objectstore-tool --help 2>&1|gawk '/list-omap/||/displayed/||/--op list/'
ceph-objectstore-tool ... <object> list-omap
<object> can be a JSON object description as displayed
by --op list.

So, in the examples above the correct commands should be.

$ ./ceph-objectstore-tool --data-path dev/osd0/ '["2.5",{"oid":"605.00000000","key":"","snapid":-2,"hash":3502345757,"max":0,"pool":2,"namespace":"","max":0}]' list-omap

$ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-70 --journal-path /var/lib/ceph/osd/ceph-70/journal '["5.2e9",{"oid":"10000427299.00000000","key":"","snapid":-2,"hash":1949303529,"max":0,"pool":5,"namespace":"","max":0}]' list-omap

Note that not all objects will have omap entries but the objects we are interested in should.

At this stage I'm most interested in "--op list" output but I'm trying to allow us to gather as much data as possible.

#31 Updated by Brad Hubbard 7 months ago

I've also run a fs thrash test in an attempt to reproduce this but as yet no luck.

#32 Updated by sean redmond 7 months ago

Hi,

Interestingly the 'network' issue that I think caused the problem to start with happened again at the same time just a week later....

I had a look in cron.weekly with the help of some atopd logs, I found that fstrim gets called then the OSD's start to flap. This looks to be because the Dell/LSI perc card does not allow the discard command to pass through it.

I have disabled fstrim from running as the DC3710 drives seem to have garbage collection in the firmware that is already doing the needful. This should mean the 'cause' should not repeat, I will report back on that next week.

This has caused a further PG to report a problem and the log here (http://pastebin.com/LFNe6WnR) shows its in the omap_digest again.

A query of that Pg is here: http://pastebin.com/t35nqRhi

I tried to gather the data you are looking for, If you need more or I am going wrong just let me know:

ceph-objectstore-tool --op list --pgid 5.62 --data-path /var/lib/ceph/osd/ceph-17 | grep 100008ad802
["5.62",{"oid":"100008ad802.00000000","key":"","snapid":-2,"hash":2160730210,"max":0,"pool":5,"namespace":"","max":0}]
root@ceph-osd3:~#

=========================

root@ceph-osd3:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-17 '["5.62",{"oid":"100008ad802.00000000","key":"","snapid":-2,"hash":2160730210,"max":0,"pool":5,"namespace":"","max":0}]' list-omap | less > osd17_list-omap.txt

The file is enclosed.

#33 Updated by sean redmond 7 months ago

Hi,

Not sure if you need the list-omap from the other osd's in scope of that PG but here is is in case:

root@ceph-osd4:~# ceph-objectstore-tool --op list --pgid 5.62 --data-path /var/lib/ceph/osd/ceph-28 | grep 100008ad802
sh: 1: /sbin/hdparm: not found
sh: 1: /sbin/hdparm: not found
["5.62",{"oid":"100008ad802.00000000","key":"","snapid":-2,"hash":2160730210,"max":0,"pool":5,"namespace":"","max":0}]
root@ceph-osd4:~#

root@ceph-osd2:~# ceph-objectstore-tool --op list --pgid 5.62 --data-path /var/lib/ceph/osd/ceph-6 | grep 100008ad802
sh: 1: /sbin/hdparm: not found
sh: 1: /sbin/hdparm: not found
["5.62",{"oid":"100008ad802.00000000","key":"","snapid":-2,"hash":2160730210,"max":0,"pool":5,"namespace":"","max":0}]
root@ceph-osd2:~#

#34 Updated by sean redmond 7 months ago

The below shows a quick diff:

[root@korn ceph]# ll
total 732
rw-r--r-. 1 root root 239490 Sep 6 14:18 osd17_list-omap.txt
rw-r--r-. 1 root root 239490 Sep 6 14:18 osd28_list-omap.txt
rw-r--r-. 1 root root 239445 Sep 6 14:18 osd6_list-omap.txt
[root@korn ceph]# diff osd28_list-omap.txt osd6_list-omap.txt
144d143
< C2B08449-DF6C-52EB-E9BFC1CFF6FF2D6F.jpg_head
[root@korn ceph]#

#35 Updated by sean redmond 7 months ago

I also collected the data for PG 5.15b

root@ceph-osd3:~# ceph-objectstore-tool --op list --pgid 5.15b --data-path /var/lib/ceph/osd/ceph-15 | grep 10002639d84
sh: 1: /sbin/hdparm: not found
sh: 1: /sbin/hdparm: not found
Error getting attr on : 5.15b_head,#-7:da800000:::scrub_5.15b:head#, (61) No data available
["5.15b",{"oid":"10002639d84.00000000","key":"","snapid":-2,"hash":251335003,"max":0,"pool":5,"namespace":"","max":0}]
root@ceph-osd3:~#

root@ceph-osd1:~# ceph-objectstore-tool --op list --pgid 5.15b --data-path /var/lib/ceph/osd/ceph-1 | grep 10002639d84
sh: 1: /sbin/hdparm: not found
sh: 1: /sbin/hdparm: not found
["5.15b",{"oid":"10002639d84.00000000","key":"","snapid":-2,"hash":251335003,"max":0,"pool":5,"namespace":"","max":0}]
root@ceph-osd1:~#

root@ceph-osd5:~# ceph-objectstore-tool --op list --pgid 5.15b --data-path /var/lib/ceph/osd/ceph-34 | grep 10002639d84
sh: 1: /sbin/hdparm: not found
sh: 1: /sbin/hdparm: not found
["5.15b",{"oid":"10002639d84.00000000","key":"","snapid":-2,"hash":251335003,"max":0,"pool":5,"namespace":"","max":0}]
root@ceph-osd5:~#

--The below shows a diff--

[root@korn ceph2]# ll
total 756
rw-r--r-. 1 root root 248715 Sep 6 14:46 osd15_list-omap.txt
rw-r--r-. 1 root root 248760 Sep 6 14:46 osd1_list-omap.txt
rw-r--r-. 1 root root 248715 Sep 6 14:46 osd34_list-omap.txt
[root@korn ceph2]# diff osd15_list-omap.txt osd1_list-omap.txt
192a193

C1A09B32-C098-30A4-A08D15352B8BFB97.jpg_head

[root@korn ceph2]#

#36 Updated by Goncalo Borges 7 months ago

Hi Brad, Guys...

The outputs of ceph-objecstore-tool for each osd involved in the inconsistent pg are available in the same dropbox place (to keep all logs in the same location):

https://dl.dropboxusercontent.com/u/2946024/ceph-objectstore-tool_osd.49.tgz
https://dl.dropboxusercontent.com/u/2946024/ceph-objectstore-tool_osd.59.tgz
https://dl.dropboxusercontent.com/u/2946024/ceph-objectstore-tool_osd.78.tgz

Each of the tars includes the following files (I think their names are self explanatory):

$ tar tvf ceph-objectstore-tool_osd.49.tgz
rw-r--r- goncalo/goncalo 28135 2016-09-07 10:37 ceph-objectstore-tool_osd.49_pgid5.3d0_list-omap.txt
rw-r--r- goncalo/goncalo 1178 2016-09-07 10:37 ceph-objectstore-tool_osd.49_pgid5.3d0_oid-602.00000000-dump.txt
rw-r--r- goncalo/goncalo 32072 2016-09-07 10:37 ceph-objectstore-tool_osd.49_pgid5.3d0_op-list.txt
rw-r--r- goncalo/goncalo 1859157 2016-09-07 10:37 ceph-objectstore-tool_osd.49_pgid5.3d0_op-log.txt

(btw, the tricky part of the ceph-objecstore-tool command is that it need apostrophes for the jason array representing the object. I had tried to include the full jason array before without the apostrophes, with fails).

Now a couple of side comments:
- Only the ceph-objectstore-tool_osd.X_pgid5.3d0_list-omap.txt is different between X=78 and X=49,59. All the other files have exactly the same content.
- My situation is different from Sean's since my OSDs do not run with SSDs. We use SSDs for journals but up to know we have not used fstrim. I have no hints that there is some hardware problems but I will keep looking.
- 'rados list-inconsistent-obj 5.3d0' actually did not returned nothing when I run it this time. I had to trigger a deep-scrub in the pgs for that command to return some valuable output again.

Cheers
Goncalo

#37 Updated by Goncalo Borges 7 months ago

Just an additional comment on the 'pg repair' behaviour.

I confirm what Brad reported in
http://tracker.ceph.com/issues/17177#note-15

Running a 'pg repair' results in an updated object in osd.78 (which is the primary osds). The timestaps of the same objects in the peer remain unchanged.

  1. ll /var/lib/ceph/osd/ceph-78/current/5.3d0_head/602.00000000__head_2A8B6BD0__5 rw-r--r- 1 ceph ceph 0 Sep 7 02:04 /var/lib/ceph/osd/ceph-78/current/5.3d0_head/602.00000000__head_2A8B6BD0__5

I also collected the object list-map after recovery and checked that all the previously missing keys are now present. The follwing command doesn't return any output:

$ for entry in `diff ceph-objectstore-tool_osd.49_pgid5.3d0_list-omap.txt ceph-objectstore-tool_osd.78_pgid5.3d0_list-omap.txt | grep "^<" | cut -f2 -d' '`; do exists=`grep $entry ceph-objectstore-tool_osd.78_pgid5.3d0_list-omap.txt.repaired`; if [ "X$exists" == "X" ]; then echo "$entry does not exist in repaired osd.78"; fi; done

Cheers
Goncalo

#38 Updated by Brad Hubbard 7 months ago

Goncalo,

The earliest entry in the log related to the divergent object is this.

{     
"op": "modify ",
"object": "5:0bd6d154:::602.00000000:head",
"version": "23099'104711",
"prior_version": "23099'102844",
"reqid": "mds.0.805:109278642",
"extra_reqids": [],
"mtime": "2016-08-29 20:26:20.500693",
"mod_desc": {
"object_mod_desc": {
"can_local_rollback": false,
"rollback_info_completed": false,
"ops": []
}
}
},

All log views are the same in relation to this operation (and all subsequent operations) so it is possible the log does not go back far enough (this was only a few hours before the scrub error and the earliest entry in the logs is 2016-08-29 20:14:48.037782). It is also possible that all OSDs did commit the operation in question (whichever operation that was) and that the data divergence happened due to some other factor or factors. I'm afraid this gets us no closer to establishing a timeline and, for now, all we know is that some time between the first time deep-scrub identified the discrepancy and the deep-scrub before that this object became inconsistent. If you can find the logs for this period (in between the last time this pg passed a deep scrub and the first time it failed) I can take a look but it's unlikely that they will show us anything at the default log level. I'd suggest being vigilant about this and grabbing syslog, dmesg and the osd log of the divergent copy as soon as possible after this happens next time.

If you can provide the logs I mentioned above I will take a look though.

Given this is possibly related to a journal being truncated, and given you are using Intel SSDs (same as Sean) for this purpose, it's possible the drives may be the issue. Note that I am not accusing the drives but just trying to cover all bases and apply focus where I believe it is warranted.

The mystery of why the SG_IO errors are being triggered by the ceph-objectstore-tool is solved by the error messages Sean is getting, "sh: 1: /sbin/hdparm: not found". It turns out deep in the filestore code we do call hdparm which I was unaware of, my bad.

https://github.com/badone/ceph/blob/master/src/os/filestore/FileJournal.cc#L187-L193

Glad to hear the object repaired without issue and as expected.

#39 Updated by Goncalo Borges 7 months ago

Hi Brad...

Thanks for all the help. Unfortunately, I do not have further logs. Based on my experience, I do feel this will reappear in a couple of weeks time. Next time we will be faster in determining and collecting all relevant logs.

I do not exclude this is journal related. The SSD journals are the only new component introduced in our very exercised (but commodity) setup. Do you have any suggestion on how I can check / tune journals and avoid truncation? Maybe by a more frequent flush of the journals? The default settings are currently:

  1. ceph daemon osd.78 config show | egrep "(filestore)" | grep sync_interval
    "filestore_max_sync_interval": "5",
    "filestore_min_sync_interval": "0.01",

I did found something funny on ceph-osd.admin.log in all osds. For the host when osd.78 is, the log has rotated on 20160825 and remains empty up to now. But the previous file has entries such as :

  1. gunzip -c ceph-osd.admin.log-20160825.gz | grep sde3
    2016-08-24 06:14:26.032337 7eff29172800 -1 bluestore(/dev/sde3) _read_bdev_label unable to decode label at offset 66: buffer::malformed_input: void bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end of struct encoding
    2016-08-24 06:14:26.034716 7eff29172800 1 journal _open /dev/sde3 fd 4: 20971520000 bytes, block size 4096 bytes, directio = 0, aio = 0
    2016-08-24 06:14:26.034752 7eff29172800 1 journal close /dev/sde3
    2016-08-24 06:14:26.034783 7eff29172800 0 probe_block_device_fsid /dev/sde3 is filestore, 6d2c0635-8d7e-4ef4-84f3-9e2efdbf624b
    2016-08-24 23:56:03.394090 7fd5605a9800 -1 bluestore(/dev/sde3) _read_bdev_label unable to decode label at offset 66: buffer::malformed_input: void bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end of struct encoding
    2016-08-24 23:56:03.398251 7fd5605a9800 1 journal _open /dev/sde3 fd 4: 20971520000 bytes, block size 4096 bytes, directio = 0, aio = 0
    2016-08-24 23:56:03.398609 7fd5605a9800 1 journal close /dev/sde3
    2016-08-24 23:56:03.398737 7fd5605a9800 0 probe_block_device_fsid /dev/sde3 is filestore, 6d2c0635-8d7e-4ef4-84f3-9e2efdbf624b

Those entries look normal to me except the bluestore call. Why is bluestore trying to do something on the journals? Isn't bluestore a replacement of filestore, which should come into production in future releases?

Cheers
Goncalo

#40 Updated by Brad Hubbard 7 months ago

This looks normal.

https://github.com/badone/ceph/blob/master/src/os/ObjectStore.cc#L88-L115

No recommendations on journal tuning until we understand the exact nature of the issue.

#41 Updated by Brad Hubbard 7 months ago

Hi Sean,

Following up for your issue specifically.

I found that fstrim gets called then the OSD's start to flap.

You are running fstrim on which filesystems? The OSD data filesystem? The journal partition? Could I see the actual command from your cron script?

Under normal circumstances the OSDs should never "flap" so this is good data and I'd like to explore this in more depth.

Previously I said, "our best opportunity to RCA this is to gather "ceph-objectstore-tool --op log" data from the relevant pg and try to find log entries relevant to the divergent object that may help us to establish a timeline" unfortunately the log data was the only thing you didn't provide but, given we have a definite time-line now, I don't think we need it just now but keep it in mind. All the discussion of "list-omap" probably threw the focus on it so don't worry too much as that part got a bit confused unfortunately.

sh: 1: /sbin/hdparm: not found

You don't have hdparm in your environment? If you take a look at the following code you can see what it's doing and I don't think it is relevant to you so I don't see it as a problem.

https://github.com/badone/ceph/blob/master/src/os/filestore/FileJournal.cc#L171

Ideally, what I'd like to see moving forward is this reproduced with debug_filestore and debug_journal set to 20 on the OSDs (note that this may cause considerable log growth so keep an eye on it).

If that is not immediately possible perhaps we can start with dmesg (only if the machine has not been rebooted since the issue) output, syslog, and the ceph osd logs covering the period from when the fstrim call was made to when the scrub error was discovered. Please make note of the time the fstrim call was issued so I know where to start.

#42 Updated by Brad Hubbard 7 months ago

Actually Sean,

I changed my mind. If you get the chance to grab the following output from each OSD it may provide a useful data point.

  1. ceph-objectstore-tool --op log --pgid 5.15b --data-path /var/lib/ceph/osd/ceph-15
  2. ceph-objectstore-tool --op log --pgid 5.15b --data-path /var/lib/ceph/osd/ceph-1
  3. ceph-objectstore-tool --op log --pgid 5.15b --data-path /var/lib/ceph/osd/ceph-34

In this case we will be interested in entries referring to 5:da88df70:::10002639d84.00000000:head

[Edit] fixed up commands to match correct pgid and OSds as per most recent pg query and deep-scrub log error [/edit]

#43 Updated by Brad Hubbard 7 months ago

Note that I edited the last comment.

#44 Updated by sean redmond 7 months ago

Hi,

I was not running fstrim explicitly it seems 16.04 comes with the below cron.weekly:

root@ceph-osd1:~# cat /etc/cron.weekly/fstrim
#!/bin/sh
  1. trim all mounted file systems which support it
    /sbin/fstrim --all || true
    root@ceph-osd1:~#

This appears to run on the OSD file systems, This cluster is all SSD with on disk journals.

hdparm is not installed, maybe this should be a package dependancy?

I have enclosed the ceph-objectstore-tool logs.

Thanks

#45 Updated by Brad Hubbard 7 months ago

These log views are consistent.

$ md5sum *
8b850c41d84c013d95be15ea10682f05 07_09_16_osd_15.txt
8b850c41d84c013d95be15ea10682f05 07_09_16_osd_1.txt
8b850c41d84c013d95be15ea10682f05 07_09_16_osd_34.txt

The logs include the last two operations on 5:da88df70:::10002639d84.00000000:head

$ grep -B2 -A13 5:da88df70:::10002639d84.00000000:head 07_09_16_osd_15.txt
            {
                "op": "modify  ",
                "object": "5:da88df70:::10002639d84.00000000:head",
                "version": "28705'372045",
                "prior_version": "28705'370709",
                "reqid": "mds.0.34101:12287097",
                "extra_reqids": [],
                "mtime": "2016-09-05 11:00:51.246239",
                "mod_desc": {
                    "object_mod_desc": {
                        "can_local_rollback": false,
                        "rollback_info_completed": false,
                        "ops": []
                    }
                }
            },
--
            {
                "op": "modify  ",
                "object": "5:da88df70:::10002639d84.00000000:head",
                "version": "28713'373017",
                "prior_version": "28705'372045",
                "reqid": "mds.0.34101:12877404",
                "extra_reqids": [],
                "mtime": "2016-09-06 09:07:32.491670",
                "mod_desc": {
                    "object_mod_desc": {
                        "can_local_rollback": false,
                        "rollback_info_completed": false,
                        "ops": []
                    }
                }
            },

I'm going to do some more research on the implications of this but I suspect it means the divergence happened after the transactions were committed to the database.

Sean, In the mean time are you able to provide any of the other data I asked for?

#46 Updated by Brad Hubbard 7 months ago

hdparm is not installed, maybe this should be a package dependancy?

It is for rpm based installations.

# rpm -q --requires ceph-base|grep hdparm
hdparm

I'll look into why it's not for Ubuntu 16.04

#47 Updated by Ken Dreyer 6 months ago

  • Needs Doc set to No

After looking into the hdparm thing, Brad dropped the requirement for hdparm from master, https://github.com/ceph/ceph/pull/11073

#48 Updated by Kjetil Joergensen 6 months ago

"Me too!"

Of note:
  • Ceph Jewel 10.2.2
  • Started manifesting recently, after putting actual load onto cephfs (I.e. rsync -avH --delete /allourstuff/ /cephfs/allourstuff)
  • Only observed for the cephfs metadata pool, and only after putting some amount of load on it.
  • We do have tiering for the cephfs data pool
  • Cephfs metadatapool SSD only, Intel DC S3500, SSDSC2BB80, FW 2370 on LSI 3008 based/branded HBA (FW: 12.00.00.00, BIOS: 08.29.00.00)

Yes - that's some awfully similar hardware. Although; as previous commenters - we've run RBD load on this for a while now, without this manifesting itself. It really only appeared after we started running our rsync(s), and while rsync is running it's happening with some amount of frequency.

For us - this seems to exclusively have happened for object-maps for 60X.00000000 (Where I'm guessing that'd be the stray catalogs? ~mds/strayX), although that might simply be because we exercise the stray catalogs when running rsync --delete.

#49 Updated by Brad Hubbard 6 months ago

What would really help here is a minimal reproducer that reliably demonstrates the issue. If we can get that it should reproduce here and be much easier to work out and fix. Of course if it is hardware specific we would have trouble doing so but at least that will also be a datapoint. If anyone can provide instructions on creating a minimal reproducer it would be appreciated.

#50 Updated by Zheng Yan 6 months ago

A user who has corrupted cephfs metadata says he encountered this issue before. The way cephfs metadata get corrupted is some omaps have more entires than they should have.

I wonder if below patch helps. (FileStore::lfn_unlink does not always clear omap)

diff --git a/src/osd/ReplicatedBackend.cc b/src/osd/ReplicatedBackend.cc
index a090c2d..7d3700c 100644
--- a/src/osd/ReplicatedBackend.cc
+++ b/src/osd/ReplicatedBackend.cc
@@ -1729,6 +1729,7 @@ void ReplicatedBackend::submit_push_data(
     t->remove(coll, ghobject_t(target_oid));
     t->touch(coll, ghobject_t(target_oid));
     t->truncate(coll, ghobject_t(target_oid), recovery_info.size);
+    t->omap_clear(coll, ghobject_t(target_oid));
     t->omap_setheader(coll, ghobject_t(target_oid), omap_header);

     bufferlist bv = attrs[OI_ATTR];

#51 Updated by David Zafman 6 months ago

I think that 82fdacda1bf35b029ae8474ddf48c33e62ab81f6 may have caused this issue. I haven't look closely at the expectations in this function, but if there is an object with omap already there and no omaps coming in the push, then the old omaps will remain.

#52 Updated by Goncalo Borges 6 months ago

To Brad: It is difficult to actually say when this issue appears and how to reproduce it. When I opened the ticket I already had seen a couple of issues like this. Since my last occurrence, I did not got any more.

To all: Following the discussions in this ticket and in the ML, it seems that the majority of these problems happen with objects like 60X.00000000. This was my case and Kjetil Joergensen case (see comment #48). Assuming that 60X.00000000 are objects related to stray directories this point to something not ok when deleting data. If that is true, that is aligned with the case Zheng Yan is referring to (see comment #50 and 'CephFS: No space left on device' thread in the users ML) since the metadata corruption was spotted when the user is trying to delete data.

#53 Updated by Brad Hubbard 6 months ago

The problem is that, given the difficulty of reproducing, it will be hard to determine whether any patch fixes the issue (although that should not stop us trying the patch). If it can be reproduced reliably by someone in some fashion then we can be much more certain that a particular patch fixes the issue.

#54 Updated by Zheng Yan 6 months ago

David Zafman wrote:

I think that 82fdacda1bf35b029ae8474ddf48c33e62ab81f6 may have caused this issue. I haven't look closely at the expectations in this function, but if there is an object with omap already there and no omaps coming in the push, then the old omaps will remain.

t->omap_setkeys(..) does not delete existing keys

#55 Updated by Kjetil Joergensen 6 months ago

Hi,

so, reproducible test-case is hard. Our reproducible test-case is a mixed rbd/cephfs-load, where the triggering cephfs load is "not in production yet". It may well be that it's the "attempted migration load" (aka rsync) that's triggering this. That being said, I've tried this on my "lab", by essentially trying to replicate the filesystem-layer load without much luck. (A bunch of hardlinks changing positions in the directory tree(s)).

The other thing I just remembered, which in hindsight likely isn't relevant, we are running ceph-mds v10.2.2 with a cherry-picked commit 4d15eb1 (mds: only open non-regular inode with mode FILE_MODE_PIN).

-KJ

#56 Updated by Zheng Yan 5 months ago

I think I found the root cause. FileStore::lfn_unlink does two operations, unlinking object file from filesystem and clear object keys from leveldb. If OSD machine crashed after FileStore::lfn_unlink, unlinking object file succeeded, but leveldb failed to flush the change the disk. Later OSD recovers and replays the REMOVE operation, the corresponding object file does not exist. so FileStore::lfn_unlink does nothing (it return if hardlink is equal to 0) and the corresponding object keys does not get removed. If we re-create an object with same name, the new object's omap will includes old object's omap keys

steps to reproduce:
- create a two OSDs cluster.
- create a test object, add an omap entry with key 'key1'
- kill non-primary OSD for the test object (prevent test object's primary OSD from changing)
- gdb attach the test object's primary OSD, make break points at DBObjectMap::sync and DBObjectMap::clear
- use 'rados —p test_pool rm test_object' to remove the test object. When gdb hits break point DBObjectMap::clear, make DBObjectMap::clear directly return 0 (type 'return 0'). When gdb hits break point DBObjectMap::sync, kill the OSD process.
- restart the primary OSD
- re-create test object with same name, add an omap entry with key 'key2'
- list test object's omap keys

#57 Updated by Zheng Yan 5 months ago

  • Status changed from New to Need Review

#58 Updated by Zheng Yan 5 months ago

I think FileStore::_collection_move_rename has similar bug. During replay, it may fail to clone omap to new object or fail to remove omap of old object (FileStore::_close_replay_guard does not guarantee object map is flushed)

#59 Updated by Yuri Weinstein 5 months ago

  • Status changed from Need Review to Resolved

#61 Updated by Samuel Just 4 months ago

  • Related to Bug #17867: jewel: omap deep scrub error during power cycle run added

#62 Updated by sean redmond 4 months ago

Hi,

Not sure if it helps but I hit what I think is this issue today, not caused by fstrim this time or anything hardware related I can put my finger on. the logs related are below:

=================================

ceph-mon1 ceph.log

+deep, 6 active+clean+scrubbing, 4663 active+clean; 3964 GB data, 13165 GB used, 13468 GB / 26634 GB avail; 203 MB/s rd, 482 MB/s wr, 240 op/s
2016-12-05 09:55:29.198813 osd.8 172.27.175.5:6805/24569 12115 : cluster [ERR] 5.1fa deep-scrub 0 missing, 1 inconsistent objects
2016-12-05 09:55:29.198819 osd.8 172.27.175.5:6805/24569 12116 : cluster [ERR] 5.1fa deep-scrub 1 errors

=================================

ceph-osd2 syslog

Dec 5 09:55:08 ceph-osd2 ceph-osd: 2016-12-05 09:55:08.198178 7fd366747700 -1 log_channel(cluster) log [ERR] : 5.1fa shard 12: soid 5:5fbfa2b9:::10002639ceb.00000000:head omap_digest 0x99e641fc != best guess omap_digest 0x29569b31 from auth shard 2
Dec 5 09:55:08 ceph-osd2 ceph-osd24569: 2016-12-05 09:55:08.198178 7fd366747700 -1 log_channel(cluster) log [ERR] : 5.1fa shard 12: soid 5:5fbfa2b9:::10002639ceb.00000000:head omap_digest 0x99e641fc != best guess omap_digest 0x29569b31 from auth shard 2
Dec 5 09:55:12 ceph-osd2 ceph-osd: 2016-12-05 09:55:12.000263 7f6b424ea700 1 leveldb: Level-0 table #223990: started
Dec 5 09:55:12 ceph-osd2 ceph-osd: 2016-12-05 09:55:12.036738 7f6b424ea700 1 leveldb: Level-0 table #223990: 1415571 bytes OK
Dec 5 09:55:12 ceph-osd2 ceph-osd: 2016-12-05 09:55:12.038159 7f6b424ea700 1 leveldb: Delete type=0 #223987
Dec 5 09:55:29 ceph-osd2 ceph-osd: 2016-12-05 09:55:29.198809 7fd366747700 -1 log_channel(cluster) log [ERR] : 5.1fa deep-scrub 0 missing, 1 inconsistent objects
Dec 5 09:55:29 ceph-osd2 ceph-osd: 2016-12-05 09:55:29.198818 7fd366747700 -1 log_channel(cluster) log [ERR] : 5.1fa deep-scrub 1 errors
Dec 5 09:55:29 ceph-osd2 ceph-osd24569: 2016-12-05 09:55:29.198809 7fd366747700 -1 log_channel(cluster) log [ERR] : 5.1fa deep-scrub 0 missing, 1 inconsistent objects
Dec 5 09:55:29 ceph-osd2 ceph-osd24569: 2016-12-05 09:55:29.198818 7fd366747700 -1 log_channel(cluster) log [ERR] : 5.1fa deep-scrub 1 errors

=================================

root@ceph-mon1:~/# for i in osd.12-5.1fa-omap.list osd.8-5.1fa-omap.list osd.2-pg5.1fa-omap.list ; do sum $i ; done
61312 211
27435 211
27435 211
root@ceph-mon1:~/#
root@ceph-mon1:~/5# diff osd.12-5.1fa-omap.list osd.8-5.1fa-omap.list
569a570

B9025058-BD06-FFBD-C5110E94FF05B631.jpg_head

root@ceph-mon1:~/#

=================================

I have also enclosed the below log files:

->osd.8-pg5.1fa.oplog.gz
ceph-objectstore-tool --op log --pgid 5.1fa --data-path /var/lib/ceph/osd/ceph-8

->osd.2-pg5.1fa.oplog.gz
ceph-objectstore-tool --op log --pgid 5.1fa --data-path /var/lib/ceph/osd/ceph-2

->osd.12-pg5.1fa.oplog.gz
ceph-objectstore-tool --op log --pgid 5.1fa --data-path /var/lib/ceph/osd/ceph-12

->pg_query.txt
ceph pg query 5.1fa

->osd.2-pg5.1fa-omap.list
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1 '["5.15b",{"oid":"10002639d84.00000000","key":"","snapid":-2,"hash":251335003,"max":0,"pool":5,"namespace":"","max":0}]' list-omap

->osd.8-5.1fa-omap.list
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-8 '["5.1fa",{"oid":"10002639ceb.00000000","key":"","snapid":-2,"hash":2638609914,"max":0,"pool":5,"namespace":"","max":0}]' list-omap
->

->osd.12-5.1fa-omap.list
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12 '["5.1fa",{"oid":"10002639ceb.00000000","key":"","snapid":-2,"hash":2638609914,"max":0,"pool":5,"namespace":"","max":0}]' list-omap

If any more detail or logs would be helpful just let me know.

Thanks

#63 Updated by Brad Hubbard 4 months ago

Hi Sean,

What version are you currently running? Still 10.2.2?

#64 Updated by sean redmond 4 months ago

Hi Brad,

This has been upgraded some time back:

root@ceph-mon1:~# ceph -v
ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
root@ceph-mon1:~#

#65 Updated by Brad Hubbard 4 months ago

Hey Sean,

This is the fix.

$ git describe 73a1b45ec01a7debe0520015a11afdaa11dfd269
v11.0.2-652-g73a1b45

It has not been backported to jewel but I can request that if you want? Alternatively you can wait for Kraken and upgrade then?

#66 Updated by sean redmond 4 months ago

It would be very useful to see this in jewel as I prefer to use the LTS releases over the 'stable' releases

#67 Updated by Goncalo Borges 3 months ago

According to Zheng's comments in the users ML, this is a bug that causes data corruption (although I have not explicitly seen it). So, in my oppinion, a bug generating data corruption should be treated urgently, and should be definitively backported to the Jewel LTS which will be around for quite some time.

#68 Updated by Brad Hubbard 3 months ago

  • Status changed from Resolved to Pending Backport

Sure, I see no problem with that.

#69 Updated by Brad Hubbard 3 months ago

  • Copied to Backport #18183: jewel: cephfs metadata pool: deep-scrub error "omap_digest != best guess omap_digest" added

#70 Updated by Loic Dachary 3 months ago

  • Backport set to jewel

#71 Updated by sean redmond 3 months ago

In the last 24hrs there was corruption in the metadata pool, It resulted in an assert loop on the MDS until the corruption was manually cleaned up, I presume it is related to hitting this.

#72 Updated by Brad Hubbard 3 months ago

Sorry to hear that Sean. The backport is in progress so hopefully will make it into the next Jewel release.

#73 Updated by Brad Hubbard about 1 month ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF