Bug #44072

Add new Bluestore OSDs to Filestore cluster leads to scrub errors (union_shard_errors=missing)

Added by Aleksandr Rudenko about 4 years ago. Updated about 4 years ago.

Target version:
% Done:


Community (user)
1 - critical
Affected Versions:
Pull request ID:
Crash signature (v1):
Crash signature (v2):



I sat severity=Critical for attention grabbing because i think is serious problem!

We have two different Luminous clusters (12.2.12). All osd pools are replicated with size=3 min_size=2. Clusters used as S3 (RadosGW).
Upgrade to Luminous has completed about 1.5 years ago. All recommended flags were set ('sortbitwise' etc.).
Before now all OSDs were filestore (journal on SSD) and everything was fine.

About 3 month ago we added first BS OSDs to our small cluster.
After some time we got issue 'pgs inconsistent':

After facing this issue we try to add first BS OSDs to our second cluster but we set for them primary-affinity=0. We thought this can help.
After about a month i have saw that many of PGs on this BS have successfully scrabed without any errors.
But today we have first 'pgs inconsistent' error on second cluster. One OSD is BS.

Some info about our clusters settings:

ceph osd dump | head -n 12

epoch 290315
fsid {truncated}
created 2015-07-31 16:05:27.389478
modified 2020-02-11 10:03:24.517865
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 1946
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release luminous

debug filestore = 0
debug journal = 0
debug ms = 0
debug osd = 0
filestore fd cache size = 512
filestore op threads = 6 # TODO: reduce these timeouts after enable autoresharding
filestore op thread timeout = 180
filestore op thread suicide timeout = 240
osd enable op tracker = false
osd journal size = 1000
osd max backfills = 1
osd recovery max active = 1
osd recovery sleep hdd = 0.2
osd scrub begin hour = 0
osd scrub end hour = 8
osd scrub sleep = 2
osd scrub chunk min = 1
osd scrub chunk max = 2
osd disk thread ioprio class = idle
osd disk thread ioprio priority = 7
osd disk threads = 6 # formerly known as 'osd op threads'
osd peering wq threads = 6 # TODO: reduce this timeout after enable autoresharding
osd op thread timeout = 120
throttler perf counter = false

# filestore OSD
    osd uuid = {truncated}-89fb-000000000000
    host = xx
    public addr = xx.xx.xx.xx
    osd journal = /dev/ceph1/journal-0
# blustore OSD
    osd uuid = {truncated}-89fb-000000000490
    host = xxx
    public addr = xx.xx.xx.xx

I don't write more details because all usefull information have already written here:
It's issue very important for us because we can't continue migrate to BS. One of our cluster is big (more then 1 PB data) and we can't quickly and easy migrate all FS OSDs to BS. And we can't ignore scrub during migration because scrub is very important for data consistency.


#1 Updated by David Zafman about 4 years ago

Two questions:

Do all the objects with missing copies have names that included multi-byte characters?

Are the OSDs with missing copies always filestore or always bluestore?

#2 Updated by Aleksandr Rudenko about 4 years ago

Hi, David

Do all the objects with missing copies have names that included multi-byte characters?

yes, most of missing objects have names that included multi-byte characters.
But there are objects with ASCII-only names, for example:

ovBck/SEKRETERYA 13.09.2019/SEVGI belgelerim/FATURA/OXETTE ucret.doc  

This name was checked by grep v -P "[^\x00\x7F]"

Overall statistic is ~90 missing object's names included multi-byte characters and ~3 object's names NOT included multi-byte characters.

Are the OSDs with missing copies always filestore or always bluestore?

No. For most PGs i can see missing objects on FS OSDs but sometimes i can see missing objects on BS too.

#3 Updated by Aleksandr Rudenko about 4 years ago

grep for checking ASCII-only names:

grep -v -P "[^\x00-\x7F]" 

Also available in: Atom PDF