Project

General

Profile

Actions

Bug #39103

closed

bitmap allocator osd make %use of osd over 100%

Added by hoan nv about 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

after pull request https://github.com/ceph/ceph/pull/26983 is merged.
I install package

ceph-common-13.2.5-101.ga1aa89a.el7.x86_64
ceph-mgr-13.2.5-101.ga1aa89a.el7.x86_64
ceph-mds-13.2.5-101.ga1aa89a.el7.x86_64
ceph-13.2.5-101.ga1aa89a.el7.x86_64
libcephfs2-13.2.5-101.ga1aa89a.el7.x86_64
ceph-selinux-13.2.5-101.ga1aa89a.el7.x86_64
ceph-osd-13.2.5-101.ga1aa89a.el7.x86_64
python-cephfs-13.2.5-101.ga1aa89a.el7.x86_64
ceph-base-13.2.5-101.ga1aa89a.el7.x86_64
ceph-mon-13.2.5-101.ga1aa89a.el7.x86_64

and configured

[osd]
bluestore_allocator = bitmap

after restart 1 osd. use of this osd over 100

ceph osd df tree
ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL   %USE          VAR  PGS TYPE NAME
-4       7.19995        - 7.3 TiB  16 EiB 7.3 TiB  230759568.00 1.13   - root ssd
-3       3.59998        - 3.6 TiB 5.3 GiB 3.6 TiB          0.14    0   -     host ssd-ceph-2
 0   ssd 0.89999  1.00000 930 GiB 1.3 GiB 929 GiB          0.14    0 128         osd.0
 1   ssd 0.89999  1.00000 930 GiB 1.3 GiB 929 GiB          0.14    0 139         osd.1
 2   ssd 0.89999  1.00000 930 GiB 1.3 GiB 929 GiB          0.14    0 114         osd.2
 3   ssd 0.89999  1.00000 930 GiB 1.3 GiB 929 GiB          0.14    0 131         osd.3
-5       3.59998        - 3.6 TiB  16 EiB 3.7 TiB  461405824.00 2.25   -     host ssd-ceph-3
 4   ssd 0.89999  1.00000 930 GiB  16 EiB 965 GiB 1846529920.00 9.00  36         osd.4
 5   ssd 0.89999  1.00000 931 GiB 1.4 GiB 930 GiB          0.15    0 152         osd.5
 6   ssd 0.89999  1.00000 931 GiB 1.3 GiB 930 GiB          0.14    0 133         osd.6
 7   ssd 0.89999  1.00000 931 GiB 1.3 GiB 930 GiB          0.14    0 125         osd.7

It make my cluster not working, iops = 0 and ceph health

1 full osd(s)
1 pool(s) full
Degraded data redundancy: 52/924 objects degraded (5.628%), 38 pgs degraded, 38 pgs undersized
Degraded data redundancy (low space): 2 pgs backfill_toofull, 36 pgs recovery_toofull

log start has

2019-04-04 15:44:37.580 7fa86eb29700 -1 log_channel(cluster) log [ERR] : full status failsafe engaged, dropping updates, now 1846530048% full

detail log in attach file.


Files

bitmap-error.log (217 KB) bitmap-error.log hoan nv, 04/04/2019 09:02 AM
ceph-osd.4.log.tar.gz (664 KB) ceph-osd.4.log.tar.gz start osd with debug bluestore 20 log hoan nv, 04/11/2019 09:16 AM
Actions #1

Updated by Igor Fedotov about 5 years ago

  • Pull request ID set to 27366

Probably that's the missed backport which triggers the issue.
See https://github.com/ceph/ceph/pull/27366

Actions #2

Updated by Igor Fedotov about 5 years ago

  • Status changed from New to Need More Info

@hoan nv, could you please apply the patch when it's available and report back if it helps

Actions #3

Updated by hoan nv about 5 years ago

Igor Fedotov wrote:

@hoan nv, could you please apply the patch when it's available and report back if it helps

i will try and feedback.

Actions #4

Updated by hoan nv about 5 years ago

Igor Fedotov wrote:

Probably that's the missed backport which triggers the issue.
See https://github.com/ceph/ceph/pull/27366

This pull request not fix my issue.

Actions #5

Updated by Igor Fedotov about 5 years ago

Could you please collect OSD startup log with 'debug bluestore' set to 20?

Actions #6

Updated by hoan nv about 5 years ago

Igor Fedotov wrote:

Could you please collect OSD startup log with 'debug bluestore' set to 20?

Yes. my log is on attach file.

Actions #7

Updated by Igor Fedotov about 5 years ago

Just managed to reproduce your issue in mimic HEAD.
After applying the patch from https://github.com/ceph/ceph/pull/27366 reporting has been fixed:

before:
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
-1 0.90860 - 930 GiB 16 EiB 965 GiB 1846529920.00 1.00 - root default
-3 0.90860 - 930 GiB 16 EiB 965 GiB 1846529920.00 1.00 - host crius
0 ssd 0.90860 1.00000 930 GiB 16 EiB 965 GiB 1846529920.00 1.00 16 osd.0
TOTAL 930 GiB 16 EiB 965 GiB 1846529920.00

after:
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
-1 0.90860 - 930 GiB 1.0 GiB 929 GiB 0.11 1.00 - root default
-3 0.90860 - 930 GiB 1.0 GiB 929 GiB 0.11 1.00 - host crius
0 ssd 0.90860 1.00000 930 GiB 1.0 GiB 929 GiB 0.11 1.00 16 osd.0
TOTAL 930 GiB 1.0 GiB 929 GiB 0.11

Could you please double check if you applied the patch properly?

Actions #8

Updated by hoan nv about 5 years ago

Could you please double check if you applied the patch properly?

Yes. i will rebuild and recheck.

Actions #9

Updated by hoan nv about 5 years ago

This is my step to build rpm and install.

Checkout branch remotes/origin/mimic from github.com/ceph/ceph
Apply this patch
./make-srpm.sh
rpmbuild --rebuild ceph-13.2.5-140.g5ae3e4b.el7.src.rpm
install package on server.

do i have a wrong step?

Actions #10

Updated by Igor Fedotov about 5 years ago

Looks good to me.

Suggest to insert some new logging, e.g.
int BlueStore::_mount(bool kv_only, bool open_db) {
dout(1) << func << " path " << path << " MY CODE" << dendl;
...

to make sure you run new code.

And check if it appears in the log after restart.

Actions #11

Updated by hoan nv about 5 years ago

Igor Fedotov wrote:

Looks good to me.

Suggest to insert some new logging, e.g.
int BlueStore::_mount(bool kv_only, bool open_db) {
dout(1) << func << " path " << path << " MY CODE" << dendl;
...

to make sure you run new code.

And check if it appears in the log after restart.

I added your log code and rebuild but new seem not patch.
I don't know why.
Do you have srpm file ưhich has patch? share to me if you can.
Thanks.

Actions #12

Updated by Igor Fedotov about 5 years ago

Unfortunately I don't use rpm in my lab. And have pretty limited expertise in this area so can hardly advise something.

And I'm not sure whether rpm (if any) built in my lab running SUSE Linux are applicable for you environment.

Actions #13

Updated by hoan nv about 5 years ago

I patched code.

It work :)

Thanks.

Actions #14

Updated by Igor Fedotov about 5 years ago

  • Status changed from Need More Info to Fix Under Review
Actions #15

Updated by Igor Fedotov almost 5 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF