Bug #23365: CEPH device class not honored for erasure encoding. - RADOS - Ceph

Actions

Copy link

Bug #23365

open

CEPH device class not honored for erasure encoding.

Added by Brian Woods about 6 years ago. Updated about 6 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

Ceph - v12.2.2

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

To start, this cluster isn't happy. It is my destructive testing/learning cluster.

Recently I rebuilt the cluster adding SSDs (having used just HDDs before) and I have been having some issues. First was with performance, it dropped, by a fair amount (going down to just 2MBps, per stream), then I had pgs suddenly do "not active" without any failures. And now it is all sorts of mad. BUT, for this case, that is just some context.

When I first added the SSDs I had the reweight set VERY low to prevent data from the old pools being migrated (that have now been removed). Thinking that that could have been causing some of my issues I returned the weight to normal. This triggered some rebalancing, but the next day the SSDs had all filled (one died in the process, bad hardware). This puzzled me as the meta data pool is only about 80MBs and the write cache hovers around 3.5GBs.

So I started digging, thinking that the data pool may have been migrated the SSDs for some reason.

Lets get our data pool:

root@MediaServer:~# ceph df
...
    NAME                         ID     USED       %USED      MAX AVAIL     OBJECTS 
    MigrationPool                17      6308G     100.00             0     2019061 
    MigrationPool-Meta           18     70932k     100.00             0       88098 
    MigrationPool-WriteCache     19      3395M     100.00             0         875 
...

We are interested in ID 17, here is that pools profile:

root@MediaServer:~# ceph osd pool get MigrationPool all
...
erasure_code_profile: Erasure-D5F1-HDD
...

That profiles class:

root@MediaServer:~# ceph osd erasure-code-profile get Erasure-D5F1-HDD
crush-device-class=hdd
...

Lets pick an SSD:

root@MediaServer:~# ceph osd df | sort -n -k1
ID CLASS WEIGHT  REWEIGHT SIZE   USE    AVAIL   %USE  VAR  PGS 
...
11   ssd 0.09999  1.00000 95392M 88634M   6757M 92.92 1.55  14 
...

And finally, lets look for pool ID 17 in that SSD:

root@MediaServer:~# ceph pg ls-by-osd 11 | grep '17\.'
17.16 31854 0 0 0 0 106917534705 1557 1557 active+clean+remapped 2018-03-14 04:11:17.392929 2996'82494 3490:274645 [2,1,2147483647,12,11,3] 2 [2,1,6,12, 11 ,3] 2 2996'82494 2018-03-14 04:11:17.392736 2996'82494 2018-03-10 23:07:48.358707
17.35 31370 0 0 0 0 104997447663 1623 1623 active+clean 2018-03-14 05:31:53.160644 2996'81594 3490:315943 [12,3,1,6,2,11] 12 [12,3,1,6,2, 11 ] 12 2996'81594 2018-03-14 05:31:53.160520 2993'81192 2018-03-07 15:01:11.463540
17.36 31787 0 0 0 0 106702587303 1500 1500 active+clean 2018-03-14 04:10:48.600589 2996'82305 3490:418755 [12,2,1,3,8,11] 12 [12,2,1,3,8, 11 ] 12 2996'82305 2018-03-14 04:10:48.600464 2996'82305 2018-03-12 05:00:40.453809
17.3e 31662 0 0 0 0 106070903943 1553 1553 active+clean 2018-03-14 03:49:39.645138 3013'81823 3490:289769 [2,1,3,6,11,12] 2 [2,1,3,6, 11 ,12] 2 3013'81823 2018-03-14 03:49:39.645014 3013'81823 2018-03-12 03:38:14.041181

That's not good.... Ideas?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #23365

CEPH device class not honored for erasure encoding.

Updated by Greg Farnum about 6 years ago

Updated by Brian Woods about 6 years ago

Updated by Brian Woods about 6 years ago