Bug #23365
openCEPH device class not honored for erasure encoding.
0%
Description
To start, this cluster isn't happy. It is my destructive testing/learning cluster.
Recently I rebuilt the cluster adding SSDs (having used just HDDs before) and I have been having some issues. First was with performance, it dropped, by a fair amount (going down to just 2MBps, per stream), then I had pgs suddenly do "not active" without any failures. And now it is all sorts of mad. BUT, for this case, that is just some context.
When I first added the SSDs I had the reweight set VERY low to prevent data from the old pools being migrated (that have now been removed). Thinking that that could have been causing some of my issues I returned the weight to normal. This triggered some rebalancing, but the next day the SSDs had all filled (one died in the process, bad hardware). This puzzled me as the meta data pool is only about 80MBs and the write cache hovers around 3.5GBs.
So I started digging, thinking that the data pool may have been migrated the SSDs for some reason.
Lets get our data pool:
root@MediaServer:~# ceph df ... NAME ID USED %USED MAX AVAIL OBJECTS MigrationPool 17 6308G 100.00 0 2019061 MigrationPool-Meta 18 70932k 100.00 0 88098 MigrationPool-WriteCache 19 3395M 100.00 0 875 ...
We are interested in ID 17, here is that pools profile:
root@MediaServer:~# ceph osd pool get MigrationPool all ... erasure_code_profile: Erasure-D5F1-HDD ...
That profiles class:
root@MediaServer:~# ceph osd erasure-code-profile get Erasure-D5F1-HDD crush-device-class=hdd ...
Lets pick an SSD:
root@MediaServer:~# ceph osd df | sort -n -k1 ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS ... 11 ssd 0.09999 1.00000 95392M 88634M 6757M 92.92 1.55 14 ...
And finally, lets look for pool ID 17 in that SSD:
root@MediaServer:~# ceph pg ls-by-osd 11 | grep '17\.'
17.16 31854 0 0 0 0 106917534705 1557 1557 active+clean+remapped 2018-03-14 04:11:17.392929 2996'82494 3490:274645 [2,1,2147483647,12,11,3] 2 [2,1,6,12, 11 ,3] 2 2996'82494 2018-03-14 04:11:17.392736 2996'82494 2018-03-10 23:07:48.358707
17.35 31370 0 0 0 0 104997447663 1623 1623 active+clean 2018-03-14 05:31:53.160644 2996'81594 3490:315943 [12,3,1,6,2,11] 12 [12,3,1,6,2, 11 ] 12 2996'81594 2018-03-14 05:31:53.160520 2993'81192 2018-03-07 15:01:11.463540
17.36 31787 0 0 0 0 106702587303 1500 1500 active+clean 2018-03-14 04:10:48.600589 2996'82305 3490:418755 [12,2,1,3,8,11] 12 [12,2,1,3,8, 11 ] 12 2996'82305 2018-03-14 04:10:48.600464 2996'82305 2018-03-12 05:00:40.453809
17.3e 31662 0 0 0 0 106070903943 1553 1553 active+clean 2018-03-14 03:49:39.645138 3013'81823 3490:289769 [2,1,3,6,11,12] 2 [2,1,3,6, 11 ,12] 2 3013'81823 2018-03-14 03:49:39.645014 3013'81823 2018-03-12 03:38:14.041181
That's not good.... Ideas?
Updated by Greg Farnum about 6 years ago
- Project changed from Ceph to RADOS
- Category deleted (
common)
What version are you running? How are your OSDs configured?
There was a bug with BlueStore SSDs being misreported as rotational for some purposes that may have caused this.
Updated by Brian Woods about 6 years ago
I put 12.2.2, but that is incorrect. It is version ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable).
OSD Tree:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -10 0 datacenter home -1 9.53093 root default -2 9.10095 host MediaServer 1 hdd 4.20000 osd.1 up 1.00000 1.00000 2 hdd 1.54648 osd.2 up 1.00000 1.00000 3 hdd 1.45551 osd.3 up 1.00000 1.00000 4 hdd 0.17000 osd.4 up 1.00000 1.00000 5 hdd 0.00899 osd.5 up 1.00000 1.00000 12 hdd 1.50000 osd.12 up 1.00000 1.00000 9 ssd 0.06000 osd.9 down 1.00000 1.00000 10 ssd 0.06000 osd.10 up 1.00000 1.00000 11 ssd 0.09999 osd.11 up 1.00000 1.00000 -3 0.42998 host TheMonolith 6 hdd 0.17000 osd.6 up 1.00000 1.00000 8 hdd 0.15999 osd.8 up 1.00000 1.00000 7 ssd 0.09999 osd.7 up 0.20000 1.00000
Side note, the OSD that I thought had died is actually in a crash loop of some sort.
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (()+0xa74234) [0x558d65225234] 2: (()+0x11390) [0x7f66bdbd5390] 3: (gsignal()+0x38) [0x7f66bcb70428] 4: (abort()+0x16a) [0x7f66bcb7202a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x558d652689fe] 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x435) [0x558d64eed115] 7: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x396) [0x558d64eca8a6] 8: (OSD::build_past_intervals_parallel()+0xd9d) [0x558d64c7417d] 9: (OSD::load_pgs()+0x14fb) [0x558d64c76c7b] 10: (OSD::init()+0x2217) [0x558d64c94d07] 11: (main()+0x2f07) [0x558d64ba3f17] 12: (__libc_start_main()+0xf0) [0x7f66bcb5b830] 13: (_start()+0x29) [0x558d64c2f6b9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.9.log --- end dump of recent events ---
Updated by Brian Woods about 6 years ago
A quote from Greg Farnum on the crash from another ticket:
Brian, that's a separate bug; the code address you've picked up on is just part of the generic failure handling code.
Reading it again, maybe it is just a hardware failure.