Actions
Bug #14900
closedhammer: A massive crush OSD daemons in cache tier pool
Status:
Rejected
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Description
Cluster:
- 5 servers
- Centos 7.2 with last updates
- Ceph 0.94.6 from Intank repo
- 4 SSD and 12 HDD on each server
- HDD and SSD have their root in CRUSH map ("slow" - HDD, "ssd" - SSD)
- a journal for each drive placed on the same drive
- two pools created:
- replicated on SSD that using as tier for EC pool. Tier mode: "readonly"
- EC ISA pool on HDD
rados benchmark on EC pool without cache tier works normal.
But after I add cache tier pool and try to create rbd image, I see a massive crush OSD daemons on cache tier pool: 14 of 20 drived crushed instantly.
In a log files I found:
-2> 2016-02-25 16:21:09.674398 7f17ed73b700 3 osd.0 403 handle_osd_map epochs [404,404], i have 403, src has [1,404] -1> 2016-02-25 16:21:09.675503 7f17ed73b700 0 osd.0 404 crush map has features 17971250724864, adjusting msgr requires for osds 0> 2016-02-25 16:21:09.681995 7f17e4729700 -1 ./include/interval_set.h: In function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread 7f17e4729700 time 2016-02-25 16:21:09.675951 ./include/interval_set.h: 386: FAILED assert(_size >= 0) ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc9195] 2: (interval_set<snapid_t>::subtract(interval_set<snapid_t> const&)+0xc0) [0x81ec80] 3: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x54e) [0x7f21be] 4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x2a2) [0x7f26f2] 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2da) [0x6a591a] 6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22c) [0x6a643c] 7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28) [0x701d88] 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966] 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0] 10: (()+0x7dc5) [0x7f17fd224dc5] 11: (clone()+0x6d) [0x7f17fbd0628d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- <skip> --- end dump of recent events --- 2016-02-25 16:21:09.682005 7f17e3f28700 -1 ./include/interval_set.h: In function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread 7f17e3f28700 time 2016-02-25 16:21:09.676003 ./include/interval_set.h: 386: FAILED assert(_size >= 0) ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc9195] 2: (interval_set<snapid_t>::subtract(interval_set<snapid_t> const&)+0xc0) [0x81ec80] 3: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x54e) [0x7f21be] 4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x2a2) [0x7f26f2] 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2da) [0x6a591a] 6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22c) [0x6a643c] 7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28) [0x701d88] 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966] 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0] 10: (()+0x7dc5) [0x7f17fd224dc5] 11: (clone()+0x6d) [0x7f17fbd0628d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -1> 2016-02-25 16:21:09.682005 7f17e3f28700 -1 ./include/interval_set.h: In function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread 7f17e3f28700 time 2016-02-25 16:21:09.676003 ./include/interval_set.h: 386: FAILED assert(_size >= 0) ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc9195] 2: (interval_set<snapid_t>::subtract(interval_set<snapid_t> const&)+0xc0) [0x81ec80] 3: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x54e) [0x7f21be] 4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x2a2) [0x7f26f2] 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2da) [0x6a591a] 6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22c) [0x6a643c] 7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28) [0x701d88] 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966] 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0] 10: (()+0x7dc5) [0x7f17fd224dc5] 11: (clone()+0x6d) [0x7f17fbd0628d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 0> 2016-02-25 16:21:09.739878 7f17f50b8700 5 osd.0 404 tick
I that time I see that another 8 OSD is still alive and some of PG is stale:
[root@c2 ~]# ceph -s cluster 74c6d1fa-cb9b-44d5-8c6f-6b81b8c6c838 health HEALTH_WARN 1412 pgs stale 14/80 in osds are down monmap e1: 5 mons at {c1=10.22.11.20:6789/0,c2=10.22.11.21:6789/0,c3=10.22.11.22:6789/0,c4=10.22.11.23:6789/0,c5=10.22.11.24:6789/0} election epoch 12, quorum 0,1,2,3,4 c1,c2,c3,c4,c5 osdmap e410: 80 osds: 66 up, 80 in pgmap v5086: 4096 pgs, 2 pools, 0 bytes data, 2 objects 8251 MB used, 221 TB / 221 TB avail 2684 active+clean 1412 stale+active+clean [root@c2 ~]# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -12 3.39996 root ssd -7 0.67999 host c2-ssd 0 0.17000 osd.0 down 1.00000 1.00000 1 0.17000 osd.1 up 1.00000 1.00000 2 0.17000 osd.2 up 1.00000 1.00000 3 0.17000 osd.3 down 1.00000 1.00000 -8 0.67999 host c3-ssd 16 0.17000 osd.16 down 1.00000 1.00000 17 0.17000 osd.17 up 1.00000 1.00000 18 0.17000 osd.18 up 1.00000 1.00000 19 0.17000 osd.19 down 1.00000 1.00000 -9 0.67999 host c4-ssd 32 0.17000 osd.32 down 1.00000 1.00000 33 0.17000 osd.33 down 1.00000 1.00000 34 0.17000 osd.34 up 1.00000 1.00000 35 0.17000 osd.35 down 1.00000 1.00000 -10 0.67999 host c5-ssd 48 0.17000 osd.48 down 1.00000 1.00000 49 0.17000 osd.49 down 1.00000 1.00000 61 0.17000 osd.61 down 1.00000 1.00000 62 0.17000 osd.62 down 1.00000 1.00000 -11 0.67999 host c1-ssd 64 0.17000 osd.64 down 1.00000 1.00000 65 0.17000 osd.65 up 1.00000 1.00000 66 0.17000 osd.66 down 1.00000 1.00000 67 0.17000 osd.67 down 1.00000 1.00000 -1 217.79999 root slow -2 43.56000 host c2-hdd 4 3.62999 osd.4 up 1.00000 1.00000 5 3.62999 osd.5 up 1.00000 1.00000 6 3.62999 osd.6 up 1.00000 1.00000 <skip beacouse HDD not afected>
But after 20 sec. the rest 8 drives died:
-11> 2016-02-25 16:21:28.553391 7f11597b8700 1 -- 10.22.11.21:0/28635 mark_down 0x553cc00 -- 0x4d32000 -10> 2016-02-25 16:21:28.553407 7f11597b8700 1 -- 10.22.11.21:0/28635 mark_down 0x553a6e0 -- 0x5217000 -9> 2016-02-25 16:21:28.553429 7f11597b8700 1 -- 10.22.10.21:6802/28635 mark_down 10.22.10.20:6804/8084 -- 0x7602000 -8> 2016-02-25 16:21:28.553466 7f11597b8700 1 -- 10.22.11.21:0/28635 mark_down 0x4e963c0 -- 0x7685000 -7> 2016-02-25 16:21:28.553474 7f1157fb5700 1 osd.1 405 ms_handle_reset con 0x553c260 session 0x58fcd00 -6> 2016-02-25 16:21:28.553480 7f11597b8700 1 -- 10.22.11.21:0/28635 mark_down 0x48c7340 -- 0x75a6000 -5> 2016-02-25 16:21:28.553499 7f11597b8700 1 -- 10.22.10.21:6802/28635 mark_down 10.22.10.20:6806/8672 -- 0x4b81000 -4> 2016-02-25 16:21:28.553581 7f11597b8700 1 -- 10.22.11.21:0/28635 mark_down 0x48c62c0 -- 0x759d000 -3> 2016-02-25 16:21:28.553585 7f1157fb5700 1 osd.1 405 ms_handle_reset con 0x553adc0 session 0x5f2d900 -2> 2016-02-25 16:21:28.553600 7f11597b8700 1 -- 10.22.11.21:0/28635 mark_down 0x48c6b00 -- 0x75c6000 -1> 2016-02-25 16:21:28.553797 7f11597b8700 0 osd.1 406 crush map has features 17971250724864, adjusting msgr requires for osds 0> 2016-02-25 16:21:28.560825 7f11507a6700 -1 ./include/interval_set.h: In function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread 7f11507a6700 time 2016-02-25 16:21:28.554765 ./include/interval_set.h: 386: FAILED assert(_size >= 0) ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc9195] 2: (interval_set<snapid_t>::subtract(interval_set<snapid_t> const&)+0xc0) [0x81ec80] 3: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x54e) [0x7f21be] 4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x2a2) [0x7f26f2] 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2da) [0x6a591a] 6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22c) [0x6a643c] 7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28) [0x701d88] 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966] 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0] 10: (()+0x7dc5) [0x7f11692a1dc5] 11: (clone()+0x6d) [0x7f1167d8328d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
OSD that hold EC pool - still alive.
How I get it¶
- Setup EC profile
# ceph osd erasure-code-profile get isa directory=/usr/lib64/ceph/erasure-code k=3 m=2 plugin=isa ruleset-failure-domain=host ruleset-root=slow technique=reed_sol_van
- Create EC pool
# ceph osd pool create data 2048 2048 erasure isa pool 'data' created
- Create cache tier
[root@c1 ~]# ceph osd tier add data cachetier pool 'cachetier' is now (or already was) a tier of 'data' [root@c1 ~]# ceph osd tier cache-mode cachetier readonly set cache-mode for pool 'cachetier' to readonly [root@c1 ~]# ceph osd tier set-overlay data cachetier overlay for 'data' is now (or already was) 'cachetier'
- Try create a rbd image
[root@c1 ~]# rbd create --size 512000 data/test rbd: create error: (95) Operation not supported 2016-02-25 16:06:08.669203 7fc6589d27c0 -1 librbd: error adding image to directory: (95) Operation not supported [root@c1 ~]# rbd create --size 51200 -p data test 2016-02-25 16:21:09.776618 7f00f11f4700 0 -- 10.22.11.20:0/2941374872 >> 10.22.11.21:6800/28111 pipe(0x7f00e801cad0 sd=11 :0 s=1 pgs=0 cs=0 l=1 c=0x7f00e8020d70).fault 2016-02-25 16:21:09.780317 7f00f16f9700 0 -- 10.22.11.20:0/2941374872 >> 10.22.11.23:6800/22554 pipe(0x7f00e8022990 sd=8 :0 s=1 pgs=0 cs=0 l=1 c=0x7f00e8026c30).fault 2016-02-25 16:21:10.957194 7f00f17fa700 0 -- 10.22.11.20:0/2941374872 >> 10.22.11.20:6804/8084 pipe(0x7f00e8028500 sd=9 :0 s=1 pgs=0 cs=0 l=1 c=0x7f00e802c7a0).fault 2016-02-25 16:21:28.525895 7f00f16f9700 0 -- 10.22.11.20:0/2941374872 >> 10.22.11.21:6804/29154 pipe(0x7f00e8022990 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x7f00e801de40).fault ^C
OSD is crushed.
Files
Actions