Bug #14900: hammer: A massive crush OSD daemons in cache tier pool - Ceph - Ceph

Actions

Copy link

Bug #14900

closed

hammer: A massive crush OSD daemons in cache tier pool

Added by Mike Almateia about 8 years ago. Updated almost 7 years ago.

Status:

Rejected

Priority:

High

Assignee:

Category:

OSD

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v0.94.6

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Cluster:

5 servers
Centos 7.2 with last updates
Ceph 0.94.6 from Intank repo
4 SSD and 12 HDD on each server
HDD and SSD have their root in CRUSH map ("slow" - HDD, "ssd" - SSD)
a journal for each drive placed on the same drive
two pools created:
- replicated on SSD that using as tier for EC pool. Tier mode: "readonly"
- EC ISA pool on HDD

rados benchmark on EC pool without cache tier works normal.

But after I add cache tier pool and try to create rbd image, I see a massive crush OSD daemons on cache tier pool: 14 of 20 drived crushed instantly.
In a log files I found:

   -2> 2016-02-25 16:21:09.674398 7f17ed73b700  3 osd.0 403 handle_osd_map epochs [404,404], i have 403, src has [1,404]
    -1> 2016-02-25 16:21:09.675503 7f17ed73b700  0 osd.0 404 crush map has features 17971250724864, adjusting msgr requires for osds
     0> 2016-02-25 16:21:09.681995 7f17e4729700 -1 ./include/interval_set.h: In function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread 7f17e4729700 time 2016-02-25 16:21:09.675951
./include/interval_set.h: 386: FAILED assert(_size >= 0)

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc9195]
 2: (interval_set<snapid_t>::subtract(interval_set<snapid_t> const&)+0xc0) [0x81ec80]
 3: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x54e) [0x7f21be]
 4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x2a2) [0x7f26f2]
 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2da) [0x6a591a]
 6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22c) [0x6a643c]
 7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28) [0x701d88]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966]
 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0]
 10: (()+0x7dc5) [0x7f17fd224dc5]
 11: (clone()+0x6d) [0x7f17fbd0628d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
<skip>
--- end dump of recent events ---
2016-02-25 16:21:09.682005 7f17e3f28700 -1 ./include/interval_set.h: In function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread 7f17e3f28700 time 2016-02-25 16:21:09.676003
./include/interval_set.h: 386: FAILED assert(_size >= 0)

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc9195]
 2: (interval_set<snapid_t>::subtract(interval_set<snapid_t> const&)+0xc0) [0x81ec80]
 3: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x54e) [0x7f21be]
 4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x2a2) [0x7f26f2]
 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2da) [0x6a591a]
 6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22c) [0x6a643c]
 7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28) [0x701d88]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966]
 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0]
 10: (()+0x7dc5) [0x7f17fd224dc5]
 11: (clone()+0x6d) [0x7f17fbd0628d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
    -1> 2016-02-25 16:21:09.682005 7f17e3f28700 -1 ./include/interval_set.h: In function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread 7f17e3f28700 time 2016-02-25 16:21:09.676003
./include/interval_set.h: 386: FAILED assert(_size >= 0)

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc9195]
 2: (interval_set<snapid_t>::subtract(interval_set<snapid_t> const&)+0xc0) [0x81ec80]
 3: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x54e) [0x7f21be]
 4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x2a2) [0x7f26f2]
 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2da) [0x6a591a]
 6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22c) [0x6a643c]
 7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28) [0x701d88]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966]
 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0]
 10: (()+0x7dc5) [0x7f17fd224dc5]
 11: (clone()+0x6d) [0x7f17fbd0628d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2016-02-25 16:21:09.739878 7f17f50b8700  5 osd.0 404 tick

I that time I see that another 8 OSD is still alive and some of PG is stale:

[root@c2 ~]# ceph -s
    cluster 74c6d1fa-cb9b-44d5-8c6f-6b81b8c6c838
     health HEALTH_WARN
            1412 pgs stale
            14/80 in osds are down
     monmap e1: 5 mons at {c1=10.22.11.20:6789/0,c2=10.22.11.21:6789/0,c3=10.22.11.22:6789/0,c4=10.22.11.23:6789/0,c5=10.22.11.24:6789/0}
            election epoch 12, quorum 0,1,2,3,4 c1,c2,c3,c4,c5
     osdmap e410: 80 osds: 66 up, 80 in
      pgmap v5086: 4096 pgs, 2 pools, 0 bytes data, 2 objects
            8251 MB used, 221 TB / 221 TB avail
                2684 active+clean
                1412 stale+active+clean
[root@c2 ~]# ceph osd tree
ID  WEIGHT    TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-12   3.39996 root ssd                                          
 -7   0.67999     host c2-ssd                                   
  0   0.17000         osd.0      down  1.00000          1.00000 
  1   0.17000         osd.1        up  1.00000          1.00000 
  2   0.17000         osd.2        up  1.00000          1.00000 
  3   0.17000         osd.3      down  1.00000          1.00000 
 -8   0.67999     host c3-ssd                                   
 16   0.17000         osd.16     down  1.00000          1.00000 
 17   0.17000         osd.17       up  1.00000          1.00000 
 18   0.17000         osd.18       up  1.00000          1.00000 
 19   0.17000         osd.19     down  1.00000          1.00000 
 -9   0.67999     host c4-ssd                                   
 32   0.17000         osd.32     down  1.00000          1.00000 
 33   0.17000         osd.33     down  1.00000          1.00000 
 34   0.17000         osd.34       up  1.00000          1.00000 
 35   0.17000         osd.35     down  1.00000          1.00000 
-10   0.67999     host c5-ssd                                   
 48   0.17000         osd.48     down  1.00000          1.00000 
 49   0.17000         osd.49     down  1.00000          1.00000 
 61   0.17000         osd.61     down  1.00000          1.00000 
 62   0.17000         osd.62     down  1.00000          1.00000 
-11   0.67999     host c1-ssd                                   
 64   0.17000         osd.64     down  1.00000          1.00000 
 65   0.17000         osd.65       up  1.00000          1.00000 
 66   0.17000         osd.66     down  1.00000          1.00000 
 67   0.17000         osd.67     down  1.00000          1.00000 
-1 217.79999 root slow                                         
 -2  43.56000     host c2-hdd                                   
  4   3.62999         osd.4        up  1.00000          1.00000 
  5   3.62999         osd.5        up  1.00000          1.00000 
  6   3.62999         osd.6        up  1.00000          1.00000
<skip beacouse HDD not afected>

But after 20 sec. the rest 8 drives died:

   -11> 2016-02-25 16:21:28.553391 7f11597b8700  1 -- 10.22.11.21:0/28635 mark_down 0x553cc00 -- 0x4d32000
   -10> 2016-02-25 16:21:28.553407 7f11597b8700  1 -- 10.22.11.21:0/28635 mark_down 0x553a6e0 -- 0x5217000
    -9> 2016-02-25 16:21:28.553429 7f11597b8700  1 -- 10.22.10.21:6802/28635 mark_down 10.22.10.20:6804/8084 -- 0x7602000
    -8> 2016-02-25 16:21:28.553466 7f11597b8700  1 -- 10.22.11.21:0/28635 mark_down 0x4e963c0 -- 0x7685000
    -7> 2016-02-25 16:21:28.553474 7f1157fb5700  1 osd.1 405 ms_handle_reset con 0x553c260 session 0x58fcd00
    -6> 2016-02-25 16:21:28.553480 7f11597b8700  1 -- 10.22.11.21:0/28635 mark_down 0x48c7340 -- 0x75a6000
    -5> 2016-02-25 16:21:28.553499 7f11597b8700  1 -- 10.22.10.21:6802/28635 mark_down 10.22.10.20:6806/8672 -- 0x4b81000
    -4> 2016-02-25 16:21:28.553581 7f11597b8700  1 -- 10.22.11.21:0/28635 mark_down 0x48c62c0 -- 0x759d000
    -3> 2016-02-25 16:21:28.553585 7f1157fb5700  1 osd.1 405 ms_handle_reset con 0x553adc0 session 0x5f2d900
    -2> 2016-02-25 16:21:28.553600 7f11597b8700  1 -- 10.22.11.21:0/28635 mark_down 0x48c6b00 -- 0x75c6000
    -1> 2016-02-25 16:21:28.553797 7f11597b8700  0 osd.1 406 crush map has features 17971250724864, adjusting msgr requires for osds
     0> 2016-02-25 16:21:28.560825 7f11507a6700 -1 ./include/interval_set.h: In function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread 7f11507a6700 time 2016-02-25 16:21:28.554765
./include/interval_set.h: 386: FAILED assert(_size >= 0)

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc9195]
 2: (interval_set<snapid_t>::subtract(interval_set<snapid_t> const&)+0xc0) [0x81ec80]
 3: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x54e) [0x7f21be]
 4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x2a2) [0x7f26f2]
 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2da) [0x6a591a]
 6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22c) [0x6a643c]
 7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28) [0x701d88]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966]
 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0]
 10: (()+0x7dc5) [0x7f11692a1dc5]
 11: (clone()+0x6d) [0x7f1167d8328d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

OSD that hold EC pool - still alive.

How I get it¶

Setup EC profile

# ceph osd erasure-code-profile get isa
directory=/usr/lib64/ceph/erasure-code
k=3
m=2
plugin=isa
ruleset-failure-domain=host
ruleset-root=slow
technique=reed_sol_van

Create EC pool

# ceph osd pool create data 2048 2048 erasure isa

pool 'data' created

Create cache tier

[root@c1 ~]# ceph osd tier add data cachetier
pool 'cachetier' is now (or already was) a tier of 'data'
[root@c1 ~]# ceph osd tier cache-mode cachetier readonly
set cache-mode for pool 'cachetier' to readonly
[root@c1 ~]# ceph osd tier set-overlay data cachetier
overlay for 'data' is now (or already was) 'cachetier'

Try create a rbd image

[root@c1 ~]# rbd create --size 512000 data/test
rbd: create error: (95) Operation not supported
2016-02-25 16:06:08.669203 7fc6589d27c0 -1 librbd: error adding image to directory: (95) Operation not supported
[root@c1 ~]# rbd create --size 51200 -p data test
2016-02-25 16:21:09.776618 7f00f11f4700  0 -- 10.22.11.20:0/2941374872 >> 10.22.11.21:6800/28111 pipe(0x7f00e801cad0 sd=11 :0 s=1 pgs=0 cs=0 l=1 c=0x7f00e8020d70).fault
2016-02-25 16:21:09.780317 7f00f16f9700  0 -- 10.22.11.20:0/2941374872 >> 10.22.11.23:6800/22554 pipe(0x7f00e8022990 sd=8 :0 s=1 pgs=0 cs=0 l=1 c=0x7f00e8026c30).fault
2016-02-25 16:21:10.957194 7f00f17fa700  0 -- 10.22.11.20:0/2941374872 >> 10.22.11.20:6804/8084 pipe(0x7f00e8028500 sd=9 :0 s=1 pgs=0 cs=0 l=1 c=0x7f00e802c7a0).fault
2016-02-25 16:21:28.525895 7f00f16f9700  0 -- 10.22.11.20:0/2941374872 >> 10.22.11.21:6804/29154 pipe(0x7f00e8022990 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x7f00e801de40).fault
^C

OSD is crushed.

Files

Download all files

osd_crushed_de (5.46 KB) osd_crushed_de	CRUSH map	Mike Almateia, 02/26/2016 09:25 AM
ceph-osd.0.log-20160226.gz (233 KB) ceph-osd.0.log-20160226.gz	Some log OSD.0	Mike Almateia, 02/26/2016 09:28 AM
ceph-osd.2.log-20160226.gz (227 KB) ceph-osd.2.log-20160226.gz	Some log OSD.2 crashed after first 14	Mike Almateia, 02/26/2016 09:29 AM
ceph-osd.3.log-20160226.gz (233 KB) ceph-osd.3.log-20160226.gz	Some log OSD.3	Mike Almateia, 02/26/2016 09:30 AM