Bug #15313: cluster stuck and thrashosd waiting for it to be clean - Ceph - Ceph

Bug #15313

Updated by Loïc Dachary about 8 years ago

http://167.114.252.97:8081/ubuntu-2016-03-29_19:53:23-rados-wip-15171---basic-openstack/ 

 The test above succeeded only after logging in one of the instances and setting the primary affinity to all OSD to 1. 

 The logs from one of the targets are at http://teuthology-logs.public.ceph.com/logs-15313/ and the rest was lost because the job ran with "archive-on-error" and succeeded. 

 <pre> 
 ubuntu@target167114227065:~$ sudo ceph -s 
     cluster 5d792e64-8050-4f6e-908c-8df3060d7e8d 
      health HEALTH_WARN 
             1 pgs backfill 
             1 pgs backfilling 
             3 pgs degraded 
             1 pgs recovering 
             2 pgs recovery_wait 
             3 pgs stuck degraded 
             5 pgs stuck unclean 
             recovery 5859/23854 objects degraded (24.562%) 
             recovery 3145/23854 objects misplaced (13.184%) 
             pool rbd pg_num 64 > pgp_num 34 
             mon.b has mon_osd_down_out_interval set to 0 
      monmap e1: 3 mons at {a=167.114.227.65:6789/0,b=167.114.227.66:6789/0,c=167.114.227.65:6790/0} 
             election epoch 4, quorum 0,1,2 a,b,c 
      osdmap e102: 6 osds: 6 up, 6 in; 2 remapped pgs 
       pgmap v1691: 72 pgs, 3 pools, 24548 MB data, 8609 objects 
             44129 MB used, 136 GB / 179 GB avail 
             5859/23854 objects degraded (24.562%) 
             3145/23854 objects misplaced (13.184%) 
                   67 active+clean 
                    2 active+recovery_wait+degraded 
                    1 active+recovering+degraded 
                    1 active+remapped+backfilling 
                    1 active+remapped+wait_backfill 
 recovery io 12294 kB/s, 4 objects/s 
 ubuntu@target167114227065:~$ sudo ceph osd tree 
 ID WEIGHT    TYPE NAME                UP/DOWN REWEIGHT PRIMARY-AFFINITY  
 -1 6.00000 root default                                              
 -3 6.00000       rack localrack                                        
 -2 6.00000           host localhost                                    
  0 1.00000               osd.0             up    1.00000            1.00000  
  1 1.00000               osd.1             up    1.00000            0.21001  
  2 1.00000               osd.2             up    1.00000                  0  
  3 1.00000               osd.3             up    1.00000            1.00000  
  4 1.00000               osd.4             up    1.00000                  0  
  5 1.00000               osd.5             up    1.00000            1.00000  
 </pre> 

 <pre> 
 ubuntu@target167114227065:~$ sudo ceph -s 
     cluster 5d792e64-8050-4f6e-908c-8df3060d7e8d 
      health HEALTH_WARN 
             1 pgs backfill 
             2 pgs degraded 
             2 pgs recovering 
             1 pgs stuck degraded 
             3 pgs stuck unclean 
             recovery 2769/22869 objects degraded (12.108%) 
             recovery 1970/22869 objects misplaced (8.614%) 
             pool rbd pg_num 64 > pgp_num 34 
             mon.c has mon_osd_down_out_interval set to 0 
      monmap e1: 3 mons at {a=167.114.227.65:6789/0,b=167.114.227.66:6789/0,c=167.114.227.65:6790/0} 
             election epoch 4, quorum 0,1,2 a,b,c 
      osdmap e110: 6 osds: 6 up, 6 in; 1 remapped pgs 
       pgmap v1858: 72 pgs, 3 pools, 24548 MB data, 8609 objects 
             41379 MB used, 139 GB / 179 GB avail 
             2769/22869 objects degraded (12.108%) 
             1970/22869 objects misplaced (8.614%) 
                   69 active+clean 
                    2 active+recovering+degraded 
                    1 active+remapped+wait_backfill 
 </pre> 

 <pre> 
 2016-03-29 21:57:58.630797 mon.0 [INF] pgmap v1869: 72 pgs: 1 active+remapped+wait_backfill, 69 active+clean, 2 active+recovering+degraded; 24548 MB data, 41734 MB used, 139 GB / 179 GB avail; 2549/22869 objects degraded (11.146%); 1970/22869 objects misplaced (8.614%) 
 2016-03-29 21:58:00.863474 mon.0 [INF] pgmap v1870: 72 pgs: 1 active+remapped+wait_backfill, 69 active+clean, 2 active+recovering+degraded; 24548 MB data, 41734 MB used, 139 GB / 179 GB avail; 2523/22869 objects degraded (11.032%); 1970/22869 objects misplaced (8.614%) 
 2016-03-29 21:57:54.640063 osd.5 [INF] 0.10 scrub starts 
 2016-03-29 21:57:54.706198 osd.5 [INF] 0.10 scrub ok 
 2016-03-29 21:58:02.339356 osd.3 [INF] 0.13 scrub starts 
 2016-03-29 21:58:02.354004 osd.3 [INF] 0.13 scrub ok 
 </pre> 

 <pre> 
 bjects/s recovering 
 2016-03-29 22:09:12.497228 osd.4 [INF] 0.2d scrub starts 
 2016-03-29 22:09:12.498801 osd.4 [INF] 0.2d scrub ok 
 2016-03-29 22:09:14.705764 mon.0 [INF] pgmap v2234: 72 pgs: 1 active+remapped+backfilling, 71 active+clean; 24548 MB data, 42626 MB used, 138 GB / 179 GB avail; 1150/22869 objects misplaced (5.029%) 
 2016-03-29 22:09:07.106207 osd.3 [INF] 2.0 scrub ok 
 </pre> 

 <pre> 
 ubuntu@target167114227065:~$ sudo ceph -w 
     cluster 5d792e64-8050-4f6e-908c-8df3060d7e8d 
      health HEALTH_WARN 
             pool rbd pg_num 64 > pgp_num 34 
             mon.b has mon_osd_down_out_interval set to 0 
      monmap e1: 3 mons at {a=167.114.227.65:6789/0,b=167.114.227.66:6789/0,c=167.114.227.65:6790/0} 
             election epoch 4, quorum 0,1,2 a,b,c 
      osdmap e112: 6 osds: 6 up, 6 in 
       pgmap v2281: 72 pgs, 3 pools, 24548 MB data, 8609 objects 
             41910 MB used, 138 GB / 179 GB avail 
                   71 active+clean 
                    1 active+clean+scrubbing 
 recovery io 13157 kB/s, 4 objects/s 
 </pre>

Back

Project

General

Profile

Ceph

Bug #15313