Project

General

Profile

Actions

Bug #15313

closed

cluster stuck and thrashosd waiting for it to be clean

Added by Loïc Dachary about 8 years ago. Updated about 8 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://167.114.252.97:8081/ubuntu-2016-03-29_19:53:23-rados-wip-15171---basic-openstack/

The test above succeeded only after logging in one of the instances and setting the primary affinity to all OSD to 1.

The logs from one of the targets are at http://teuthology-logs.public.ceph.com/logs-15313/ and the rest was lost because the job ran with "archive-on-error" and succeeded.

ubuntu@target167114227065:~$ sudo ceph -s
    cluster 5d792e64-8050-4f6e-908c-8df3060d7e8d
     health HEALTH_WARN
            1 pgs backfill
            1 pgs backfilling
            3 pgs degraded
            1 pgs recovering
            2 pgs recovery_wait
            3 pgs stuck degraded
            5 pgs stuck unclean
            recovery 5859/23854 objects degraded (24.562%)
            recovery 3145/23854 objects misplaced (13.184%)
            pool rbd pg_num 64 > pgp_num 34
            mon.b has mon_osd_down_out_interval set to 0
     monmap e1: 3 mons at {a=167.114.227.65:6789/0,b=167.114.227.66:6789/0,c=167.114.227.65:6790/0}
            election epoch 4, quorum 0,1,2 a,b,c
     osdmap e102: 6 osds: 6 up, 6 in; 2 remapped pgs
      pgmap v1691: 72 pgs, 3 pools, 24548 MB data, 8609 objects
            44129 MB used, 136 GB / 179 GB avail
            5859/23854 objects degraded (24.562%)
            3145/23854 objects misplaced (13.184%)
                  67 active+clean
                   2 active+recovery_wait+degraded
                   1 active+recovering+degraded
                   1 active+remapped+backfilling
                   1 active+remapped+wait_backfill
recovery io 12294 kB/s, 4 objects/s
ubuntu@target167114227065:~$ sudo ceph osd tree
ID WEIGHT  TYPE NAME              UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 6.00000 root default                                             
-3 6.00000     rack localrack                                       
-2 6.00000         host localhost                                   
 0 1.00000             osd.0           up  1.00000          1.00000 
 1 1.00000             osd.1           up  1.00000          0.21001 
 2 1.00000             osd.2           up  1.00000                0 
 3 1.00000             osd.3           up  1.00000          1.00000 
 4 1.00000             osd.4           up  1.00000                0 
 5 1.00000             osd.5           up  1.00000          1.00000 
ubuntu@target167114227065:~$ sudo ceph -s
    cluster 5d792e64-8050-4f6e-908c-8df3060d7e8d
     health HEALTH_WARN
            1 pgs backfill
            2 pgs degraded
            2 pgs recovering
            1 pgs stuck degraded
            3 pgs stuck unclean
            recovery 2769/22869 objects degraded (12.108%)
            recovery 1970/22869 objects misplaced (8.614%)
            pool rbd pg_num 64 > pgp_num 34
            mon.c has mon_osd_down_out_interval set to 0
     monmap e1: 3 mons at {a=167.114.227.65:6789/0,b=167.114.227.66:6789/0,c=167.114.227.65:6790/0}
            election epoch 4, quorum 0,1,2 a,b,c
     osdmap e110: 6 osds: 6 up, 6 in; 1 remapped pgs
      pgmap v1858: 72 pgs, 3 pools, 24548 MB data, 8609 objects
            41379 MB used, 139 GB / 179 GB avail
            2769/22869 objects degraded (12.108%)
            1970/22869 objects misplaced (8.614%)
                  69 active+clean
                   2 active+recovering+degraded
                   1 active+remapped+wait_backfill
2016-03-29 21:57:58.630797 mon.0 [INF] pgmap v1869: 72 pgs: 1 active+remapped+wait_backfill, 69 active+clean, 2 active+recovering+degraded; 24548 MB data, 41734 MB used, 139 GB / 179 GB avail; 2549/22869 objects degraded (11.146%); 1970/22869 objects misplaced (8.614%)
2016-03-29 21:58:00.863474 mon.0 [INF] pgmap v1870: 72 pgs: 1 active+remapped+wait_backfill, 69 active+clean, 2 active+recovering+degraded; 24548 MB data, 41734 MB used, 139 GB / 179 GB avail; 2523/22869 objects degraded (11.032%); 1970/22869 objects misplaced (8.614%)
2016-03-29 21:57:54.640063 osd.5 [INF] 0.10 scrub starts
2016-03-29 21:57:54.706198 osd.5 [INF] 0.10 scrub ok
2016-03-29 21:58:02.339356 osd.3 [INF] 0.13 scrub starts
2016-03-29 21:58:02.354004 osd.3 [INF] 0.13 scrub ok
bjects/s recovering
2016-03-29 22:09:12.497228 osd.4 [INF] 0.2d scrub starts
2016-03-29 22:09:12.498801 osd.4 [INF] 0.2d scrub ok
2016-03-29 22:09:14.705764 mon.0 [INF] pgmap v2234: 72 pgs: 1 active+remapped+backfilling, 71 active+clean; 24548 MB data, 42626 MB used, 138 GB / 179 GB avail; 1150/22869 objects misplaced (5.029%)
2016-03-29 22:09:07.106207 osd.3 [INF] 2.0 scrub ok
ubuntu@target167114227065:~$ sudo ceph -w
    cluster 5d792e64-8050-4f6e-908c-8df3060d7e8d
     health HEALTH_WARN
            pool rbd pg_num 64 > pgp_num 34
            mon.b has mon_osd_down_out_interval set to 0
     monmap e1: 3 mons at {a=167.114.227.65:6789/0,b=167.114.227.66:6789/0,c=167.114.227.65:6790/0}
            election epoch 4, quorum 0,1,2 a,b,c
     osdmap e112: 6 osds: 6 up, 6 in
      pgmap v2281: 72 pgs, 3 pools, 24548 MB data, 8609 objects
            41910 MB used, 138 GB / 179 GB avail
                  71 active+clean
                   1 active+clean+scrubbing
recovery io 13157 kB/s, 4 objects/s
Actions #1

Updated by Loïc Dachary about 8 years ago

  • Description updated (diff)
Actions #2

Updated by Loïc Dachary about 8 years ago

  • Status changed from New to Need More Info
  • Assignee set to Loïc Dachary
Actions #3

Updated by Loïc Dachary about 8 years ago

  • Status changed from Need More Info to Rejected

the cluster was not stuck, just slow and I failed to notice progress although there was progress (0.01% at a time)

Actions

Also available in: Atom PDF