Bug #26971: failed to become clean before timeout expired - RADOS - Ceph

Actions

Copy link

Bug #26971

closed

failed to become clean before timeout expired

Added by David Zafman over 5 years ago. Updated about 5 years ago.

Status:

Duplicate

Priority:

High

Assignee:

David Zafman

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://qa-proxy.ceph.com/teuthology/dzafman-2018-08-16_17:35:08-rados:thrash-wip-zafman-testing4-distro-basic-smithi/2911870

1.8 0 0 0 0 0 0 0 0 active+undersized 2018-08-17 11:30:11.506468 0'0 617:557 [5,7] 5 [5,7] 5 0'0 2018-08-17 11:26:39.848677 0'0 2018-08-17 11:25:05.348213 0

2018-08-17T12:01:15.935 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-zafman-testing4/qa/tasks/ceph_manager.py", line 861, in wrapper
return func(self)
File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-zafman-testing4/qa/tasks/ceph_manager.py", line 980, in do_thrash
self.test_map_discontinuity()
File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-zafman-testing4/qa/tasks/ceph_manager.py", line 779, in test_map_discontinuity
timeout=self.config.get('timeout')
File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-zafman-testing4/qa/tasks/ceph_manager.py", line 2112, in wait_for_clean
'failed to become clean before timeout expired'
AssertionError: failed to become clean before timeout expired

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Greg Farnum over 5 years ago

Looks like a PG is active+undersized state. Maybe the balancer screwed up?

Actions

Copy link

Updated by David Zafman about 5 years ago

Assignee set to David Zafman

Actions

Copy link

Updated by David Zafman about 5 years ago

Seen recently in luminous.

yuriw-2019-02-28_14:42:05-rados-wip-yuri4-testing-2019-02-27-2159-luminous-distro-basic-smithi/3650479

I'm retesting in master filtering for closest equivalent:

--suite rados:thrash --filter '0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml 2-recovery-overrides/{more-active-recovery.yaml} backoff/peering_and_degraded.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/upmap.yaml msgr-failures/osd-delay.yaml msgr/random.yaml objectstore/filestore-xfs.yaml rados.yaml supported-random-distro$/{ubuntu_latest.yaml} thrashers/mapgap.yaml thrashosds-health.yaml workloads/radosbench.yaml'

Actions

Copy link

Updated by David Zafman about 5 years ago

Status changed from New to 12
Priority changed from Normal to High

Reproduced on master in 1 of 10 duplicate runs:

dzafman-2019-03-05_10:43:39-rados:thrash-master-distro-basic-smithi/3671598

Actions

Copy link

Updated by David Zafman about 5 years ago

I'm not sure what this means, but pg 1.0 (size 3) needs to pick another one of the 2 remaining OSDs (4 OSDs in) to recovery to. However, we hit this during Peering.

2019-03-05 19:30:13.102 7f90e1c87700 10 osd.3 pg_epoch: 601 pg[1.0( empty local-lis/les=576/577 n=0 ec=14/14 lis/c 576/576 les/c/f 577/577/0 601/601/601) [3,6] r=0 lpr=601 pi=[576,601)/1 crt=0'0 mlcod 0'0 peering mbc={}] up_thru 585 < same_since 601, must notify monitor
2019-03-05 19:30:13.106 7f90e5c8f700 10 osd.3 pg_epoch: 601 pg[1.0( empty local-lis/les=576/577 n=0 ec=14/14 lis/c 576/576 les/c/f 577/577/0 601/601/601) [3,6] r=0 lpr=601 pi=[576,601)/1 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetMissing>: still need up_thru update before going active
2019-03-05 19:30:14.110 7f90e1c87700 10 osd.3 pg_epoch: 602 pg[1.0( empty local-lis/les=576/577 n=0 ec=14/14 lis/c 576/576 les/c/f 577/577/0 601/601/601) [3,6] r=0 lpr=601 pi=[576,601)/1 crt=0'0 mlcod 0'0 peering mbc={}] adjust_need_up_thru now 601, need_up_thru now false

Actions

Copy link

Updated by David Zafman about 5 years ago

dzafman-2019-03-26_16:39:54-rados:thrash-wip-zafman-26971-diag-distro-basic-smithi/3776762

Another run with diagnostics added to calc_replicated_acting():

This shows that OSDs 6 and 7 are available for these other PGs.

1.4           0                  0        0         0       0     0           0          0   0        0 active+undersized 2019-03-27 00:14:02.954625     0'0  659:716   [3,0]          3   [3,0]              3        0'0 2019-03-27 00:12:55.474184             0'0 2019-03-27 00:07:35.533921             0
1.3           0                  0        0         0       0     0           0          0   0        0      active+clean 2019-03-27 00:54:22.082685     0'0  660:526 [7,0,3]          7 [7,0,3]              7        0'0 2019-03-27 00:54:22.082626             0'0 2019-03-27 00:54:22.082626             0
1.2           0                  0        0         0       0     0           0          0   0        0      active+clean 2019-03-27 00:53:56.778513     0'0  660:761 [0,3,6]          0 [0,3,6]              0        0'0 2019-03-27 00:53:56.778380             0'0 2019-03-27 00:52:17.282249             0

I added a message if "want->size() >= size of" that did not output. So "up" must NOT include 6 and 7. Also,the candidate_by_last_update is empty so 6 and 7 aren't strays and weren't in acting to be rejected (no message for that).

I guess I could have assumed that up acting [3,0] coming in here. Maybe the "up" is the problem before we try to compute "acting."

2019-03-27 00:14:01.945 7f6ec2a23700 10 osd.3 pg_epoch: 398 pg[1.4( empty local-lis/les=232/233 n=0 ec=14/14 lis/c 232/232 les/c/f 233/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 peering mbc={}] calc_replicated_acting newest update on osd.3 with 1.4( empty local-lis/les=232/233 n=0 ec=14/14 lis/c 232/232 les/c/f 233/233/0 398/398/14)
up_primary: 3) selected as primary
primary is osd.3 with 1.4( empty local-lis/les=232/233 n=0 ec=14/14 lis/c 232/232 les/c/f 233/233/0 398/398/14)
osd.0 (up) accepted 1.4( empty local-lis/les=232/233 n=0 ec=14/14 lis/c 232/232 les/c/f 233/233/0 398/398/14)
candidate_by_last_update is empty

Actions

Copy link

Updated by Neha Ojha about 5 years ago

The up set seems to be problem here.

This is the point when we find out that osd.5 is down

2019-03-27 00:14:01.941 7f6ebea1b700 10 osd.3 pg_epoch: 398 pg[1.4( empty local-lis/les=232/233 n=0 ec=14/14 lis/c 232/232 les/c/f 233/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 peering mbc={}] build_prior  prior osd.5 is down

Beyond that newup/newacting is always reported as [3,0]/[3,0]

2019-03-27 00:14:01.941 7f6ebea1b700 10 osd.3 pg_epoch: 397 pg[1.4( empty local-lis/les=232/233 n=0 ec=14/14 lis/c 232/232 les/c/f 233/233/0 232/232/14) [3,0,5] r=0 lpr=232 crt=0'0 mlcod 0'0 active+clean] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:02.941 7f6ec2a23700 10 osd.3 pg_epoch: 398 pg[1.4( empty local-lis/les=232/233 n=0 ec=14/14 lis/c 232/232 les/c/f 233/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 peering mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:03.945 7f6ec2a23700 10 osd.3 pg_epoch: 399 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:04.949 7f6ebea1b700 10 osd.3 pg_epoch: 400 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:06.957 7f6ebea1b700 10 osd.3 pg_epoch: 401 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:09.981 7f6ebea1b700 10 osd.3 pg_epoch: 402 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:12.749 7f6ec2a23700 10 osd.3 pg_epoch: 403 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:13.769 7f6ebea1b700 10 osd.3 pg_epoch: 404 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:16.785 7f6ec2a23700 10 osd.3 pg_epoch: 405 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:19.821 7f6ec2a23700 10 osd.3 pg_epoch: 406 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:22.725 7f6ebea1b700 10 osd.3 pg_epoch: 407 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:23.829 7f6ec2a23700 10 osd.3 pg_epoch: 408 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:24.849 7f6ec2a23700 10 osd.3 pg_epoch: 409 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:26.857 7f6ebea1b700 10 osd.3 pg_epoch: 410 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:27.717 7f6ebea1b700 10 osd.3 pg_epoch: 411 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:30.893 7f6ebea1b700 10 osd.3 pg_epoch: 412 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:32.725 7f6ebea1b700 10 osd.3 pg_epoch: 413 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:33.925 7f6ec2a23700 10 osd.3 pg_epoch: 414 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
2019-03-27 00:14:34.961 7f6ebea1b700 10 osd.3 pg_epoch: 415 pg[1.4( empty local-lis/les=398/399 n=0 ec=14/14 lis/c 398/232 les/c/f 399/233/0 398/398/14) [3,0] r=0 lpr=398 pi=[232,398)/1 crt=0'0 mlcod 0'0 active+undersized mbc={}] handle_advance_map [3,0]/[3,0] -- 3/3
.
.

Actions

Copy link

Updated by Sage Weil about 5 years ago

This is just CRUSH failing. I extracted the osdmap from the data/mon.a.tgz and verified with osdmaptool that it's just failing to find a third available osd after 50 tries.

In theory the probability of that happening is about .00000056, or (roughly) one in a million. I think it's coming up in QA quite a bit more often than that (even considering there are 10s of PGs in each run.. probably because the hash that crush is using isn't as strong as it could be?

for the record,

$ ~/src/ceph/build/bin/osdmaptool  640 --test-map-pg 1.4
...
 crush_bucket_choose -5 x=1039746536 r=51
weight 0xdf76 item 1
weight 0xdf76 item 0
weight 0xdf76 item 3
weight 0xdf76 item 2
  item 0 type 0
  reject 0  collide 1  ftotal 50  flocal 1
 crush_bucket_choose -1 x=1039746536 r=52
weight 0x5964 item -3
weight 0x37dd8 item -5
  item -5 type 1
 crush_bucket_choose -5 x=1039746536 r=52
weight 0xdf76 item 1
weight 0xdf76 item 0
weight 0xdf76 item 3
weight 0xdf76 item 2
  item 1 type 0
  reject 1  collide 0  ftotal 51  flocal 1
skip rep
CHOOSE returns 2
1.4 raw ([3,0], p3) up ([3,0], p3) acting ([3,0], p3)

with this patch applied

diff --git a/src/crush/mapper.c b/src/crush/mapper.c
index 73f92a77d9f..c0ade62839e 100644
--- a/src/crush/mapper.c
+++ b/src/crush/mapper.c
@@ -25,7 +25,7 @@
 #include "crush_ln_table.h" 
 #include "mapper.h" 

-#define dprintk(args...) /* printf(args) */
+#define dprintk(args...)  printf(args) 

 /*
  * Implement the core CRUSH mapping algorithm.

Actions

Copy link

Updated by Sage Weil about 5 years ago

oh, it's because there's alos 1/10th the probability of choosing the second host:

ID CLASS WEIGHT  TYPE NAME          STATUS REWEIGHT PRI-AFF 
-1       3.84076 root default                               
-5       3.49158     host smithi025                         
 0   ssd 0.87289         osd.0          up  1.00000 1.00000 
 1   ssd 0.87289         osd.1          up        0 1.00000 
 2   ssd 0.87289         osd.2          up        0 1.00000 
 3   ssd 0.87289         osd.3          up  1.00000 1.00000 
-3       0.34918     host smithi124                         
 4   ssd 0.08730         osd.4          up        0 1.00000 
 5   ssd 0.08730         osd.5        down        0 0.43417 
 6   ssd 0.08730         osd.6          up  1.00000 1.00000 
 7   ssd 0.08730         osd.7          up  1.00000 1.00000

because,

teuthology:3820599  10:15 PM $ ssh smithi124 lsblk
NAME           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda              8:0    0 931.5G  0 disk 
└─sda1           8:1    0 931.5G  0 part /
nvme0n1        259:0    0 372.6G  0 disk 
├─vg_nvme-lv_5 253:0    0  14.9G  0 lvm  /var/lib/ceph
├─vg_nvme-lv_4 253:1    0  89.4G  0 lvm  
├─vg_nvme-lv_3 253:2    0  89.4G  0 lvm  
├─vg_nvme-lv_2 253:3    0  89.4G  0 lvm  
└─vg_nvme-lv_1 253:4    0  89.4G  0 lvm  
teuthology:3820599  10:15 PM $ ssh smithi025 lsblk
NAME           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda              8:0    0 931.5G  0 disk 
`-sda1           8:1    0 931.5G  0 part /
nvme0n1        259:0    0   3.7T  0 disk 
|-vg_nvme-lv_5 253:0    0   149G  0 lvm  /var/lib/ceph
|-vg_nvme-lv_4 253:1    0 894.2G  0 lvm  /var/lib/ceph/osd/ceph-4
|-vg_nvme-lv_3 253:2    0 894.2G  0 lvm  /var/lib/ceph/osd/ceph-7
|-vg_nvme-lv_2 253:3    0 894.2G  0 lvm  /var/lib/ceph/osd/ceph-6
`-vg_nvme-lv_1 253:4    0 894.2G  0 lvm  /var/lib/ceph/osd/ceph-5

Actions

Copy link

#10