Bug #5901: stuck incomplete immediately after clean - Ceph - Ceph

Actions

Copy link

Bug #5901

closed

stuck incomplete immediately after clean

Added by Samuel Just over 10 years ago. Updated over 10 years ago.

Status:

Duplicate

Priority:

Urgent

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2013-08-06T19:19:00.289 INFO:teuthology.orchestra.run.err:[10.214.133.31]: dumped all in format json
2013-08-06T19:19:00.295 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.35]: 588 16 7403 7387 50.2441 32 0.707152 1.27163
2013-08-06T19:19:00.318 DEBUG:teuthology.misc:with jobid basedir: 98004
2013-08-06T19:19:00.318 DEBUG:teuthology.orchestra.run:Running [10.214.133.31]: '/home/ubuntu/cephtest/98004/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/98004/archive/coverage ceph -s'
2013-08-06T19:19:00.623 INFO:teuthology.task.thrashosds.ceph_manager: cluster 3e717ca8-bd17-439a-a1cd-80770f68c0a6
health HEALTH_OK
monmap e1: 3 mons at {a=10.214.133.31:6789/0,b=10.214.133.35:6789/0,c=10.214.133.31:6790/0}, election epoch 8, quorum 0,1,2 a,b,c
osdmap e38: 6 osds: 6 up, 4 in
pgmap v275: 102 pgs: 102 active+clean; 45452 MB data, 62311 MB used, 2727 GB / 2794 GB avail; 78386KB/s wr, 19op/s
mdsmap e5: 1/1/1 up {0=a=up:active}

2013-08-06T19:19:00.623 INFO:teuthology.task.thrashosds.ceph_manager:clean!
2013-08-06T19:19:00.623 INFO:teuthology.task.thrashosds.thrasher:Recovered, killing an osd
2013-08-06T19:19:00.623 INFO:teuthology.task.thrashosds.thrasher:Killing osd 1, live_osds are [1, 3, 2, 5, 4, 0]
2013-08-06T19:19:00.624 DEBUG:teuthology.task.ceph.osd.1:waiting for process to exit
2013-08-06T19:19:00.694 INFO:teuthology.task.ceph.osd.1:Stopped
2013-08-06T19:19:00.695 DEBUG:teuthology.misc:with jobid basedir: 98004
2013-08-06T19:19:00.695 DEBUG:teuthology.orchestra.run:Running [10.214.133.31]: '/home/ubuntu/cephtest/98004/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/98004/archive/coverage ceph osd down 1'
2013-08-06T19:19:00.695 INFO:teuthology.task.radosbench.radosbench.0.err:[10.214.133.35]: 2013-08-06 19:19:49.697563 7f51d41df700 0 -- 10.214.133.35:0/1018686 >> 10.214.133.31:6800/15075 pipe(0x7f51d0016280 sd=9 :0 s=1 pgs=0 cs=0 l=1 c=0x7f51d001de00).fault
2013-08-06T19:19:01.294 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.35]: 589 16 7420 7404 50.2743 68 0.16186 1.27114
2013-08-06T19:19:01.416 INFO:teuthology.orchestra.run.err:[10.214.133.31]: marked down osd.1.
2013-08-06T19:19:01.428 INFO:teuthology.task.thrashosds.thrasher:Removing osd 1, in_osds are: [2, 3, 1, 4]
2013-08-06T19:19:01.428 DEBUG:teuthology.misc:with jobid basedir: 98004
2013-08-06T19:19:01.428 DEBUG:teuthology.orchestra.run:Running [10.214.133.31]: '/home/ubuntu/cephtest/98004/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/98004/archive/coverage ceph osd out 1'
2013-08-06T19:19:02.294 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.35]: 590 16 7428 7412 50.2433 32 0.788347 1.27108
2013-08-06T19:19:02.429 INFO:teuthology.orchestra.run.err:[10.214.133.31]: marked out osd.1.
2013-08-06T19:19:02.439 INFO:teuthology.task.thrashosds.thrasher:Waiting for clean again
2013-08-06T19:19:02.439 INFO:teuthology.task.thrashosds.ceph_manager:waiting for clean
2013-08-06T19:19:02.440 DEBUG:teuthology.misc:with jobid basedir: 98004
2013-08-06T19:19:02.440 DEBUG:teuthology.orchestra.run:Running [10.214.133.31]: '/home/ubuntu/cephtest/98004/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/98004/archive/coverage ceph pg dump --format=json'
2013-08-06T19:19:02.741 INFO:teuthology.orchestra.run.err:[10.214.133.31]: dumped all in format json
2013-08-06T19:19:02.754 DEBUG:teuthology.misc:with jobid basedir: 98004
2013-08-06T19:19:02.754 DEBUG:teuthology.orchestra.run:Running [10.214.133.31]: '/home/ubuntu/cephtest/98004/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/98004/archive/coverage ceph pg dump --format=json'
2013-08-06T19:19:03.023 INFO:teuthology.orchestra.run.err:[10.214.133.31]: dumped all in format json
2013-08-06T19:19:03.045 DEBUG:teuthology.misc:with jobid basedir: 98004
2013-08-06T19:19:03.045 DEBUG:teuthology.orchestra.run:Running [10.214.133.31]: '/home/ubuntu/cephtest/98004/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/98004/archive/coverage ceph -s'
2013-08-06T19:19:03.294 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.35]: 591 16 7437 7421 50.2192 36 0.158046 1.27007
2013-08-06T19:19:03.298 INFO:teuthology.task.thrashosds.ceph_manager: cluster 3e717ca8-bd17-439a-a1cd-80770f68c0a6
health HEALTH_WARN 25 pgs stale
monmap e1: 3 mons at {a=10.214.133.31:6789/0,b=10.214.133.35:6789/0,c=10.214.133.31:6790/0}, election epoch 8, quorum 0,1,2 a,b,c
osdmap e40: 6 osds: 5 up, 3 in
pgmap v277: 102 pgs: 77 active+clean, 25 stale+active+clean; 45452 MB data, 48367 MB used, 2276 GB / 2328 GB avail
mdsmap e5: 1/1/1 up {0=a=up:active}

2013-08-06T19:19:03.298 DEBUG:teuthology.misc:with jobid basedir: 98004
2013-08-06T19:19:03.298 DEBUG:teuthology.orchestra.run:Running [10.214.133.31]: '/home/ubuntu/cephtest/98004/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/98004/archive/coverage ceph pg dump --format=json'
2013-08-06T19:19:03.764 INFO:teuthology.orchestra.run.err:[10.214.133.31]: dumped all in format json
2013-08-06T19:19:04.294 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.35]: 592 16 7440 7424 50.1546 12 3.88858 1.27099
2013-08-06T19:19:05.294 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.35]: 593 16 7457 7441 50.1847 68 0.667485 1.27231
2013-08-06T19:19:06.295 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.35]: 594 16 7465 7449 50.1541 32 0.547518 1.27189
2013-08-06T19:19:06.772 DEBUG:teuthology.misc:with jobid basedir: 98004
2013-08-06T19:19:06.773 DEBUG:teuthology.orchestra.run:Running [10.214.133.31]: '/home/ubuntu/cephtest/98004/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/98004/archive/coverage ceph pg dump --format=json'
2013-08-06T19:19:07.001 INFO:teuthology.orchestra.run.err:[10.214.133.31]: dumped all in format json
2013-08-06T19:19:07.021 DEBUG:teuthology.misc:with jobid basedir: 98004
2013-08-06T19:19:07.021 DEBUG:teuthology.orchestra.run:Running [10.214.133.31]: '/home/ubuntu/cephtest/98004/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/98004/archive/coverage ceph -s'
2013-08-06T19:19:07.274 INFO:teuthology.task.thrashosds.ceph_manager: cluster 3e717ca8-bd17-439a-a1cd-80770f68c0a6
health HEALTH_WARN 25 pgs stale
monmap e1: 3 mons at {a=10.214.133.31:6789/0,b=10.214.133.35:6789/0,c=10.214.133.31:6790/0}, election epoch 8, quorum 0,1,2 a,b,c
osdmap e41: 6 osds: 5 up, 3 in
pgmap v278: 102 pgs: 77 active+clean, 25 stale+active+clean; 45452 MB data, 48367 MB used, 2276 GB / 2328 GB avail
mdsmap e5: 1/1/1 up {0=a=up:active}

2013-08-06T19:19:07.275 DEBUG:teuthology.misc:with jobid basedir: 98004
2013-08-06T19:19:07.275 DEBUG:teuthology.orchestra.run:Running [10.214.133.31]: '/home/ubuntu/cephtest/98004/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/98004/archive/coverage ceph pg dump --format=json'
2013-08-06T19:19:07.295 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.35]: 595 16 7481 7465 50.1773 64 1.35989 1.2742
2013-08-06T19:19:07.681 INFO:teuthology.orchestra.run.err:[10.214.133.31]: dumped all in format json
2013-08-06T19:19:08.296 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.35]: 596 16 7486 7470 50.1267 20 1.72277 1.27431
2013-08-06T19:19:09.295 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.35]: 597 16 7502 7486 50.1499 64 0.636611 1.27468
2013-08-06T19:19:10.295 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.35]: 598 16 7506 7490 50.0928 16 1.57742 1.27478
2013-08-06T19:19:10.689 DEBUG:teuthology.misc:with jobid basedir: 98004
2013-08-06T19:19:10.689 DEBUG:teuthology.orchestra.run:Running [10.214.133.31]: '/home/ubuntu/cephtest/98004/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/98004/archive/coverage ceph pg dump --format=json'
2013-08-06T19:19:10.906 INFO:teuthology.orchestra.run.err:[10.214.133.31]: dumped all in format json
2013-08-06T19:19:10.923 DEBUG:teuthology.misc:with jobid basedir: 98004
2013-08-06T19:19:10.923 DEBUG:teuthology.orchestra.run:Running [10.214.133.31]: '/home/ubuntu/cephtest/98004/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/98004/archive/coverage ceph -s'
2013-08-06T19:19:11.180 INFO:teuthology.task.thrashosds.ceph_manager: cluster 3e717ca8-bd17-439a-a1cd-80770f68c0a6
health HEALTH_WARN 5 pgs backfilling; 5 pgs degraded; 1 pgs incomplete; 12 pgs recovering; 1 pgs stuck inactive; 1 pgs stuck unclean; recovery 7986/22978 degraded (34.755%); recovering 2 o/s, 13500KB/s

ubuntu@teuthology:/a/teuthology-2013-08-06_01:00:25-rados-next-testing-basic-plana/98004

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Samuel Just over 10 years ago

~/teuthology [mine?] » ./virtualenv/bin/teuthology-schedule --name "samuelj-5901-0" -n 10 --owner samuelj@slider testruns/5799.yaml | tee 5901-0.jobs

Job scheduled with ID 101560
Job scheduled with ID 101561
Job scheduled with ID 101562
Job scheduled with ID 101563
Job scheduled with ID 101564
Job scheduled with ID 101565
Job scheduled with ID 101566
Job scheduled with ID 101567
Job scheduled with ID 101568
Job scheduled with ID 101569

Actions

Copy link

Updated by Samuel Just over 10 years ago

./virtualenv/bin/teuthology-schedule --name "samuelj-5901-1" -n 50 --owner samuelj@slider testruns/5799.yaml | tee 5901-1.jobs

~/teuthology [mine?] » cat 5901-1.jobs
Job scheduled with ID 101570
Job scheduled with ID 101571
Job scheduled with ID 101572
Job scheduled with ID 101573
Job scheduled with ID 101574
Job scheduled with ID 101575
Job scheduled with ID 101576
Job scheduled with ID 101577
Job scheduled with ID 101578
Job scheduled with ID 101579
Job scheduled with ID 101580
Job scheduled with ID 101581
Job scheduled with ID 101582
Job scheduled with ID 101583
Job scheduled with ID 101584
Job scheduled with ID 101585
Job scheduled with ID 101586
Job scheduled with ID 101587
Job scheduled with ID 101588
Job scheduled with ID 101589
Job scheduled with ID 101590
Job scheduled with ID 101591
Job scheduled with ID 101592
Job scheduled with ID 101593
Job scheduled with ID 101594
Job scheduled with ID 101595
Job scheduled with ID 101596
Job scheduled with ID 101597
Job scheduled with ID 101598
Job scheduled with ID 101599
Job scheduled with ID 101600
Job scheduled with ID 101601
Job scheduled with ID 101602
Job scheduled with ID 101603
Job scheduled with ID 101604
Job scheduled with ID 101605
Job scheduled with ID 101606
Job scheduled with ID 101607
Job scheduled with ID 101608
Job scheduled with ID 101609
Job scheduled with ID 101610
Job scheduled with ID 101611
Job scheduled with ID 101612
Job scheduled with ID 101613
Job scheduled with ID 101614
Job scheduled with ID 101615
Job scheduled with ID 101616
Job scheduled with ID 101617
Job scheduled with ID 101618
Job scheduled with ID 101619

Actions

Copy link

Updated by Samuel Just over 10 years ago

The bug is that the primary can report that the pg is clean before the replica sees the OP_BACKFILL_FINISH message. This is a pretty tight race which probably won't happen outside of our tests. We do need to fix it, but the fix probably involves adding an ack message from replica->primary before we officially stop backfilling. In that case, we'll have to implement it in master and backport to dumpling at a later date.

Actions

Copy link