Actions
Bug #20221
openkill osd + osd out leads to stale PGs
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
description¶
When the OSD is killed before ceph osd out, the PGs stay in stale state.
reproducer¶
From the build directory
source ../src/test/detect-build-env-vars.sh ../src/stop.sh ; rm -fr out dev ; rm -f ceph.conf ; bash ../src/vstart.sh -n -l --short --nolockdep --mon_num 1 --osd_num 3 --mds_num 0 ceph osd pool set rbd size 1 pkill -f 'ceph-osd -i 0' sleep 10 ceph osd out osd.0 sleep 10 ceph pg dump => osd.0 still mapped to stale PGs
All is well if ceph osd out osd.0 is done before killing the OSD.
Sample run¶
(virtualenv) loic@fold:~/software/ceph/ceph/build$ ../src/stop.sh ; rm -fr out dev ; rm -f ceph.conf ; bash ../src/vstart.sh -n -l --short --nolockdep --mon_num 1 --osd_num 3 --mds_num 0 (virtualenv) loic@fold:~/software/ceph/ceph/build$ ceph osd pool set rbd size 1 set pool 0 size to 1 (virtualenv) loic@fold:~/software/ceph/ceph/build$ ceph pg dump dumped all version 8 stamp 2017-06-08 11:37:27.021636 last_osdmap_epoch 0 last_pg_scan 0 full_ratio 0 nearfull_ratio 0 PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP 0.7 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.100644 0'0 16:32 [1] 1 [1] 1 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.6 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.100645 0'0 16:41 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.5 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.201360 0'0 16:41 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.4 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.201217 0'0 16:41 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.0 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.201318 0'0 16:36 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.1 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.756707 0'0 16:11 [2] 2 [2] 2 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.2 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.201392 0'0 16:41 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.3 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.756706 0'0 16:11 [2] 2 [2] 2 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0 0 0 0 0 0 0 0 0 sum 0 0 0 0 0 0 0 0 OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM 2 174G 8076M 182G [0,1] 2 2 1 174G 8076M 182G [0,2] 1 1 0 174G 8076M 182G [1,2] 5 5 sum 523G 24229M 547G (virtualenv) loic@fold:~/software/ceph/ceph/build$ rados put BAR3 /etc/group --pool rbd (virtualenv) loic@fold:~/software/ceph/ceph/build$ ceph pg dump dumped all version 14 stamp 2017-06-08 11:37:39.022935 last_osdmap_epoch 0 last_pg_scan 0 full_ratio 0 nearfull_ratio 0 PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP 0.7 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.100644 0'0 16:32 [1] 1 [1] 1 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.6 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.100645 0'0 16:41 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.5 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.201360 0'0 16:41 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.4 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.201217 0'0 16:41 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.0 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.201318 0'0 16:36 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.1 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.756707 0'0 16:11 [2] 2 [2] 2 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.2 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.201392 0'0 16:41 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.3 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.756706 0'0 16:11 [2] 2 [2] 2 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0 0 0 0 0 0 0 0 0 sum 0 0 0 0 0 0 0 0 OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM 2 174G 8076M 182G [0,1] 2 2 1 174G 8076M 182G [0,2] 1 1 0 174G 8076M 182G [1,2] 5 5 sum 523G 24229M 547G (virtualenv) loic@fold:~/software/ceph/ceph/build$ ps fauwwx | grep ceph-osd loic 18093 0.0 0.0 15796 1008 pts/24 S+ 11:38 0:00 | | | \_ grep ceph-osd loic 17433 0.1 0.2 762160 33584 ? Ssl 11:37 0:00 \_ /home/loic/software/ceph/ceph/build/bin/ceph-osd -i 0 -c /home/loic/software/ceph/ceph/build/ceph.conf loic 17641 0.1 0.2 761136 32700 ? Ssl 11:37 0:00 \_ /home/loic/software/ceph/ceph/build/bin/ceph-osd -i 1 -c /home/loic/software/ceph/ceph/build/ceph.conf loic 17848 0.1 0.1 761140 31720 ? Ssl 11:37 0:00 \_ /home/loic/software/ceph/ceph/build/bin/ceph-osd -i 2 -c /home/loic/software/ceph/ceph/build/ceph.conf (virtualenv) loic@fold:~/software/ceph/ceph/build$ kill 17433 (virtualenv) loic@fold:~/software/ceph/ceph/build$ ceph pg dump dumped all version 40 stamp 2017-06-08 11:38:29.028347 last_osdmap_epoch 0 last_pg_scan 0 full_ratio 0 nearfull_ratio 0 PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP 0.7 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.100644 0'0 16:32 [1] 1 [1] 1 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.6 0 0 0 0 0 0 0 0 stale+active+clean 2017-06-08 11:37:23.100645 0'0 16:41 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.5 0 0 0 0 0 0 0 0 stale+active+clean 2017-06-08 11:37:23.201360 0'0 16:41 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.4 0 0 0 0 0 0 0 0 stale+active+clean 2017-06-08 11:37:23.201217 0'0 16:41 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.0 0 0 0 0 0 0 0 0 stale+active+clean 2017-06-08 11:37:23.201318 0'0 16:36 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.1 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.756707 0'0 16:11 [2] 2 [2] 2 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.2 1 0 0 0 0 1499 1 1 stale+active+clean 2017-06-08 11:37:23.201392 16'1 16:42 [0] 0 [0] 0 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0.3 0 0 0 0 0 0 0 0 active+clean 2017-06-08 11:37:23.756706 0'0 16:11 [2] 2 [2] 2 0'0 2017-06-08 11:37:01.091443 0'0 2017-06-08 11:37:01.091443 0 1 0 0 0 0 1499 1 1 sum 1 0 0 0 0 1499 1 1 OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM 2 174G 8076M 182G [0,1] 2 2 1 174G 8076M 182G [0,2] 1 1 0 174G 8076M 182G [1,2] 5 5 sum 523G 24228M 547G (virtualenv) loic@fold:~/software/ceph/ceph/build$ rados --debug-objecter 20 --debug-rados 20 put BAR2 /etc/group --pool rbd 2017-06-08 11:41:52.928208 7f1e52998e00 1 librados: starting msgr at - 2017-06-08 11:41:52.928224 7f1e52998e00 1 librados: starting objecter 2017-06-08 11:41:52.928500 7f1e52998e00 1 librados: setting wanted keys 2017-06-08 11:41:52.928501 7f1e52998e00 1 librados: calling monclient init 2017-06-08 11:41:52.929470 7f1e3debb700 10 client.?.objecter ms_handle_connect 0x5597ace69ec0 2017-06-08 11:41:52.929489 7f1e3debb700 10 client.?.objecter resend_mon_ops 2017-06-08 11:41:52.930829 7f1e52998e00 10 client.4120.objecter _maybe_request_map subscribing (onetime) to next osd map 2017-06-08 11:41:52.930870 7f1e52998e00 1 librados: init done 2017-06-08 11:41:52.930877 7f1e52998e00 10 librados: wait_for_osdmap waiting 2017-06-08 11:41:52.931078 7f1e3debb700 10 client.4120.objecter ms_dispatch 0x5597acd413a0 osd_map(18..18 src has 1..18) v3 2017-06-08 11:41:52.931087 7f1e3debb700 3 client.4120.objecter handle_osd_map got epochs [18,18] > 0 2017-06-08 11:41:52.931088 7f1e3debb700 3 client.4120.objecter handle_osd_map decoding full epoch 18 2017-06-08 11:41:52.931207 7f1e3debb700 20 client.4120.objecter dump_active .. 0 homeless 2017-06-08 11:41:52.931227 7f1e52998e00 10 librados: wait_for_osdmap done waiting 2017-06-08 11:41:52.931265 7f1e52998e00 10 librados: writefull oid=BAR2 nspace= 2017-06-08 11:41:52.931278 7f1e52998e00 10 client.4120.objecter _op_submit op 0x5597ace6dac0 2017-06-08 11:41:52.931281 7f1e52998e00 20 client.4120.objecter _calc_target epoch 18 base BAR2 @0 precalc_pgid 0 pgid 0.0 is_write 2017-06-08 11:41:52.931293 7f1e52998e00 20 client.4120.objecter _calc_target target BAR2 @0 -> pgid 0.39aeb7e2 2017-06-08 11:41:52.931297 7f1e3debb700 10 client.4120.objecter ms_handle_connect 0x7f1e28009690 2017-06-08 11:41:52.931313 7f1e52998e00 10 client.4120.objecter _calc_target raw pgid 0.39aeb7e2 -> actual 0.2 acting [] primary -1 2017-06-08 11:41:52.931319 7f1e52998e00 20 client.4120.objecter _get_session osd=-1 returning homeless 2017-06-08 11:41:52.931322 7f1e52998e00 10 client.4120.objecter _maybe_request_map subscribing (onetime) to next osd map 2017-06-08 11:41:52.931343 7f1e52998e00 10 client.4120.objecter _op_submit oid BAR2 '@0' '@0' [writefull 0~1499] tid 1 osd.-1 2017-06-08 11:41:52.931351 7f1e52998e00 15 client.4120.objecter _session_op_assign -1 1 2017-06-08 11:41:52.931352 7f1e52998e00 5 client.4120.objecter 1 in flight 2017-06-08 11:41:57.930919 7f1e3e70c700 10 client.4120.objecter tick 2017-06-08 11:41:57.930935 7f1e3e70c700 10 client.4120.objecter _maybe_request_map subscribing (onetime) to next osd map 2017-06-08 11:42:02.931046 7f1e3e70c700 10 client.4120.objecter tick 2017-06-08 11:42:02.931057 7f1e3e70c700 10 client.4120.objecter _maybe_request_map subscribing (onetime) to next osd map 2017-06-08 11:42:07.931192 7f1e3e70c700 10 client.4120.objecter tick 2017-06-08 11:42:07.931208 7f1e3e70c700 10 client.4120.objecter _maybe_request_map subscribing (onetime) to next osd map 2017-06-08 11:42:12.931317 7f1e3e70c700 10 client.4120.objecter tick 2017-06-08 11:42:12.931334 7f1e3e70c700 10 client.4120.objecter _maybe_request_map subscribing (onetime) to next osd map 2017-06-08 11:42:17.931406 7f1e3e70c700 10 client.4120.objecter tick 2017-06-08 11:42:17.931423 7f1e3e70c700 10 client.4120.objecter _maybe_request_map subscribing (onetime) to next osd map 2017-06-08 11:42:22.931567 7f1e3e70c700 10 client.4120.objecter tick 2017-06-08 11:42:22.931586 7f1e3e70c700 10 client.4120.objecter _maybe_request_map subscribing (onetime) to next osd map
Actions