Project

General

Profile

Bug #2094

osd: pgs remapped to down+out osd

Added by Josh Durgin over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

This is why the dump_stuck test fails on master. When one osd is marked out, the pg is remapped incorrectly:

2012-02-22 18:42:55.809068 mon <- [osd,dump]
2012-02-22 18:42:55.809762 mon.0 -> 'dumped osdmap epoch 234' (0)
epoch 234
fsid ed6c6aab-41ac-42e3-b2af-fda425a5a835
created 2012-02-22 18:25:37.311048
modifed 2012-02-22 18:42:54.751454
flags 

pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 lpg_num 2 lpgp_num 2 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 8 pgp_num 8 lpg_num 2 lpgp_num 2 last_change 1 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 8 pgp_num 8 lpg_num 2 lpgp_num 2 last_change 1 owner 0

max_osd 2
osd.0 down out weight 0 up_from 232 up_thru 122 down_at 234 last_clean_interval [2,231) 10.3.14.175:6800/24051 10.3.14.175:6802/24051 10.3.14.175:6807/24051 exists
osd.1 up   in  weight 1 up_from 196 up_thru 232 down_at 195 last_clean_interval [3,195) 10.3.14.174:6800/11841 10.3.14.174:6801/11841 10.3.14.174:6802/11841 exists,up

pg_temp 0.0 [1,0]
pg_temp 0.1 [1,0]
pg_temp 0.2 [1,0]
pg_temp 0.3 [1,0]
pg_temp 0.4 [1,0]
pg_temp 0.5 [1,0]
pg_temp 0.6 [1,0]
pg_temp 0.7 [1,0]
pg_temp 0.0p0 [1,0]
pg_temp 0.1p0 [1,0]
pg_temp 0.0p1 [1,0]
pg_temp 0.1p1 [1,0]
pg_temp 1.0 [1,0]
pg_temp 1.1 [1,0]
pg_temp 1.2 [1,0]
pg_temp 1.3 [1,0]
pg_temp 1.4 [1,0]
pg_temp 1.5 [1,0]
pg_temp 1.6 [1,0]
pg_temp 1.7 [1,0]
pg_temp 1.0p0 [1,0]
pg_temp 1.1p0 [1,0]
pg_temp 1.0p1 [1,0]
pg_temp 1.1p1 [1,0]
pg_temp 2.0 [1,0]
pg_temp 2.1 [1,0]
pg_temp 2.2 [1,0]
pg_temp 2.3 [1,0]
pg_temp 2.4 [1,0]
pg_temp 2.5 [1,0]
pg_temp 2.6 [1,0]
pg_temp 2.7 [1,0]
pg_temp 2.0p0 [1,0]
pg_temp 2.1p0 [1,0]
pg_temp 2.0p1 [1,0]
pg_temp 2.1p1 [1,0]
2012-02-22 18:56:23.154864 mon <- [pg,dump]
2012-02-22 18:56:23.155765 mon.0 -> 'dumped all in format plain' (0)
version 443
last_osdmap_epoch 364
last_pg_scan 1
full_ratio 0.95
nearfull_ratio 0.85
pg_stat    objects    mip    degr    unf    bytes    log    disklog    state    state_stamp    v    reported    up    acting    last_scrub    scrub_stamp
1.0p1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193199    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:46:53.289635
1.1p0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193130    0'0    360'427    [1]    [1,0]    0'0    2012-02-22 18:46:52.289457
0.1p1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193165    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:46:29.286840
0.0p0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193092    0'0    360'427    [1]    [1,0]    0'0    2012-02-22 18:46:25.286337
1.1p1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193344    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:46:56.290013
1.0p0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193267    0'0    360'427    [1]    [1,0]    0'0    2012-02-22 18:46:51.289321
0.0p1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193304    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:46:28.286759
0.1p0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193233    0'0    360'427    [1]    [1,0]    0'0    2012-02-22 18:46:26.286477
2.1p1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193415    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:47:55.528783
2.0p0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193380    0'0    360'427    [1]    [1,0]    0'0    2012-02-22 18:47:39.999936
2.0p1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193485    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:47:50.222973
2.1p0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193450    0'0    360'427    [1]    [1,0]    0'0    2012-02-22 18:47:49.387450
2.5    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193599    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:47:24.293020
1.6    1    0    0    0    205    116    116    active+clean+remapped    2012-02-22 18:55:35.193561    9'1    360'429    [1]    [1,0]    9'1    2012-02-22 18:46:49.289771
0.7    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193521    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:46:24.286268
2.4    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193730    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:47:21.292609
1.7    5    0    0    0    1395    585    585    active+clean+remapped    2012-02-22 18:55:35.193691    9'5    360'433    [1]    [1,0]    9'5    2012-02-22 18:46:50.290696
0.6    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:56:10.882835    0'0    360'430    [1]    [1,0]    0'0    2012-02-22 18:56:10.882807
2.7    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193869    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:47:35.347991
1.4    5    0    0    0    1724    816    816    active+clean+remapped    2012-02-22 18:55:35.193831    9'7    360'432    [1]    [1,0]    9'7    2012-02-22 18:46:37.289327
0.5    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.193769    0'0    360'427    [1]    [1,0]    0'0    2012-02-22 18:46:20.285813
2.6    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.194000    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:47:29.293549
1.5    3    0    0    0    451    349    349    active+clean+remapped    2012-02-22 18:55:35.193960    9'3    360'431    [1]    [1,0]    9'3    2012-02-22 18:46:46.290116
0.4    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:56:09.883158    0'0    360'430    [1]    [1,0]    0'0    2012-02-22 18:56:09.883131
2.1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.194120    0'0    360'427    [1]    [1,0]    0'0    2012-02-22 18:47:14.291870
1.2    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.194079    0'0    360'427    [1]    [1,0]    0'0    2012-02-22 18:46:33.287326
0.3    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:56:16.883499    0'0    360'429    [1]    [1,0]    0'0    2012-02-22 18:56:16.883472
2.0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.194269    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:47:02.290673
1.3    3    0    0    0    4135    352    352    active+clean+remapped    2012-02-22 18:55:35.194221    9'3    360'431    [1]    [1,0]    9'3    2012-02-22 18:46:36.288397
0.2    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:56:01.881836    0'0    360'430    [1]    [1,0]    0'0    2012-02-22 18:56:01.881809
2.3    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.194425    0'0    360'427    [1]    [1,0]    0'0    2012-02-22 18:47:17.292216
1.0    4    0    0    0    820    464    464    active+clean+remapped    2012-02-22 18:55:35.194374    4'4    360'429    [1]    [1,0]    4'4    2012-02-22 18:46:31.288174
0.1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:56:14.883395    0'0    360'432    [1]    [1,0]    0'0    2012-02-22 18:56:14.883369
2.2    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.194576    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:47:15.291877
1.1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:55:35.194524    0'0    360'428    [1]    [1,0]    0'0    2012-02-22 18:46:32.287193
0.0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-22 18:56:07.883842    0'0    360'432    [1]    [1,0]    0'0    2012-02-22 18:56:07.883813
pool 0    0    0    0    0    0    0    0
pool 1    21    0    0    0    8730    2682    2682
pool 2    0    0    0    0    0    0    0
 sum    21    0    0    0    8730    2682    2682
osdstat    kbused    kbavail    kb    hb in    hb out
0    32560448    38491776    74798272    []    [1]
1    29821900    41230324    74798272    [0]    []
 sum    62382348    79722100    149596544

I think this was uncovered due to 8e6f9ca8ac70b6aafabcce4d96e467bfb46e6505 dropping the up != acting check before going clean.

Associated revisions

Revision e8bc42ff (diff)
Added by Sage Weil over 8 years ago

osd: only set CLEAN when we are not remapped (up == acting)

If we have a temporary mapping for this PG, consider that unclean. This
makes CLEAN and REMAPPED mutually exclusive. For example, a 2 node cluster
with 2x replication and one osd marked out will make the pgs all
active+remapped, not active+clean+remapped.

Fixes: #2094
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil over 8 years ago

  • Priority changed from Normal to High
  • Target version set to v0.43

#2 Updated by Josh Durgin over 8 years ago

Reproduced again with stats flushing. This seems to happen every time with this configuration (maybe having only 2 osds is a factor?):

ubuntu@sepia48:~$ LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf tell osd.1 flush_pg_stats
2012-02-23 10:37:44.164037 osd.1 <- [flush_pg_stats]
2012-02-23 10:37:44.516739 osd.1 -> '' (0)
ubuntu@sepia48:~$ LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf pg dump 
2012-02-23 10:37:50.802456 mon <- [pg,dump]
2012-02-23 10:37:50.803405 mon.2 -> 'dumped all in format plain' (0)
version 264
last_osdmap_epoch 193
last_pg_scan 1
full_ratio 0.8
nearfull_ratio 0.9
pg_stat    objects    mip    degr    unf    bytes    log    disklog    state    state_stamp    v    reported    up    acting    last_scrub    scrub_stamp
1.0p1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.606036    0'0    191'217    [1]    [1,0]    0'0    2012-02-23 10:27:37.428191
1.1p0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.605966    0'0    191'219    [1]    [1,0]    0'0    2012-02-23 10:27:32.427559
0.1p1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:37:11.150707    0'0    191'219    [1]    [1,0]    0'0    2012-02-23 10:37:11.150681
0.0p0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:37:18.151059    0'0    191'223    [1]    [1,0]    0'0    2012-02-23 10:37:18.151033
1.1p1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.606170    0'0    191'217    [1]    [1,0]    0'0    2012-02-23 10:27:39.428359
1.0p0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:37:36.153874    0'0    191'223    [1]    [1,0]    0'0    2012-02-23 10:37:36.153130
0.0p1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:37:01.149118    0'0    191'219    [1]    [1,0]    0'0    2012-02-23 10:37:01.149094
0.1p0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:37:22.151556    0'0    191'223    [1]    [1,0]    0'0    2012-02-23 10:37:22.151530
2.1p1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.606240    0'0    191'217    [1]    [1,0]    0'0    2012-02-23 10:28:04.431177
2.0p0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.606205    0'0    191'219    [1]    [1,0]    0'0    2012-02-23 10:27:57.430411
2.0p1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.606307    0'0    191'217    [1]    [1,0]    0'0    2012-02-23 10:27:59.430575
2.1p0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.606274    0'0    191'219    [1]    [1,0]    0'0    2012-02-23 10:27:58.430480
2.5    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.606420    0'0    191'217    [1]    [1,0]    0'0    2012-02-23 10:27:47.429325
1.6    1    0    0    0    205    116    116    active+clean+remapped    2012-02-23 10:36:40.606385    3'1    191'219    [1]    [1,0]    3'1    2012-02-23 10:27:30.427608
0.7    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:55.148497    0'0    191'219    [1]    [1,0]    0'0    2012-02-23 10:36:55.148474
2.4    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.606549    0'0    191'217    [1]    [1,0]    0'0    2012-02-23 10:27:45.429023
1.7    5    0    0    0    1395    585    585    active+clean+remapped    2012-02-23 10:36:40.606508    3'5    191'223    [1]    [1,0]    3'5    2012-02-23 10:27:31.428703
0.6    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:48.148929    0'0    191'219    [1]    [1,0]    0'0    2012-02-23 10:36:48.148907
2.7    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.606689    0'0    191'217    [1]    [1,0]    0'0    2012-02-23 10:27:55.430189
1.4    5    0    0    0    1724    700    700    active+clean+remapped    2012-02-23 10:37:33.153973    3'6    191'234    [1]    [1,0]    3'6    2012-02-23 10:37:33.153943
0.5    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:37:07.149807    0'0    191'223    [1]    [1,0]    0'0    2012-02-23 10:37:07.149782
2.6    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.606828    0'0    191'217    [1]    [1,0]    0'0    2012-02-23 10:27:48.429413
1.5    3    0    0    0    451    349    349    active+clean+remapped    2012-02-23 10:37:44.154857    3'3    191'223    [1]    [1,0]    3'3    2012-02-23 10:37:44.154826
0.4    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:45.147370    0'0    191'219    [1]    [1,0]    0'0    2012-02-23 10:36:45.147346
2.1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.606959    0'0    191'219    [1]    [1,0]    0'0    2012-02-23 10:27:41.428551
1.2    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:37:28.152174    0'0    191'223    [1]    [1,0]    0'0    2012-02-23 10:37:28.152150
0.3    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:59.148858    0'0    191'223    [1]    [1,0]    0'0    2012-02-23 10:36:59.148833
2.0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.607104    0'0    191'217    [1]    [1,0]    0'0    2012-02-23 10:27:40.428467
1.3    3    0    0    0    4135    352    352    active+clean+remapped    2012-02-23 10:37:39.154143    3'3    191'223    [1]    [1,0]    3'3    2012-02-23 10:37:39.154115
0.2    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:42.146919    0'0    191'219    [1]    [1,0]    0'0    2012-02-23 10:36:42.146895
2.3    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.607262    0'0    191'219    [1]    [1,0]    0'0    2012-02-23 10:27:43.428766
1.0    4    0    0    0    820    464    464    active+clean+remapped    2012-02-23 10:37:30.153656    3'4    191'231    [1]    [1,0]    3'4    2012-02-23 10:37:30.153627
0.1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:53.148204    0'0    191'223    [1]    [1,0]    0'0    2012-02-23 10:36:53.148179
2.2    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:40.607413    0'0    191'217    [1]    [1,0]    0'0    2012-02-23 10:27:42.428729
1.1    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:37:26.152511    0'0    191'219    [1]    [1,0]    0'0    2012-02-23 10:37:26.152485
0.0    0    0    0    0    0    0    0    active+clean+remapped    2012-02-23 10:36:47.147505    0'0    191'223    [1]    [1,0]    0'0    2012-02-23 10:36:47.147481
pool 0    0    0    0    0    0    0    0
pool 1    21    0    0    0    8730    2566    2566
pool 2    0    0    0    0    0    0    0
 sum    21    0    0    0    8730    2566    2566
osdstat    kbused    kbavail    kb    hb in    hb out
0    32555924    38496300    74798272    []    []
1    29815000    41237224    74798272    [0]    []
 sum    62370924    79733524    149596544

ubuntu@sepia48:~$ LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf osd dump 191
2012-02-23 10:37:58.905827 mon <- [osd,dump,191]
2012-02-23 10:37:58.906632 mon.0 -> 'dumped osdmap epoch 191' (0)
epoch 191
fsid b6285e8d-d934-47cc-9fc8-cd8cf11f00b1
created 2012-02-23 10:15:58.567549
modifed 2012-02-23 10:36:37.813010
flags 

pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 lpg_num 2 lpgp_num 2 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 8 pgp_num 8 lpg_num 2 lpgp_num 2 last_change 1 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 8 pgp_num 8 lpg_num 2 lpgp_num 2 last_change 1 owner 0

max_osd 2
osd.0 down out weight 0 up_from 184 up_thru 184 down_at 187 last_clean_interval [2,183) 10.3.14.175:6800/23225 10.3.14.175:6801/23225 10.3.14.175:6802/23225 exists
osd.1 up   in  weight 1 up_from 191 up_thru 188 down_at 190 last_clean_interval [4,190) 10.3.14.174:6800/24087 10.3.14.174:6807/24087 10.3.14.174:6808/24087 exists,up

pg_temp 0.0 [1,0]
pg_temp 0.1 [1,0]
pg_temp 0.2 [1,0]
pg_temp 0.3 [1,0]
pg_temp 0.4 [1,0]
pg_temp 0.5 [1,0]
pg_temp 0.6 [1,0]
pg_temp 0.7 [1,0]
pg_temp 0.0p0 [1,0]
pg_temp 0.1p0 [1,0]
pg_temp 0.0p1 [1,0]
pg_temp 0.1p1 [1,0]
pg_temp 1.0 [1,0]
pg_temp 1.1 [1,0]
pg_temp 1.2 [1,0]
pg_temp 1.3 [1,0]
pg_temp 1.4 [1,0]
pg_temp 1.5 [1,0]
pg_temp 1.6 [1,0]
pg_temp 1.7 [1,0]
pg_temp 1.0p0 [1,0]
pg_temp 1.1p0 [1,0]
pg_temp 1.0p1 [1,0]
pg_temp 1.1p1 [1,0]
pg_temp 2.0 [1,0]
pg_temp 2.1 [1,0]
pg_temp 2.2 [1,0]
pg_temp 2.3 [1,0]
pg_temp 2.4 [1,0]
pg_temp 2.5 [1,0]
pg_temp 2.6 [1,0]
pg_temp 2.7 [1,0]
pg_temp 2.0p0 [1,0]
pg_temp 2.1p0 [1,0]
pg_temp 2.0p1 [1,0]
pg_temp 2.1p1 [1,0]

#3 Updated by Sage Weil over 8 years ago

  • Status changed from New to Resolved

making remapped and clean mutually exclusive. e8bc42ff435e5648b88b818775d8fa47989af5dc

Also available in: Atom PDF