https://tracker.ceph.com/
https://tracker.ceph.com/favicon.ico
2018-04-17T00:16:54Z
Ceph
RADOS - Bug #23763: upgrade: bad pg num and stale health status in mixed lumnious/mimic cluster
https://tracker.ceph.com/issues/23763?journal_id=111374
2018-04-17T00:16:54Z
Yuri Weinstein
yweinste@redhat.com
<ul><li><strong>ceph-qa-suite</strong> <i>upgrade/luminous-x</i> added</li></ul>
RADOS - Bug #23763: upgrade: bad pg num and stale health status in mixed lumnious/mimic cluster
https://tracker.ceph.com/issues/23763?journal_id=111457
2018-04-18T15:34:06Z
Kefu Chai
tchaikov@gmail.com
<ul></ul><p>the pgs with creating or unknown status "pg dump" were active+clean after 2018-04-16 22:47. so the output of last "pg dump" was stale.</p>
<p>but the pg_num of test-rados-api-ovh086-65141-1 is kind of weird: 11. it should be 8 by default.</p>
<p>@Yuri, is this issue reproduciable?</p>
RADOS - Bug #23763: upgrade: bad pg num and stale health status in mixed lumnious/mimic cluster
https://tracker.ceph.com/issues/23763?journal_id=111552
2018-04-20T00:32:24Z
Josh Durgin
<ul></ul><p>Yuri reproduced the bad pg_num in 1 of 2 runs:</p>
<pre>
$ date
Fri Apr 20 00:29:29 UTC 2018
[ubuntu@smithi037 ~]$ sudo ceph osd dump
epoch 1203
fsid e7d2fd7c-dcf5-44bf-8f03-4605e82192b2
created 2018-04-19 23:43:04.465761
modified 2018-04-20 00:28:18.092987
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 198
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release luminous
pool 1 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 13 flags hashpspool stripe_width 0 application rbd
pool 2 'cephfs_metadata' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 6 pgp_num 6 last_change 17 flags hashpspool stripe_width 0 application cephfs
pool 3 'cephfs_data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 6 pgp_num 6 last_change 17 flags hashpspool stripe_width 0 application cephfs
pool 256 '.rgw.buckets' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 658 flags hashpspool stripe_width 0 application rgw
pool 257 '.rgw.root' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 660 flags hashpspool stripe_width 0 application rgw
pool 258 'default.rgw.control' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 662 flags hashpspool stripe_width 0 application rgw
pool 259 'default.rgw.meta' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 664 flags hashpspool stripe_width 0 application rgw
pool 260 'default.rgw.log' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 666 flags hashpspool stripe_width 0 application rgw
pool 261 'default.rgw.buckets.index' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 669 flags hashpspool stripe_width 0 application rgw
pool 262 'default.rgw.buckets.data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 671 flags hashpspool stripe_width 0 application rgw
pool 263 'default.rgw.buckets.non-ec' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 674 flags hashpspool stripe_width 0 application rgw
pool 267 'test-rados-api-smithi106-855-1' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 11 pgp_num 8 last_change 730 lfor 0/730 flags hashpspool stripe_width 0 application rados
max_osd 6
osd.0 up in weight 1 up_from 21 up_thru 1196 down_at 20 last_clean_interval [11,19) 172.21.15.37:6805/2165 172.21.15.37:6806/2165 172.21.15.37:6807/2165 172.21.15.37:6808/2165 exists,up a567f23c-930e-4ea7-8eb9-cfd7c51e0717
osd.1 up in weight 1 up_from 24 up_thru 1161 down_at 23 last_clean_interval [11,22) 172.21.15.37:6801/3613 172.21.15.37:6802/3613 172.21.15.37:6803/3613 172.21.15.37:6804/3613 exists,up e23ed5b6-4d8b-4205-848c-312fc7f954d0
osd.2 up in weight 1 up_from 28 up_thru 1183 down_at 26 last_clean_interval [11,25) 172.21.15.37:6809/5166 172.21.15.37:6810/5166 172.21.15.37:6811/5166 172.21.15.37:6812/5166 exists,up 04ee2e4c-3946-4cf0-a041-91cc491756f7
osd.3 up in weight 1 up_from 32 up_thru 1196 down_at 30 last_clean_interval [11,29) 172.21.15.170:6800/1675 172.21.15.170:6801/1675 172.21.15.170:6802/1675 172.21.15.170:6803/1675 exists,up c21c5b0c-c592-4fad-b964-0a65a05882ee
osd.4 up in weight 1 up_from 35 up_thru 1196 down_at 34 last_clean_interval [11,33) 172.21.15.170:6808/1797 172.21.15.170:6809/1797 172.21.15.170:6810/1797 172.21.15.170:6811/1797 exists,up 6c0f9144-e18d-47f4-9cea-2639daa3c7d1
osd.5 up in weight 1 up_from 38 up_thru 1196 down_at 37 last_clean_interval [11,36) 172.21.15.170:6804/1919 172.21.15.170:6805/1919 172.21.15.170:6806/1919 172.21.15.170:6807/1919 exists,up 1f4f2110-f34c-49b4-9188-2941ad725c7c
blacklist 172.21.15.37:6813/850180070 expires 2018-04-21 00:18:42.823676
</pre>
<p>The cluster is still up, with logs in teuthology:~yuriw/logs/2413738 - and on machines smithi037 smithi106 smithi170</p>
RADOS - Bug #23763: upgrade: bad pg num and stale health status in mixed lumnious/mimic cluster
https://tracker.ceph.com/issues/23763?journal_id=111573
2018-04-20T10:09:47Z
Kefu Chai
tchaikov@gmail.com
<ul></ul><p>i think the pg_num = 11 is set by LibRadosList.EnumerateObjects</p>
<pre>
// Ensure a non-power-of-two PG count to avoid only
// touching the easy path.
std::string err_str = set_pg_num(&s_cluster, pool_name, 11);
ASSERT_TRUE(err_str.empty());
</pre>
RADOS - Bug #23763: upgrade: bad pg num and stale health status in mixed lumnious/mimic cluster
https://tracker.ceph.com/issues/23763?journal_id=111574
2018-04-20T11:11:54Z
Kefu Chai
tchaikov@gmail.com
<ul><li><strong>Category</strong> changed from <i>Correctness/Safety</i> to <i>Tests</i></li><li><strong>Status</strong> changed from <i>New</i> to <i>Fix Under Review</i></li><li><strong>Assignee</strong> set to <i>Kefu Chai</i></li><li><strong>Backport</strong> set to <i>luminous</i></li></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/21555">https://github.com/ceph/ceph/pull/21555</a></p>
RADOS - Bug #23763: upgrade: bad pg num and stale health status in mixed lumnious/mimic cluster
https://tracker.ceph.com/issues/23763?journal_id=111580
2018-04-20T12:55:01Z
Kefu Chai
tchaikov@gmail.com
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-6 priority-high2 closed" href="/issues/23808">Backport #23808</a>: luminous: upgrade: bad pg num and stale health status in mixed lumnious/mimic cluster</i> added</li></ul>
RADOS - Bug #23763: upgrade: bad pg num and stale health status in mixed lumnious/mimic cluster
https://tracker.ceph.com/issues/23763?journal_id=111726
2018-04-24T15:45:37Z
Kefu Chai
tchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Pending Backport</i></li></ul>
RADOS - Bug #23763: upgrade: bad pg num and stale health status in mixed lumnious/mimic cluster
https://tracker.ceph.com/issues/23763?journal_id=113449
2018-05-17T15:52:18Z
Kefu Chai
tchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul>