https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-11-14T08:51:33ZCeph RADOS - Bug #22113: osd: pg limit on replica test failurehttps://tracker.ceph.com/issues/22113?journal_id=1022132017-11-14T08:51:33ZKefu Chaitchaikov@gmail.com
<ul></ul><p>the reason why osd.2 didn't finish creating pg 1.0 is that after the unique_pool_14 (pool.15) is removed by mon.a, it didn't send the updated osdmap to osd.2, which was the primary osd of pg 14.0. and instead, it sent the osdmap.61 to a random osd: osd.1.</p>
<p>osd.2 didn't get the osdmap(61..) until the wait_for_clean timed out.</p>
<p>osd.2<br /><pre>
2017-11-10 18:07:02.551 7fc704b83700 5 osd.2 60 maybe_wait_for_max_pg withhold creation of pg 1.0: 1 >= 1
2017-11-10 18:21:29.675 7fc705b85700 20 osd.2 60 _dispatch 0x55ef39334080 osd_map(61..62 src has 1..62) v4
2017-11-10 18:21:29.675 7fc705b85700 3 osd.2 60 handle_osd_map epochs [61,62], i have 60, src has [1,62]
2017-11-10 18:21:29.676 7fc70cbbd700 10 osd.2 60 _committed_osd_maps 61..62
2017-11-10 18:21:29.676 7fc70cbbd700 7 osd.2 62 consume_map version 62
</pre></p>
<p>mon.a<br /><pre>
2017-11-10 18:07:17.569 7fa0e18f7700 10 mon.a@0(leader).osd e60 _prepare_remove_pool 15
..
2017-11-10 18:07:17.579 7fa0e40fc700 10 mon.a@0(leader).osd e60 encode_pending e 61
2017-11-10 18:07:17.581 7fa0dd0ee700 1 mon.a@0(leader).osd e61 e61: 4 total, 4 up, 3 in
2017-11-10 18:07:17.581 7fa0dd0ee700 10 mon.a@0(leader).osd e61 check_osdmap_subs
2017-11-10 18:07:17.581 7fa0dd0ee700 10 mon.a@0(leader).osd e61 check_osdmap_sub 0x562ae75a6d00 next 61 (onetime)
2017-11-10 18:07:17.581 7fa0dd0ee700 5 mon.a@0(leader).osd e61 send_incremental [61..61] to client.4097 172.21.15.36:0/1016650811
2017-11-10 18:07:17.581 7fa0dd0ee700 10 mon.a@0(leader).osd e61 build_incremental [61..61]
2017-11-10 18:07:17.581 7fa0dd0ee700 20 mon.a@0(leader).osd e61 build_incremental inc 61 220 bytes
2017-11-10 18:07:17.581 7fa0dd0ee700 1 -- 172.21.15.36:6789/0 --> 172.21.15.36:0/1016650811 -- osd_map(61..61 src has 1..61) v4 -- 0x562ae7458a00 con 0
2017-11-10 18:07:17.581 7fa0dd0ee700 20 mon.a@0(leader).osd e61 check_pg_creates_sub .. osd.2 172.21.15.36:6805/23262
2017-11-10 18:07:17.581 7fa0dd0ee700 10 mon.a@0(leader).osd e61 committed, telling random osd.1 172.21.15.36:6801/23261 all about it
</pre></p>
<p>i think the current design works fine because objecter subscribes from mon continuously once it gets a fullmap. and if an OSD runs into a requests requires new osdmap, it will request from mon for a new map, neither does it hurt. even if an osd is out of sync when some of the pg(s) it serves does not exist anymore, it's fine. because the pg will get removed eventually, once the osd received the updated osdmap. just a matter of time.</p>
<p>but this design leads to a problem once the free-pg slots become a resource. we need to subscribe to the monitor continuously once there is any pending pg and stop doing so once all pending pgs are created.</p> RADOS - Bug #22113: osd: pg limit on replica test failurehttps://tracker.ceph.com/issues/22113?journal_id=1022162017-11-14T10:48:27ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>Fix Under Review</i></li></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/18916">https://github.com/ceph/ceph/pull/18916</a></p> RADOS - Bug #22113: osd: pg limit on replica test failurehttps://tracker.ceph.com/issues/22113?journal_id=1024162017-11-20T06:42:53ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Pending Backport</i></li><li><strong>Backport</strong> set to <i>luminous</i></li></ul> RADOS - Bug #22113: osd: pg limit on replica test failurehttps://tracker.ceph.com/issues/22113?journal_id=1024372017-11-20T11:05:13ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/22176">Backport #22176</a>: luminous: osd: pg limit on replica test failure</i> added</li></ul> RADOS - Bug #22113: osd: pg limit on replica test failurehttps://tracker.ceph.com/issues/22113?journal_id=1065002018-02-01T23:24:38ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul>