https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2020-06-25T15:15:23ZCeph RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1690452020-06-25T15:15:23ZSridhar Seshasayee
<ul></ul><p>Saw the same error during this run:<br /><a class="external" href="http://pulpito.ceph.com/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/">http://pulpito.ceph.com/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/</a></p>
<p>Job ID: 5176265</p>
<p>Failure Reason:<br />Scrubbing terminated -- not all pgs were active and clean.</p> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1690552020-06-25T17:29:35ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Duplicated by</strong> <i><a class="issue tracker-1 status-10 priority-6 priority-high2 closed" href="/issues/46211">Bug #46211</a>: qa: pools stuck in creating</i> added</li></ul> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1690572020-06-25T17:29:58ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>Immediate</i></li><li><strong>Target version</strong> set to <i>v16.0.0</i></li></ul> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1690652020-06-25T19:40:50ZNeha Ojhanojha@redhat.com
<ul></ul><p>The common thing in all of these is that the tests are all failing while running the ceph task, no thrashing or anything going on at all at this time.</p>
<p>From /a/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/5176265</p>
<p>This is where we start seeing PGs getting stuck:<br /><pre>
2020-06-24T21:05:50.958+0000 7fa42b7ce700 10 mgr.server operator() 8 pgs: 2 creating+peering, 6 active+clean; 0 B data, 320 KiB used, 297 GiB / 300 GiB avail
</pre></p>
<p>Just before that there seem to be issues in the msgr which is leading to reap_dead start</p>
<pre>
2020-06-24T21:05:50.181+0000 7fa442f31700 1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).ready entity=client.4179 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2020-06-24T21:05:50.199+0000 7fa4227c8700 1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] <== osd.0 v2:172.21.15.205:6806/20452 6 ==== pg_stats(1 pgs tid 0 v 0) v2 ==== 1199+0+0 (secure 0 0 0) 0x5647f6de0900 con 0x5647f6d47400
2020-06-24T21:05:50.200+0000 7fa442f31700 1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 msgr2=0x5647f6ed3900 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 42
2020-06-24T21:05:50.200+0000 7fa442f31700 1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 msgr2=0x5647f6ed3900 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2020-06-24T21:05:50.200+0000 7fa442f31700 1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
2020-06-24T21:05:50.200+0000 7fa442f31700 1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).stop
2020-06-24T21:05:50.200+0000 7fa442f31700 1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] reap_dead start
</pre> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1690772020-06-25T22:43:11ZSebastian Wagner
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-10 priority-6 priority-high2 closed" href="/issues/46178">Bug #46178</a>: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) </i> added</li></ul> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1690892020-06-26T05:04:46ZIlya Dryomov
<ul><li><strong>Category</strong> set to <i>Correctness/Safety</i></li><li><strong>Assignee</strong> set to <i>Ilya Dryomov</i></li><li><strong>Component(RADOS)</strong> <i>Messenger</i> added</li></ul><p>This is a msgr2.1 issue.</p> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1690922020-06-26T06:50:00ZIlya Dryomov
<ul></ul><p>I think it has to do with reconnect handling and how connections are reused.</p>
<p>This part of ProtocolV2 is pretty fragile, evidenced by steadily accumulating workarounds for invalid memory use issues during msgr2.0 development and after (the last just in March). Most likely what happens is FrameAssembler is_rev1 state gets lost and a 2.0 frame ends up being assembled while the peer is expecting a 2.1 frame. I'll confirm and put out a fix ASAP.</p> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1691532020-06-26T17:12:12ZNeha Ojhanojha@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-10 priority-4 priority-default closed" href="/issues/46179">Bug #46179</a>: Health check failed: Reduced data availability: PG_AVAILABILITY</i> added</li></ul> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1691702020-06-26T17:40:16ZNeha Ojhanojha@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-10 priority-4 priority-default closed" href="/issues/46225">Bug #46225</a>: Health check failed: 1 osds down (OSD_DOWN)</i> added</li></ul> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1691742020-06-26T18:21:05ZNeha Ojhanojha@redhat.com
<ul></ul><p>Here's a reliable reproducer for the issue:</p>
<p>-s rados/singleton-nomsgr -c master --filter 'all/health-warnings rados' -N 20</p>
<p><a class="external" href="https://pulpito.ceph.com/nojha-2020-06-26_17:24:37-rados:singleton-nomsgr-master-distro-basic-smithi/">https://pulpito.ceph.com/nojha-2020-06-26_17:24:37-rados:singleton-nomsgr-master-distro-basic-smithi/</a></p> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1691952020-06-28T10:45:09ZIlya Dryomov
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Fix Under Review</i></li><li><strong>Pull request ID</strong> set to <i>35816</i></li></ul> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1695652020-07-02T08:56:29ZIlya Dryomov
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Resolved</i></li></ul><p>Will be cherry-picked into <a class="external" href="https://github.com/ceph/ceph/pull/35720">https://github.com/ceph/ceph/pull/35720</a> and <a class="external" href="https://github.com/ceph/ceph/pull/35733">https://github.com/ceph/ceph/pull/35733</a>.</p> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1695702020-07-02T09:04:17ZIlya Dryomov
<ul><li><strong>Related to</strong> deleted (<i><a class="issue tracker-1 status-10 priority-6 priority-high2 closed" href="/issues/46178">Bug #46178</a>: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) </i>)</li></ul> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1695722020-07-02T09:04:20ZIlya Dryomov
<ul><li><strong>Duplicated by</strong> <i><a class="issue tracker-1 status-10 priority-6 priority-high2 closed" href="/issues/46178">Bug #46178</a>: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) </i> added</li></ul> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1696112020-07-02T16:35:59ZNeha Ojhanojha@redhat.com
<ul><li><strong>Related to</strong> deleted (<i><a class="issue tracker-1 status-10 priority-4 priority-default closed" href="/issues/46225">Bug #46225</a>: Health check failed: 1 osds down (OSD_DOWN)</i>)</li></ul> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1696132020-07-02T16:36:00ZNeha Ojhanojha@redhat.com
<ul><li><strong>Duplicated by</strong> <i><a class="issue tracker-1 status-10 priority-4 priority-default closed" href="/issues/46225">Bug #46225</a>: Health check failed: 1 osds down (OSD_DOWN)</i> added</li></ul> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1696162020-07-02T16:37:23ZNeha Ojhanojha@redhat.com
<ul><li><strong>Related to</strong> deleted (<i><a class="issue tracker-1 status-10 priority-4 priority-default closed" href="/issues/46179">Bug #46179</a>: Health check failed: Reduced data availability: PG_AVAILABILITY</i>)</li></ul> RADOS - Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean.https://tracker.ceph.com/issues/46180?journal_id=1696182020-07-02T16:37:37ZNeha Ojhanojha@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-10 priority-4 priority-default closed" href="/issues/46179">Bug #46179</a>: Health check failed: Reduced data availability: PG_AVAILABILITY</i> added</li></ul>