https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-02-16T13:53:35ZCeph Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=861622017-02-16T13:53:35ZGeorge Vasilakakosgeorge.vasilakakos@stfc.ac.uk
<ul></ul><p>Upload ID from ceph-post-file:</p>
<p>84bf663c-cfa4-48d5-9836-0fda8a4d8990</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=863122017-02-19T00:00:59ZBrad Hubbardbhubbard@redhat.com
<ul><li><strong>Category</strong> set to <i>OSD</i></li><li><strong>Assignee</strong> set to <i>Brad Hubbard</i></li></ul><p>The debug log shows the following.</p>
<p>2017-02-15 18:25:13.649706 We transition to primary and begin peering.</p>
<p>2017-02-15 18:25:13.649706 7fb9c1745700 1 osd.595 pg_epoch: 20826 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] state<Start>: transitioning to Primary</p>
<p>18:25:15.342591 we respond to a new map which shows osd.595 (the current and primary OSD) down. This causes us to leave peering and reset.</p>
<p>2017-02-15 18:25:15.342591 7fb9c1f46700 10 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] state<Started/Primary/Peering>: Peering advmap<br />2017-02-15 18:25:15.342607 7fb9c1f46700 10 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.595 now down<br />2017-02-15 18:25:15.342637 7fb9c1f46700 5 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] exit Started/Primary/Peering/GetInfo 1.692859 2 0.004968<br />2017-02-15 18:25:15.342658 7fb9c1f46700 10 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] state<Started/Primary/Peering>: Leaving Peering</p>
<p>Do the other logs for this time period give any indication why osd.595 was marked down in the map with epoch 20827? Is there an indication it is missing heartbeats or any indication of any other timeouts?</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=863882017-02-20T17:03:58ZGeorge Vasilakakosgeorge.vasilakakos@stfc.ac.uk
<ul></ul><p>Brad Hubbard wrote:</p>
<blockquote>
<p>The debug log shows the following.</p>
<p>2017-02-15 18:25:13.649706 We transition to primary and begin peering.</p>
<p>2017-02-15 18:25:13.649706 7fb9c1745700 1 osd.595 pg_epoch: 20826 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] state<Start>: transitioning to Primary</p>
<p>18:25:15.342591 we respond to a new map which shows osd.595 (the current and primary OSD) down. This causes us to leave peering and reset.</p>
<p>2017-02-15 18:25:15.342591 7fb9c1f46700 10 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] state<Started/Primary/Peering>: Peering advmap<br />2017-02-15 18:25:15.342607 7fb9c1f46700 10 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.595 now down<br />2017-02-15 18:25:15.342637 7fb9c1f46700 5 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] exit Started/Primary/Peering/GetInfo 1.692859 2 0.004968<br />2017-02-15 18:25:15.342658 7fb9c1f46700 10 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] state<Started/Primary/Peering>: Leaving Peering</p>
<p>Do the other logs for this time period give any indication why osd.595 was marked down in the map with epoch 20827? Is there an indication it is missing heartbeats or any indication of any other timeouts?</p>
</blockquote>
<p>I have checked the logs of other OSDs in the PG (though not exhaustively) and cannot find anything out of the ordinary in any of the ones I checked. We don't have debug logs from other OSDs at that time.</p>
<p>We have since removed osd.595 in an effort to get peering to happen with a new set but this has not worked.<br />The new OSD chosen was osd.307 as can be seen in the up set but the acting set still contains ITEM_NONE:</p>
<p>osdmap e22022 pg 1.323 (1.323) -> up [307,1391,240,127,937,362,267,320,7,634,716] acting [2147483647,1391,2147483647,127,937,362,267,320,7,634,716]</p>
<p>In the meantime the recovery triggered by removing osd.595 from the CRUSH map has stalled. The percentage of objects degraded keeps going up and no recoveries are happening.</p>
<p>Looking on the actual OSD with ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-595 --op info --pgid 1.323 I get "PG '1.323' not found". This is the response I get from all other OSDs when running that command...</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=864662017-02-21T17:17:36ZGeorge Vasilakakosgeorge.vasilakakos@stfc.ac.uk
<ul></ul><p>I have upload two more logs obtained with osd debug = 10 in ceph.conf.</p>
<p>ceph-post-file ID: 1bd61b3e-d035-4a58-ba06-a6608e553e9e</p>
<p>The logs were obtained after an OSD restart of each of those OSDs. <br />307 is currently rank 0 in the up set but the acting set has ITEM_NONE in place of the primary.<br />1391 has always been rank 1 in the PG. I think its log is more interesting. It tries to peer as the primary and then resets.</p>
<p>If there is anything else that is required for debugging this issue, please request that by 2pm GMT tomorrow (6am PST Wed Feb 22nd).<br />After this time we will attempt to resolve this issue by re-creating the PG and/or any other means necessary and may no longer be able to provide the necessary information.</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=865162017-02-22T02:10:15ZBrad Hubbardbhubbard@redhat.com
<ul></ul><p>We'd need logging at a minimum of debug osd = 20 and debug ms = 20. What we need going forward depends on what the actual issue is which is as yet undetermined. Did you try "--pgid 1.323s0" with the ceph-objectstore-tool and maybe add the "debug" option to produce more output? Since this is an EC pg you are looking for shards on the OSDs which will take the form of 1.323sX rather than 1.323.</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=865172017-02-22T02:20:32ZBrad Hubbardbhubbard@redhat.com
<ul></ul><p>The recently uploaded logs show the same behaviour.</p>
<p>$ grep "PriorSet: affected_by_map" ceph-osd*<br />ceph-osd.1391-debug-10-kraken.log:2017-02-21 16:43:34.178331 7f85b723e700 10 osd.1391 pg_epoch: 22182 pg[1.323s1( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 22178/22179/22179) [307,1391,240,127,937,362,267,320,7,634,716]/[2147483647,1391,240,127,937,362,267,320,7,634,716] r=1 lpr=22179 pi=14671-22178/643 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.1391 now down<br />ceph-osd.1391-debug-10-kraken.log:2017-02-21 16:43:36.420378 7f85b723e700 10 osd.1391 pg_epoch: 22187 pg[1.323s1( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 22185/22186/22186) [307,1391,240,127,937,362,267,320,7,634,716]/[2147483647,1391,240,127,937,362,267,320,7,634,716] r=1 lpr=22186 pi=14671-22185/646 bft=307(0) crt=16806'62414 lcod 0'0 mlcod 0'0 remapped+peering] PriorSet: affected_by_map osd.595 no longer exists<br />ceph-osd.1391-debug-10-kraken.log:2017-02-21 16:43:47.335677 7f85b723e700 10 osd.1391 pg_epoch: 22188 pg[1.323s1( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 22185/22186/22186) [307,1391,240,127,937,362,267,320,7,634,716]/[2147483647,1391,240,127,937,362,267,320,7,634,716] r=1 lpr=22187 pi=14671-22185/646 bft=307(0) crt=16806'62414 lcod 0'0 mlcod 0'0 remapped+peering] PriorSet: affected_by_map osd.595 no longer exists<br />ceph-osd.307-debug-10-kraken.log:2017-02-21 15:18:37.944469 7f5016223700 10 osd.307 pg_epoch: 22163 pg[11.1f7s0( v 21192'2097 (0'0,21192'2097] local-les=22123 n=45 ec=992 les/c/f 22123/22123/0 22121/22122/22122) [307,1056,1220,605,939,136,1325,270,487,226,1425] r=0 lpr=22123 crt=21156'2096 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.307 now down<br />ceph-osd.307-debug-10-kraken.log:2017-02-21 15:18:37.958737 7f5016223700 10 osd.307 pg_epoch: 22163 pg[2.3bds0( v 18344'17772 (14843'14731,18344'17772] local-les=22154 n=5879 ec=767 les/c/f 22154/22154/0 22151/22153/22122) [307,837,61,267,602,704,987,476,796,654,497] r=0 lpr=22154 crt=18305'17770 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.307 now down<br />ceph-osd.307-debug-10-kraken.log:2017-02-21 15:18:37.960232 7f5016223700 10 osd.307 pg_epoch: 22163 pg[11.22s0( v 17135'2221 (0'0,17135'2221] local-les=22123 n=60 ec=992 les/c/f 22123/22123/0 22121/22122/22122) [307,68,538,168,1394,384,436,1226,792,1265,1151] r=0 lpr=22123 crt=16717'2219 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.307 now down<br />ceph-osd.307-debug-10-kraken.log:2017-02-21 15:18:37.968345 7f5016223700 10 osd.307 pg_epoch: 22163 pg[3.154s0( empty local-les=22123 n=0 ec=770 les/c/f 22123/22123/0 22121/22122/22122) [307,183,382,9,294,114,1394,627,104,539,789] r=0 lpr=22123 crt=0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.307 now down<br />ceph-osd.307-debug-10-kraken.log:2017-02-21 15:18:37.974919 7f5016223700 10 osd.307 pg_epoch: 22163 pg[2.32ds0( v 18344'17095 (14843'14000,18344'17095] local-les=22122 n=5886 ec=767 les/c/f 22122/22123/0 22121/22121/22121) [307,1387,602,865,448,1315,888,404,104,293,467] r=0 lpr=22123 crt=18170'17092 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.307 now down</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=865352017-02-22T11:07:14ZGeorge Vasilakakosgeorge.vasilakakos@stfc.ac.uk
<ul></ul><p>Brad Hubbard wrote:</p>
<blockquote>
<p>We'd need logging at a minimum of debug osd = 20 and debug ms = 20. What we need going forward depends on what the actual issue is which is as yet undetermined. Did you try "--pgid 1.323s0" with the ceph-objectstore-tool and maybe add the "debug" option to produce more output? Since this is an EC pg you are looking for shards on the OSDs which will take the form of 1.323sX rather than 1.323.</p>
</blockquote>
<p>Thanks, that way works, I get what I guess would be the 'ceph pg X query' output. I'll be grabbing the outputs from 595 (original primary), 307 (new primary) and 1391 (normally rank 1 but tried to be primary).</p>
<p>I'll also grab some debug logs from these three. I expect to be rather large though.</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=865362017-02-22T11:42:21ZGeorge Vasilakakosgeorge.vasilakakos@stfc.ac.uk
<ul></ul><p>I have uploaded ceph-objectstore-tool output for 307, 595, 1391.<br />I have also uploaded logs from 307 and 1391, from OSD start to peering complete (with the exception of this PG).</p>
<p>ID: 8ff74898-4d61-4c03-9f5d-fba1640f215a</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=865482017-02-22T14:26:18ZGeorge Vasilakakosgeorge.vasilakakos@stfc.ac.uk
<ul></ul><p>I have uploaded two more log files: cd758c7b-7e74-4ff3-a00c-24b1391c77a2</p>
<p>These are from osd.7, one is the normal operating log from Feb 15, the other is a freshly obtained debug with osd & ms set to 20, from startup to peering complete on all but the one PG.</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=866642017-02-23T18:09:36ZGeorge Vasilakakosgeorge.vasilakakos@stfc.ac.uk
<ul></ul><p>Since we need this pool to work again, we decided to take the data loss and try to move on.</p>
<p>So far, no luck. We tried a force create but, as expected, with a PG that is not peering this did absolutely nothing.<br />We also tried rm-past-intervals and remove from ceph-objectstore-tool and manually deleting the data directories in the disks. The PG remains down+remapped with two OSDs failing to join the acting set. These have been restarted multiple times to no avail.</p>
<ol>
<li>ceph pg map 1.323<br />osdmap e23122 pg 1.323 (1.323) -> up [595,1391,240,127,937,362,267,320,986,634,716] acting [595,1391,240,127,937,362,267,320,986,2147483647,2147483647]</li>
</ol>
We have also seen some very odd behaviour.
<ol>
<li>ceph pg map 1.323<br />osdmap e22909 pg 1.323 (1.323) -> up [595,1391,240,127,937,362,267,320,986,634,716] acting [595,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]</li>
</ol>
<p>Straight after a restart of all OSDs in the PG and after everything else has settled down. From that state restarting 595 results in:</p>
<ol>
<li>ceph pg map 1.323<br />osdmap e22921 pg 1.323 (1.323) -> up [595,1391,240,127,937,362,267,320,986,634,716] acting [2147483647,1391,240,127,937,362,267,320,986,634,716]</li>
</ol>
<p>Restarting 595 doesn't change this. Another restart of all OSDs in the PG results in the state seen above with the last two replaced by ITEM_NONE.</p>
<p>Another strange thing is that on osd.7 (the one originally at rank 8 that was restarted and caused this problem) the objectstore tool fails to remove the PG and crashes out:</p>
<ol>
<li>ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-7 --op remove --pgid 1.323s8<br /> marking collection for removal<br />setting '_remove' omap key<br />finish_remove_pgs 1.323s8_head removing 1.323s8
*<strong>* Caught signal (Aborted) *</strong><br /> in thread 7fa713782700 thread_name:tp_fstore_op<br /> ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)<br /> 1: (()+0x97463a) [0x7fa71c47563a]<br /> 2: (()+0xf370) [0x7fa71935a370]<br /> 3: (snappy::RawUncompress(snappy::Source*, char*)+0x374) [0x7fa71abd0cd4]<br /> 4: (snappy::RawUncompress(char const*, unsigned long, char*)+0x3d) [0x7fa71abd0e2d]<br /> 5: (leveldb::ReadBlock(leveldb::RandomAccessFile*, leveldb::ReadOptions const&, leveldb::BlockHandle const&, leveldb::BlockContents*)+0x35e) [0x7fa71b08007e]<br /> 6: (leveldb::Table::BlockReader(void*, leveldb::ReadOptions const&, leveldb::Slice const&)+0x276) [0x7fa71b081196]<br /> 7: (()+0x3c820) [0x7fa71b083820]<br /> 8: (()+0x3c9cd) [0x7fa71b0839cd]<br /> 9: (()+0x3ca3e) [0x7fa71b083a3e]<br /> 10: (()+0x39c75) [0x7fa71b080c75]<br /> 11: (()+0x21e20) [0x7fa71b068e20]<br /> 12: (()+0x223c5) [0x7fa71b0693c5]<br /> 13: (LevelDBStore::LevelDBWholeSpaceIteratorImpl::seek_to_first(std::string const&)+0x3d) [0x7fa71c3ecb1d]<br /> 14: (LevelDBStore::LevelDBTransactionImpl::rmkeys_by_prefix(std::string const&)+0x138) [0x7fa71c3ec028]<br /> 15: (DBObjectMap::clear_header(std::shared_ptr<DBObjectMap::_Header>, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x1d0) [0x7fa71c400a40]<br /> 16: (DBObjectMap::_clear(std::shared_ptr<DBObjectMap::_Header>, std::shared_ptr<KeyValueDB::TransactionImpl>)+0xa1) [0x7fa71c401171]<br /> 17: (DBObjectMap::clear(ghobject_t const&, SequencerPosition const*)+0x1ff) [0x7fa71c4075bf]<br /> 18: (FileStore::lfn_unlink(coll_t const&, ghobject_t const&, SequencerPosition const&, bool)+0x241) [0x7fa71c2c0d41]<br /> 19: (FileStore::_remove(coll_t const&, ghobject_t const&, SequencerPosition const&)+0x8e) [0x7fa71c2c171e]<br /> 20: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x433e) [0x7fa71c2d8c6e]<br /> 21: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*)+0x3b) [0x7fa71c2db75b]<br /> 22: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x2cd) [0x7fa71c2dba5d]<br /> 23: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0x7fa71c63e189]<br /> 24: (ThreadPool::WorkThread::entry()+0x10) [0x7fa71c63f160]<br /> 25: (()+0x7dc5) [0x7fa719352dc5]<br /> 26: (clone()+0x6d) [0x7fa71843e73d]<br />Aborted</li>
</ol>
<p>At this point all we want to achieve is for the PG to peer again (and soon) without us having to delete the pool.</p>
<p>Any help would be appreciated...</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=867032017-02-24T00:50:57ZBrad Hubbardbhubbard@redhat.com
<ul></ul><p>This looks like corruption of the leveldb database on osd.7. Could you upload the omap directory from 0sd.7's data partition?</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=867092017-02-24T03:12:33ZBrad Hubbardbhubbard@redhat.com
<ul></ul><p>Given the amount of retries in the logs and the history of <a class="external" href="http://tracker.ceph.com/issues/18508">http://tracker.ceph.com/issues/18508</a> could you try with async messaging disabled?</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=867152017-02-24T09:32:54ZGeorge Vasilakakosgeorge.vasilakakos@stfc.ac.uk
<ul></ul><p>Brad Hubbard wrote:</p>
<blockquote>
<p>Given the amount of retries in the logs and the history of <a class="external" href="http://tracker.ceph.com/issues/18508">http://tracker.ceph.com/issues/18508</a> could you try with async messaging disabled?</p>
</blockquote>
<p>We have tried that too, to no avail. I forgot to include it here, but setting ms type to simple didn't have any effect.</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=867192017-02-24T09:56:10ZGeorge Vasilakakosgeorge.vasilakakos@stfc.ac.uk
<ul></ul><p>Brad Hubbard wrote:</p>
<blockquote>
<p>This looks like corruption of the leveldb database on osd.7. Could you upload the omap directory from 0sd.7's data partition?</p>
</blockquote>
<p>Done: 9b5e9602-d843-42f8-bda1-5c5542f9e778</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=867212017-02-24T10:10:18ZBrad Hubbardbhubbard@redhat.com
<ul></ul><p>The next step then would be recreating osd.7 (marking it out+down, then lost, re-formatting it and making a new osd in its place).</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=867242017-02-24T10:55:09ZGeorge Vasilakakosgeorge.vasilakakos@stfc.ac.uk
<ul></ul><p>Brad Hubbard wrote:</p>
<blockquote>
<p>The next step then would be recreating osd.7 (marking it out+down, then lost, re-formatting it and making a new osd in its place).</p>
</blockquote>
<p>Will be doing that today.</p>
<p>Two points of note:</p>
<p>osd.7 has been stopped and marked out and backfill on its replacements has completed, for this PG that was osd.986.</p>
<p>We have already manually deleted the data on the others and done a rm-past-intervals and remove with the obj. store tool. Once this is done very little information will likely remain to help troubleshoot issue on our cluster.</p>
<p>Can you think of anything else for me to extract and post before I do this?</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=867352017-02-24T15:57:05ZGeorge Vasilakakosgeorge.vasilakakos@stfc.ac.uk
<ul></ul><p>Brad Hubbard wrote:</p>
<blockquote>
<p>The next step then would be recreating osd.7 (marking it out+down, then lost, re-formatting it and making a new osd in its place).</p>
</blockquote>
<p>I did this.</p>
<p>Once the OSD was brought into the cluster the PG went incomplete. Seeing as it was empty anyway, I stopped the primary (osd.595) and used ceph-objectstore-tool to mark-complete the stuck PG. I then started the primary up again. The second in the set (osd.1391) was hanging so I restarted that one too.</p>
<p>The PG became active+clean immediately as there was no data in it. Other PGs using osd.7 are still backfilling but the cluster basically looks like it simply had a disk replaced and all PGs are active.</p>
<p>I think this solution should be able to get a PG unstuck without having to incur the loss of all data in the PG.</p>
<p>Thanks a lot for your help Brad!</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=867632017-02-24T23:36:32ZBrad Hubbardbhubbard@redhat.com
<ul></ul><p>George Vasilakakos wrote:</p>
<blockquote>
<p>I think this solution should be able to get a PG unstuck without having to incur the loss of all data in the PG.</p>
</blockquote>
<p>I agree. Unfortunately, this would have taken more time and analysis to work out the correct procedure. I understand that for you, in this instance, it was not possible to take that extra time. We will take a look at the omap corruption from osd.7 and see if we can formulate a theory on what actually happened and how it might have been alternatively resolved.</p>
<blockquote>
<p>Thanks a lot for your help Brad!</p>
</blockquote>
<p>Welcome, glad to hear you are back up and running.</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=876322017-03-17T04:25:01ZBrad Hubbardbhubbard@redhat.com
<ul></ul><p>I've done extensive analysis of the uploaded omap database to determine the source of any corruption including running various versions of the following code.</p>
<pre>
#include <iostream>
#include "leveldb/db.h"
int main()
{
leveldb::DB* db;
leveldb::Options options;
options.max_open_files = 512;
leveldb::Status status = leveldb::DB::Open(options, "./omap", &db);
if (!status.ok())
std::cerr << status.ToString() << std::endl;
leveldb::Iterator* it = db->NewIterator(leveldb::ReadOptions());
for (it->SeekToFirst(); it->Valid(); it->Next()) {
std::cout << it->key().ToString() << ": " << it->value().ToString() << std::endl;
if (!(it->status().ok()))
std::cerr << it->status(),ToString() << std::endl;
status = db->Delete(leveldb::WriteOptions(), it->key());
if (!status.ok())
std::cerr << status.ToString() << std::endl;
}
//status = RepairDB("./omap", options);
//if (!status.ok())
// std::cerr << status.ToString() << std::endl;
delete it;
delete db;
return 0;
}
</pre>
<p>The code does not detect any corruption in the database either during repair, reading of ever key or deletion of every key. Without additional data I'm not sure how we can proceed here.</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=881362017-03-28T22:48:38ZBrad Hubbardbhubbard@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Need More Info</i></li></ul> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=883242017-04-02T23:35:29ZBrad Hubbardbhubbard@redhat.com
<ul><li><strong>Status</strong> changed from <i>Need More Info</i> to <i>Closed</i></li></ul> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=884832017-04-05T02:53:57Zzhiyuan wangzhiyuan.wang@istuary.com
<ul></ul><p>Hi Brad<br />I meet the same issue in my environment, could you help to explain why close this issue?<br />Not enough information to locate it?</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=884842017-04-05T03:38:00ZBrad Hubbardbhubbard@redhat.com
<ul></ul><p>Hello,</p>
<p>Yes, in the specific case above there was not enough information to determine the exact cause.</p>
<p>What specifically is the issue you are facing?</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=884972017-04-05T09:02:59ZGeorge Vasilakakosgeorge.vasilakakos@stfc.ac.uk
<ul></ul><p>Hi Brad, sorry I couldn't get around to this sooner. Unfortunately we did not collect any more data than what I've uploaded.<br />By now, I'd expect most useful information will have been replaced by more recent data.</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=885222017-04-06T01:36:15ZBrad Hubbardbhubbard@redhat.com
<ul></ul><p>Right, and in the absence of further information this tracker has been closed.</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=887402017-04-10T02:28:08Zzhiyuan wangzhiyuan.wang@istuary.com
<ul></ul><p>HI<br />My test case is:<br />Three servers with one monitor and 5 osd each (Jewel bluestore)<br />Swich off three servers when writing (not normal poweroff), then power on three servers<br />two osds met this issue.<br />I got below pg trace information:<br />"peering_blocked_by": [],<br />"peering_blocked_by_detail": [
{<br />"detail": "peering_blocked_by_history_les_bound"</p>
<p>After restarting the two osds, the issue disapperred.<br />But the issue only occur once, couldn't be reproduced till now</p> Ceph - Bug #18960: PG stuck peering after host reboothttps://tracker.ceph.com/issues/18960?journal_id=887412017-04-10T03:26:52ZBrad Hubbardbhubbard@redhat.com
<ul></ul><p>zhiyuan wang that is not the same issue. If you can reproduce and provide some debugging data please feel free to open a new tracker so we can investigate your issue further.</p>