Project

General

Profile

Bug #18960

PG stuck peering after host reboot

Added by George Vasilakakos about 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
ec, peering, crush, osd, msgr
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On a cluster running Jewel 10.2.5, rebooting a host resulted in a PG getting stuck in the peering state.

pg 1.323 is stuck inactive for 73352.498493, current state peering, last acting [595,1391,240,127,937,362,267,320,7,634,716]

Restarting OSDs or hosts does nothing to help, or sometimes results in things like this:

pg 1.323 is remapped+peering, acting [2147483647,1391,240,127,937,362,267,320,7,634,716]

In fact, we since upgraded to Kraken 11.2.0 and every OSD that was restarted was replaced by CRUSH_ITEM_NONE. This is only rectified by restarting OSDs 595 and 1391.

The host that was rebooted originally is home to osd.7 (8). When I would go onto it to look at the logs for osd.7 this is what I would see:

$ tail f /var/log/ceph/ceph-osd.7.log
2017-02-08 15:41:00.445247 7f5fcc2bd700 0 -
XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 pipe(0x7f6074a0b400 sd=34 :42828 s=2 pgs=319 cs=471 l=0 c=0x7f6070086700).fault, initiating reconnect

I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> indicates the direction of communication. I've traced these to osd.7 (rank 8 in the stuck PG) reaching out to osd.595 (the primary in the stuck PG).

Meanwhile, looking at the logs of osd.595 I would see this:

$ tail f /var/log/ceph/ceph-osd.595.log
2017-02-08 15:41:15.760708 7f1765673700 0 -
XXX.XXX.XXX.192:6921/55371 >> XXX.XXX.XXX.172:6905/20510 pipe(0x7f17b2911400 sd=101 :6921 s=0 pgs=0 cs=0 l=0 c=0x7f17b7beaf00).accept connect_seq 478 vs existing 477 state standby
2017-02-08 15:41:20.768844 7f1765673700 0 bad crc in front 1941070384 != exp 3786596716

which again shows osd.595 reaching out to osd.7 and from what I could gather the CRC problem is about messaging.

The CRC messages no longer seem to be logged by osd.595, osd.7 is still logging reconnects.

ceph pg 1.323 query seems to hang forever but it completed once early on and I noticed this:

"peering_blocked_by_detail": [
    {
"detail": "peering_blocked_by_history_les_bound"
}

We have seen this before and it was cleared by setting osd_find_best_info_ignore_history_les to true for the first two OSDs on the stuck PGs (this was on a 3 replica pool). This hasn't worked in this case.

PG 1.323 is the only PG with both 595 and 7 in its set.
There are another 217 PGs with OSDs from both these hosts in their sets which makes this seem less like a networking issue.

ceph.conf View (1.05 KB) George Vasilakakos, 02/16/2017 01:47 PM

crushmap.txt View (72 KB) George Vasilakakos, 02/16/2017 01:47 PM

health-detail-pg-1.323.txt View (2.42 KB) George Vasilakakos, 02/16/2017 01:48 PM

History

#1 Updated by George Vasilakakos about 7 years ago

Upload ID from ceph-post-file:

84bf663c-cfa4-48d5-9836-0fda8a4d8990

#2 Updated by Brad Hubbard about 7 years ago

  • Category set to OSD
  • Assignee set to Brad Hubbard

The debug log shows the following.

2017-02-15 18:25:13.649706 We transition to primary and begin peering.

2017-02-15 18:25:13.649706 7fb9c1745700 1 osd.595 pg_epoch: 20826 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] state<Start>: transitioning to Primary

18:25:15.342591 we respond to a new map which shows osd.595 (the current and primary OSD) down. This causes us to leave peering and reset.

2017-02-15 18:25:15.342591 7fb9c1f46700 10 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] state<Started/Primary/Peering>: Peering advmap
2017-02-15 18:25:15.342607 7fb9c1f46700 10 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.595 now down
2017-02-15 18:25:15.342637 7fb9c1f46700 5 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] exit Started/Primary/Peering/GetInfo 1.692859 2 0.004968
2017-02-15 18:25:15.342658 7fb9c1f46700 10 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] state<Started/Primary/Peering>: Leaving Peering

Do the other logs for this time period give any indication why osd.595 was marked down in the map with epoch 20827? Is there an indication it is missing heartbeats or any indication of any other timeouts?

#3 Updated by George Vasilakakos about 7 years ago

Brad Hubbard wrote:

The debug log shows the following.

2017-02-15 18:25:13.649706 We transition to primary and begin peering.

2017-02-15 18:25:13.649706 7fb9c1745700 1 osd.595 pg_epoch: 20826 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] state<Start>: transitioning to Primary

18:25:15.342591 we respond to a new map which shows osd.595 (the current and primary OSD) down. This causes us to leave peering and reset.

2017-02-15 18:25:15.342591 7fb9c1f46700 10 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] state<Started/Primary/Peering>: Peering advmap
2017-02-15 18:25:15.342607 7fb9c1f46700 10 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.595 now down
2017-02-15 18:25:15.342637 7fb9c1f46700 5 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] exit Started/Primary/Peering/GetInfo 1.692859 2 0.004968
2017-02-15 18:25:15.342658 7fb9c1f46700 10 osd.595 pg_epoch: 20827 pg[1.323s0( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 20823/20824/20824) [595,1391,240,127,937,362,267,320,7,634,716] r=0 lpr=20824 pi=16805-20823/250 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] state<Started/Primary/Peering>: Leaving Peering

Do the other logs for this time period give any indication why osd.595 was marked down in the map with epoch 20827? Is there an indication it is missing heartbeats or any indication of any other timeouts?

I have checked the logs of other OSDs in the PG (though not exhaustively) and cannot find anything out of the ordinary in any of the ones I checked. We don't have debug logs from other OSDs at that time.

We have since removed osd.595 in an effort to get peering to happen with a new set but this has not worked.
The new OSD chosen was osd.307 as can be seen in the up set but the acting set still contains ITEM_NONE:

osdmap e22022 pg 1.323 (1.323) -> up [307,1391,240,127,937,362,267,320,7,634,716] acting [2147483647,1391,2147483647,127,937,362,267,320,7,634,716]

In the meantime the recovery triggered by removing osd.595 from the CRUSH map has stalled. The percentage of objects degraded keeps going up and no recoveries are happening.

Looking on the actual OSD with ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-595 --op info --pgid 1.323 I get "PG '1.323' not found". This is the response I get from all other OSDs when running that command...

#4 Updated by George Vasilakakos about 7 years ago

I have upload two more logs obtained with osd debug = 10 in ceph.conf.

ceph-post-file ID: 1bd61b3e-d035-4a58-ba06-a6608e553e9e

The logs were obtained after an OSD restart of each of those OSDs.
307 is currently rank 0 in the up set but the acting set has ITEM_NONE in place of the primary.
1391 has always been rank 1 in the PG. I think its log is more interesting. It tries to peer as the primary and then resets.

If there is anything else that is required for debugging this issue, please request that by 2pm GMT tomorrow (6am PST Wed Feb 22nd).
After this time we will attempt to resolve this issue by re-creating the PG and/or any other means necessary and may no longer be able to provide the necessary information.

#5 Updated by Brad Hubbard about 7 years ago

We'd need logging at a minimum of debug osd = 20 and debug ms = 20. What we need going forward depends on what the actual issue is which is as yet undetermined. Did you try "--pgid 1.323s0" with the ceph-objectstore-tool and maybe add the "debug" option to produce more output? Since this is an EC pg you are looking for shards on the OSDs which will take the form of 1.323sX rather than 1.323.

#6 Updated by Brad Hubbard about 7 years ago

The recently uploaded logs show the same behaviour.

$ grep "PriorSet: affected_by_map" ceph-osd*
ceph-osd.1391-debug-10-kraken.log:2017-02-21 16:43:34.178331 7f85b723e700 10 osd.1391 pg_epoch: 22182 pg[1.323s1( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 22178/22179/22179) [307,1391,240,127,937,362,267,320,7,634,716]/[2147483647,1391,240,127,937,362,267,320,7,634,716] r=1 lpr=22179 pi=14671-22178/643 crt=16806'62414 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.1391 now down
ceph-osd.1391-debug-10-kraken.log:2017-02-21 16:43:36.420378 7f85b723e700 10 osd.1391 pg_epoch: 22187 pg[1.323s1( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 22185/22186/22186) [307,1391,240,127,937,362,267,320,7,634,716]/[2147483647,1391,240,127,937,362,267,320,7,634,716] r=1 lpr=22186 pi=14671-22185/646 bft=307(0) crt=16806'62414 lcod 0'0 mlcod 0'0 remapped+peering] PriorSet: affected_by_map osd.595 no longer exists
ceph-osd.1391-debug-10-kraken.log:2017-02-21 16:43:47.335677 7f85b723e700 10 osd.1391 pg_epoch: 22188 pg[1.323s1( v 16806'62416 (16717'58985,16806'62416] local-les=16806 n=2342 ec=764 les/c/f 16806/16806/0 22185/22186/22186) [307,1391,240,127,937,362,267,320,7,634,716]/[2147483647,1391,240,127,937,362,267,320,7,634,716] r=1 lpr=22187 pi=14671-22185/646 bft=307(0) crt=16806'62414 lcod 0'0 mlcod 0'0 remapped+peering] PriorSet: affected_by_map osd.595 no longer exists
ceph-osd.307-debug-10-kraken.log:2017-02-21 15:18:37.944469 7f5016223700 10 osd.307 pg_epoch: 22163 pg[11.1f7s0( v 21192'2097 (0'0,21192'2097] local-les=22123 n=45 ec=992 les/c/f 22123/22123/0 22121/22122/22122) [307,1056,1220,605,939,136,1325,270,487,226,1425] r=0 lpr=22123 crt=21156'2096 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.307 now down
ceph-osd.307-debug-10-kraken.log:2017-02-21 15:18:37.958737 7f5016223700 10 osd.307 pg_epoch: 22163 pg[2.3bds0( v 18344'17772 (14843'14731,18344'17772] local-les=22154 n=5879 ec=767 les/c/f 22154/22154/0 22151/22153/22122) [307,837,61,267,602,704,987,476,796,654,497] r=0 lpr=22154 crt=18305'17770 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.307 now down
ceph-osd.307-debug-10-kraken.log:2017-02-21 15:18:37.960232 7f5016223700 10 osd.307 pg_epoch: 22163 pg[11.22s0( v 17135'2221 (0'0,17135'2221] local-les=22123 n=60 ec=992 les/c/f 22123/22123/0 22121/22122/22122) [307,68,538,168,1394,384,436,1226,792,1265,1151] r=0 lpr=22123 crt=16717'2219 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.307 now down
ceph-osd.307-debug-10-kraken.log:2017-02-21 15:18:37.968345 7f5016223700 10 osd.307 pg_epoch: 22163 pg[3.154s0( empty local-les=22123 n=0 ec=770 les/c/f 22123/22123/0 22121/22122/22122) [307,183,382,9,294,114,1394,627,104,539,789] r=0 lpr=22123 crt=0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.307 now down
ceph-osd.307-debug-10-kraken.log:2017-02-21 15:18:37.974919 7f5016223700 10 osd.307 pg_epoch: 22163 pg[2.32ds0( v 18344'17095 (14843'14000,18344'17095] local-les=22122 n=5886 ec=767 les/c/f 22122/22123/0 22121/22121/22121) [307,1387,602,865,448,1315,888,404,104,293,467] r=0 lpr=22123 crt=18170'17092 lcod 0'0 mlcod 0'0 peering NIBBLEWISE] PriorSet: affected_by_map osd.307 now down

#7 Updated by George Vasilakakos about 7 years ago

Brad Hubbard wrote:

We'd need logging at a minimum of debug osd = 20 and debug ms = 20. What we need going forward depends on what the actual issue is which is as yet undetermined. Did you try "--pgid 1.323s0" with the ceph-objectstore-tool and maybe add the "debug" option to produce more output? Since this is an EC pg you are looking for shards on the OSDs which will take the form of 1.323sX rather than 1.323.

Thanks, that way works, I get what I guess would be the 'ceph pg X query' output. I'll be grabbing the outputs from 595 (original primary), 307 (new primary) and 1391 (normally rank 1 but tried to be primary).

I'll also grab some debug logs from these three. I expect to be rather large though.

#8 Updated by George Vasilakakos about 7 years ago

I have uploaded ceph-objectstore-tool output for 307, 595, 1391.
I have also uploaded logs from 307 and 1391, from OSD start to peering complete (with the exception of this PG).

ID: 8ff74898-4d61-4c03-9f5d-fba1640f215a

#9 Updated by George Vasilakakos about 7 years ago

I have uploaded two more log files: cd758c7b-7e74-4ff3-a00c-24b1391c77a2

These are from osd.7, one is the normal operating log from Feb 15, the other is a freshly obtained debug with osd & ms set to 20, from startup to peering complete on all but the one PG.

#10 Updated by George Vasilakakos about 7 years ago

Since we need this pool to work again, we decided to take the data loss and try to move on.

So far, no luck. We tried a force create but, as expected, with a PG that is not peering this did absolutely nothing.
We also tried rm-past-intervals and remove from ceph-objectstore-tool and manually deleting the data directories in the disks. The PG remains down+remapped with two OSDs failing to join the acting set. These have been restarted multiple times to no avail.

  1. ceph pg map 1.323
    osdmap e23122 pg 1.323 (1.323) -> up [595,1391,240,127,937,362,267,320,986,634,716] acting [595,1391,240,127,937,362,267,320,986,2147483647,2147483647]
We have also seen some very odd behaviour.
  1. ceph pg map 1.323
    osdmap e22909 pg 1.323 (1.323) -> up [595,1391,240,127,937,362,267,320,986,634,716] acting [595,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]

Straight after a restart of all OSDs in the PG and after everything else has settled down. From that state restarting 595 results in:

  1. ceph pg map 1.323
    osdmap e22921 pg 1.323 (1.323) -> up [595,1391,240,127,937,362,267,320,986,634,716] acting [2147483647,1391,240,127,937,362,267,320,986,634,716]

Restarting 595 doesn't change this. Another restart of all OSDs in the PG results in the state seen above with the last two replaced by ITEM_NONE.

Another strange thing is that on osd.7 (the one originally at rank 8 that was restarted and caused this problem) the objectstore tool fails to remove the PG and crashes out:

  1. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-7 --op remove --pgid 1.323s8
    marking collection for removal
    setting '_remove' omap key
    finish_remove_pgs 1.323s8_head removing 1.323s8 ** Caught signal (Aborted) *
    in thread 7fa713782700 thread_name:tp_fstore_op
    ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
    1: (()+0x97463a) [0x7fa71c47563a]
    2: (()+0xf370) [0x7fa71935a370]
    3: (snappy::RawUncompress(snappy::Source*, char*)+0x374) [0x7fa71abd0cd4]
    4: (snappy::RawUncompress(char const*, unsigned long, char*)+0x3d) [0x7fa71abd0e2d]
    5: (leveldb::ReadBlock(leveldb::RandomAccessFile*, leveldb::ReadOptions const&, leveldb::BlockHandle const&, leveldb::BlockContents*)+0x35e) [0x7fa71b08007e]
    6: (leveldb::Table::BlockReader(void*, leveldb::ReadOptions const&, leveldb::Slice const&)+0x276) [0x7fa71b081196]
    7: (()+0x3c820) [0x7fa71b083820]
    8: (()+0x3c9cd) [0x7fa71b0839cd]
    9: (()+0x3ca3e) [0x7fa71b083a3e]
    10: (()+0x39c75) [0x7fa71b080c75]
    11: (()+0x21e20) [0x7fa71b068e20]
    12: (()+0x223c5) [0x7fa71b0693c5]
    13: (LevelDBStore::LevelDBWholeSpaceIteratorImpl::seek_to_first(std::string const&)+0x3d) [0x7fa71c3ecb1d]
    14: (LevelDBStore::LevelDBTransactionImpl::rmkeys_by_prefix(std::string const&)+0x138) [0x7fa71c3ec028]
    15: (DBObjectMap::clear_header(std::shared_ptr<DBObjectMap::_Header>, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x1d0) [0x7fa71c400a40]
    16: (DBObjectMap::_clear(std::shared_ptr<DBObjectMap::_Header>, std::shared_ptr<KeyValueDB::TransactionImpl>)+0xa1) [0x7fa71c401171]
    17: (DBObjectMap::clear(ghobject_t const&, SequencerPosition const*)+0x1ff) [0x7fa71c4075bf]
    18: (FileStore::lfn_unlink(coll_t const&, ghobject_t const&, SequencerPosition const&, bool)+0x241) [0x7fa71c2c0d41]
    19: (FileStore::_remove(coll_t const&, ghobject_t const&, SequencerPosition const&)+0x8e) [0x7fa71c2c171e]
    20: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x433e) [0x7fa71c2d8c6e]
    21: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*)+0x3b) [0x7fa71c2db75b]
    22: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x2cd) [0x7fa71c2dba5d]
    23: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0x7fa71c63e189]
    24: (ThreadPool::WorkThread::entry()+0x10) [0x7fa71c63f160]
    25: (()+0x7dc5) [0x7fa719352dc5]
    26: (clone()+0x6d) [0x7fa71843e73d]
    Aborted

At this point all we want to achieve is for the PG to peer again (and soon) without us having to delete the pool.

Any help would be appreciated...

#11 Updated by Brad Hubbard about 7 years ago

This looks like corruption of the leveldb database on osd.7. Could you upload the omap directory from 0sd.7's data partition?

#12 Updated by Brad Hubbard about 7 years ago

Given the amount of retries in the logs and the history of http://tracker.ceph.com/issues/18508 could you try with async messaging disabled?

#13 Updated by George Vasilakakos about 7 years ago

Brad Hubbard wrote:

Given the amount of retries in the logs and the history of http://tracker.ceph.com/issues/18508 could you try with async messaging disabled?

We have tried that too, to no avail. I forgot to include it here, but setting ms type to simple didn't have any effect.

#14 Updated by George Vasilakakos about 7 years ago

Brad Hubbard wrote:

This looks like corruption of the leveldb database on osd.7. Could you upload the omap directory from 0sd.7's data partition?

Done: 9b5e9602-d843-42f8-bda1-5c5542f9e778

#15 Updated by Brad Hubbard about 7 years ago

The next step then would be recreating osd.7 (marking it out+down, then lost, re-formatting it and making a new osd in its place).

#16 Updated by George Vasilakakos about 7 years ago

Brad Hubbard wrote:

The next step then would be recreating osd.7 (marking it out+down, then lost, re-formatting it and making a new osd in its place).

Will be doing that today.

Two points of note:

osd.7 has been stopped and marked out and backfill on its replacements has completed, for this PG that was osd.986.

We have already manually deleted the data on the others and done a rm-past-intervals and remove with the obj. store tool. Once this is done very little information will likely remain to help troubleshoot issue on our cluster.

Can you think of anything else for me to extract and post before I do this?

#17 Updated by George Vasilakakos about 7 years ago

Brad Hubbard wrote:

The next step then would be recreating osd.7 (marking it out+down, then lost, re-formatting it and making a new osd in its place).

I did this.

Once the OSD was brought into the cluster the PG went incomplete. Seeing as it was empty anyway, I stopped the primary (osd.595) and used ceph-objectstore-tool to mark-complete the stuck PG. I then started the primary up again. The second in the set (osd.1391) was hanging so I restarted that one too.

The PG became active+clean immediately as there was no data in it. Other PGs using osd.7 are still backfilling but the cluster basically looks like it simply had a disk replaced and all PGs are active.

I think this solution should be able to get a PG unstuck without having to incur the loss of all data in the PG.

Thanks a lot for your help Brad!

#18 Updated by Brad Hubbard about 7 years ago

George Vasilakakos wrote:

I think this solution should be able to get a PG unstuck without having to incur the loss of all data in the PG.

I agree. Unfortunately, this would have taken more time and analysis to work out the correct procedure. I understand that for you, in this instance, it was not possible to take that extra time. We will take a look at the omap corruption from osd.7 and see if we can formulate a theory on what actually happened and how it might have been alternatively resolved.

Thanks a lot for your help Brad!

Welcome, glad to hear you are back up and running.

#19 Updated by Brad Hubbard about 7 years ago

I've done extensive analysis of the uploaded omap database to determine the source of any corruption including running various versions of the following code.

#include <iostream>
#include "leveldb/db.h" 

int main()
{

    leveldb::DB* db;
    leveldb::Options options;
    options.max_open_files = 512;
    leveldb::Status status = leveldb::DB::Open(options, "./omap", &db);
    if (!status.ok())
        std::cerr << status.ToString() << std::endl;

    leveldb::Iterator* it = db->NewIterator(leveldb::ReadOptions());
    for (it->SeekToFirst(); it->Valid(); it->Next()) {
        std::cout << it->key().ToString() << ": "  << it->value().ToString() << std::endl;
        if (!(it->status().ok()))
            std::cerr << it->status(),ToString() << std::endl;
        status = db->Delete(leveldb::WriteOptions(), it->key());
        if (!status.ok())
            std::cerr << status.ToString() << std::endl;
    }

    //status = RepairDB("./omap", options);
    //if (!status.ok())
    //   std::cerr << status.ToString() << std::endl;

    delete it;
    delete db;

    return 0;

}

The code does not detect any corruption in the database either during repair, reading of ever key or deletion of every key. Without additional data I'm not sure how we can proceed here.

#20 Updated by Brad Hubbard almost 7 years ago

  • Status changed from New to Need More Info

#21 Updated by Brad Hubbard almost 7 years ago

  • Status changed from Need More Info to Closed

#22 Updated by zhiyuan wang almost 7 years ago

Hi Brad
I meet the same issue in my environment, could you help to explain why close this issue?
Not enough information to locate it?

#23 Updated by Brad Hubbard almost 7 years ago

Hello,

Yes, in the specific case above there was not enough information to determine the exact cause.

What specifically is the issue you are facing?

#24 Updated by George Vasilakakos almost 7 years ago

Hi Brad, sorry I couldn't get around to this sooner. Unfortunately we did not collect any more data than what I've uploaded.
By now, I'd expect most useful information will have been replaced by more recent data.

#25 Updated by Brad Hubbard almost 7 years ago

Right, and in the absence of further information this tracker has been closed.

#26 Updated by zhiyuan wang almost 7 years ago

HI
My test case is:
Three servers with one monitor and 5 osd each (Jewel bluestore)
Swich off three servers when writing (not normal poweroff), then power on three servers
two osds met this issue.
I got below pg trace information:
"peering_blocked_by": [],
"peering_blocked_by_detail": [ {
"detail": "peering_blocked_by_history_les_bound"

After restarting the two osds, the issue disapperred.
But the issue only occur once, couldn't be reproduced till now

#27 Updated by Brad Hubbard almost 7 years ago

zhiyuan wang that is not the same issue. If you can reproduce and provide some debugging data please feel free to open a new tracker so we can investigate your issue further.

Also available in: Atom PDF