Project

General

Profile

Support #22531

OSD flapping under repair/scrub after recieve inconsistent PG LFNIndex.cc: 439: FAILED assert(long_name == short_name)

Added by Jan Michlik over 3 years ago. Updated about 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(RADOS):
OSD
Pull request ID:

Description

Hi.
I have a problem when repairing PG 1.f under copy from OSD.3 on OSD.0. During the upgrade to 12.2.2, all OSDs were not started and for that reason an inconsistency was probably caused. Now during repair of PG 1.f OSD.0 is restarting itself. If I remove OSD.0, the OSD is restarted, where PG 1.f is copied. It does not help deep scrub / repair.

Logs from OSD.0
2017-12-15 11:28:53.772760 7fdf338ee700 10 osd.0 pg_epoch: 17464 pg[1.f( v 17464'641517 (17036'639973,17464'641517] local-lis/les=17462/17463 n=16457 ec=20/20 lis/c 17462/17462 les/c/f 17463/17463/14330 17462/17462/17462) [3,0] r=1 lpr=174
62 luod=0'0 crt=17464'641517 lcod 17464'641515 active+inconsistent] _handle_message: 0x55c4fd2f3c80
2017-12-15 11:28:53.772766 7fdf338ee700 7 osd.0 pg_epoch: 17464 pg[1.f( v 17464'641517 (17036'639973,17464'641517] local-lis/les=17462/17463 n=16457 ec=20/20 lis/c 17462/17462 les/c/f 17463/17463/14330 17462/17462/17462) [3,0] r=1 lpr=174
62 luod=0'0 crt=17464'641517 lcod 17464'641515 active+inconsistent] replica_scrub
2017-12-15 11:28:53.772770 7fdf338ee700 10 osd.0 pg_epoch: 17464 pg[1.f( v 17464'641517 (17036'639973,17464'641517] local-lis/les=17462/17463 n=16457 ec=20/20 lis/c 17462/17462 les/c/f 17463/17463/14330 17462/17462/17462) [3,0] r=1 lpr=174
62 luod=0'0 crt=17464'641517 lcod 17464'641515 active+inconsistent] build_scrub_map_chunk [1:f0dca08d:::rbd_data.11f574b0dc51.0000000000005828:3895,1:f0e1df65:::rbd_data.5dd47238e1f29.000000000000b92c:ba2) seed 4294967295
2017-12-15 11:28:53.845944 7fdf338ee700 -1 /build/ceph-12.2.2/src/os/filestore/LFNIndex.cc: In function 'int LFNIndex::list_objects(const std::vector<std::basic_string<char> >&, int, long int*, std::map<std::basic_string<char>, ghobject_t>
*)' thread 7fdf338ee700 time 2017-12-15 11:28:53.840677
/build/ceph-12.2.2/src/os/filestore/LFNIndex.cc: 439: FAILED assert(long_name == short_name)

ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55c4ec5d09b2]
2: (LFNIndex::list_objects(std::vector&lt;std::string, std::allocator&lt;std::string&gt; > const&, int, long*, std::map&lt;std::string, ghobject_t, std::less&lt;std::string&gt;, std::allocator&lt;std::pair&lt;std::string const, ghobject_t&gt; > >)+0x239) [0x55c4ec
3bd579]
3: (HashIndex::get_path_contents_by_hash_bitwise(std::vector&lt;std::string, std::allocator&lt;std::string&gt; > const&, ghobject_t const
, std::set&lt;std::string, HashIndex::CmpHexdigitStringBitwise, std::allocator&lt;std::string&gt; >, std::set&lt;std::pa
ir&lt;std::string, ghobject_t&gt;, HashIndex::CmpPairBitwise, std::allocator&lt;std::pair&lt;std::string, ghobject_t&gt; > >
)+0x86) [0x55c4ec52b7e6]
4: (HashIndex::list_by_hash_bitwise(std::vector&lt;std::string, std::allocator&lt;std::string&gt; > const&, ghobject_t const&, int, ghobject_t*, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >)+0x140) [0x55c4ec52c2e0]
5: (HashIndex::list_by_hash_bitwise(std::vector&lt;std::string, std::allocator&lt;std::string&gt; > const&, ghobject_t const&, int, ghobject_t
, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >)+0x536) [0x55c4ec52c6d6]
6: (HashIndex::list_by_hash_bitwise(std::vector&lt;std::string, std::allocator&lt;std::string&gt; > const&, ghobject_t const&, int, ghobject_t
, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >)+0x536) [0x55c4ec52c6d6]
7: (HashIndex::list_by_hash_bitwise(std::vector&lt;std::string, std::allocator&lt;std::string&gt; > const&, ghobject_t const&, int, ghobject_t
, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >)+0x536) [0x55c4ec52c6d6]
8: (HashIndex::_collection_list_partial(ghobject_t const&, ghobject_t const&, int, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >
, ghobject_t*)+0x1c7) [0x55c4ec52d3f7]
9: (FileStore::collection_list(coll_t const&, ghobject_t const&, ghobject_t const&, int, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >, ghobject_t)+0x3c5) [0x55c4ec359ca5]
10: (PGBackend::objects_list_range(hobject_t const&, hobject_t const&, snapid_t, std::vector&lt;hobject_t, std::allocator&lt;hobject_t&gt; >, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >)+0x18f) [0x55c4ec21d8ff]
11: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1de) [0x55c4ec0cee0e]
12: (PG::replica_scrub(boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x624) [0x55c4ec0cf784]
13: (PrimaryLogPG::do_request(boost::intrusive_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x8b6) [0x55c4ec18e436]
14: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x3f7) [0x55c4ec01ae07]
15: (PGQueueable::RunVis::operator()(boost::intrusive_ptr&lt;OpRequest&gt; const&)+0x57) [0x55c4ec292a67]
16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x108c) [0x55c4ec049f5c]
17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x88d) [0x55c4ec5d663d]
18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55c4ec5d8600]
19: (()+0x7494) [0x7fdf4f09d494]
20: (clone()+0x3f) [0x7fdf4e124aff]
NOTE: a copy of the executable, or `objdump rdS &lt;executable&gt;` is needed to interpret this.
...
-14> 2017-12-15 11:28:53.621104 7fdf400ec700 20 osd.0 pg_epoch: 17464 pg[1.31( v 17464'715154 (17095'713621,17464'715154] local-lis/les=17451/17452 n=16574 ec=20/20 lis/c 17451/17451 les/c/f 17452/17452/14330 17451/17451/17451) [0,2] r=
0 lpr=17451 crt=17464'715154 lcod 17464'715152 mlcod 17464'715152 active+clean] remove_repop repgather(0x55c4fcf3b680 17464'715154 rep_tid=3292 committed?=1 applied?=1 r=0)
-13> 2017-12-15 11:28:53.715536 7fdf320eb700 20 osd.0 op_wq(3) _process empty q, waiting
-12> 2017-12-15 11:28:53.772657 7fdf4b4fc700 5 -
192.168.0.49:6810/2195 >> 192.168.0.49:6818/11206 conn(0x55c501db5000 :6810 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=0). rx osd.3 seq 1069 0x55c501a73200 replica scrub
(pg: 1.f,from:0'0,to:0'0,epoch:17464/17462,start:1:f0dca08d:::rbd_data.11f574b0dc51.0000000000005828:3895,end:1:f0e1df65:::rbd_data.5dd47238e1f29.000000000000b92c:ba2,chunky:1,deep:1,seed:4294967295,version:7) v7
11> 2017-12-15 11:28:53.772678 7fdf4b4fc700 1 - 192.168.0.49:6810/2195 <== osd.3 192.168.0.49:6818/11206 1069 ==== replica scrub(pg: 1.f,from:0'0,to:0'0,epoch:17464/17462,start:1:f0dca08d:::rbd_data.11f574b0dc51.0000000000005828:3895
,end:1:f0e1df65:::rbd_data.5dd47238e1f29.000000000000b92c:ba2,chunky:1,deep:1,seed:4294967295,version:7) v7 ==== 211+0+0 (1697000901 0 0) 0x55c501a73200 con 0x55c501db5000
-10> 2017-12-15 11:28:53.772690 7fdf4b4fc700 15 osd.0 17464 enqueue_op 0x55c4fd2f3c80 prio 127 cost 0 latency 0.000248 epoch 17464 replica scrub(pg: 1.f,from:0'0,to:0'0,epoch:17464/17462,start:1:f0dca08d:::rbd_data.11f574b0dc51.00000000
00005828:3895,end:1:f0e1df65:::rbd_data.5dd47238e1f29.000000000000b92c:ba2,chunky:1,deep:1,seed:4294967295,version:7) v7
-9> 2017-12-15 11:28:53.772698 7fdf4b4fc700 20 osd.0 op_wq(0) _enqueue 1.f PGQueueable(0x55c4fd2f3c80 prio 127 cost 0 e17464)
-8> 2017-12-15 11:28:53.772723 7fdf338ee700 20 osd.0 op_wq(0) _process 1.f item PGQueueable(0x55c4fd2f3c80 prio 127 cost 0 e17464) queued
-7> 2017-12-15 11:28:53.772732 7fdf338ee700 20 osd.0 op_wq(0) _process 1.f item PGQueueable(0x55c4fd2f3c80 prio 127 cost 0 e17464) pg 0x55c4fb172000
-6> 2017-12-15 11:28:53.772735 7fdf338ee700 10 osd.0 17464 dequeue_op 0x55c4fd2f3c80 prio 127 cost 0 latency 0.000292 replica scrub(pg: 1.f,from:0'0,to:0'0,epoch:17464/17462,start:1:f0dca08d:::rbd_data.11f574b0dc51.0000000000005828:389
5,end:1:f0e1df65:::rbd_data.5dd47238e1f29.000000000000b92c:ba2,chunky:1,deep:1,seed:4294967295,version:7) v7 pg pg[1.f( v 17464'641517 (17036'639973,17464'641517] local-lis/les=17462/17463 n=16457 ec=20/20 lis/c 17462/17462 les/c/f 17463/1
7463/14330 17462/17462/17462) [3,0] r=1 lpr=17462 luod=0'0 crt=17464'641517 lcod 17464'641515 active+inconsistent]
-5> 2017-12-15 11:28:53.772751 7fdf338ee700 20 osd.0 17464 share_map osd.3 192.168.0.49:6818/11206 17464
-4> 2017-12-15 11:28:53.772754 7fdf338ee700 20 osd.0 17464 should_share_map osd.3 192.168.0.49:6818/11206 17464
-3> 2017-12-15 11:28:53.772760 7fdf338ee700 10 osd.0 pg_epoch: 17464 pg[1.f( v 17464'641517 (17036'639973,17464'641517] local-lis/les=17462/17463 n=16457 ec=20/20 lis/c 17462/17462 les/c/f 17463/17463/14330 17462/17462/17462) [3,0] r=1
lpr=17462 luod=0'0 crt=17464'641517 lcod 17464'641515 active+inconsistent] _handle_message: 0x55c4fd2f3c80
-2> 2017-12-15 11:28:53.772766 7fdf338ee700 7 osd.0 pg_epoch: 17464 pg[1.f( v 17464'641517 (17036'639973,17464'641517] local-lis/les=17462/17463 n=16457 ec=20/20 lis/c 17462/17462 les/c/f 17463/17463/14330 17462/17462/17462) [3,0] r=1
lpr=17462 luod=0'0 crt=17464'641517 lcod 17464'641515 active+inconsistent] replica_scrub
-1> 2017-12-15 11:28:53.772770 7fdf338ee700 10 osd.0 pg_epoch: 17464 pg[1.f( v 17464'641517 (17036'639973,17464'641517] local-lis/les=17462/17463 n=16457 ec=20/20 lis/c 17462/17462 les/c/f 17463/17463/14330 17462/17462/17462) [3,0] r=1
lpr=17462 luod=0'0 crt=17464'641517 lcod 17464'641515 active+inconsistent] build_scrub_map_chunk [1:f0dca08d:::rbd_data.11f574b0dc51.0000000000005828:3895,1:f0e1df65:::rbd_data.5dd47238e1f29.000000000000b92c:ba2) seed 4294967295
0> 2017-12-15 11:28:53.845944 7fdf338ee700 -1 /build/ceph-12.2.2/src/os/filestore/LFNIndex.cc: In function 'int LFNIndex::list_objects(const std::vector&lt;std::basic_string&lt;char&gt; >&, int, long int*, std::map&lt;std::basic_string&lt;char&gt;, gho
bject_t>*)' thread 7fdf338ee700 time 2017-12-15 11:28:53.840677
/build/ceph-12.2.2/src/os/filestore/LFNIndex.cc: 439: FAILED assert(long_name == short_name)
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55c4ec5d09b2]
2: (LFNIndex::list_objects(std::vector&lt;std::string, std::allocator&lt;std::string&gt; > const&, int, long*, std::map&lt;std::string, ghobject_t, std::less&lt;std::string&gt;, std::allocator&lt;std::pair&lt;std::string const, ghobject_t&gt; > >)+0x239) [0x55c4ec
3bd579]
3: (HashIndex::get_path_contents_by_hash_bitwise(std::vector&lt;std::string, std::allocator&lt;std::string&gt; > const&, ghobject_t const
, std::set&lt;std::string, HashIndex::CmpHexdigitStringBitwise, std::allocator&lt;std::string&gt; >, std::set&lt;std::pa
ir&lt;std::string, ghobject_t&gt;, HashIndex::CmpPairBitwise, std::allocator&lt;std::pair&lt;std::string, ghobject_t&gt; > >
)+0x86) [0x55c4ec52b7e6]
4: (HashIndex::list_by_hash_bitwise(std::vector&lt;std::string, std::allocator&lt;std::string&gt; > const&, ghobject_t const&, int, ghobject_t*, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >)+0x140) [0x55c4ec52c2e0]
5: (HashIndex::list_by_hash_bitwise(std::vector&lt;std::string, std::allocator&lt;std::string&gt; > const&, ghobject_t const&, int, ghobject_t
, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >)+0x536) [0x55c4ec52c6d6]
6: (HashIndex::list_by_hash_bitwise(std::vector&lt;std::string, std::allocator&lt;std::string&gt; > const&, ghobject_t const&, int, ghobject_t
, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >)+0x536) [0x55c4ec52c6d6]
7: (HashIndex::list_by_hash_bitwise(std::vector&lt;std::string, std::allocator&lt;std::string&gt; > const&, ghobject_t const&, int, ghobject_t
, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >)+0x536) [0x55c4ec52c6d6]
8: (HashIndex::_collection_list_partial(ghobject_t const&, ghobject_t const&, int, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >
, ghobject_t*)+0x1c7) [0x55c4ec52d3f7]
9: (FileStore::collection_list(coll_t const&, ghobject_t const&, ghobject_t const&, int, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >, ghobject_t)+0x3c5) [0x55c4ec359ca5]
10: (PGBackend::objects_list_range(hobject_t const&, hobject_t const&, snapid_t, std::vector&lt;hobject_t, std::allocator&lt;hobject_t&gt; >, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >)+0x18f) [0x55c4ec21d8ff]
11: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1de) [0x55c4ec0cee0e]
12: (PG::replica_scrub(boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x624) [0x55c4ec0cf784]
13: (PrimaryLogPG::do_request(boost::intrusive_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x8b6) [0x55c4ec18e436]
14: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x3f7) [0x55c4ec01ae07]
15: (PGQueueable::RunVis::operator()(boost::intrusive_ptr&lt;OpRequest&gt; const&)+0x57) [0x55c4ec292a67]
16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x108c) [0x55c4ec049f5c]
17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x88d) [0x55c4ec5d663d]
18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55c4ec5d8600]
19: (()+0x7494) [0x7fdf4f09d494]
20: (clone()+0x3f) [0x7fdf4e124aff]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

osd_ceph-3_current_1.f_head-ls-la.txt.gz - List directory OSD3 pg 1.f (204 KB) Jan Michlik, 01/15/2018 10:57 AM

History

#1 Updated by Patrick Donnelly over 3 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
  • Component(RADOS) OSD added

#2 Updated by Josh Durgin over 3 years ago

  • Tracker changed from Bug to Support

Can you provide a directory listing for pg 1.f? It seems a file that does not obey the internal naming rules of filestore has been created there somehow.

#3 Updated by Jan Michlik over 3 years ago

Josh Durgin wrote:

Can you provide a directory listing for pg 1.f? It seems a file that does not obey the internal naming rules of filestore has been created there somehow.

Hello.
Thank you for your answer. I am uploading a file listing.

#4 Updated by Chris Dunlop about 3 years ago

For the record...

I was also suffering this problem on a pg repair. That was because I was following the procedure at https://ceph.com/geen-categorie/ceph-manually-repair-object/ and at the point of "move the bad object to another location" I did a "mv ${obj} ${obj}.bad", thus creating in Josh's words above "a file that does not obey the internal naming rules of filestore".

To resolve the issue I moved the bad object elsewhere entirely (to save it for further examination) and restarted the osd.

Also available in: Atom PDF