Project

General

Profile

Actions

Bug #39956

open

OSD:Cancel copy op causes memory leak

Added by tao ning almost 5 years ago. Updated almost 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Tiering
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version 12.2.7

00:00:06:00.712 3722687 15,237,248 (2,770,560 direct, 12,466,688 indirect) bytes in 3,848 blocks are definitely lost in loss record 21,210 of 21,212
00:00:06:00.712 3722687 at 0xA3E8888: operator new[](unsigned long) (vg_replace_malloc.c:423)
00:00:06:00.712 3722687 by 0x8A261B: Objecter::_prepare_osd_op(Objecter::Op*) (Objecter.cc:3232)
00:00:06:00.712 3722687 by 0x8AA5E8: Objecter::_op_submit(Objecter::Op*, ceph::shunique_lock<boost::shared_mutex>&, unsigned long*) (Objecter.cc:2514)
00:00:06:00.712 3722687 by 0x8B7734: Objecter::_op_submit_with_budget(Objecter::Op*, ceph::shunique_lock<boost::shared_mutex>&, unsigned long*, int*) (Objecter.cc:2351)
00:00:06:00.712 3722687 by 0x8B7A32: Objecter::op_submit(Objecter::Op*, unsigned long*, int*) (Objecter.cc:2318)
00:00:06:00.712 3722687 by 0x76A0E5: read (Objecter.h:2315)
00:00:06:00.712 3722687 by 0x76A0E5: PrimaryLogPG::_copy_some(std::shared_ptr<ObjectContext>, std::shared_ptr<PrimaryLogPG::CopyOp>) (PrimaryLogPG.cc:8291)
00:00:06:00.712 3722687 by 0x76B0E3: PrimaryLogPG::start_copy(PrimaryLogPG::CopyCallback*, std::shared_ptr<ObjectContext>, hobject_t, object_locator_t, unsigned long, unsigned int, bool, unsigned int, unsigned int) (PrimaryLogPG.cc:8228)
00:00:06:00.712 3722687 by 0x7838E4: PrimaryLogPG::promote_object(std::shared_ptr<ObjectContext>, hobject_t const&, object_locator_t const&, boost::intrusive_ptr<OpRequest>, std::shared_ptr<ObjectContext>*) (PrimaryLogPG.cc:3298)
00:00:06:00.712 3722687 by 0x785CDE: PrimaryLogPG::maybe_handle_cache_detail(boost::intrusive_ptr<OpRequest>, bool, std::shared_ptr<ObjectContext>, int, hobject_t, bool, bool, std::shared_ptr<ObjectContext>*) (PrimaryLogPG.cc:2669)
00:00:06:00.712 3722687 by 0x807BE9: PrimaryLogPG::maybe_handle_cache(boost::intrusive_ptr<OpRequest>, bool, std::shared_ptr<ObjectContext>, int, hobject_t const&, bool, bool) (PrimaryLogPG.h:1165) 00:00:06:00.712 3722687 15,548,416 bytes in 3,796 blocks are indirectly lost in loss record 21,211 of 21,212
00:00:06:00.712 3722687 at 0xA3E9C3C: memalign (vg_replace_malloc.c:857)
00:00:06:00.712 3722687 by 0xA3E9D43: posix_memalign (vg_replace_malloc.c:1021)
00:00:06:00.712 3722687 by 0xB7E265: create (buffer.cc:311)
00:00:06:00.712 3722687 by 0xB7E265: ceph::buffer::list::append(char const*, unsigned int) (buffer.cc:1948)
00:00:06:00.713 3722687 by 0x842495: encode_raw<unsigned char> (encoding.h:69)
00:00:06:00.713 3722687 by 0x842495: encode (encoding.h:81)
00:00:06:00.713 3722687 by 0x842495: object_copy_cursor_t::encode(ceph::buffer::list&) const (osd_types.cc:4468)
00:00:06:00.713 3722687 by 0x769B95: encode (osd_types.h:4352)
00:00:06:00.713 3722687 by 0x769B95: copy_get (Objecter.h:833)
00:00:06:00.713 3722687 by 0x769B95: PrimaryLogPG::_copy_some(std::shared_ptr<ObjectContext>, std::shared_ptr<PrimaryLogPG::CopyOp>) (PrimaryLogPG.cc:8278)
00:00:06:00.713 3722687 by 0x76B0E3: PrimaryLogPG::start_copy(PrimaryLogPG::CopyCallback*, std::shared_ptr<ObjectContext>, hobject_t, object_locator_t, unsigned long, unsigned int, bool, unsigned int, unsigned int) (PrimaryLogPG.cc:8228)
00:00:06:00.713 3722687 by 0x7838E4: PrimaryLogPG::promote_object(std::shared_ptr<ObjectContext>, hobject_t const&, object_locator_t const&, boost::intrusive_ptr<OpRequest>, std::shared_ptr<ObjectContext>*) (PrimaryLogPG.cc:3298)
00:00:06:00.713 3722687 by 0x785CDE: PrimaryLogPG::maybe_handle_cache_detail(boost::intrusive_ptr<OpRequest>, bool, std::shared_ptr<ObjectContext>, int, hobject_t, bool, bool, std::shared_ptr<ObjectContext>*) (PrimaryLogPG.cc:2669)
Actions #1

Updated by tao ning almost 5 years ago

If two clients access the same snap object at the same time, and the object needs to promote, before the promote is completed, the two op will cancle each other, aggravating the memory leak phenomenon
1. client.a
start_copy 3:6e3a6ad3:::10000308246.00000000:46ec from 3:6e3a6ad3:::10000308246.00000000:46ec @2 v0 flags 30
2019-05-10 04:37:21.500270 7fe478957700 10 osd.17 pg_epoch: 60755 pg[3.76( v 60755'1023259 (60011'1021684,60755'1023259] local-lis/les=60740/60741 n=8 ec=100/100 lis/c 60740/60721 les/c/f 60741/60722/0 60740/60740/60738) [17,6,23,15] r=0 lpr=60740 pi=[60721,60740)/1 crt=60755'1023259 lcod 60755'1023258 mlcod 60755'1023258 active+undersized+degraded]
_copy_some 0x555b98e50a00 0x555babfe0e18
2019-05-10 04:37:21.508270 7fe478957700 10 osd.17 pg_epoch: 60755 pg[3.76( v 60755'1023259 (60011'1021684,60755'1023259] local-lis/les=60740/60741 n=8 ec=100/100 lis/c 60740/60721 les/c/f 60741/60722/0 60740/60740/60738) [17,6,23,15] r=0 lpr=60740 pi=[60721,60740)/1 crt=60755'1023259 lcod 60755'1023258 mlcod 60755'1023258 active+undersized+degraded]
wait_for_blocked_object 3:6e3a6ad3:::10000308246.00000000:46ec 0x555bb5ec5b00
2. client.b cancle client.a
2019-05-10 04:37:21.508597 7fe47a15a700 10 osd.17 pg_epoch: 60755 pg[3.76( v 60755'1023259 (60011'1021684,60755'1023259] local-lis/les=60740/60741 n=8 ec=100/100 lis/c 60740/60721 les/c/f 60741/60722/0 60740/60740/60738) [17,6,23,15] r=0 lpr=60740 pi=[60721,60740)/1 crt=60755'1023259 lcod 60755'1023258 mlcod 60755'1023258 active+undersized+degraded]
start_copy 3:6e3a6ad3:::10000308246.00000000:46ec from 3:6e3a6ad3:::10000308246.00000000:46ec @2 v0 flags 30
2019-05-10 04:37:21.508608 7fe47a15a700 10 osd.17 pg_epoch: 60755 pg[3.76( v 60755'1023259 (60011'1021684,60755'1023259] local-lis/les=60740/60741 n=8 ec=100/100 lis/c 60740/60721 les/c/f 60741/60722/0 60740/60740/60738) [17,6,23,15] r=0 lpr=60740 pi=[60721,60740)/1 crt=60755'1023259 lcod 60755'1023258 mlcod 60755'1023258 active+undersized+degraded]
cancel_copy 3:6e3a6ad3:::10000308246.00000000:46ec from 3:6e3a6ad3:::10000308246.00000000:46ec @2 v0
2019-05-10 04:37:21.508618 7fe47a15a700 10 osd.17 pg_epoch: 60755 pg[3.76( v 60755'1023259 (60011'1021684,60755'1023259] local-lis/les=60740/60741 n=8 ec=100/100 lis/c 60740/60721 les/c/f 60741/60722/0 60740/60740/60738) [17,6,23,15] r=0 lpr=60740 pi=[60721,60740)/1 crt=60755'1023259 lcod 60755'1023258 mlcod 60755'1023258 active+undersized+degraded]
kick_object_context_blocked 3:6e3a6ad3:::10000308246.00000000:46ec requeuing 1 requests
2019-05-10 04:37:21.508672 7fe47a15a700 10 osd.17 pg_epoch: 60755 pg[3.76( v 60755'1023259 (60011'1021684,60755'1023259] local-lis/les=60740/60741 n=8 ec=100/100 lis/c 60740/60721 les/c/f 60741/60722/0 60740/60740/60738) [17,6,23,15] r=0 lpr=60740 pi=[60721,60740)/1 crt=60755'1023259 lcod 60755'1023258 mlcod 60755'1023258 active+undersized+degraded]
_copy_some 0x555b98e50a00 0x555b8a219518
3. cient.a cancle client.b
start_copy 3:6e3a6ad3:::10000308246.00000000:46ec from 3:6e3a6ad3:::10000308246.00000000:46ec @2 v0 flags 30
2019-05-10 04:37:21.508906 7fe47a15a700 10 osd.17 pg_epoch: 60755 pg[3.76( v 60755'1023259 (60011'1021684,60755'1023259] local-lis/les=60740/60741 n=8 ec=100/100 lis/c 60740/60721 les/c/f 60741/60722/0 60740/60740/60738) [17,6,23,15] r=0 lpr=60740 pi=[60721,60740)/1 crt=60755'1023259 lcod 60755'1023258 mlcod 60755'1023258 active+undersized+degraded]
kick_object_context_blocked 3:6e3a6ad3:::10000308246.00000000:46ec requeuing 1 requests
2019-05-10 04:37:21.508897 7fe47a15a700 10 osd.17 pg_epoch: 60755 pg[3.76( v 60755'1023259 (60011'1021684,60755'1023259] local-lis/les=60740/60741 n=8 ec=100/100 lis/c 60740/60721 les/c/f 60741/60722/0 60740/60740/60738) [17,6,23,15] r=0 lpr=60740 pi=[60721,60740)/1 crt=60755'1023259 lcod 60755'1023258 mlcod 60755'1023258 active+undersized+degraded]
cancel_copy 3:6e3a6ad3:::10000308246.00000000:46ec from 3:6e3a6ad3:::10000308246.00000000:46ec @2 v0

...Form a loop

Actions #2

Updated by Josh Durgin almost 5 years ago

  • Category set to Tiering
Actions

Also available in: Atom PDF