Project

General

Profile

Actions

Bug #7891

closed

osd: leaked pg refs on shutdown

Added by Sage Weil about 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
firefly
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

  -330> 2014-03-27 18:26:28.535862 1ff69700 20 osd.3 80  kicking pg 23.6
  -329> 2014-03-27 18:26:28.535984 1ff69700 30 osd.3 pg_epoch: 80 pg[23.6( v 80'8 (0'0,80'8] local-les=80 n=1 ec=79 les/c 80/80 79/79/79) [3,1] r=0 lpr=79 luod=0'0 crt=0'0 lcod 80'6 mlcod 0'0 active+clean] lock
  -328> 2014-03-27 18:26:28.537515 1ff69700 -1 osd.3 80 pgid 23.6 has ref count of 4
  -327> 2014-03-27 18:26:28.642681 1ff69700 -1 osd/OSD.cc: In function 'int OSD::shutdown()' thread 1ff69700 time 2014-03-27 18:26:28.541435
osd/OSD.cc: 1689: FAILED assert(0)

 ceph version 0.78-367-gd9a2dea (d9a2dea755a62e4f9fe0795410f37b68a15ae054)
 1: (OSD::shutdown()+0x1557) [0x62f3c7]
 2: (OSD::handle_signal(int)+0x11d) [0x62fabd]
 3: (SignalHandler::entry()+0x18b) [0x98139b]
 4: (()+0x7e9a) [0x52a8e9a]
 5: (clone()+0x6d) [0x6d1e3fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-03-27_02:30:16-rados-firefly-distro-basic-plana/149035

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #9630: osd: leaked pg refs on shutdown (dumpling)ResolvedLoïc Dachary09/30/2014

Actions
Actions #1

Updated by Samuel Just about 10 years ago

  • Assignee set to Samuel Just
Actions #2

Updated by Samuel Just about 10 years ago

  • Assignee deleted (Samuel Just)
Actions #3

Updated by Samuel Just about 10 years ago

ubuntu@teuthology:/a/teuthology-2014-04-02_02:30:02-rados-master-testing-basic-plana/161303/remote$

Actions #4

Updated by Sage Weil about 10 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-04-04_02:30:19-rados-master-testing-basic-plana/168600

Actions #5

Updated by Sage Weil about 10 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/gregf-2014-04-04_22:05:49-rados-wip-7994-testing-basic-plana/171097

Actions #6

Updated by Sage Weil about 10 years ago

  • Status changed from 12 to In Progress
  • Assignee set to Sage Weil
Actions #8

Updated by Sage Weil about 10 years ago

ubuntu@teuthology:/a/teuthology-2014-04-15_02:30:04-rados-firefly-distro-basic-plana/193023 but not debug patch applied

Actions #9

Updated by Samuel Just almost 10 years ago

ubuntu@teuthology:/a/sage-2014-04-23_18:03:07-rados-firefly-testing-basic-plana/211808

Actions #10

Updated by Samuel Just almost 10 years ago

ubuntu@teuthology:/a/teuthology-2014-04-25_02:30:12-rados-master-testing-basic-plana/214284/remote

Actions #11

Updated by Sage Weil almost 10 years ago

ubuntu@teuthology:/a/samuelj-2014-05-08_12:44:29-rados-firefly-testing-basic-plana/243892 (shared_cache assert on shutdown)

Actions #12

Updated by Samuel Just almost 10 years ago

ubuntu@teuthology:/a/samuelj-2014-05-08_12:44:29-rados-firefly-testing-basic-plana/243892/remote

Actions #13

Updated by Sage Weil almost 10 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-05-11_09:34:50-rados-firefly-testing-basic-plana/249163

Actions #14

Updated by Samuel Just almost 10 years ago

ubuntu@teuthology:/a/samuelj-2014-05-09_14:11:19-rados-wip-sam-testing-testing-basic-plana/245865

Actions #15

Updated by Sage Weil almost 10 years ago

  • Priority changed from Urgent to High
Actions #16

Updated by Samuel Just almost 10 years ago

two instances:
ubuntu@teuthology:/a/samuelj-2014-05-16_23:11:10-rados-wip-sam-testing-testing-basic-plana/259903

Actions #17

Updated by Tamilarasi muthamizhan almost 10 years ago

recent logs from the nightlies:ubuntu@teuthology:/a/teuthology-2014-05-18_02:30:21-rados-master-testing-basic-plana/260740

2014-05-18 12:46:11.967890 24b04700 20 osd.1 1540  kicking pg 84.6
2014-05-18 12:46:11.968011 24b04700 30 osd.1 pg_epoch: 1540 pg[84.6( v 1477'162 (0'0,1477'162] local-les=1516 n=90 ec=1303 les/c 1516/1481 1533/1533/1533) [0,1] r=1 lpr=1533 pi=1303-1532/6 crt=1383'156 lcod 1477'160 inactive NOTIFY] lock
2014-05-18 12:46:11.971784 24b04700 20 osd.1 1540  kicking pg 85.6
2014-05-18 12:46:11.971908 24b04700 30 osd.1 pg_epoch: 1540 pg[85.6( v 1531'14 (0'0,1531'14] local-les=1446 n=14 ec=1304 les/c 1446/1457 1437/1437/1437) [1,5] r=0 lpr=1437 luod=0'0 crt=0'0 lcod 1528'13 mlcod 0'0 active+clean] lock
2014-05-18 12:46:11.973464 24b04700 -1 osd.1 1540 pgid 85.6 has ref count of 3
2014-05-18 12:46:12.088728 24b04700 -1 osd/OSD.cc: In function 'int OSD::shutdown()' thread 24b04700 time 2014-05-18 12:46:11.977360
osd/OSD.cc: 1733: FAILED assert(0)

 ceph version 0.80-469-g991f7f1 (991f7f15a6e107b33a24bbef1169f21eb7fcce2c)
 1: (OSD::shutdown()+0x16d3) [0x6342b3]
 2: (OSD::handle_signal(int)+0x11d) [0x634a1d]
 3: (SignalHandler::entry()+0x18b) [0x99561b]
 4: (()+0x7e9a) [0x52a8e9a]
 5: (clone()+0x6d) [0x6d1e3fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #18

Updated by Sage Weil almost 10 years ago

ubuntu@teuthology:/a/teuthology-2014-05-28_02:30:04-rados-next-testing-basic-plana/278239

Actions #19

Updated by Sage Weil almost 10 years ago

ubuntu@teuthology:/a/teuthology-2014-06-02_02:30:05-rados-master-testing-basic-plana/285880

Actions #20

Updated by Sage Weil almost 10 years ago

ubuntu@teuthology:/a/teuthology-2014-06-02_02:30:05-rados-master-testing-basic-plana/285882

Actions #22

Updated by Sahana Lokeshappa almost 10 years ago

Steps to reproduce:

setup 8 nodes , 3 osds per node.
While Io was going on and cluster was also in rebalancing state, removed osd.23 from cluster ,steps followed
sudo stop ceph-osd id=23,
sudo ceph osd out 23
sudo ceph osd crush remove osd.23
sudo ceph auth del osd.23
sudo ceph osd rm 23

logs in osd.23.log:

-1> 2014-06-19 08:28:29.313157 7fa95b007700 -1 osd.23 1551 pgid 4.53 has ref count of 6
0> 2014-06-19 08:28:29.313935 7fa95b007700 -1 osd/OSD.cc: In function 'int OSD::shutdown()' thread 7fa95b007700 time 2014-06-19 08:28:29.313161
osd/OSD.cc: 1740: FAILED assert(0)
ceph version andisk-sprint-2-drop-3-390-g2dbd85c (2dbd85c94cf27a1ff0419c5ea9359af7fe30e9b6)
1: (OSD::shutdown()+0x176f) [0x61e94f]
2: (OSD::handle_signal(int)+0x60) [0x61f210]
3: (SignalHandler::entry()+0x19f) [0x97584f]
4: (()+0x8182) [0x7fa9836b2182]
5: (clone()+0x6d) [0x7fa981a5330d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

pg 4.53 has refcount more than resulted in this assertion failure

no cores generated till now for this osd crash.

Actions #23

Updated by Sage Weil almost 10 years ago

  • Status changed from In Progress to 12
Actions #24

Updated by Samuel Just almost 10 years ago

ubuntu@teuthology:/a/teuthology-2014-06-17_02:30:05-rados-firefly-distro-basic-plana/313621/remote

Actions #25

Updated by Samuel Just almost 10 years ago

Blessed[Gen]Context doesn't clean up the wrapped Context if the destructor is called. Running flab:/home/sage/src/teuthology/bug-7995.yaml

Actions #26

Updated by Sage Weil almost 10 years ago

teuthology-2014-07-01_02:30:37-rados-firefly-distro-basic-plana/337102

Actions #27

Updated by Sage Weil almost 10 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-07-05_19:34:55-rados-wip-msgr-testing-basic-plana/344755

Actions #28

Updated by Sage Weil almost 10 years ago

This correlates with the applying in-progress repops:

 -1296> 2014-07-08 15:12:11.257577 29063700 10 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded]  applying repop tid 1
 -1295> 2014-07-08 15:12:11.265540 29063700 20 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] remove_repop repgather(0x29c24820 8'1 rep_tid=1 committed?=0 applied?=0 lock=0 op=osd_op(mds.0.1:5 600.00000000 [??? 1~0,omap-set-header 0~222] 1.c2e541b0 ondisk+write+known_if_redirected e7) v4)
 -1294> 2014-07-08 15:12:11.276721 29063700 15 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded]  requeue_ops 
 -1293> 2014-07-08 15:12:11.293444 29063700 10 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded]  applying repop tid 3
 -1292> 2014-07-08 15:12:11.293965 29063700 20 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] remove_repop repgather(0x29d4bc20 8'2 rep_tid=3 committed?=0 applied?=0 lock=0 op=osd_op(mds.0.1:6 601.00000000 [??? 1~0,omap-set-header 0~222] 1.81580290 ondisk+write+known_if_redirected e7) v4)
 -1291> 2014-07-08 15:12:11.294755 29063700 15 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded]  requeue_ops 
 -1290> 2014-07-08 15:12:11.295290 29063700 10 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded]  applying repop tid 4
 -1289> 2014-07-08 15:12:11.295789 29063700 20 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] remove_repop repgather(0x29e9b440 8'3 rep_tid=4 committed?=0 applied?=0 lock=0 op=osd_op(mds.0.1:7 602.00000000 [??? 1~0,omap-set-header 0~222] 1.2a8b6bd0 ondisk+write+known_if_redirected e7) v4)
 -1288> 2014-07-08 15:12:11.296625 29063700 15 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded]  requeue_ops 
 -1270> 2014-07-08 15:12:11.299097 29063700 10 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded]  applying repop tid 5
 -1269> 2014-07-08 15:12:11.465247 29063700 20 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] remove_repop repgather(0x29edab80 8'4 rep_tid=5 committed?=0 applied?=0 lock=0 op=osd_op(mds.0.1:8 603.00000000 [??? 1~0,omap-set-header 0~222] 1.7a4d91b0 ondisk+write+known_if_redirected e7) v4)

... and appears to be 3 refs per repop.

Actions #29

Updated by Sage Weil almost 10 years ago

  • Status changed from 12 to Pending Backport
Actions #30

Updated by Sahana Lokeshappa almost 10 years ago

  • File monlogs.gz added
  • File log_osd_678.gz added
  • File log_osd_345.gz added
  • File log_osd_012.gz added
Actions #31

Updated by Sage Weil over 9 years ago

  • Status changed from Pending Backport to Resolved
  • Backport set to firefly
Actions #32

Updated by Loïc Dachary over 9 years ago

  • File deleted (monlogs.gz)
Actions #33

Updated by Loïc Dachary over 9 years ago

  • File deleted (log_osd_678.gz)
Actions #34

Updated by Loïc Dachary over 9 years ago

  • File deleted (log_osd_345.gz)
Actions #35

Updated by Loïc Dachary over 9 years ago

  • File deleted (log_osd_012.gz)
Actions

Also available in: Atom PDF