Bug #7891
closedosd: leaked pg refs on shutdown
0%
Description
-330> 2014-03-27 18:26:28.535862 1ff69700 20 osd.3 80 kicking pg 23.6 -329> 2014-03-27 18:26:28.535984 1ff69700 30 osd.3 pg_epoch: 80 pg[23.6( v 80'8 (0'0,80'8] local-les=80 n=1 ec=79 les/c 80/80 79/79/79) [3,1] r=0 lpr=79 luod=0'0 crt=0'0 lcod 80'6 mlcod 0'0 active+clean] lock -328> 2014-03-27 18:26:28.537515 1ff69700 -1 osd.3 80 pgid 23.6 has ref count of 4 -327> 2014-03-27 18:26:28.642681 1ff69700 -1 osd/OSD.cc: In function 'int OSD::shutdown()' thread 1ff69700 time 2014-03-27 18:26:28.541435 osd/OSD.cc: 1689: FAILED assert(0) ceph version 0.78-367-gd9a2dea (d9a2dea755a62e4f9fe0795410f37b68a15ae054) 1: (OSD::shutdown()+0x1557) [0x62f3c7] 2: (OSD::handle_signal(int)+0x11d) [0x62fabd] 3: (SignalHandler::entry()+0x18b) [0x98139b] 4: (()+0x7e9a) [0x52a8e9a] 5: (clone()+0x6d) [0x6d1e3fd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-03-27_02:30:16-rados-firefly-distro-basic-plana/149035
Updated by Samuel Just about 10 years ago
ubuntu@teuthology:/a/teuthology-2014-04-02_02:30:02-rados-master-testing-basic-plana/161303/remote$
Updated by Sage Weil about 10 years ago
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-04-04_02:30:19-rados-master-testing-basic-plana/168600
Updated by Sage Weil about 10 years ago
ubuntu@teuthology:/var/lib/teuthworker/archive/gregf-2014-04-04_22:05:49-rados-wip-7994-testing-basic-plana/171097
Updated by Sage Weil about 10 years ago
- Status changed from 12 to In Progress
- Assignee set to Sage Weil
Updated by Sage Weil about 10 years ago
ubuntu@teuthology:/a/teuthology-2014-04-15_02:30:04-rados-firefly-distro-basic-plana/193023 but not debug patch applied
Updated by Samuel Just almost 10 years ago
ubuntu@teuthology:/a/sage-2014-04-23_18:03:07-rados-firefly-testing-basic-plana/211808
Updated by Samuel Just almost 10 years ago
ubuntu@teuthology:/a/teuthology-2014-04-25_02:30:12-rados-master-testing-basic-plana/214284/remote
Updated by Sage Weil almost 10 years ago
ubuntu@teuthology:/a/samuelj-2014-05-08_12:44:29-rados-firefly-testing-basic-plana/243892 (shared_cache assert on shutdown)
Updated by Samuel Just almost 10 years ago
ubuntu@teuthology:/a/samuelj-2014-05-08_12:44:29-rados-firefly-testing-basic-plana/243892/remote
Updated by Sage Weil almost 10 years ago
ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-05-11_09:34:50-rados-firefly-testing-basic-plana/249163
Updated by Samuel Just almost 10 years ago
ubuntu@teuthology:/a/samuelj-2014-05-09_14:11:19-rados-wip-sam-testing-testing-basic-plana/245865
Updated by Samuel Just almost 10 years ago
two instances:
ubuntu@teuthology:/a/samuelj-2014-05-16_23:11:10-rados-wip-sam-testing-testing-basic-plana/259903
Updated by Tamilarasi muthamizhan almost 10 years ago
recent logs from the nightlies:ubuntu@teuthology:/a/teuthology-2014-05-18_02:30:21-rados-master-testing-basic-plana/260740
2014-05-18 12:46:11.967890 24b04700 20 osd.1 1540 kicking pg 84.6 2014-05-18 12:46:11.968011 24b04700 30 osd.1 pg_epoch: 1540 pg[84.6( v 1477'162 (0'0,1477'162] local-les=1516 n=90 ec=1303 les/c 1516/1481 1533/1533/1533) [0,1] r=1 lpr=1533 pi=1303-1532/6 crt=1383'156 lcod 1477'160 inactive NOTIFY] lock 2014-05-18 12:46:11.971784 24b04700 20 osd.1 1540 kicking pg 85.6 2014-05-18 12:46:11.971908 24b04700 30 osd.1 pg_epoch: 1540 pg[85.6( v 1531'14 (0'0,1531'14] local-les=1446 n=14 ec=1304 les/c 1446/1457 1437/1437/1437) [1,5] r=0 lpr=1437 luod=0'0 crt=0'0 lcod 1528'13 mlcod 0'0 active+clean] lock 2014-05-18 12:46:11.973464 24b04700 -1 osd.1 1540 pgid 85.6 has ref count of 3 2014-05-18 12:46:12.088728 24b04700 -1 osd/OSD.cc: In function 'int OSD::shutdown()' thread 24b04700 time 2014-05-18 12:46:11.977360 osd/OSD.cc: 1733: FAILED assert(0) ceph version 0.80-469-g991f7f1 (991f7f15a6e107b33a24bbef1169f21eb7fcce2c) 1: (OSD::shutdown()+0x16d3) [0x6342b3] 2: (OSD::handle_signal(int)+0x11d) [0x634a1d] 3: (SignalHandler::entry()+0x18b) [0x99561b] 4: (()+0x7e9a) [0x52a8e9a] 5: (clone()+0x6d) [0x6d1e3fd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Sage Weil almost 10 years ago
ubuntu@teuthology:/a/teuthology-2014-05-28_02:30:04-rados-next-testing-basic-plana/278239
Updated by Sage Weil almost 10 years ago
ubuntu@teuthology:/a/teuthology-2014-06-02_02:30:05-rados-master-testing-basic-plana/285880
Updated by Sage Weil almost 10 years ago
ubuntu@teuthology:/a/teuthology-2014-06-02_02:30:05-rados-master-testing-basic-plana/285882
Updated by Sahana Lokeshappa almost 10 years ago
Steps to reproduce:
setup 8 nodes , 3 osds per node.
While Io was going on and cluster was also in rebalancing state, removed osd.23 from cluster ,steps followed
sudo stop ceph-osd id=23,
sudo ceph osd out 23
sudo ceph osd crush remove osd.23
sudo ceph auth del osd.23
sudo ceph osd rm 23
logs in osd.23.log:
-1> 2014-06-19 08:28:29.313157 7fa95b007700 -1 osd.23 1551 pgid 4.53 has ref count of 6
0> 2014-06-19 08:28:29.313935 7fa95b007700 -1 osd/OSD.cc: In function 'int OSD::shutdown()' thread 7fa95b007700 time 2014-06-19 08:28:29.313161
osd/OSD.cc: 1740: FAILED assert(0)
ceph version andisk-sprint-2-drop-3-390-g2dbd85c (2dbd85c94cf27a1ff0419c5ea9359af7fe30e9b6)
1: (OSD::shutdown()+0x176f) [0x61e94f]
2: (OSD::handle_signal(int)+0x60) [0x61f210]
3: (SignalHandler::entry()+0x19f) [0x97584f]
4: (()+0x8182) [0x7fa9836b2182]
5: (clone()+0x6d) [0x7fa981a5330d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
pg 4.53 has refcount more than resulted in this assertion failure
no cores generated till now for this osd crash.
Updated by Samuel Just almost 10 years ago
ubuntu@teuthology:/a/teuthology-2014-06-17_02:30:05-rados-firefly-distro-basic-plana/313621/remote
Updated by Samuel Just almost 10 years ago
Blessed[Gen]Context doesn't clean up the wrapped Context if the destructor is called. Running flab:/home/sage/src/teuthology/bug-7995.yaml
Updated by Sage Weil almost 10 years ago
teuthology-2014-07-01_02:30:37-rados-firefly-distro-basic-plana/337102
Updated by Sage Weil almost 10 years ago
ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-07-05_19:34:55-rados-wip-msgr-testing-basic-plana/344755
Updated by Sage Weil almost 10 years ago
This correlates with the applying in-progress repops:
-1296> 2014-07-08 15:12:11.257577 29063700 10 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] applying repop tid 1 -1295> 2014-07-08 15:12:11.265540 29063700 20 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] remove_repop repgather(0x29c24820 8'1 rep_tid=1 committed?=0 applied?=0 lock=0 op=osd_op(mds.0.1:5 600.00000000 [??? 1~0,omap-set-header 0~222] 1.c2e541b0 ondisk+write+known_if_redirected e7) v4) -1294> 2014-07-08 15:12:11.276721 29063700 15 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] requeue_ops -1293> 2014-07-08 15:12:11.293444 29063700 10 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] applying repop tid 3 -1292> 2014-07-08 15:12:11.293965 29063700 20 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] remove_repop repgather(0x29d4bc20 8'2 rep_tid=3 committed?=0 applied?=0 lock=0 op=osd_op(mds.0.1:6 601.00000000 [??? 1~0,omap-set-header 0~222] 1.81580290 ondisk+write+known_if_redirected e7) v4) -1291> 2014-07-08 15:12:11.294755 29063700 15 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] requeue_ops -1290> 2014-07-08 15:12:11.295290 29063700 10 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] applying repop tid 4 -1289> 2014-07-08 15:12:11.295789 29063700 20 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] remove_repop repgather(0x29e9b440 8'3 rep_tid=4 committed?=0 applied?=0 lock=0 op=osd_op(mds.0.1:7 602.00000000 [??? 1~0,omap-set-header 0~222] 1.2a8b6bd0 ondisk+write+known_if_redirected e7) v4) -1288> 2014-07-08 15:12:11.296625 29063700 15 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] requeue_ops -1270> 2014-07-08 15:12:11.299097 29063700 10 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] applying repop tid 5 -1269> 2014-07-08 15:12:11.465247 29063700 20 osd.0 pg_epoch: 8 pg[1.0( v 8'4 (0'0,8'4] local-les=8 n=4 ec=1 les/c 8/8 7/7/7) [0] r=0 lpr=7 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+degraded] remove_repop repgather(0x29edab80 8'4 rep_tid=5 committed?=0 applied?=0 lock=0 op=osd_op(mds.0.1:8 603.00000000 [??? 1~0,omap-set-header 0~222] 1.7a4d91b0 ondisk+write+known_if_redirected e7) v4)
... and appears to be 3 refs per repop.
Updated by Sage Weil almost 10 years ago
- Status changed from 12 to Pending Backport
Updated by Sahana Lokeshappa almost 10 years ago
- File monlogs.gz added
- File log_osd_678.gz added
- File log_osd_345.gz added
- File log_osd_012.gz added
Updated by Sage Weil over 9 years ago
- Status changed from Pending Backport to Resolved
- Backport set to firefly