Project

General

Profile

Bug #696

osd: _put_pool, assert(p->num_pg > 0)

Added by Wido den Hollander over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
Start date:
01/09/2011
Due date:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Yesterday I tried to remove some pools from my Ceph system on 'noisy', running unstable commit eace4398cb163a670ff6bbd657de9aa5c917fcb9

I tried to remove the pools created by the RADOS gateway (.rgw, .users, .users.email) and one of my 4 OSD's went down with:

#0  0x00007f6dc69897bb in raise () from /lib/libpthread.so.0
#1  0x000000000061a96b in handle_fatal_signal (signum=6) at config.cc:253
#2  <signal handler called>
#3  0x00007f6dc5559a75 in raise () from /lib/libc.so.6
#4  0x00007f6dc555d5c0 in abort () from /lib/libc.so.6
#5  0x00007f6dc5e0f8e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#6  0x00007f6dc5e0dd16 in ?? () from /usr/lib/libstdc++.so.6
#7  0x00007f6dc5e0dd43 in std::terminate() () from /usr/lib/libstdc++.so.6
#8  0x00007f6dc5e0de3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#9  0x0000000000604992 in ceph::__ceph_assert_fail (assertion=0x63d077 "p->num_pg > 0", file=<value optimized out>, 
    line=874, func=<value optimized out>) at common/assert.cc:25
#10 0x00000000004e9b27 in OSD::_put_pool (this=0x151c000, p=0x293b000) at osd/OSD.cc:874
#11 0x00000000004eba93 in OSD::_remove_pg (this=0x151c000, pg=<value optimized out>) at osd/OSD.cc:4609
#12 0x0000000000607a76 in ThreadPool::worker (this=0x151c5e8) at common/WorkQueue.cc:44
#13 0x0000000000523b7d in ThreadPool::WorkThread::entry() ()
#14 0x0000000000481f0a in Thread::_entry_func (arg=0x6185) at ./common/Thread.h:39
#15 0x00007f6dc69809ca in start_thread () from /lib/libpthread.so.0
#16 0x00007f6dc560c70d in clone () from /lib/libc.so.6
#17 0x0000000000000000 in ?? ()

The last few log lines (logging was a bit low).

2011-01-08 19:47:10.922490 7f6dba9f7700 osd3 149 pg[17.1( v 143'1 (0'0,143'1] n=1 ec=141 les=143 141/141/141) [3,1,0] r=0 mlc
od 0'0 active+clean] oi.user_version=143'1 is_modify=0
2011-01-08 19:47:10.928918 7f6dbcafc700 osd3 149 OSD::ms_handle_reset()
2011-01-08 19:47:10.928960 7f6dbcafc700 osd3 149 OSD::ms_handle_reset() s=0x2d4eea0
osd/OSD.cc: In function 'void OSD::_put_pool(PGPool*)', In thread 7f6db91f4700
osd/OSD.cc:874: FAILED assert(p->num_pg > 0)
 ceph version 0.25~rc (commit:eace4398cb163a670ff6bbd657de9aa5c917fcb9)
 1: (OSD::_put_pool(PGPool*)+0x267) [0x4e9b27]
 2: (OSD::_remove_pg(PG*)+0x1653) [0x4eba93]
 3: (ThreadPool::worker()+0x556) [0x607a76]
 4: (ThreadPool::WorkThread::entry()+0xd) [0x523b7d]
 5: (Thread::_entry_func(void*)+0xa) [0x481f0a]
 6: (()+0x69ca) [0x7f6dc69809ca]
 7: (clone()+0x6d) [0x7f6dc560c70d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (Aborted) ***
in thread 7f6db91f4700
 ceph version 0.25~rc (commit:eace4398cb163a670ff6bbd657de9aa5c917fcb9)
 1: (handle_fatal_signal(int)+0x13c) [0x61a92c]
 2: (()+0xf8f0) [0x7f6dc69898f0]
 3: (gsignal()+0x35) [0x7f6dc5559a75]
 4: (abort()+0x180) [0x7f6dc555d5c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f6dc5e0f8e5]
 6: (()+0xcad16) [0x7f6dc5e0dd16]
 7: (()+0xcad43) [0x7f6dc5e0dd43]
 8: (()+0xcae3e) [0x7f6dc5e0de3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x2f2) [0x604992]
 a: (OSD::_put_pool(PGPool*)+0x267) [0x4e9b27]
 b: (OSD::_remove_pg(PG*)+0x1653) [0x4eba93]
 c: (ThreadPool::worker()+0x556) [0x607a76]
 d: (ThreadPool::WorkThread::entry()+0xd) [0x523b7d]
 e: (Thread::_entry_func(void*)+0xa) [0x481f0a]
 f: (()+0x69ca) [0x7f6dc69809ca]
 10: (clone()+0x6d) [0x7f6dc560c70d]

I could start the OSD afterwards and it recovered nicely.

I removed the pools rather quickly, they were all removed within a few seconds.

I've collected the data and uploaded it to logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_put_pool


Related issues

Related to Ceph - Bug #629: cosd segfaults when deleting a pool containing degraded objects Resolved 12/02/2010

History

#1 Updated by Sage Weil over 8 years ago

  • Assignee set to Colin McCabe
  • Target version set to v0.24.2

We may want to add get/put pool debug prints (if there aren't any already) so we can nail this one down...

#2 Updated by Wido den Hollander over 8 years ago

I'm able to reproduce the crash on my 'noisy' machine.

After setting the logging to 20 I got:

2011-01-10 09:59:46.918344 7fd3f02d2700 filestore(/var/lib/ceph/osd.2) write /var/lib/ceph/osd.2/current/meta/pglog_24.7_0 0~0 = 0
2011-01-10 09:59:46.918358 7fd3f02d2700 filestore(/var/lib/ceph/osd.2) collection_setattr /var/lib/ceph/osd.2/current/24.7_head 'ondisklog' len 17
2011-01-10 09:59:46.918372 7fd3f02d2700 filestore(/var/lib/ceph/osd.2) collection_setattr /var/lib/ceph/osd.2/current/24.7_head 'ondisklog' len 17 = -2
2011-01-10 09:59:46.918383 7fd3f02d2700 filestore(/var/lib/ceph/osd.2) _do_op 0x4046580 209423 r = 0, finisher 0x32b06a0 0
2011-01-10 09:59:46.918392 7fd3f02d2700 filestore(/var/lib/ceph/osd.2) _finish_op on osr 0x32d2a50/0x4071b98
2011-01-10 09:59:46.918400 7fd3f02d2700 filestore(/var/lib/ceph/osd.2) _finish_op 209423 next_finish 209423 queueing 0x32b06a0 doing 0
2011-01-10 09:59:46.918426 7fd3ea1c5700 filestore(/var/lib/ceph/osd.2) _flush_op_queue waiting for apply finisher
2011-01-10 09:59:46.918438 7fd3ea1c5700 filestore(/var/lib/ceph/osd.2) flush complete
2011-01-10 09:59:46.918445 7fd3ea1c5700 osd2 206 _remove_pg 24.7 taking osd_lock
2011-01-10 09:59:46.918454 7fd3ea1c5700 osd2 206 _remove_pg 24.7 removing final
2011-01-10 09:59:46.918469 7fd3ea1c5700 filestore(/var/lib/ceph/osd.2) queue_transactions existing osr 0x3943000/0x29f7538
2011-01-10 09:59:46.918478 7fd3ea1c5700 filestore(/var/lib/ceph/osd.2) queue_transactions (parallel) 209424 0x3b32e00
2011-01-10 09:59:46.918495 7fd3ea1c5700 filestore(/var/lib/ceph/osd.2) queue_op 0x40f6400 seq 209424 122 bytes   (queue has 1 ops and 122 bytes)
2011-01-10 09:59:46.918510 7fd3ea1c5700 osd2 206 _put_pool 24 0 -> -1
osd/OSD.cc: In function 'void OSD::_put_pool(PGPool*)', In thread 7fd3ea1c5700
osd/OSD.cc:874: FAILED assert(p->num_pg > 0)
 ceph version 0.25~rc (commit:eace4398cb163a670ff6bbd657de9aa5c917fcb9)
 1: (OSD::_put_pool(PGPool*)+0x267) [0x4e9b27]
 2: (OSD::_remove_pg(PG*)+0x1653) [0x4eba93]
 3: (ThreadPool::worker()+0x556) [0x607a76]
 4: (ThreadPool::WorkThread::entry()+0xd) [0x523b7d]
 5: (Thread::_entry_func(void*)+0xa) [0x481f0a]
 6: (()+0x69ca) [0x7fd3f7a529ca]
 7: (clone()+0x6d) [0x7fd3f66de70d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (Aborted) ***
in thread 7fd3ea1c5700
 ceph version 0.25~rc (commit:eace4398cb163a670ff6bbd657de9aa5c917fcb9)
 1: (handle_fatal_signal(int)+0x13c) [0x61a92c]
 2: (()+0xf8f0) [0x7fd3f7a5b8f0]
 3: (gsignal()+0x35) [0x7fd3f662ba75]
 4: (abort()+0x180) [0x7fd3f662f5c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fd3f6ee18e5]
 6: (()+0xcad16) [0x7fd3f6edfd16]
 7: (()+0xcad43) [0x7fd3f6edfd43]
 8: (()+0xcae3e) [0x7fd3f6edfe3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x2f2) [0x604992]
 a: (OSD::_put_pool(PGPool*)+0x267) [0x4e9b27]
 b: (OSD::_remove_pg(PG*)+0x1653) [0x4eba93]
 c: (ThreadPool::worker()+0x556) [0x607a76]
 d: (ThreadPool::WorkThread::entry()+0xd) [0x523b7d]
 e: (Thread::_entry_func(void*)+0xa) [0x481f0a]
 f: (()+0x69ca) [0x7fd3f7a529ca]
 10: (clone()+0x6d) [0x7fd3f66de70d]

I'll stick to my current code and won't change anything so you can track this one down.

#3 Updated by Colin McCabe over 8 years ago

It looks like we already have "_get_pool" and "_put_pool" printed out in the logs when the relevant functions are called. Looks like a refcount issue, will investigate.

#4 Updated by Colin McCabe over 8 years ago

Hi Wido,

I took a quick glance at the logs in osd_3_crash_rgw_pool_remove.tar.gz, but it looks like those are the old ones, with the log level set to low. Can you reproduce this with the high logging level and post that?

Also, Sage, is there a reason why PG::~PG is doing a pool->put(), but OSD::_remove_pg() also results in a pg->pool->put()? Unless I'm mistaken, this leads to the reference count getting screwed up. If the pool structure was mistakenly freed, and later zeroed, this would explain all the grief in _put_pool.

C.

#5 Updated by Wido den Hollander over 8 years ago

Hi Colin,

The logging was indeed low. But I was able to reproduce the crash yesterday with a higher logging enabled, i've uploaded osd.2.log.1.gz to the logger machine, it should have the information.

But doing a quick grep on the logfile showed me:

root@noisy:/var/log/ceph# zcat osd.2.log.1.gz |grep put_pool
2011-01-10 09:59:46.907408 7fd3ea1c5700 osd2 206 _put_pool 24 6 -> 5
2011-01-10 09:59:46.911705 7fd3ea1c5700 osd2 206 _put_pool 24 5 -> 4
2011-01-10 09:59:46.912814 7fd3ea1c5700 osd2 206 _put_pool 24 4 -> 3
2011-01-10 09:59:46.914182 7fd3ea1c5700 osd2 206 _put_pool 24 3 -> 2
2011-01-10 09:59:46.915284 7fd3ea1c5700 osd2 206 _put_pool 24 2 -> 1
2011-01-10 09:59:46.916337 7fd3ea1c5700 osd2 206 _put_pool 24 1 -> 0
2011-01-10 09:59:46.917564 7fd3ea1c5700 osd2 206 _put_pool 25 7 -> 6
2011-01-10 09:59:46.918510 7fd3ea1c5700 osd2 206 _put_pool 24 0 -> -1
osd/OSD.cc: In function 'void OSD::_put_pool(PGPool*)', In thread 7fd3ea1c5700
 1: (OSD::_put_pool(PGPool*)+0x267) [0x4e9b27]
 a: (OSD::_put_pool(PGPool*)+0x267) [0x4e9b27]
2011-01-10 10:00:53.636842 7fa592422700 osd2 211 _put_pool 26 8 -> 7
2011-01-10 10:00:55.289396 7fa592422700 osd2 214 _put_pool 26 7 -> 6
2011-01-10 10:00:55.338038 7fa592422700 osd2 214 _put_pool 26 6 -> 5
2011-01-10 10:00:55.341779 7fa592422700 osd2 214 _put_pool 26 5 -> 4
2011-01-10 10:00:55.391337 7fa592422700 osd2 214 _put_pool 26 4 -> 3
2011-01-10 10:00:55.392650 7fa592422700 osd2 214 _put_pool 26 3 -> 2
2011-01-10 10:00:55.395315 7fa592422700 osd2 214 _put_pool 26 2 -> 1
2011-01-10 10:00:55.468614 7fa592422700 osd2 214 _put_pool 26 1 -> 0
2011-01-10 10:00:55.469753 7fa592422700 osd2 214 _put_pool 26 0 -> -1
osd/OSD.cc: In function 'void OSD::_put_pool(PGPool*)', In thread 7fa592422700
 1: (OSD::_put_pool(PGPool*)+0x267) [0x4e9b27]
 a: (OSD::_put_pool(PGPool*)+0x267) [0x4e9b27]
2011-01-10 17:25:10.098724 7f9999841700 osd2 224 _put_pool 27 8 -> 7
2011-01-10 17:25:10.100035 7f9999841700 osd2 224 _put_pool 27 7 -> 6
2011-01-10 17:25:10.101181 7f9999841700 osd2 224 _put_pool 27 6 -> 5
2011-01-10 17:25:10.102264 7f9999841700 osd2 224 _put_pool 27 5 -> 4
2011-01-10 17:25:10.103372 7f9999841700 osd2 224 _put_pool 27 4 -> 3
2011-01-10 17:25:10.104502 7f9999841700 osd2 224 _put_pool 27 3 -> 2
2011-01-10 17:25:10.105594 7f9999841700 osd2 224 _put_pool 27 2 -> 1
2011-01-10 17:25:10.106633 7f9999841700 osd2 224 _put_pool 27 1 -> 0
root@noisy:/var/log/ceph# 

As yo can see, it goes to -1 and then crashes.

But you can take a look at the log yourself, see if it shows anything relevant for you.

#6 Updated by Greg Farnum over 8 years ago

Colin McCabe wrote:

Also, Sage, is there a reason why PG::~PG is doing a pool->put(), but OSD::_remove_pg() also results in a pg->pool->put()? Unless I'm mistaken, this leads to the reference count getting screwed up. If the pool structure was mistakenly freed, and later zeroed, this would explain all the grief in _put_pool.

Notice that _get_pool() is called by _open_lock_pg(), which is called by _create_lock_pg() and _create_lock_new_pg(). This doesn't mean that there isn't a ref-counting problem involving this, but the PGPool has a ref held by the PG on creation, plus should have one for each of its PGs that the OSD is accessing directly. Thus _remove_pg() puts a ref. :)

#7 Updated by Colin McCabe over 8 years ago

  • Status changed from New to Testing

See if this fixes it up: 27f172f4c7d1afafeda8c9ec1d530f5e02ea1b7a

in the latest unstable.

#8 Updated by Wido den Hollander over 8 years ago

Tried your fix, seems to be working, but like I said on IRC, I can't be 100% sure. I couldn't reproduce it today, so we can't rule out it is still present.

Simple test I did:

$ for i in {1..100}; do rados mkpool foo${i}; done
$ rados lspools|grep foo|xargs -n 1 rados rmpool

Went fine, all 4 OSD's are still up. I'll keep trying, hope it stays away! Tnx.

#9 Updated by Colin McCabe over 8 years ago

  • Status changed from Testing to Resolved

Also available in: Atom PDF